I had a bunch of Skills sitting in a folder. None of them were callable as APIs
Hey there, little explorer! Imagine you have a bunch of super cool toy robots, right? Each robot knows how to do one special trick, like singing a song or building a tower.
But guess what? Each robot could only do its trick by itself, in its own room! You couldn't tell your singing robot to help your building robot. They were stuck!
This story is about a clever grown-up who built a special magic bridge called "Skrun." Now, all the robot tricks can talk to each other and work together! It's like giving them walkie-talkies so they can share their amazing skills and make even bigger, cooler things happen! Hooray for teamwork!
So I built a runtime to fix that. The problem If you use Claude Code, Copilot, or Codex, you've probably created Agent Skills, those SKILL.md files that tell the AI what to do. I had a bunch of them. But they were stuck. I couldn't plug them into a product, trigger them from a webhook, or let any service call them with a POST request. Each skill was trapped inside the tool that created it. What I wanted take a SKILL.md → get a POST /run endpoint No new framework to learn. No infrastructure to set up. Just point at a skill, configure the model, and deploy. What I built Skrun , an open-source runtime that takes Agent Skills and turns them into callable APIs. skrun init --from-skill ./my-existing-skill # reads SKILL.md, generates agent.yaml skrun deploy # validates, builds, pushes # → POST ht
So I built a runtime to fix that.
The problem
If you use Claude Code, Copilot, or Codex, you've probably created Agent Skills, those SKILL.md files that tell the AI what to do.
I had a bunch of them. But they were stuck. I couldn't plug them into a product, trigger them from a webhook, or let any service call them with a POST request.
Each skill was trapped inside the tool that created it.
What I wanted
take a SKILL.md → get a POST /run endpoint
Enter fullscreen mode
Exit fullscreen mode
No new framework to learn. No infrastructure to set up. Just point at a skill, configure the model, and deploy.
What I built
Skrun, an open-source runtime that takes Agent Skills and turns them into callable APIs.
skrun init --from-skill ./my-existing-skill
reads SKILL.md, generates agent.yaml
skrun deploy
validates, builds, pushes
→ POST http://localhost:4000/api/agents/dev/my-skill/run`
Enter fullscreen mode
Exit fullscreen mode
Then you call it:
curl -X POST http://localhost:4000/api/agents/dev/code-review/run \ -H "Authorization: Bearer dev-token" \ -H "Content-Type: application/json" \ -d '{"input": {"code": "function add(a,b) { return a + b; }"}}'curl -X POST http://localhost:4000/api/agents/dev/code-review/run \ -H "Authorization: Bearer dev-token" \ -H "Content-Type: application/json" \ -d '{"input": {"code": "function add(a,b) { return a + b; }"}}'Enter fullscreen mode
Exit fullscreen mode
{ "status": "completed", "output": { "score": 60, "issues": [ {"severity": "warning", "description": "Use const instead of var"} ], "review": "Lacks error handling..." } }{ "status": "completed", "output": { "score": 60, "issues": [ {"severity": "warning", "description": "Use const instead of var"} ], "review": "Lacks error handling..." } }Enter fullscreen mode
Exit fullscreen mode
Multi-model
You pick the provider in agent.yaml, not in your code. Anthropic, OpenAI, Google, Mistral, Groq. If one fails, it falls back to the next.
model: provider: google name: gemini-2.5-flash fallback: provider: openai name: gpt-4omodel: provider: google name: gemini-2.5-flash fallback: provider: openai name: gpt-4oEnter fullscreen mode
Exit fullscreen mode
Tool calling
Two approaches.
You can bundle your own CLI tools with the agent. Create a scripts/ directory, write whatever you want (shell, Node, Python), declare them in agent.yaml:
tools:
- name: eslint_check script: scripts/eslint-check.sh description: "Run ESLint on JavaScript code"`
Enter fullscreen mode
Exit fullscreen mode
The LLM calls the tool when it needs to. Skrun executes the script, returns the result. Your agent can run a linter, query a database, call an internal API.
Or use MCP servers. Any MCP server from the npm ecosystem works via npx:
mcp_servers:
- name: browser transport: stdio command: npx args: ["-y", "@playwright/mcp", "--headless"]`
Enter fullscreen mode
Exit fullscreen mode
Stateful
Agents can persist key-value state across runs. Run the same agent twice, it remembers what happened last time.
Roadmap
v0.1 runs on a local registry server. Next up:
Cloud deploy (the architecture has a RuntimeAdapter interface ready for sandboxed VMs), caller-provided API keys, streaming responses, and a hub to discover and share agents.
The numbers
4 packages (@skrun-dev/schema, cli, runtime, api), 10 CLI commands, 154 tests, 6 demo agents, MIT license.
Try it
npm install -g @skrun-dev/cli git clone https://github.com/skrun-dev/skrun.git cd skrun && pnpm install && pnpm buildnpm install -g @skrun-dev/cli git clone https://github.com/skrun-dev/skrun.git cd skrun && pnpm install && pnpm buildEnter fullscreen mode
Exit fullscreen mode
Set a Google API key in .env, start the registry (pnpm dev:registry), and follow the "Try an example" section in the README.
github.com/skrun-dev/skrun
I'd love feedback on: The agent.yaml format (does the I/O contract make sense?), the skill import flow, and what agents you'd build with this.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudegeminimistral
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
arXiv:2604.02709v1 Announce Type: new Abstract: The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language

Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints
arXiv:2604.02699v1 Announce Type: new Abstract: A previous study reported that E-Prime (English without the verb "to be") selectively altered reasoning in language models, with cross-model correlations suggesting a structural signature tied to which vocabulary was removed. I designed a replication with active controls to test the proposed mechanism: cognitive restructuring through specific vocabulary-cognition mappings. The experiment tested five conditions (unconstrained control, E-Prime, No-Have, elaborated metacognitive prompt, neutral filler-word ban) across six models and seven reasoning tasks (N=15,600 trials, 11,919 after compliance filtering). Every prediction from the cognitive restructuring hypothesis was disconfirmed. All four treatments outperformed the control (83.0%), includi

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments
arXiv:2604.02669v1 Announce Type: new Abstract: How biased is a language model? The answer depends on how you ask. A model that refuses to choose between castes for a leadership role will, in a fill-in-the-blank task, reliably associate upper castes with purity and lower castes with lack of hygiene. Single-task benchmarks miss this because they capture only one slice of a model's bias profile. We introduce a hierarchical taxonomy covering 9 bias types, including under-studied axes like caste, linguistic, and geographic bias, operationalized through 7 evaluation tasks that span explicit decision-making to implicit association. Auditing 7 commercial and open-weight LLMs with \textasciitilde45K prompts, we find three systematic patterns. First, bias is task-dependent: models counter stereotyp
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
arXiv:2604.02709v1 Announce Type: new Abstract: The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language

Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints
arXiv:2604.02699v1 Announce Type: new Abstract: A previous study reported that E-Prime (English without the verb "to be") selectively altered reasoning in language models, with cross-model correlations suggesting a structural signature tied to which vocabulary was removed. I designed a replication with active controls to test the proposed mechanism: cognitive restructuring through specific vocabulary-cognition mappings. The experiment tested five conditions (unconstrained control, E-Prime, No-Have, elaborated metacognitive prompt, neutral filler-word ban) across six models and seven reasoning tasks (N=15,600 trials, 11,919 after compliance filtering). Every prediction from the cognitive restructuring hypothesis was disconfirmed. All four treatments outperformed the control (83.0%), includi

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments
arXiv:2604.02669v1 Announce Type: new Abstract: How biased is a language model? The answer depends on how you ask. A model that refuses to choose between castes for a leadership role will, in a fill-in-the-blank task, reliably associate upper castes with purity and lower castes with lack of hygiene. Single-task benchmarks miss this because they capture only one slice of a model's bias profile. We introduce a hierarchical taxonomy covering 9 bias types, including under-studied axes like caste, linguistic, and geographic bias, operationalized through 7 evaluation tasks that span explicit decision-making to implicit association. Auditing 7 commercial and open-weight LLMs with \textasciitilde45K prompts, we find three systematic patterns. First, bias is task-dependent: models counter stereotyp

SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models
arXiv:2604.02660v1 Announce Type: new Abstract: As Large Language Models (LLMs) increasingly power decision-making systems across critical domains, understanding and mitigating their biases becomes essential for responsible AI deployment. Although bias assessment frameworks have proliferated for attributes such as race and gender, socioeconomic status bias remains significantly underexplored despite its widespread implications in the real world. We introduce SocioEval, a template-based framework for systematically evaluating socioeconomic bias in foundation models through decision-making tasks. Our hierarchical framework encompasses 8 themes and 18 topics, generating 240 prompts across 6 class-pair combinations. We evaluated 13 frontier LLMs on 3,120 responses using a rigorous three-stage


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!