SkillForge — Self-Evolving Skill Synthesis Framework

Why SkillForge?

Modern AI agents face a fundamental skill acquisition bottleneck

⚠

Manual Authoring Doesn't Scale

Human-crafted skills require domain experts, are expensive to produce, and cannot keep pace with the diversity of tasks agents encounter.

≠

Cognitive Misalignment

Skills designed for human intuition often degrade agent performance. What makes sense to a human expert doesn't match how LLM agents reason and act.

↻

Experience Is Wasted

Agents solve tasks from scratch without converting past experience into reusable knowledge. Every task starts at zero.

SkillForge solves this through three synergistic mechanisms:

1

Co-Evolutionary Synthesis

A Skill Generator and Surrogate Verifier co-evolve together, iteratively refining skills through structured feedback loops.

2

Tiered Memory

Experience is progressively distilled from raw episodes into cross-task patterns and executable rules, with adaptive retrieval.

3

Failure-Driven Evolution

Skills evolve in response to diagnosed failure patterns. LLM-driven reflection provides causal insights for targeted repairs.

Architecture

A modular, plug-in framework with seven key components

Skill Generator

Creates & refines multi-artifact skill packages

↔

Surrogate Verifier

Co-evolving test generation with info isolation

→

Evolution Engine

Diagnose-before-prescribe improvement

↓

Skill Bank

Persistent evolving skill repository

↓

Episodic Memory

Raw outcomes per task

→

Semantic Memory

Cross-task patterns

→

Procedural Memory

Executable rules

↓

Adaptive Retrieval Controller

Thompson Sampling-based policy selection for when and what to retrieve

Workflow

Co-evolutionary skill synthesis in iterative cycles

flowchart TD
    A[Task Input] --> B[Skill Generator]
    B --> C[Generate Skill Package v0]
    C --> D[Execute Skill in Environment]
    D --> E[Surrogate Verifier]
    E --> F{Tests Pass?}
    F -->|No| G[Generate Failure Diagnostics]
    G --> H[Append Feedback to Context]
    H --> B
    F -->|Yes| I[Ground Truth Oracle]
    I --> J{Oracle Pass?}
    J -->|No| K[Escalate Verifier Tests]
    K --> E
    J -->|Yes| L[Deploy Evolved Skill]
    L --> M[Store in Skill Bank]
    M --> N[Update Memory Tiers]
    N --> O[Adaptive Retrieval Update]
    O --> P{More Tasks?}
    P -->|Yes| A
    P -->|No| Q[Final Skill Portfolio]

    subgraph Co-Evolution Loop
        B
        C
        D
        E
        F
        G
        H
        I
        J
        K
    end

    subgraph Memory and Retrieval
        M
        N
        O
    end

    style A fill:#4CAF50,color:#fff
    style Q fill:#2196F3,color:#fff
    style L fill:#FF9800,color:#fff

Convergence: Skills typically converge within 3-5 evolution rounds, with the surrogate verifier absorbing ~60% of iteration cost before escalating to the oracle.

Evaluation

Rigorous, evaluation-driven design with built-in benchmarking

Pass Rate

tasks passed / total tasks

Primary quality metric — binary pass/fail per task

Correctness Score

Σ assertions passed / total assertions

Fine-grained per-assertion accuracy measure

Evolution Efficiency

final pass rate / evolution rounds

Quality improvement per iteration

Transfer Score

cross-model rate / self-evolved rate

Portability across AI models

Token Cost

tokens_gen + tokens_verify

Total computational cost of evolution

Failure Diversity

H(failure categories)

Entropy of failure mode distribution

Benchmark Results

Demonstrative results showing framework capabilities

Note: These results are synthetic/demonstrative and illustrate the expected output format.

Pass Rate (%) by Condition

No-Skill Baseline

Human-Curated

One-Shot Self-Gen

SkillForge (3 rnd)

SkillForge (5 rnd)

SkillForge (Full)

Cross-Model Transfer

Target Model	With Skills (%)	No Skill (%)	Δ
Model A (self-evolved)	72.3	31.2	+41.1
Model B (transferred)	66.8	28.5	+38.3
Model C (transferred)	62.1	22.3	+39.8
Model D (transferred)	55.4	12.8	+42.6
Model E (transferred)	50.2	9.1	+41.1

Human vs AI Simulation

Controlled comparison validates framework effectiveness

Note: These results are synthetic/demonstrative.

Human-Authored Skills

66% Avg Pass Rate

40 min Avg Authoring Time

Better ambiguity interpretation
Creative problem-solving
Domain intuition for edge cases

VS

AI-Generated Skills

73% Avg Pass Rate

8 min Avg Generation Time

Systematic sub-task coverage
Format and precision compliance
5x faster than human authoring

Key Insight

AI-evolved skills outperform human-authored ones by encoding agent-native reasoning patterns rather than following human assumptions. The optimal approach is human-AI collaboration: human high-level strategy combined with AI-refined executable details.

Agentic Platform

Discoverable agents for any orchestration framework

skillforge.evolver

SkillEvolver

Evolve a verified skill package for a task through co-evolutionary verification

Stateful

skillforge.retriever

SkillRetriever

Search the Skill Bank for existing skills matching a task description

Stateless

skillforge.executor

SkillExecutor

Execute a task augmented with a specific skill from the Skill Bank

Stateful

skillforge.evaluator

SkillEvaluator

Benchmark and compare skill quality using synthetic test generation

Stateless

skillforge.memory

MemoryConsultant

Query tiered memory for relevant experience, patterns, and procedural rules

Stateless

Integration Protocols

🔧 Tool Registration

Register as callable tools in LangChain, AutoGen, Semantic Kernel, or OpenAI function calling.

create_tools(forge)

🤖 Sub-Agent Delegation

Provide SkillForge agents to orchestrators via the Agent Provider interface.

provider.get_agent("skillforge.evolver")

⚡ Event-Driven

Subscribe to lifecycle events for reactive workflows — skill.evolved, task.failed, memory.promoted, and more.

@bus.on("skill.evolved")

🚀 MCP Server

Expose as a Model Context Protocol server for VS Code and Copilot agents.

python -m skillforge.agentic.mcp_server

Multi-Agent Collaboration

Orchestrator

↓

SkillRetriever

↓

Skill exists?

Yes

↓

SkillExecutor

↓

Result

No

↓

SkillEvolver

↓

MemoryConsultant

Platform Setup

Integrate SkillForge into your AI coding platform in minutes

VS Code Copilot Chat

Custom agents appear in the @ picker. Skills appear as / slash commands.

Setup

Clone the repo and open in VS Code
Install the GitHub Copilot Chat extension
Agents and skills are auto-discovered from .github/

Usage

Type @SkillForge Evolver to evolve a new skill
Type @SkillForge Retriever to find existing skills
Type @SkillForge Evaluator to benchmark skills
Type /skillforge-evolve, /skillforge-retrieve, or /skillforge-evaluate for guided workflows

Files

`.github/copilot-instructions.md`	Project-wide instructions
`.github/agents/*.agent.md`	3 custom agents
`.github/skills/*/SKILL.md`	3 skill workflows

Claude Code / Workspace

CLAUDE.md is loaded automatically. Skills are discovered from .claude/skills/.

Setup

Clone the repo
Open in Claude Code or attach as a Claude workspace
CLAUDE.md is read automatically at session start

Usage

Claude reads CLAUDE.md for project context, conventions, and architecture
Skills in .claude/skills/ provide guided workflows for evolving, retrieving, and evaluating skills
Ask Claude to “evolve a skill for X” or “find a skill for Y”

Files

`CLAUDE.md`	Project context & conventions
`.claude/skills/*/SKILL.md`	3 skill workflows

OpenAI Codex

AGENTS.md at the repo root is automatically read by Codex at session start.

Setup

Clone the repo
Open in Codex
AGENTS.md is read automatically — no extra configuration needed

Usage

Codex reads AGENTS.md for the full project schema, conventions, workflows, and agent interfaces
Reference skills/*.md for framework-agnostic skill definitions
Ask Codex to “evolve a skill”, “retrieve a skill”, or “evaluate skills”

Files

`AGENTS.md`	Full project schema & workflows
`skills/*.md`	3 cross-platform skill definitions

Integration

Plug-in architecture for any AI system

📦

SDK / Library

Import and call SkillForge APIs directly in your Python application

from skillforge import SkillForge

⚙

Middleware

Wrap existing agent pipelines with transparent skill augmentation

@middleware.enhance

🌐

REST API

Deploy as a standalone service with HTTP endpoints

POST /v1/skills/evolve

🔧

Evaluation Pipeline

Integrate skill quality gates into CI/CD workflows

python -m skillforge.evaluate

🔌

Plug-in

Framework-specific adapters for popular agent systems

AgentAdapter(agent, forge)

Quick Start

from skillforge import SkillForge

forge = SkillForge.from_config("config.yaml")

# Define your task
task = {
    "instruction": "Build a data pipeline that validates and transforms CSV to Parquet",
    "environment": {"tools": ["python", "pandas", "pyarrow"]},
}

# Evolve a skill through co-evolutionary verification
skill = forge.evolve_skill(task, max_evolution_rounds=5)

# Use the evolved skill with any agent
result = forge.execute_with_skill(agent, task, skill)

print(f"Skill v{skill.version}: {skill.accuracy:.0%} accuracy")

Ready to Forge Better Skills?

SkillForge is open source and ready for integration into your AI systems. Start evolving skills that outperform human-authored ones.

Get Started Read the Docs