The Tacit Ceiling of AI Knowledge Synthesis

Why Turning Documents into Expert-Level Skills Is So Difficult

Mar 02, 2026

The core problem I’m trying to solve with cogworks is how to use AI agents to programmatically encode topic knowledge from diverse sources and automate the creation of high-quality agent skills (you can read the cogworks origin story here).

The ultimate goal is extraordinary quality. The synthesis produced by cogworks-encode should be difficult to distinguish from something written by someone with 20 years of domain expertise, and the resulting skill should produce expert-level performance on novel inputs, not just on variations of the training sources.

Reaching acceptable is relatively easy, but reaching extraordinary requires solving the deeper problem of whether the synthesis captures how experts think, not only what they produce.

Although this post focuses on the specific challenges of the cogworks pipeline, it touches a broader question of how to bridge the gap between knowledge and action in AI systems in a way that generalises beyond the training data.

I’m still figuring out how to reliably hit verifiable quality with cogworks. Since context is loaded by the same model doing the synthesis and skill generation, there’s a risk that my tests confirm the implementation rather than the intended behaviour. So, wherever I make claims about quality or generalisation, read them as theoretically grounded but not externally validated. I’m working toward testing cogworks-generated skills against “best-in-class” skills via SkillsBench, but I’m not there yet.

For cogworks, it hinges on quality at both stages of the pipeline:

Stage 1 – Knowledge synthesis (cogworks-encode): synthesising diverse sources into a coherent, comprehensive knowledge base.
Stage 2 – Skill generation (cogworks-learn): distilling that synthesis into a lean, immediately usable agent skill.

Synthesis captures what’s known and wants to be comprehensive (capture everything, but with no redundancy or unresolved contradictions). On the other hand, skill generation captures what to do and demands ruthless distillation. The second stage depends entirely on the first, but they require almost opposite cognitive operations.

The knowledge structure that works for understanding doesn’t always work for action. You can’t fix a bad synthesis with good skill-writing either; if the synthesis missed something critical or got a contradiction wrong, the skill inherits that flaw, and even a strong synthesis can fail at the skill stage because distillation is inherently lossy (strip away the wrong thing and you remove the mechanism that drives good decisions).

What Makes an Expert an Expert: Tacit Knowledge

So how does expert knowledge differ structurally from novice knowledge?

Explicit knowledge (know-what) can be written down. Tacit knowledge (know-how) is what experts develop through practice and often cannot fully articulate. The expert acts without explicitly reflecting on every principle involved.

AI synthesis extracts explicit knowledge well - facts, relationships, and patterns are all there on the page. But the most valuable layer in many domains is tacit judgment about edge cases, intuitions about when rules apply or break down, and when to ignore them. You can memorise every chess book ever written and still lack Kasparov’s intuitive grasp of the game.

When synthesis captures only explicit patterns, the resulting skill performs well on familiar cases but degrades on edge cases and novel situations that require underlying judgment. This is where a ceiling appears, since document synthesis is constrained by what is expressed or at least defensibly inferable from the available material.

Extraordinary quality here requires moving beyond extracting what experts say, to approximating the structure of their reasoning.

Conceptual Models and Decision Principles

What I’m learning is that the real differentiator is the conceptual model embedded in the source material. Capturing facts and examples isn’t enough, you need the underlying structure of thinking that produces correct decisions.

Experts have internalised conceptual models. In many domains, what drives expert judgment is the conceptual model beneath the rules. When they teach or write, they may describe rules or patterns, but what actually drives good judgment is the underlying model that they subconsciously preserve.

To ensure extraordinary synthesis quality, the focus must shift toward extracting decision principles of the form: “When X, do Y, because Z.” When sources conflict, synthesis must resolve the tension by explaining under what conditions each approach applies, rather than merely flagging disagreement.

Structural Rationale: The Mechanism-Level “Why”

“Standard” synthesis (using an AI agent without the cogworks toolchain) usually extracts surface rationale, i.e. the stated justification (“write tests first because it improves API design”).

Structural rationale captures the mechanism, e.g. “write tests first because otherwise implementation details shape test structure, causing tests to validate structure rather than intended behaviour.”

The structural “why” enables you to correctly identify when the pattern doesn’t apply because you understand what the pattern is protecting against. The surface “why” doesn’t do this.

The distinction matters enormously for agent skill generalisation. A skill built on surface rationale performs well on cases resembling the sources. A skill built on structural rationale should perform better on novel cases because it understands what the rule is defending against. It knows why the rule exists, not just that it exists.

The problem is that structural rationale is often absent from the sources. Authors describe what they do and why it works, but rarely articulate the mechanism by which it prevents failure.

Although synthesis can’t extract what isn’t there, it can probe for it.

The Execution Challenge: From Synthesis to Skill

The cogworks workflow runs synthesis through an 8-phase process that culminates in narrative, then extracts a Decision Skeleton (the bridge between knowledge structure and decision structure), gates on user review, applies skill-writing expertise, and validates the output (see the system deep dive here).

Compared to naive prompting, this scaffolding improves structure and conflict resolution, and this is also where I initially tried to enforce discipline with the “Expert Subtraction Principle” in cogworks-encode:

### The Expert Subtraction Principle

**Core Philosophy:** Experts are systems thinkers who leverage their extensive knowledge and deep understanding to reduce complexity. Novices add. Experts subtract until nothing superfluous remains.

**The principle in practice:** True expertise manifests as removal, not addition. The expert's value is knowing what to leave out. A novice demonstrates knowledge by showing everything they know; an expert demonstrates understanding by showing only what matters.

In hindsight it’s obvious why this alone wasn’t sufficient. The model doing the synthesis isn’t an expert in any of the source reference domains, so it can’t reliably subtract what’s non-essential without additional safeguards, because it doesn’t independently know what matters.

A more deliberate approach was to make subtraction explicit. The pipeline uses a Compression Guard that cross-checks any removed content against a Critical Distinctions Registry (CDR) — a catalogue of non-negotiable distinctions extracted before compression begins — and a Pre-Review Coverage Gate that maps every source capability to the synthesis output before user review. For example, if a source distinguishes between “testing behaviour” and “testing implementation,” that distinction is logged in the CDR and any compression that collapses those into one concept is flagged before it reaches the skill stage.

This reframes subtraction as evaluation rather than “intuition”, which improves the synthesis, but still doesn’t solve the problem of how to make the leap from synthesis to skill since generating a skill is not a simple, additional compression run on the synthesis. Skills must activate at the right moment, apply correctly in context, resist deviation pressure, and generalise beyond the training sources.

The SKILL.md format handles activation and application, but generalisation beyond the source material is where reliability most often degrades.

How cogworks Is Trying to Bridge the Gap

Currently, cogworks attempts three concrete strategies:

Explicitly extract the reasoning behind each pattern - why it works and what fails if you ignore it.
Capture boundary conditions - when and how experts would deviate from standard rules.
Generate adversarial test cases to probe edge cases and structural gaps.

Selection, however, remains a bottleneck. A synthesis may contain ten nuanced patterns, but the Decision Skeleton distils these to the five to seven most important decisions. This by its very nature is “LLM-subjective”. Choosing which rules are load-bearing requires domain judgment that cannot be fully automated.

The best solution I have for this at the moment is to generate hard, edge-case questions before reading the sources. Then, after synthesis, get the agent to check whether those questions are answered convincingly. Gaps often reveal missing tacit knowledge, but this is not a robust solution since guarding against context pollution and drift remain an unsolved problem as I’ve already mentioned.

For skill generation, the key bridging artifact is the Decision Skeleton — a structured extraction of the most important decisions, each with a trigger, options, right call, failure mode, and boundary conditions. The skill is then built around this skeleton, rather than the narrative structure of the synthesis.

Ideally, the skill would then be tested against novel cases not covered in the sources, but proper evaluation requires domain expertise, which creates a bootstrapping problem. As an approximation, we generate novel cases from the synthesis itself and check whether the skill handles them correctly.

Recursive Improvement and the Tacit Ceiling

The cogworks pipeline itself is a single linear pass. There’s no built-in iteration loop, but you can re-run cogworks on the same sources (or feed its output back as a source) to iteratively improve quality. In practice, this manual recursive improvement often surfaces additional explicit structure from the sources (though it can also amplify existing blind spots if not carefully reviewed), but does not break through to tacit knowledge. Once the synthesis stabilises across iterations and the remaining gaps are purely tacit, you hit the wall.

When the synthesis plateaus and what’s missing is the kind of thing you’d need to ask domain experts directly to uncover, that’s when I stop.

On the meta-question of whether recursive improvement actually works, I honestly think so. The first round fixes obvious quality gaps like missing reasoning and poor structure. The second round tightens things up by removing non-essential elements and sharpening the decision rules. Beyond that, returns diminish because the synthesis is now capturing all the explicit knowledge available and what’s left is tacit knowledge that more iterations won’t unlock without better source material.

So, in my experience, 2-3 rounds tend to capture most of the gains. Anything beyond that and you start rephrasing the same content.

Conclusion: Extraordinary Quality and Its Limits

Extraordinary synthesis means capturing the structural reasoning behind decisions, not merely restating documented patterns. Extraordinary skills must enable correct decisions in novel edge cases, and not be confined to familiar ones.

The gap between good and extraordinary comes down to how deeply the synthesis explains the underlying reasoning beyond the surface patterns, how well the decision skeleton identifies what choices the skill needs to enable, and whether the resulting skill generalises beyond the specific cases in the source material.

But there is a ceiling.

In domains where knowledge is highly explicit (API specifications, protocol documentation, well-documented procedures) the ceiling is high. In domains dominated by tacit judgment (system design, architecture decisions, anything where the interesting questions are “when” rather than “how”) the ceiling is lower, and pipeline refinement alone cannot extract what the sources never contained.

And across all of this, the hardest problem by FAR, remains objective, external validation.

I’ll keep working on it.

Attribution: This Post’s Image is Courtesy of Yulia Gapeenko

read | front left

Discussion about this post

Ready for more?