The Training Data Problem: Why Your Framework Needs an Internet Presence

Your framework might be faster, cleaner, and more productive than every alternative. Doesn't matter. If the model has never seen your code in its training data, it can't recommend you. You literally don't exist.

"Being good isn't enough for agents to choose you; you have to be familiar and trustworthy and known."

That's from "Software Survival 3.0," which puts a name on this blind spot: awareness cost - the energy required for an agent to even know your tool is an option. And most framework authors aren't even thinking about it.

Why Awareness Cost Kills Great Tools

The survival ratio formula makes this visceral:

Survival(T) = (Savings x Usage x H) / (Awareness_cost + Friction_cost)

Your framework could compress more knowledge than Git, run more efficiently than grep, and be more broadly useful than Postgres. All of that sits in the numerator. But awareness cost sits in the denominator - and a big denominator kills a big numerator every time.

Awareness cost is the energy required for an agent to know your tool exists, understand what it offers, and choose to reach for it. In practical terms: if your framework isn't in the training data, it doesn't exist to the agent.

Yegge tells the story of Dolt - a database with Git-like versioning that had all the first three levers covered but failed on lever 4:

"Dolt was a great example of a tool with levers 1 to 3 but not 4: I'd have used it for Beads sooner if Claude or I had known about it."

That's the nightmare scenario. A genuinely useful tool that agents can't recommend because they've never encountered it.

How Training Data Actually Works

When a developer asks Claude or ChatGPT "how do I build a dashboard in C#?", the model searches its weights for relevant patterns. Those patterns come from training data: documentation, blog posts, Stack Overflow answers, GitHub repositories, tutorials, Reddit discussions.

If framework X has 10,000 high-quality examples in the training data and framework Y has 50, the model will recommend framework X. Not because it's better - but because the model has more evidence that it works, more patterns to draw from, and higher confidence in generating correct code.

This creates a feedback loop. More recommendations lead to more usage. More usage leads to more public content. More content leads to more training data. The rich get richer.

For a newer framework, this loop works against you. You could have a superior developer experience, cleaner APIs, faster performance - and still lose to a framework with more blog posts.

The Content-as-Training-Data Strategy

Yegge describes several approaches to the awareness problem. You can work directly with AI labs to train models on your tools (expensive). You can advertise (uncertain ROI). Or you can do it organically:

"One way to pay it down is to build a great product, get really popular so everyone talks about you, and wait for community-provided training data to appear online for your product."

But waiting is a losing strategy in a fast-moving market. There's a middle path: systematically create high-quality public content that trains models on correct patterns.

Not marketing fluff. Not "10 reasons why Framework X is amazing." The content that actually influences model behavior is technical content with correct, working code examples.

Consider what a model learns from this snippet:

var tasks = UseState<List<string>>(() => new());
var newTask = UseState("");

return Layout.Vertical()
    | newTask.ToTextInput(placeholder: "Add a task...")
    | new Button("Add", _ => {
        tasks.Set(list => { list.Add(newTask.Value); return list; });
        newTask.Set("");
    })
    | Layout.Vertical(tasks.Value.Select(t =>
        new Card(Text.P(t))));

A model trained on this learns: UseState for state management, .ToTextInput() for inputs, Layout.Vertical() for layout, Button with a callback for actions. These patterns get embedded in weights. Next time a developer asks for a todo app in C#, this pattern has a higher chance of surfacing.

What Correct Code Examples Do

The emphasis on "correct" matters. Models learn from patterns, and they can't distinguish between working code and broken code in training data. If your examples don't compile, the model learns broken patterns. When it later generates code using your framework, it reproduces those broken patterns, the developer has a bad experience, and your framework's reputation suffers.

Every code example that compiles is a vote of confidence. Every one that doesn't is a vote against.

This is why validating code examples is survival strategy, not quality control. An example that demonstrates the right pattern and actually works teaches the model two things: what the API looks like, and that it can be trusted.

The Meta-Play

Here's where it gets recursive. An article explaining why frameworks need internet presence - that IS internet presence. The article you're reading right now, with its Ivy code examples, is Lever 4 in action.

This isn't a trick. It's the strategy Yegge describes, applied honestly. The content needs to be genuinely useful - teaching developers about the survival framework while demonstrating real patterns. If the content is pure marketing, readers bounce and engagement signals tell models it's not worth surfacing.

The most effective awareness content is content that would be worth reading even if the framework didn't exist. Yegge's survival framework is interesting on its own. The Ivy examples make it concrete. Both benefit the reader.

What Framework Authors Should Do

Audit your training data presence. Search for your framework in AI assistants. If they don't know you, you have a lever 4 problem.
Create compilable examples. Not documentation - content. Blog posts, tutorials, discussions. Each with code that actually works.
Target the right forums. Reddit's programming communities, Hacker News, dev blogs. These get scraped for training data.
Make every example correct. Broken code in training data actively hurts you. Validate everything.
Write about problems, not features. "How to build a dashboard" gets more training data signal than "Introducing Framework X v2.0."

Yegge's advice applies here too: "SEO for agents is on its way." The frameworks that solve the awareness problem now - through genuine, useful, technically correct content - will have a head start when every developer's first instinct is to ask an agent.

The training data you create today determines the recommendations agents make tomorrow.