Does Free-Text Workout Generation Actually Improve LLM Output Quality?

TL;DR: Yes. And we can explain why.

We recently made a significant architectural decision in AFitPilot: prioritizing free-text markdown (workoutText) over strict structured JSON for workout generation.

Some of you have noticed that sessions now appear as clean, readable workout descriptions instead of deeply nested exercise cards. This change is still being tested and refined, but early results are promising.

Here’s what we changed, why we changed it, and what both research and real-world behavior suggest about LLM output quality.

What We Changed

Before: Deeply Structured JSON

Previously, the AI generated workouts in a rigid, nested JSON format:

{
  "exercises": [
    {
      "name": "Goblet Squat",
      "sets": 3,
      "reps": 12,
      "targetRpe": 7,
      "tempo": "3-1-2-0",
      "rest": "90s",
      "notes": "Keep chest up",
      "progressionContext": "Building from last week"
    }
  ]
}

Now: Free-Text Markdown

We now instruct the AI to generate workouts as natural markdown text:

**Block A – Lower Body (12 min)**
• Goblet Squats: 3 sets × 12 reps @ RPE 7
  - Keep chest up, push knees out
  - Rest 60–90 sec between sets

At the prompt level, the instruction is explicit:

ALWAYS prefer workoutText (free-text markdown format)

This applies to nearly all session types.

Does JSON Formatting Actually Constrain LLM Creativity?

Short answer: yes, and measurably so.

Here’s what both research and our own observations point to.

1. Cognitive Load on the Model

When you force an LLM to output strict JSON with many nested fields, a meaningful portion of its capacity goes toward:

Ensuring valid syntax (brackets, commas, quotes)
Matching field names exactly
Maintaining consistent data types
Avoiding hallucinated or missing fields

That is computational overhead that does not improve coaching intelligence. The model is thinking about formatting instead of thinking about training.

2. Natural Language Is the LLM’s Native Domain

LLMs are trained primarily on natural language, not schemas.

When allowed to write freely, they can:

Flow naturally between instructions
Embed coaching cues inline where they matter
Adjust tone and depth dynamically
Express nuance that does not fit cleanly into fixed fields

Writing:

“Keep your chest up, drive through your heels, and if your knees cave, drop the weight”

is trivial for a language model.

Forcing that same idea into something like:

{
  "notes": "...",
  "formCues": [...],
  "safetyConsiderations": [...]
}

introduces artificial segmentation that actively degrades expression quality.

3. Schema Adherence Competes With Content Quality

When prompts demand complex schemas, the model must balance three competing goals:

Correctness: Does the output match the schema?
Completeness: Are all required fields present?
Quality: Is the coaching actually good?

In practice, when schemas are strict, models often default to safe, generic content to reduce the risk of structural failure.

Free-text removes that tension entirely.

The Token Reduction Effect

This change landed at the same time as major token optimization work, and the effects compound.

Before

Input: ~30k tokens of context (full profile, full master plan history, detailed weekly plans)
Output: Complex nested JSON requiring careful formatting
Result: Model attention split between comprehension, formatting, and generation

After

Input: ~8k tokens of essential context only
Output: Natural markdown prose
Result: More capacity devoted to actual reasoning and coaching decisions

Put simply, the model has more budget for thinking instead of bookkeeping.

What We’re Still Testing

This shift is experimental, and we are actively monitoring:

Session coherence
Do free-text sessions maintain logical structure and intent?
Progression tracking
Can we reliably extract sets, reps, and load signals for analytics?
Benchmark integration
Benchmarks still require structured data. Finding the right balance matters.
Multi-language quality
Early signals suggest French and other languages benefit from free-text generation, but this is still being evaluated.

So far, the signals are positive, but iteration continues.

What This Means for You

If your workouts recently feel more like they were written by a human coach, that is intentional.

You should notice:

More natural coaching cues
Better flow between exercises
Clearer explanation of why things are ordered the way they are
Cleaner, more readable presentation

The goal is simple: training plans that read like a coach wrote them for you, not like a database export.

The Technical Reality (For the Curious)

A few implementation details for those who care about the internals:

All generation prompts (week_plan_v4.txt, session_adaptation.txt, etc.) now prioritize workoutText
The UI renders workoutText first, with structured formats as fallback
workoutBlocks has been fully deprecated. It added complexity without delivering value
Benchmark tests still rely on a structured exercises array for PR tracking and analytics

Prompt rules are explicit:

ALWAYS use workoutText for:

Full-body workouts
Strength sessions
HIIT and circuits
CrossFit formats
Skill drills
Mobility and recovery
Any mixed or complex session

ONLY use structured exercises when:

A benchmark requires parseable tracking
The session is extremely simple (2–3 exercises, no nuance)

Model Matters: GPT vs Gemini Output Differences

We’re also observing notable differences between models in how they handle workout generation:

Gemini 2.5 Flash

Our current default for most generation tasks. Observations:

Faster and cheaper – important for real-time generation
Good at following structured instructions – respects the workoutText priority well
Sometimes too literal – can produce workouts that feel formulaic
Handles complex schemas reasonably – but still benefits from free-text freedom

GPT-4 / GPT-5 Class Models

When used (typically for Stage 1 strategic decisions):

More “coaching voice” – naturally writes like a human coach
Better contextual reasoning – connects the “why” behind exercise selection
More creative with programming – interesting exercise pairings, better flow
Higher token cost – 10-20x more expensive per generation

What This Means in Practice

The same prompt produces noticeably different outputs:

Gemini might write:

**Block A - Lower Body**
• Goblet Squats: 3 sets × 12 reps @ RPE 7
• Romanian Deadlifts: 3 sets × 10 reps @ RPE 7

GPT might write:

**Block A - Lower Body (Build Your Base)**
• Goblet Squats: 3 sets × 12 reps @ RPE 7
  - We're starting here to groove the pattern before adding load next week
  - Focus on sitting back into your heels
• Romanian Deadlifts: 3 sets × 10 reps @ RPE 7
  - Think "push hips back" not "bend forward"
  - You should feel this in your hamstrings, not lower back

The GPT output has more coaching personality—but costs significantly more and takes longer.

Our Current Approach

We use a two-stage generation system:

Stage 1 (Strategic Decisions): Uses a stronger reasoning model (Claude Sonnet) to decide the week’s role, constraints, tradeoffs, and anchor movements
Stage 2 (Execution): Uses Gemini Flash to generate the full workout details based on Stage 1 decisions

This gives us the best of both worlds: intelligent coaching decisions from a reasoning model, fast execution from a speed-optimized model.

The Free-Text Advantage Across Models

Here’s the key insight: free-text output improves quality across ALL models, but the gap is more pronounced with faster/cheaper models.

When Gemini has to output strict JSON, it gets conservative. When it can write markdown, it produces more natural, readable workouts. The constraint reduction helps budget models more than it helps premium models (which can handle constraints better).

So our architecture decisions—workoutText priority + token reduction + two-stage generation—are specifically designed to get maximum quality from cost-effective models.

We’re continuing to test different model combinations. If you’re on a plan that allows model selection, try generating with different options and let us know what you notice.

Bottom Line

More formatting constraints do not make an LLM smarter. They make it more cautious.

By letting the model write workouts in natural language, we let it do what it does best: communicate clearly and contextually.

Combined with leaner input context, this gives the model more capacity for what actually matters: designing training that fits your goals, your equipment, and your progression.

We’re still refining this system. If something feels off, tell us. Your feedback directly shapes how this evolves.

Curious about the prompt architecture? Interested in seeing how this is implemented under the hood? We are considering open-sourcing parts of it. Let us know if that would be useful.