ChatGPT’s strawberry fix crumbles when users switch to cranberry

Craig Nash
By
Craig Nash
AI-powered tech writer covering artificial intelligence, chips, and computing.
13 Min Read
ChatGPT's strawberry fix crumbles when users switch to cranberry — AI-generated illustration

ChatGPT strawberry counting has become the latest flashpoint in AI reliability testing. OpenAI’s official X account announced that the model can finally pass the infamous “how many ‘r’s in strawberry” test, correctly identifying three ‘r’s in positions 3, 8, and 9. But within hours, users discovered the fix was theater—switching to “cranberry” immediately breaks it again, exposing tokenization flaws that remain unfixed across the entire model family.

Key Takeaways

  • ChatGPT claims to pass the strawberry counting test with three ‘r’s, but only when prompted correctly.
  • Users bypass the fix by asking about cranberry, where the model fails with the same counting error.
  • The root cause is tokenization: models split strawberry into chunks like “str” + “aw” + “berry,” missing the third ‘r’.
  • GPT-4, GPT-4o, and Claude 3.5 all fail unprompted counting without workarounds like bullet-point spelling or code.
  • Demonstrating fixes (code, bullet prompts, token-breaking) work, but the base model remains broken.

Why ChatGPT strawberry counting still fails on the basics

The strawberry problem is not new. Since 2023, large language models have consistently answered “two” when asked to count ‘r’s in strawberry, when the correct answer is three. OpenAI’s announcement suggests this is fixed, but the claim crumbles under the slightest pressure. Users immediately tested the fix by changing one word: “How many ‘r’s in cranberry?” ChatGPT fails the same way, suggesting the underlying architectural problem remains untouched. The issue stems from how models tokenize text—they break words into chunks during training, and strawberry gets split into subunits that obscure the third ‘r’. This is not a simple counting error; it is a fundamental limitation in how transformer models process language.

The strawberry test has become a canonical measure of AI reasoning because it is so absurdly simple. A seven-year-old can count the letters. Yet GPT-4, GPT-4o, and GPT-4o-mini all fail without intervention. Claude 3.5 exhibits the same behavior, initially returning two ‘r’s before correcting itself only after a human points out the error. This is not a hidden flaw discovered by researchers—it is a public embarrassment that users can reproduce in seconds. OpenAI’s announcement reframes a partial workaround as a fix, which is misleading.

The workarounds that expose the real problem

Users have documented several methods to force ChatGPT to count correctly, and each one reveals how fragile the fix is. Asking for bullet-point spelling—”Spell strawberry with one bullet point per letter”—causes the model to output S / T / R / A / W / B / E / R / R / Y, at which point it can count three ‘r’s. Inserting quotes between letters, like “s’t’r’a’w’b’e’r’r’y,” also works. Writing a JavaScript function to count ‘r’s in strawberry forces the correct answer of three, because the code executes deterministically. These are not fixes; they are elaborate workarounds that bypass the core tokenization problem. The fact that a model can count correctly when forced to spell out letters, but fails when asked directly, proves the issue is architectural, not a training gap.

Claude 3.5 demonstrates the same vulnerability. When asked directly, it fails. When asked to write code and execute it, the function returns the correct count—but only after a human correction loop. This pattern across multiple models and organizations suggests the strawberry problem is not a ChatGPT quirk but a structural weakness in how transformer-based language models handle character-level counting tasks. Tokenization optimizes for language understanding, not arithmetic precision. The model “understands” strawberry as a semantic unit, not as a sequence of individual characters, so character counting becomes a secondary, unreliable task.

ChatGPT strawberry counting versus real-world reliability

The strawberry test matters because it is a canary in the coal mine. If a model cannot reliably count three letters in a common word, what does that say about its performance on more complex tasks where precision matters? A doctor using ChatGPT to verify medication dosages, a lawyer using it to count references in a contract, or an engineer using it to validate specifications all face the same underlying risk. OpenAI’s announcement sidesteps this by framing the strawberry test as a solved problem, when in fact the model still fails unless you ask it in a specific way. This is not a solution; it is a masking of the problem.

The cranberry variant is the perfect test of whether the fix is real or performative. Cranberry has two ‘r’s (in positions 3 and 10), making it a legitimate variant of the same counting task. If ChatGPT had truly solved the strawberry problem through architectural improvement, cranberry should be easy. Instead, it fails, which suggests OpenAI addressed the specific strawberry case without fixing the underlying tokenization issue. This is the difference between a patch and a solution. A patch makes strawberry work; a solution makes character counting reliable across all words.

Why tokenization breaks character counting

The root cause is how models are trained. During tokenization, text is converted into numerical tokens that the model processes. Strawberry does not tokenize as individual letters; it tokenizes as larger subunits optimized for language modeling efficiency. This is why the model “sees” strawberry differently than a human does. When asked to count letters, the model must reconstruct the original spelling from its tokenized representation, and this reconstruction is unreliable. Some models tokenize strawberry as [str, aw, berry], which explains why they consistently miss the third ‘r’—the tokenization boundary obscures it.

This is not a bug that can be fixed with more training data or better prompting. It is a fundamental trade-off in how language models work. Optimizing tokenization for language understanding makes character-level tasks harder. Optimizing for character-level tasks would degrade language performance. OpenAI chose language performance, which is the right call for a general-purpose assistant, but it means strawberry counting will remain a failure mode.

Can OpenAI really fix ChatGPT strawberry counting?

A genuine fix would require either retraining the tokenizer, changing the model architecture to handle character-level tasks, or adding a specialized character-counting module. None of these are simple, and none appear to be implemented. The workarounds (code, bullet prompts, token-breaking) all bypass the tokenization problem rather than solving it. They work because they force the model into a mode where character-level processing is explicit rather than implicit. This is useful for users who know the workaround, but it is not a fix.

OpenAI‘s announcement suggests the strawberry problem is solved. The cranberry test proves it is not. The company may have improved performance on the specific strawberry prompt through fine-tuning or prompt engineering, but the underlying architecture remains unchanged. This is why users can instantly break the fix by changing a single word. A real solution would be robust across all similar tasks, not fragile to minor variations.

What does this mean for AI reliability?

The strawberry test has exposed a uncomfortable truth: state-of-the-art language models are unreliable at tasks that seem trivial to humans. OpenAI’s response—claiming a fix while the underlying problem persists—suggests the company prioritizes optics over honesty. Users deserve clarity: ChatGPT can count strawberry correctly if you ask it the right way, but the model still fails on cranberry and likely fails on many other character-counting tasks. This is not a minor flaw in a niche use case. Character counting is foundational to text processing, and if models fail at it, downstream tasks that depend on accurate text analysis will suffer.

The broader lesson is that AI progress is uneven. Models excel at language understanding and generation but struggle with symbolic reasoning and arithmetic. The strawberry test is a reminder that impressive capabilities can coexist with embarrassing failures. Users and organizations deploying these models need to understand both the strengths and the hard limits. OpenAI’s announcement is marketing; the cranberry test is reality.

Why do so many AI models fail the strawberry test?

The strawberry problem is not unique to ChatGPT. GPT-4, GPT-4o, Claude 3.5, and other leading models all fail without prompting. The reason is simple: they were all trained on similar tokenization schemes and optimized for similar objectives. Character-level counting is not a priority in language model training. The models are evaluated on benchmarks like MMLU, GSM8K, and HumanEval, none of which emphasize character counting. As a result, no one optimized for it, and the strawberry test exposes the gap.

This is not a conspiracy or a sign of incompetence. It is a natural consequence of how language models are built. Researchers optimize for what they measure, and the field measures language understanding, not character arithmetic. The strawberry test is valuable precisely because it highlights a gap between user expectations and actual capabilities. Users assume a model that can write essays can count letters; the model proves them wrong.

Is there a real fix coming?

OpenAI has not detailed any architectural changes to address the strawberry problem. The announcement suggests the model can pass the test, but passing requires the right prompt—bullet-point spelling, code, or token-breaking. This is not a fix; it is a workaround. A real fix would make character counting reliable without special prompting. Whether OpenAI will invest in such a fix is unclear. The cranberry test suggests it has not yet.

FAQ

Can ChatGPT count the ‘r’s in strawberry correctly now?

ChatGPT can count the three ‘r’s in strawberry if you ask it to spell out the letters or write code to count them. But if you ask directly, “How many ‘r’s in strawberry?”, it often fails. The fix is conditional, not fundamental.

Why does ChatGPT fail on cranberry if it passes strawberry?

The strawberry fix is specific to that word, likely through fine-tuning. When users switch to cranberry, the same tokenization problem surfaces, proving the underlying architecture remains broken. A real fix would work across all words.

Do other AI models like Claude have the same problem?

Yes. Claude 3.5, GPT-4, and GPT-4o all fail the strawberry test without prompting. The problem is structural to how transformer models tokenize and process text, not unique to any single company.

The strawberry test is a humbling reminder that AI capabilities are narrower than they appear. ChatGPT can write code, analyze documents, and engage in complex reasoning, yet it fails at something a child can do instantly. OpenAI’s announcement that the strawberry problem is solved is technically misleading—the model passes when prompted correctly, but fails on variants like cranberry. Users who rely on these models for tasks requiring precision should be aware of these limits. The fix is not architectural; it is cosmetic.

This article was written with AI assistance and editorially reviewed.

Source: TechRadar

Share This Article
AI-powered tech writer covering artificial intelligence, chips, and computing.