Why AI Coding Tools Misread Your Diagrams (The Claude Code Fix)
AI tools think in text. Diagrams, screenshots, and whiteboard photos force a translation step where information gets lost.
> Key Takeaways
- > But each one is a lossy summary of a region of pixels.
- > Let me show you what happens at fourteen nodes.
- > You might not know this, but everything on this YouTube channel is driven by text files and my agentic setup. Creating video scripts, des...
> Linked Resources
> Transcript
▶ Show full transcript
I was using Claude Code to fix layout bugs in a custom diagramming tool I built. At first, I set up a simple loop and got amazing results very early on. I told it to build some specification, take screenshots with Playwright, and check if it was right, and iterate until it worked. I went out for lunch and I came back to a really good proof of concept diagramming tool. And then it stopped working. As soon as it got into moderate complexity cases, tokens per iteration skyrocketed, the model started missing things I felt were obvious. Out of frustration, I stopped giving the system guidance and started telling it exactly what to move, where and how.
Even with this level of instruction, it was still struggling. And I was no better off than writing code by hand. I changed tactics. I had it parse SVGs as text instead of taking screenshots. Later, I even had it build out a special calculator suite for itself to ensure it was always measuring the images the same way. My tokens per iteration dropped, the bug regressions mostly disappeared, and the complex cases it could solve on its own went up dramatically without me hand holding it through every step. If modern LLMs are image native models in the first place, why did changing from uploading screenshots to text give me this unlock?
The underlying pattern here has a name called text native, and that comes from the catalog of augmented coding patterns. Link in the description if you want to take a look after this video. Today, I'm going to cover why text gives you better results in situations like this. What is really going on with modern LLMs when you send them a picture, and why text is a better option for almost all of these tasks. When you write A arrow B in text, the model reads that with only a few tokens, but also very high precision about what the input tokens are. Images work slightly differently. A vision encoder slices the picture into patches, and works with each patch until it can map out vision tokens, which later get placed into an embedding space.
Now, in most frontier models, the text tokens and the vision tokens live in the same embedding space, but each one is a lossy summary of a region of pixels. It still works, and there's still real situations where you want the vision side of your LLM to work. But I would not recommend using vision LLMs when you have structured information you can simply get in a text equivalent format and use that instead. I wanted to show you the quality degradation in practice, so I set up a little experiment on Opus 4. 7. I generated four mermaid diagrams ranging from simple to complex, and I gave it the same task each time. List every component directly connected to the cache, count the nodes, and count the edges.
I sent each task to Opus 4. 7 two ways, first with the mermaid source text, and again in a new context window with the task and the rendered PNG. So let's look at what happens at 14 nodes. The text-only approach got it fast and cheap. The image approach misread the diagram and said there's a dependency on message queue. A bit tricky, but a human looking carefully could have spotted this, and it came at a cost six and a half times what it would cost text to give you the same answer, and it would have given you correctly. So when we go to 28 nodes, that text prompt still has no problems, but the image, it missed two real neighbors and invented two that were not there.
And these are tidy computer-generated diagrams. A hand-drawn one on a whiteboard would have even less benefit. So why is the model doing this? Let me show you what's happening inside any vision-capable model. The image gets chopped into a bunch of patches. Each one becomes a vector with thousands of dimensions. At the first projection layer, four kilobytes per patch isn't unheard of, so this whole image is already over two megabytes of vector data, and deeper layers make that even more.
Then those vectors flow through layers of something called attention, which is where every patch learns what other patches are most relevant to it. You can see, for example, when we pick one patch, what highlighted areas the model thinks are most relevant to that patch. We can see what this looks like for our entire image in a dimensionally compressed space, where similar things drift closer together. By the end, the diagram structure is gone. We picked one square and looked at what the model was actually attending to. Instead of the boxes it's connected to, it's pulling from these other boxes, just because they look similar and it's trained to look at them. The researchers have some idea why images are not as good as text.
A 2024 paper titled Eyes Wide Shut put it this way. Vision encoders overlook crucial visual details and systematically fail to sort important visual patterns. The encoder itself has blind spots. Scaling doesn't fix them. A 2025 paper specifically on diagrams reframes it. Diagrams are a form of visual language that encodes abstract concepts and relationships through structured symbols. They pose unique challenges distinct from natural image processing.
The failures here compound. One wrong patch corrupts the agent's understanding for the rest of the conversation. Once you embrace text, almost every other step in your workflow gets simpler, cheaper, and higher accuracy. You might not know this, but everything on this YouTube channel is driven by text files and my agentic setup. Creating video scripts, designing all the graphic assets, and even editing the video itself is over 90% text driven. And I do the last 10% that I don't know how to put into words yet. I haven't spent any time editing video graphics by hand.
The channel banner, my website, all of it is just text files and a Git repo. And when I want to change part of my video production pipeline somewhere, I work with my agent to edit a file or build a deterministic tool. I will use image capabilities for things like automating if the right thing is on screen at the right time. But Claude never sees my videos really, and it doesn't need to. Almost all my time goes into video plans. 15 minutes of recording, minor tweaks on video edits, and a bit of stuff YouTube won't put onto an API. So why have a vision model at all?
Well, screenshots are the right choice when the pixels are the answer and the base concept. A visual bug, identifying something in a photo, a visual design review, something you truly need eyes on, not a symbolic data model that was thrown into a PNG. But be sure to use this correctly in your agentic architectures if you want to see reliability go up and costs stay low. If you want to go deeper, I teach hands-on courses for developers and teams who want to get good at agentic development. I also work with leadership groups figuring out how to run their orgs in this new world. Links in the description. If this is useful, subscribe.
New videos on agentic programming every week. Until then, remember, Ship the Loop.