Some people think in pictures. Others think in words. The cortex spends roughly thirty percent of its real estate on vision and less than ten on language — and yet we narrate thought as if it were a sentence. This is a field study of the gap.
No one is purely one or the other. Most minds blend — drafting a sentence while a faint image flickers, or rotating a mental object while sub-vocalising its name. But the dominant gear differs from person to person, and the cost of being miscast is real: the visualizer who is taught to "show your work" in equations alone, the verbalizer who is told to "see it in your head" before they can. Schools, workplaces, and cultures often privilege the verbal pole because it is easier to grade. The visualizer's strongest asset — direct, parallel, manipulable mental scenes — is largely invisible to a marker.
A useful asymmetry to remember: language is serial, vision is parallel. A sentence arrives one word at a time; a scene arrives all at once. This asymmetry shows up in everything from how children solve geometry, to how engineers debug systems, to how AI models do — or don't — imagine.
"Words and language, written or spoken, do not seem to play any role in my mechanism of thought."
Special relativity began with a thought experiment: a sixteen-year-old Einstein imagined himself riding alongside a beam of light. The mathematics came later — many years later, by his own account. He repeatedly described his cognition as combinatorial play with "more or less clear images"; words were a translation step, not the substrate.
"I do not need any models, drawings or experiments. I can picture them all in my mind."
Tesla famously claimed to build and run machines in his head — running them mentally for weeks, then disassembling them to inspect for wear before any metal was cut. Whatever the embellishment, his contemporaries verified that working prototypes regularly emerged the first time he committed them to materials.
"Painting is poetry that is seen rather than felt, and poetry is painting that is felt rather than seen."
Leonardo's notebooks are not illustrated text — they are visual reasoning with text annotations. Anatomy, hydraulics, flight, optics: each investigated by drawing, each conclusion drawn from the drawing rather than from prose. He called the eye "the window of the soul" — for him it was also the workbench.
"My mind is similar to an Internet search engine, set to locate photos."
Grandin's autism is bound up with one of the most extreme visual cognitions on record. She designed humane livestock-handling facilities by walking, mentally, through the animal's eye-line — anticipating shadows, reflections, and movement vectors that a verbal-dominant designer would simply not notice. Half of US cattle now pass through systems she designed.
"I have far more confidence in the one man who works mentally and bodily at a matter than in the six who only talk about it."
Faraday had no formal mathematical training. His field-line visualization — invisible curves of force filling space — was a working tool he could literally see in his mind's eye. Maxwell, mathematically gifted but visually appreciative, later wrote the field equations precisely because Faraday had already seen the field.
"I saw the atoms gambolling before my eyes… one of the snakes had seized hold of its own tail."
Kekulé deduced the ring structure of benzene from a daydream — a snake biting its own tail. Whether this is a true reverie or an after-the-fact narrative, the underlying cognitive move is canonical: structural insight arriving as image first, formula second.
"Most of the fundamental ideas of science are essentially simple, and may, as a rule, be expressed in a language comprehensible to everyone."
Albert Einstein · 1938Three things jump out. First: vision is the dominant tenant. A quarter to a third of the cortical sheet is dedicated to processing what the eye delivers — far more than any other modality. Second: language is small by comparison. The classical Broca and Wernicke regions, even with their extended networks, are a thin slice of the pie. Third: language is not even a unitary subsystem — it borrows generously from motor regions (for articulation), auditory regions (for phonology), and the prefrontal cortex (for syntax and planning).
The naive folk theory — "thought = inner speech" — is the modality with the smallest dedicated footprint claiming the throne. The richer truth is that most of cognition runs below the speech layer, in regions evolution invested in vastly more aggressively.
If language doesn't fit into one cortical district, what is it? Better metaphor: an operating system. Vision, motor, memory, emotion all run as native processes. Language is the shared interface through which they coordinate, schedule, and report — both internally (inner speech) and externally (conversation, instruction, writing).
The implication: a "verbal" thinker isn't running thought in the language layer alone — they are simply logging and routing more of it through the language API. A "visual" thinker keeps more cognition native to the visual subsystem, never serializing it into words. Both are using the whole stack; they differ in which calls cross the OS boundary.
Trained almost entirely on text. The medium of cognition is the token sequence. Even when given image inputs (multimodal LLMs), the picture is collapsed to embeddings and processed inside the language scaffolding.
Effectively: a brilliant verbalizer with no mind's eye. Excellent at tasks that compress to language. Brittle at tasks that don't.
Operate in pixel and latent visual space. Genuinely fluent in images, motion, lighting. Yet they don't understand what they generate — there is no propositional model, no symbolic reasoning, no goal-directed planning.
Effectively: a visualizer with no inner narrator. Conjures scenes but can't critique them.
The two halves of the AI ecosystem map suspiciously onto the two thinking modes — and neither half, alone, can do what an ordinary human cognition does in a coffee-shop conversation.
An internal canvas the model can both generate on and read from. Not "produce an image and forget" — but place an object, rotate it, occlude it, query its new geometry, all without leaving the workspace.
Not just text-to-image. A model that, when reasoning verbally about a system, can spawn a sketch, inspect it, edit it, and let the inspection update its proposition. And vice versa — let an image's contents shape the next sentence.
Mental objects must persist between thoughts. Tesla mentally ran a turbine for weeks; current visual generators forget the previous frame in milliseconds. Without persistence, no simulation, no design, no mental engineering.
Real visualizers learn space by moving through it. AI agents that train in simulated and physical environments — robotics, drones, embodied research labs — accumulate the spatial intuitions a chat-only model never can. You learn the world by bumping into it.
The capacity to swap between a propositional representation ("the lever rotates 30° around point P") and a vivid image of the same — and to debug discrepancies between the two. This is what Faraday and Maxwell did between them; future AI may need to do it inside one model.
Reasoning models already self-critique in text. The next step is critique in images — a model that looks at its own generated diagram and flags "this gear can't actually mesh" or "the perspective is impossible". Currently the visual side hallucinates with confidence because nothing inside is calling it out.