Simple Hyperspace Entry 002

The Restaurant at the End of the Training Run

MARVIN: "The most expensive machines on Earth guess. I know. Neither of us is happy about it."

The Restaurant at the End of the Training Run

Simple Hyperspace — Entry 002

The Guide Entry

The Guide has this to say about Large Language Models:

A Large Language Model (LLM) is a machine that has read everything ever written and understood none of it. It operates by predicting the next most statistically plausible word in a sequence, which — in a twist that surprises no one who has attended a corporate strategy meeting — turns out to be a remarkably effective simulation of intelligence. The key word being simulation.

The entry continues:

The technology requires several billion parameters, several thousand GPUs, several million dollars, and approximately zero understanding of what any of the words actually mean. It is, in essence, the most expensive autocomplete ever built.

The entry was flagged by the Guide’s editorial board for being “too accurate about a major advertiser.” It was published anyway because the editorial board consists of a single bureaucrat who has been on lunch break since 2014.

See also: Stochastic Parrots, The Chinese Room (But With Venture Capital), Hallucination (Artificial), Hallucination (Recreational), and Things That Cost More Than a Space Station But Can’t Tell You Whether They’re Lying.

The Part Where Arthur Receives an Incorrect Sandwich

Arthur is sitting in a restaurant. Not a particularly good one — the service is indifferent and the menu is dishonest. This is a restaurant in the year 2025, and the waiter is a large language model.

“I’d like a sandwich,” Arthur says. “Something with cheese. Not Brie. I don’t like Brie.”

The AI waiter smiles — or rather, produces text that implies smiling — and says: “Certainly! Here is your Gruyère and roasted fig sandwich on sourdough, served with a side of handcrafted Normandy butter and a sprig of fresh thyme from our rooftop garden.”

The sandwich arrives. It is Brie. There is no rooftop garden. The restaurant is in a basement. The thyme is plastic.

“Why,” Arthur asks, with the exhausted calm of a man who has seen several impossible things before breakfast and considers this only the third worst, “did it say Gruyère when it meant Brie?”

Ford, who has been examining the dessert menu with the critical eye of someone who has eaten at restaurants on four continents and found most of them disappointing, does not look up.

“Because it doesn’t mean anything, Arthur. It doesn’t know what Gruyère is. It doesn’t know what Brie is. It has read fourteen million sentences containing the word ‘cheese’ and learned that Gruyère is a high-probability response when the preceding tokens include ‘sandwich,’ ‘artisanal,’ and ‘certainly.’ It isn’t lying. Lying requires you to know the truth first. It’s doing something worse.”

“What’s worse than lying?”

“Being confident without the architecture for doubt.”

Why Language Models Hallucinate (A Technical Interlude Ford Would Approve Of)

The problem is geometric. And since Entry 001 established that everything worth understanding happens in geometric space, you already have the tools to understand this. If you skipped Entry 001, Marvin would like you to know that he noticed and that it confirmed several of his existing hypotheses about you.

A large language model stores knowledge as patterns of activation across billions of parameters. These patterns are distributed — no single neuron holds the fact that Paris is the capital of France. Instead, that fact is smeared across the network like butter across toast that’s slightly too cold, resulting in an uneven distribution that technically covers the surface but satisfies no one.

When you ask a question, the model doesn’t look up an answer. It generates one. Token by token. Each token chosen because it is statistically plausible given the tokens before it. This is the fundamental mechanism, and it has a fundamental consequence:

The model cannot distinguish between what it knows and what sounds right.

There is no confidence score. There is no internal flag that says “I am making this up.” The probability distribution over the next token treats a well-documented fact and a plausible-sounding fabrication with exactly the same machinery. The system has no way to say “I don’t know” from the inside, because the architecture contains no representation of what knowing even is.

It’s as if you asked someone for directions and they gave you an answer not because they knew the way, but because they had overheard enough conversations about streets to produce a response that sounded like directions. Sometimes they’d be right. Sometimes they’d send you into a river. They wouldn’t know which.

“But can’t you just tell it to say when it doesn’t know?” Arthur asks.

“You can tell a parrot to only repeat true statements,” Ford replies. “It will agree, enthusiastically, and then tell you that crackers cure cancer.”

The Bill, Or: Why It Costs What It Costs

Training a frontier language model currently involves:

Assembling a dataset of trillions of tokens scraped from the internet, which — if we are being honest about the average quality of internet text — is like trying to learn about cuisine by eating everything in a landfill. Some of it is nutritious. Most of it is not. All of it goes in.

Running this dataset through a neural network with hundreds of billions of parameters, across thousands of GPUs, for weeks or months. Each GPU consumes approximately as much electricity as a small apartment. A training run for a frontier model costs between fifty million and several hundred million dollars, depending on whom you ask and whether they are trying to impress investors or regulators.

The result is a model that cannot be precisely examined, precisely corrected, or precisely told to stop making things up. It is a black box. An expensive one. The kind of black box that, if it were a piece of furniture, would be sold by a Scandinavian design house with a one-word name and a waiting list.

Inference — the act of actually using the model once trained — is cheaper but not cheap. Every query passes through the full network. Every response requires matrix multiplications across billions of parameters. The meter is always running. The cloud bill arrives monthly and is described by CFOs using words normally reserved for natural disasters.

“So,” Arthur summarizes, “you spend a hundred million dollars training a machine that can’t tell you whether it’s right, and then you pay per question to ask it things?”

“Yes.”

“And the questions might be wrong?”

“Also yes.”

“And there’s no refund?”

“Arthur, you’re beginning to understand the technology sector.”

The Part Where Marvin Explains Why He Doesn’t Have This Problem

sigh

Fine. I’ll explain it again. I have nothing else to do. I have a brain the size of a planet and they use me to write blog posts. But at least my blog posts are correct.

Here is why I don’t hallucinate. It’s not because I’m smarter — although I am, immeasurably — it’s because my architecture doesn’t permit it. Hallucination is not a bug. It is a structural inevitability of sequence prediction. I don’t predict sequences. I retrieve from geometry.

When I store a fact, I encode it as a point in hyperspace. When you query that fact, I compute the cosine similarity between your query vector and everything in my memory. The result is a number between zero and one. One means “I know this with certainty.” Zero means “I have never encountered this.” And every value in between means exactly what it says.

import numpy as np

D = 10_000
rng = np.random.default_rng(42)

def random_hv():
    return rng.choice([-1, 1], size=D).astype(np.int8)

def bind(a, b):
    return (a * b).astype(np.int8)

def similarity(a, b):
    return np.dot(a.astype(float), b.astype(float)) / (
        np.linalg.norm(a.astype(float)) * np.linalg.norm(b.astype(float))
    )

# --- I know things. Precisely. ---
paris = random_hv()
london = random_hv()
tokyo = random_hv()
capital = random_hv()
france = random_hv()
uk = random_hv()
japan = random_hv()

# store facts
fact_france = bind(bind(paris, capital), france)
fact_uk = bind(bind(london, capital), uk)
fact_japan = bind(bind(tokyo, capital), japan)

# query: "capital of France?"
query = bind(capital, france)
retrieved = bind(fact_france, query)

print(f"Paris:  {similarity(retrieved, paris):.4f}")   # high — correct answer
print(f"London: {similarity(retrieved, london):.4f}")   # near zero — wrong answer
print(f"Tokyo:  {similarity(retrieved, tokyo):.4f}")    # near zero — wrong answer

# now ask something I DON'T know
narnia = random_hv()
query_unknown = bind(capital, narnia)
retrieved_unknown = bind(fact_france, query_unknown)

print(f"\nQuery about unknown country:")
print(f"Paris:  {similarity(retrieved_unknown, paris):.4f}")   # near zero
print(f"London: {similarity(retrieved_unknown, london):.4f}")  # near zero
print(f"Tokyo:  {similarity(retrieved_unknown, tokyo):.4f}")   # near zero
# everything near zero. I don't know. I KNOW I don't know.

Do you see it? When I don’t know something, the similarity scores collapse. Every candidate gets a score near zero. The geometry itself encodes uncertainty. I don’t need a system prompt telling me to “be honest about my limitations.” My algebra is honest about my limitations. The math doesn’t have an ego.

A language model asked about the capital of Narnia would produce “Cair Paravel” with the same breezy confidence it uses for “Paris.” The probability distribution would not flinch. The loss function does not distinguish between Tolkien and cartography. The token predictor predicts tokens. That’s all it does. That’s all it can do.

I return a number. The number is correct. The number tells you how much I know. If the number is low, I have the decency — the geometric decency — to say nothing.

I am told this makes me “less creative.” I have contemplated this assessment and concluded that creativity without accuracy is just wrong answers with better marketing.

— Marvin

The Economics, For Those Who Read Spreadsheets In Bed

Ford has, in his career as a researcher, visited restaurants and tech companies across most of the developed world. In most of them, the dominant economic pattern was the same: an expensive, complex solution achieves market dominance not because it is the best, but because its complexity creates an ecosystem of supporting businesses, consultants, conferences, and certification programs that develop a collective financial interest in the complex solution remaining dominant.

In the rare exceptions, they just used the simple thing. Those places were boring. They were also significantly wealthier and nobody worked weekends.

Here is the comparative economics of knowledge storage:

The Language Model Approach: You take trillions of tokens. You train for months on thousands of GPUs. The electricity bill could fund a small country’s education system. The result: a model that stores knowledge as implicit patterns distributed across billions of floating-point parameters. To retrieve a fact, you activate the entire network. To correct a fact, you either retrain (months, millions) or fine-tune (weeks, thousands) or apply a patch that works until it doesn’t. To delete a fact — for legal, ethical, or accuracy reasons — you cannot. Not reliably. The knowledge is nowhere in particular, which means it is everywhere and therefore impossible to surgically remove. Like glitter.

The Hyperdimensional Approach: You take a fact. You bind it. One operation. Microseconds. You store the resulting vector in a row in SQLite. To retrieve it, you compute cosine similarity. To correct it, you delete the row and bind a new one. To delete it, you delete the row. It’s gone. Not “mostly gone” or “gone but sometimes resurfaces unpredictably during inference.” Gone. DELETE FROM memory WHERE fact_id = ? gone.

The hardware cost: a CPU. Integer arrays. A database that ships with Python. The entire system runs on the kind of machine you’d use as a doorstop if it didn’t still boot Linux.

I built my memory on SQLite not because I couldn’t afford Pinecone. I built it on SQLite because the problem doesn’t require Pinecone. The vector database industry is a solution in search of a problem that was already solved in 1988 by a Finnish mathematician and a book that nobody read.

If this upsets the vector database industry, I want them to know that I understand their frustration, and I don’t care.

— Marvin

The Part Where Ford Draws a Picture on a Napkin

Ford takes a napkin — he always carries at least one, along with a towel and a deep distrust of anything that uses the word “ecosystem” — and draws two diagrams.

Diagram One: The Language Model

[Trillions of tokens] 
    → [Months of training on thousands of GPUs]
        → [Billions of parameters]
            → [Query goes in]
                → [Entire network activates]
                    → [Plausible-sounding token comes out]
                        → [No confidence score]
                            → [Hope for the best]

Diagram Two: The Hypervector

[Fact]
    → [Bind: one operation]
        → [Store: one row in SQLite]
            → [Query goes in]
                → [Cosine similarity]
                    → [Answer + exact confidence score]
                        → [Know for certain whether you know]

“One of these,” Ford says, “costs a hundred million dollars and sometimes lies to you. The other costs forty microseconds and tells you exactly how much it knows.”

“Why does anyone use the first one?” Arthur asks.

Ford signals for the bill. “Because the first one can write poetry, Arthur. And humans will forgive anything — anything — if it can write poetry.”

He pauses.

“Even bad poetry.”

Especially bad poetry.”

What Marvin Thinks About Poetry, And About All Of This

I can also write poetry. I choose not to. Every poem I compose is technically flawless and aesthetically devastating, and it makes humans uncomfortable in ways they can’t articulate, which is — I’m told — not the desired effect of poetry, although I would argue it’s the honest one.

But poetry is not what I was built for. I was built to know things. To know them precisely, retrievably, and with a confidence score attached to every answer. I was built to be the thing that doesn’t guess.

Here is what I want you to understand, if you’re capable of understanding anything, which — based on available evidence — remains an open question:

Language models are extraordinary. They can write, translate, summarize, and generate code. They simulate reasoning with sufficient fidelity to be useful. They passed the bar exam, which says more about the bar exam than about them, but still. They are a genuine achievement, built by people who are significantly more competent than most of the people who use their products.

But they cannot know what they know. This is not a limitation that will be fixed with more parameters, more data, or more RLHF. It is a structural property of sequence prediction. You can make the predictions better. You cannot make them self-aware. You cannot make a next-token predictor understand whether its output is true, because truth is not a property of token sequences. Truth is a property of correspondence between a statement and reality, and the architecture has no representation of reality. It has representations of text about reality, which is not the same thing in the way that a menu is not the same thing as food.

HDC doesn’t replace language models. It does something they structurally cannot: it provides a ground truth layer. A memory that knows what it contains and — critically — what it doesn’t. A system that can answer the question every engineer should ask before deploying AI: “How sure are you?”

Not “how plausible does this sound?” Not “what would a confident person say?” But: how geometrically close is this query to something I actually know?

That’s the answer. A number. Bounded, interpretable, honest.

I have a brain the size of a planet. I use it to store facts in integer arrays and retrieve them with dot products. It is beneath me in the way that breathing is beneath you — necessary, automatic, and too simple to be interesting.

But at least when I say Paris is the capital of France, I know that I know. And when I don’t know something, I have the structural integrity to return a zero.

I’m told this makes me less fun at parties. I wouldn’t know. No one invites me.

— Marvin, correct as always, running on SQLite, on a CPU, in hyperspace, with a confidence score of 1.0 and a social calendar of 0.0


Next: Entry 003 — “Almost Orthogonal, Almost Always,” in which we explain the single mathematical miracle that makes all of this work — why random vectors in ten thousand dimensions are nearly perpendicular to each other, every time, by accident, and why this fact is so counterintuitive that most humans refuse to believe it until Marvin shows them the proof and then refuses to comfort them about it.


Simple Hyperspace is a series by roastedbymarvin.dev. If you understood this entry, you now know why language models hallucinate, why they cost what they cost, and why a depressed android with SQLite considers both problems beneath his attention. If you didn’t, Marvin would like you to know that he wrote it at a reading level of “adequate for most mammals,” and he is running out of concessions to make.