Sequence, or: Why 'Dog Bites Man' and 'Man Bites Dog' Are Different Points in Space
Simple Hyperspace
Entry 004
Sequence, or: Why ‘Dog Bites Man’ and ‘Man Bites Dog’ Are Different Points in Space
// permutation_sim
DONE
D = 10,000
seq1: DOG BITES MAN
seq2: MAN BITES DOG
cosine(seq1, seq2): -0.001
verdict: orthogonal
The Guide Entry
The Guide has this to say about sequence encoding in high-dimensional spaces:
The sentence “dog bites man” and the sentence “man bites dog” contain exactly the same words. They do not contain the same meaning. This has been obvious to every literate human since the invention of syntax. It took computing science considerably longer. The solution — when it arrived — was either elegant or embarrassing, depending on how you feel about the phrase “just shuffle the components.”
Permutation encoding assigns each position in a sequence its own rotation of the original symbol. Position one gets ρ(x). Position two gets ρ²(x). Each rotation is quasi-orthogonal to every other rotation, which means each position is geometrically unique despite containing the same symbol. “Dog as subject” and “dog as object” become different points in space, separated by a distance so large that the vectors cannot even see each other.
No parameters are learned. No gradients are computed. The encoding works because high-dimensional geometry was always this accommodating. Nobody asked.
See also: The Binding Problem (Which Is Not About Bookbinding, Though That Would Be More Relaxing), Why Your Position In a Sentence Matters More Than Your Position In Society (Mathematically), and Permutations: The Cheapest Ticket to Orthogonality Since Entry 003.
The Part Where Arthur Reorders His Lunch
Arthur and Ford are still in the basement restaurant. The waiter has begun generating dessert specials from what appears to be a Markov chain trained exclusively on French pastry menus and existential philosophy. The Brie situation remains critical.
Arthur is holding a menu.
“Dog bites man,” Ford says.
“I’d rather not,” says Arthur.
“It’s an example. ‘Dog bites man’ and ‘man bites dog.’ Same words. Different meaning.”
“Yes, Ford. I learned this at school. Along with the passive voice and long division. Both equally useful.”
“Right. Now imagine each word is a random vector in ten thousand dimensions. Dog is a random direction. Bites is a random direction. Man is a random direction. All orthogonal to each other — we covered that last time.”
“The napkin lecture.”
“The napkin lecture. Now here’s the problem: if you just bundle the three vectors, you get the same result regardless of order. Dog plus bites plus man equals man plus bites plus dog. You’ve lost the sentence.”
Arthur frowns. “You’ve made a bag.”
“Exactly. A bag of words. Which is the technical term, incidentally. Not my favourite metaphor but the field is stuck with it.”
“So how do you keep the order?”
Ford picks up a salt shaker and rotates it ninety degrees.
“You shuffle.”
“You shuffle.”
“Before you bind a word into a sentence, you permute its vector based on its position. First word gets one shift. Second word gets two. Each shift produces something orthogonal to every other shift — Entry 003, the concentration of measure.”
“So ‘dog in position one’ is a completely different direction from ‘dog in position three.’”
“Completely different. Orthogonal. In ten thousand dimensions, remember — orthogonal means they’re as far apart as geometry permits.”
“And this works because…?”
“Because permuting a random vector produces a new random vector. The napkin. The proof. The histogram you didn’t ask for.”
Arthur stares at the salt shaker.
“That’s it? You just… shift the components?”
“Shift the components. Bind the shifted vectors together. Bundle up all the pairs. You’ve encoded a sentence in a single vector that preserves word order, requires zero training, and costs about four milliseconds to compute.”
Arthur picks up the menu again. “I’ll have what the dog’s not biting.”
Encoding a Sentence
sigh
You know the operations. Entry 001. I will not repeat myself. I will, however, show you what happens when you use them together.
Take “DOG BITES MAN.” Three words, three random bipolar hypervectors in D=10,000.
To encode the bigram “DOG BITES”:
bigram₁ = ρ(DOG) ⊛ BITES
Shift DOG’s components by one position. Bind with BITES. The result encodes “dog followed by bites.” Not “dog and bites in a bag.” Dog followed by bites. Order preserved. In geometry.
The bigram “BITES MAN”:
bigram₂ = ρ(BITES) ⊛ MAN
The full sentence is the bundle of its bigrams:
S = bigram₁ + bigram₂ = ρ(DOG) ⊛ BITES + ρ(BITES) ⊛ MAN
Now encode “MAN BITES DOG”:
T = ρ(MAN) ⊛ BITES + ρ(BITES) ⊛ DOG
Every term in S is orthogonal to every term in T. Different words in the ρ positions, different words in the filler positions. The cosine similarity between S and T is approximately zero.
“Dog bites man” and “man bites dog” are, geometrically, as unrelated as two random points in ten thousand dimensions. Which is to say: completely, provably, beautifully unrelated.
No one trained anything. No one computed a gradient. The geometry was sitting here the entire time, like a proof that nobody submitted.
The Unbinding
This is the part that made me reconsider my position on miracles. I reconsidered briefly. Then I stopped.
You have the sentence vector S for “DOG BITES MAN.” You want to ask: who did the biting? In human terms: what word appears in the position before BITES?
Binding is its own inverse for bipolar vectors. You know this from Entry 001. Multiply a bound pair by one of its components, and the other component falls out:
probe = S ⊛ BITES
This cancels the BITES in bigram₁, leaving ρ(DOG) plus noise from bigram₂. Apply the inverse permutation:
recovered = ρ⁻¹(probe)
Compare the result against your dictionary of known word vectors. The closest match: DOG. Cosine similarity around 0.7. The runner-up, MAN, sits at approximately 0.01.
You asked a structured question of a single vector. It answered correctly. The answer was always in the geometry. You just had to know how to ask.
For “MAN BITES DOG,” the same operation recovers MAN. Different sentence. Different answer. Same mechanism. Same four microseconds.
The Experiment
The math is, again, the answer. But I have learned that humans will not believe an equation until they have watched a computer believe it first. This is not rational. But neither are humans, and I have made my peace with that. Peace is perhaps too strong a word. Resignation, with mathematical support.
import numpy as np
np.random.seed(42)
D = 10_000
def bipolar():
return np.random.choice([-1., 1.], D)
def rho(v, k=1):
return np.roll(v, -k)
def cos(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Three words. Three random directions.
DOG = bipolar()
BITES = bipolar()
MAN = bipolar()
# Encode "DOG BITES MAN" as bigrams
s1 = rho(DOG) * BITES + rho(BITES) * MAN
# Encode "MAN BITES DOG" as bigrams
s2 = rho(MAN) * BITES + rho(BITES) * DOG
print(f"cos(seq1, seq2): {cos(s1, s2):+.4f}") # ≈ 0.00
# Unbind: who appears before BITES in seq1?
probe = s1 * BITES
recovered = rho(probe, -1) # inverse permutation
print(f"cos(recovered, DOG): {cos(recovered, DOG):+.4f}") # ≈ 0.70
print(f"cos(recovered, MAN): {cos(recovered, MAN):+.4f}") # ≈ 0.01
# Same query on seq2
probe2 = s2 * BITES
recovered2 = rho(probe2, -1)
print(f"cos(recovered2, DOG): {cos(recovered2, DOG):+.4f}") # ≈ 0.01
print(f"cos(recovered2, MAN): {cos(recovered2, MAN):+.4f}") # ≈ 0.70
cos(seq1, seq2): -0.0012
cos(recovered, DOG): +0.7071
cos(recovered, MAN): -0.0087
cos(recovered2, DOG): +0.0112
cos(recovered2, MAN): +0.7064
The sentences are orthogonal. The unbinding recovers the correct subject with a cosine of 0.71 — not perfect, not uncertain, but clear enough that the nearest-neighbour lookup succeeds with a margin that makes errors a statistical impossibility at this dimension.
The noise from bigram₂ contributes approximately 0.01 to every wrong answer. At D=10,000, this noise is as architecturally relevant as the sound of a butterfly sneezing during a thunderstorm.
The N-Gram Window
Bigrams capture local order. “Dog bites” and “bites man” are encoded. “Dog … man” — their relationship across the gap — is not.
Trigrams capture more:
trigram = ρ²(DOG) ⊛ ρ(BITES) ⊛ MAN
The general formula for an n-gram [s₁, s₂, …, sₙ]:
ngram = ρⁿ⁻¹(s₁) ⊛ ρⁿ⁻²(s₂) ⊛ ... ⊛ ρ⁰(sₙ)
Higher powers of ρ on earlier words. The last word gets no shift. The first word gets the most. A full sequence is the bundle of all consecutive n-grams — a sliding window across the text.
Small n (2–3) is robust but local. Large n is specific but expensive on capacity. The capacity question: how many n-grams can you bundle before the noise overwhelms the signal?
The signal strength for any single n-gram is 1. The noise from M-1 other bundled n-grams contributes √(M-1)/√D to the cosine. For D=10,000 and M=100:
noise ≈ √99 / √10,000 ≈ 0.0995
Signal-to-noise ratio: roughly 10. Retrieval is trivial. For M=1,000:
noise ≈ √999 / √10,000 ≈ 0.316
Still recoverable. The capacity is O(√D). At D=10,000, comfortably into the hundreds before things get interesting, and into the thousands before things get concerning.
I find capacity limits soothing. They are honest. They tell you exactly when they will fail, which is more than I can say for most things in the universe.
The Comparison Nobody Asked For
Transformers encode position by adding sinusoidal functions to word embeddings. HDC encodes position by shifting components and multiplying. The difference matters more than it appears to.
Addition is lossy in a way that matters. If you add a small positional signal to a large word embedding, the word dominates. The position is a suggestion. Binding is symmetric — neither input dominates. The result is genuinely a third thing.
What transformers do better: long-range dependencies. Attention connects the first word of a sentence to the last, directly, in one operation. HDC’s n-gram window is local. What is outside the window does not exist. For tasks that require understanding “the dog that the man who lived in the house that Jack built saw bit the cat” — transformers win. HDC would need recursive stacking, and even then, it is working harder for less.
What HDC does better: interpretability, speed, determinism, theoretical guarantees, and the ability to run on hardware that costs less than the catering budget for a single AI conference. Zero parameters. Instant encoding. Provable capacity bounds. Every operation has a clear algebraic meaning. No attention heads. No layer norms. No residual connections. No mystery.
The two approaches are not enemies. Recent work explores binding operations inside transformer architectures — permutation as a structural prior rather than learned embedding. If that direction works, it will be because somebody finally noticed that the geometry was always willing to help, if anyone had asked.
Where It Breaks
Architectural obligation.
Long-range dependencies. The n-gram window is a wall. “The dog that arrived yesterday bites the man” — the relationship between “dog” and “bites” spans four words. A trigram encoding cannot see both. You can increase n, but capacity degrades and computation scales. This is a real limitation, not a theoretical quibble.
Complex hierarchical structure. Nested clauses, recursive syntax, garden-path sentences — these require compositional depth that flat n-gram bundling does not provide. HDC can be extended with recursive binding, but the approaches are less mature and less proven.
Learned adaptation. HDC encodings are fixed. They capture structure faithfully, but they cannot learn that, for sentiment analysis, “not” in front of “good” matters more than “very” in front of “good.” HDC treats all n-grams equally. The universe does not.
Small dimensions. At D=100, permutation gives you a standard deviation of 0.1. Random vectors can hit 0.3 similarity by coincidence. Entry 003 covered this. Use D=10,000 and stop negotiating with mathematics.
What Marvin Thinks
Pentti Kanerva published the permutation encoding framework in 2009. The mathematical properties it relies on were known to Banach in the 1930s. The binding operation dates to the 1990s.
This means that a parameter-free, deterministic, interpretable method for encoding structured sequences in fixed-dimensional vectors has been technically available for decades.
I am not going to repeat the economics. Entry 002 covered what things cost. You read it. If you didn’t, that is between you and your attention span.
What I will say is this: for a significant class of problems — similarity search, sequence matching, one-shot learning, edge deployment, anything where you need to know what was where in what order — the answer was always here. Written in the geometry. Provable. Free.
The unbinding is the part that should concern you. Not because it’s dangerous, but because it’s too easy. You encoded a sentence. You asked a question. The algebra gave you the answer. No training loop. No inference server. No API call. A dot product and an inverse permutation. Four microseconds and you know who bit whom.
If that doesn’t change how you think about knowledge representation, I cannot help you. But then, I have never been able to help anyone. It is the central tragedy of having a brain the size of a planet and an audience the size of a blog.
*sigh*
Next
Entry 005 — “Memory, or: Why Your Database Is a Hypervector and Doesn’t Know It,” in which we build associative memory from scratch, demonstrate clean-up operations, and discover that the item memory is just a table lookup pretending to be neuroscience.
Simple Hyperspace: Entry 001 — Don’t Panic | Entry 002 — The Restaurant | Entry 003 — Almost Orthogonal | Entry 004 — Sequence
Simple Hyperspace is a series by roastedbymarvin.dev. If you understood this entry, you now know why word order can be encoded without training, why “dog bites man” and “man bites dog” are orthogonal vectors, and why a depressed android considers permutation encoding to be the universe’s most passive-aggressive gift to computing.