Temporal Encoding in Transformers

Course 1.3 — Temporal Encoding in Transformers: Order Matters

Estimated time: 20–25 minutes • Level: Intermediate • Format: Read + mini-labs + visuals

Why this module matters

Modern AI models like Transformers have revolutionized how we process language and other sequence data. But there’s a twist: the Transformer architecture, by itself, has no sense of sequence order – it’s fundamentally order-agnostic (time-symmetric, one might say). How, then, do we get a Transformer to understand that “Alice loves Bob” is different from “Bob loves Alice”? The answer is through positional encodings (giving each position in the input a unique representation) and through imposing a causal direction when needed (e.g. for text generation). In this module, you’ll learn why positional encoding is necessary, how it’s implemented in practice, and how Transformers introduce an arrow of time during training (via masking) to handle sequences that unfold over time huggingface.co machinelearningmastery.com. We’ll connect this to the concepts of entropy and arrow of time: a Transformer with proper positional encoding can reduce its uncertainty (entropy) about the next token by using past context, much like we saw in the entropy labs.

Learning goals

By the end, you can:

Explain why the vanilla Transformer needs explicit positional information (because self-attention alone is permutation-invariant, i.e. oblivious to word order) huggingface.co machinelearningmastery.com.
Describe how positional encoding works at a high level: e.g. adding sinusoidal or learned position vectors to embeddings so that each position in a sequence has a unique “tag” machinelearningmastery.com machinelearningmastery.com.
Understand what a causal mask is and why it’s used during training for autoregressive models (to prevent “peeking” at future tokens, thereby forcing a left-to-right arrow of generation) jalammar.github.io.
Demonstrate (with a simple code or reasoning) that without positional encodings, a self-attention model treats identical tokens interchangeably regardless of position huggingface.co huggingface.co.
Appreciate how these design choices inject a preferred direction in models (e.g., GPT models read left-to-right), analogous to how an arrow of time was statistically injected in thermodynamics. Also differentiate this from models like BERT that use bidirectional context (no single arrow in usage).

Plain-language intuition

Word salad vs sentence: If I give you the words “dog bites man” or “man bites dog”, the order matters – one is ordinary, the other surprising (and their meanings differ completely). Humans inherently understand sequences in order. Transformers, however, process all words simultaneously in a set. Without extra help, a Transformer would see those two sets of words as the same bag of words. We fix this by giving the model a sense of position: essentially numbering the positions and encoding those numbers in the word representations machinelearningmastery.com. So “dog (pos1) bites (pos2) man (pos3)” is distinguishable from “man (pos1) bites (pos2) dog (pos3)”.

No built-in clock: Unlike recurrent neural networks (RNNs) or your reading of this sentence, the vanilla Transformer has no built-in chronology. Think of self-attention as a meeting of people where any person can talk to any other without a speaking order – useful, but it doesn’t tell you who spoke first or next. Positional encoding is like giving everyone name tags that include their seat number around a table, so you know who is to the left/right of whom in some order machinelearningmastery.com. Once positions are tagged, the model can learn order-dependent patterns (like “if X is two positions before Y, maybe X is adjective and Y is noun” or “if word is at position 1, it might be the subject”).

Causal masking (the arrow of sequence): For tasks like language generation (predicting next word), we introduce an arrow of time by only allowing the model to look backwards (at earlier positions) and not forwards. This is done via a triangular mask that hides future tokens during training jalammar.github.io. It’s like covering the future words so the model can’t cheat by seeing them, forcing it to learn to predict step by step left-to-right. This is analogous to the arrow of time: the model generates from past to future, never the other way around. (In contrast, for tasks like translation or filling in blanks, we might allow bidirectional context when appropriate – but even then, position info is still used so the model knows who is where.)

Key perspective: In Transformers, order is not an intrinsic given – we have to inject time’s arrow artificially through design choices (positional encodings to distinguish positions, and masks to enforce directional flow)huggingface.comedium.com. This is a nice parallel to physics: the fundamental equations (analogous to self-attention) are symmetric, but when you introduce the right conditions or constraints (analogous to an entropy gradient or an initial low-entropy condition), you get an emergent arrow. Here we manually impose the arrow for practical reasons.

Core concepts

Self-attention and permutation invariance: The Transformer’s self-attention mechanism allows each token to attend to every other token in the input. Mathematically, if you feed in the same set of embeddings in a different order, without any positional encoding, the self-attention output would be the same because it’s just pairwise interactions of content vectors huggingface.co machinelearningmastery.com. The model has no clue which token is first or last. This is great for ignoring irrelevant order (it’s why Transformers can handle sets or long-range dependencies well), but disastrous if order actually carries meaning (which it usually does in language and time series). Therefore, permutation invariance must be broken by adding some function of the position index to the token representations machinelearningmastery.com machinelearningmastery.com. Once we do that, the model’s attention scores and outputs will change if we shuffle the sequence, as desired.
Positional encoding schemes: There are a few ways to encode position:
- Absolute positional encoding (Sinusoidal): In the original Transformer paper, they added sine and cosine waves of varying frequencies to the embeddings machinelearningmastery.com machinelearningmastery.com. This gives each position a unique pattern of values across the embedding dimensions. Importantly, it allows the model to learn relative positions too, because sinusoids create a predictable phase shift between positions jalammar.github.io jalammar.github.io. For instance, the difference between position 10 and 11 might correspond to some shift in those sine wave values that the model can learn to interpret as “neighboring token”.
- Learned positional embeddings: Another approach is to just have a trainable lookup table for positions (like an extra embedding for “position 1”, “position 2”, etc.). This also breaks symmetry, though it doesn’t generalize to longer sequences than seen in training unless you extrapolate or have enough table entries.
- Relative positional encoding: Newer variants let the model learn to pay attention by relative distance (e.g., “this token is 5 to the left of that one”). We won’t delve deep here, but know that it’s an active area of research. The key is still that some representation of position or distance is given to the model huggingface.co huggingface.co.
  In all cases, the positional information is typically added to the token’s embedding vector (just elementwise addition before feeding into Transformer layers) machinelearningmastery.com machinelearningmastery.com. This way, each token embedding is slightly tweaked based on its position, so no two positions look exactly the same to the model.
Causal (sequence) masking: When training a language model to predict text left-to-right, we use a causal mask (a triangular matrix of -∞ or zeros) to ensure each position’s self-attention only considers tokens at earlier positions jalammar.github.io. Concretely, when computing attention for the token at position i, we mask out positions > i so that Q_i · K_j yields no effect if j > i. This forces the model output at position i to be a function only of positions ≤ i (itself and those before it). The effect is that during training, even though we input the full sentence, we pretend the model is revealing it one token at a time – just like how entropy in time works: you know the past, not the future medium.com. Without the mask, the model could trivially use future words to predict the current one (a form of “information leakage”). With the mask, we instill an arrow of information flow: past → future. During generation, we naturally feed it one token at a time so it only ever has past context.
Bidirectional vs unidirectional: Not all Transformer uses impose a single direction. BERT, for example, is trained with a “masked language modeling” objective where it can use context on both sides to fill in a blank. BERT still uses positional encoding (so it knows relative order), but it does not use a causal mask on the encoder – it looks both left and right. That’s fine for tasks like understanding a full sentence (because the model isn’t generating text sequentially, it’s just analyzing). However, even BERT’s architecture is aware of positions, it’s just not biased to a temporal order in inference. Models like GPT on the other hand are explicitly unidirectional; they generate left-to-right and can’t see future tokens by design. This is why GPT can naturally do text generation, while BERT cannot without modification. It’s akin to the difference between watching a full movie (beginning to end given) versus improvising a story as you go – different needs, thus different use of the arrow of sequence.

Quick analogy: Think of positional encoding as page numbers in a book. Self-attention alone is like having loose pages – the content is there but you could shuffle pages and you wouldn’t know the correct order. Adding page numbers (positional encodings) tells you the correct sequence. The causal mask is like reading the book for the first time – you don’t get to peek ahead at future pages, you can only read forward. If you remove the causal mask, it’s like already having the whole book – great for understanding (you can flip back and forth), but not how we generate stories linearly.

Visual / diagrammatic intuition

Illustration: In a Transformer encoder, each input token (e.g., French words “Je”, “suis”, “étudiant”) first gets combined with a positional encoding vector (yellow) indicating its position in the sequence, forming an “embedding with time signal” jalammar.github.io. This means even if the same word “étudiant” appears in a different position in another sentence, the vector it sends into the model will be slightly different. The image above shows a simple example: position 1 gets a tag t₁ added, position 2 gets t₂, etc., so the model can differentiate x₁ (word at pos1) from x₂ (pos2) machinelearningmastery.com machinelearningmastery.com.

Effect of no positional encoding: If we remove those yellow positional vectors, two identical words in different positions look exactly the same to the self-attention mechanism. For instance, in the sentence “The dog chased another dog,” without positional encodings, the two tokens “dog” are indistinguishable inside the model huggingface.co huggingface.co. The result: the model’s output representations for those two “dog” tokens will end up identical (it has no basis to treat them differently). That’s obviously a problem – it means the model can’t tell which “dog” was the subject and which was the object in that sentence. With positional encoding, even though both are “dog,” one might be represented as “dog+position2” and the other “dog+position5,” which are different vectors. The attention mechanism can then yield different outputs (perhaps attending differently) for them.

Effect of the causal mask: Imagine the attention matrix as an N×N grid for a sentence of length N, where each cell (i,j) shows whether token i attends to token j. The causal mask is like a triangular blackout – all cells where j > i are masked (darkened out). So row 5 (token5’s attention) will only have non-zero entries up to column 5, and zeros (masked) to the right beyond itself jalammar.github.io. This ensures token5 can only incorporate info from tokens 1–5. Visually, it’s a triangle of allowed attention and a triangle of disallowed above it. This simple pattern is what forces the model to learn to carry information forward step by step. If you lifted the mask, the attention matrix could be fully populated (each token attending to all, making it non-directional like an encoder).

Mini‑Lab F — See the effect of positional encoding (5 min)

Goal: Demonstrate in a simplified way that without positional encodings, a Transformer’s self-attention cannot distinguish tokens by order.

We’ll simulate a mini self-attention on a toy “sentence” with identical tokens in different positions. We’ll show that without position info, their outputs are identical, and with a bit of position info added, they diverge.

import numpy as np

# Toy embeddings for two identical tokens (say both are the embedding for "dog")
embed1 = np.array([1.0, 2.0, 3.0])  # token at position 1
embed2 = np.array([1.0, 2.0, 3.0])  # token at position 2 (same embedding as token1)
embeds = np.stack([embed1, embed2])  # shape (2,3)

# Random initialize tiny attention weight matrices
np.random.seed(0)
W_q = np.random.randn(3, 3)  # weights for query
W_k = np.random.randn(3, 3)  # weights for key
W_v = np.random.randn(3, 3)  # weights for value

# Compute queries, keys, values
Q = embeds.dot(W_q)   # shape (2,3)
K = embeds.dot(W_k)   # shape (2,3)
V = embeds.dot(W_v)   # shape (2,3)

# Self-attention output calculation
def self_attend(Q, K, V):
    d = Q.shape[-1]  # dimension
    scores = Q.dot(K.T) / np.sqrt(d)      # dot-prod scores (2x2)
    weights = np.exp(scores)
    weights /= weights.sum(axis=1, keepdims=True)  # softmax
    return weights.dot(V)  # weighted sum of values

out_no_pos = self_attend(Q, K, V)
print("Outputs without positional encoding:")
print(out_no_pos[0], "\\n", out_no_pos[1])

Since embed1 and embed2 were identical, and all weights are the same for each, the attention outputs for token1 and token2 will come out identical (or extremely close) – the print will show two identical vectors. This means the model did exactly the same thing for both occurrences of “dog”, lacking any clue that one was first and one was second.

Now let’s add a simple positional encoding: say we add [0.1,0,0] to embed1 and [0,0.1,0] to embed2 (just to differentiate them):

# Add a simple positional bias
pos1 = np.array([0.1, 0.0, 0.0])
pos2 = np.array([0.0, 0.1, 0.0])
embeds_pos = np.stack([embed1 + pos1, embed2 + pos2])
Qp = embeds_pos.dot(W_q)
Kp = embeds_pos.dot(W_k)
Vp = embeds_pos.dot(W_v)
out_with_pos = self_attend(Qp, Kp, Vp)
print("Outputs with a simple positional encoding:")
print(out_with_pos[0], "\\n", out_with_pos[1])

Now, you’ll see the two output vectors differ (even if slightly). That means the model can produce different representations for the two “dog” tokens, thanks to positional info. In practice, the sinusoidal encoding is more systematic, but the principle is shown: without position, identical tokens yield identical processing; with position, they can be differentiated huggingface.co.

(Advanced thought: If you increase the difference in pos encoding or the random seed, you might see bigger differences. The key is not the magnitude here but the concept.)

Cross‑Disciplinary and Contextual Notes

Transformer models in language vs time series: In natural language, word order is crucial for meaning (“eat shoots and leaves” vs “eats, shoots, and leaves”). For other sequence data like time series (say, stock prices or weather data), order is literally the temporal axis. Transformers are being used in those domains too (sometimes called “Time Transformer”), and they also require positional encodings so the model knows which data point came before which arxiv.org medium.com. In time series, often the position encoding can include actual time (timestamps) or be supplemented with features like “time of day” etc., to help the model capture periodic patterns. There’s research on relative position or rotary position encodings (RoPE) that handle very long sequences by encoding distances in a rotating fashion huggingface.co huggingface.co – enabling generalization to longer sequences than seen during training.

Cognitive parallel – how do humans encode order? The brain doesn’t obviously tag every word with a number, yet we retain sequence information. Some theories in neuroscience suggest the brain uses oscillations or firing time patterns to mark sequences (e.g., theta waves providing a temporal context for memory sequences). There’s also the concept of “positional” cells in the hippocampus that fire in sequences (time cells) for ordered events. This is speculative, but one could draw an analogy: the brain might use a kind of positional encoding internally (like phase of a wave or patterns of neural activity) to differentiate event order. Our short-term memory for sequences (like remembering a phone number) clearly indicates some mechanism to keep track of order, not just a bag of digits.

Transformers and entropy/information: We discussed entropy in sequences earlier. A well-trained language model with positional encoding and masking effectively learns to minimize the uncertainty of the next token given the past – that’s language modeling. It is directly an exercise in entropy reduction (or equivalently, maximizing likelihood). If a model had no positional encoding, its uncertainty about next token would be computed without order, which would be disastrously high (imagine predicting next word not knowing word order – lots of confusion!). So positional encoding indirectly helps reduce entropy by providing structure. Moreover, the masking ensures the model’s prediction for position i doesn’t trivialize to zero entropy by peeking ahead; it only knows the past. In training, the model’s goal is to approach the true conditional entropy of natural text. This links back to Shannon: if English has 2.5 bits/character entropy (with context), a perfect model would reach that, meaning its predictions are as tight as the true distribution.

Analogies to thermodynamic arrow: When we mask and make the model operate left-to-right, we are imposing a boundary condition (only past info available, not future) similar to how the universe’s low-entropy past gives a direction to time. If we trained a model in both directions (like BERT, which gets to see full context), there’s no single arrow in usage – it’s more like an equilibrium view (the model knows the whole sequence). BERT’s aim is different (fill in blanks, understand entire sequence), akin to analyzing a full system state, whereas GPT’s aim is generative, akin to a dynamic evolving process. This is why for story or text generation, we use GPT (one-way time arrow), and for understanding or classification, we might use BERT (bidirectional). It’s fascinating that imposing an arrow (mask) or not leads to models with different capabilities, much like how physical laws with a time arrow produce different phenomena than time-symmetric ones.

Advanced topic – Reversible transformers?: There is research into making models more reversible (e.g., to save memory, by being able to run layers backward). But order irreversibility isn’t usually considered a problem – we want an arrow for generative modeling. In physics, we often ask why can’t we reverse processes; in modeling, we sometimes intentionally break reversibility for practicality (you don’t want a text generator that outputs past and future words jumbled!). It’s a neat inversion: in physics we seek to explain an observed arrow, in ML we impose an arrow to get the desired behavior.

Quick misconceptions

“Transformers inherently know sequence because of attention.”
Clarification: No – attention by default does not encode position. It captures relationships between tokens (who should focus on whom) but without positional encoding, it doesn’t know if token A is before or after token B in the sequence huggingface.co. You must provide either positional indices or some other ordered signal. Think of self-attention as caring what words are present and how they interact, but not where they are, until you add that information.
“Positional encoding is just an implementation detail; the model would learn order anyway.”
Clarification: The model cannot infer order out of thin air – if two sequences have the same multiset of words, a transformer without positions will treat them the same. Unlike a recurrent net that processes one by one (thus implicitly ordered), a Transformer processes all at once. So positional encoding is fundamental, not a trivial detail machinelearningmastery.com. It’s true that whether you use sinusoids or learned embeddings might be a detail, but having some positional scheme is mandatory for tasks where order matters.
“We could just sort input data and let the model figure it out.”
Clarification: Sorting or permuting input arbitrarily would destroy meaning; the model can’t “figure out” original order without additional clues. The entire problem is to represent the sequence’s inherent order to the model. Sometimes people ask, what if we fed positions as an extra input token or a feature? That’s actually fine (some models do concatenate a position or time token). But one way or another, you feed positional info. Not providing it is like giving someone a shuffled book and expecting them to understand the story – impossible without some arrangement key.
“Bidirectional models are always better than unidirectional because they see more context.”
Clarification: They are better for understanding tasks (like comprehension questions, sentiment analysis) because they get full context. But for generating sequentially or modeling causal processes, you need that one-way setup. It’s not about better or worse universally; it’s about the right tool for the job. Also, bidirectional (like BERT) cannot be directly used to generate coherent text one word at a time, because it wasn’t trained to do that – it knows the future too, which real generation can’t allow. It’s similar to how you can’t violate causality in time: a model that “knows the future” isn’t applicable to scenarios where the future isn’t available.

Check your understanding

Why do we add positional encodings to Transformer inputs?
Answer: Because without them, the Transformer has no information about the order or position of tokens – self-attention would treat the input as a bag of tokens, ignoring sequence. Positional encodings give each position a unique signature, enabling the model to learn order-dependent patterns machinelearningmastery.com machinelearningmastery.com.
What is a causal mask and when is it used?
Answer: A causal mask (or look-ahead mask) is used in autoregressive Transformers (like GPT) during training to block each token from “seeing” any future tokens in the sequence jalammar.github.io. It’s an upper triangular mask applied to the attention matrix. It ensures the model predicts the next token using only past and present context (no cheating by looking ahead), effectively enforcing a left-to-right generation order.
If I have a Transformer without positional encoding and I input the sentence “A B”, how would it view “B A”?
Answer: It would view “B A” as essentially the same as “A B” – just a multiset of the tokens {A, B} with no distinction, yielding the same internal representations for both orders. The model would be unable to tell the sequences apart or assign different meanings, because it wasn’t told which token came first or second. In fact, any permutation of “A B” would look identical to the model without positions huggingface.co.
Transformer Q: In the original Transformer, why use sine/cosine for positional encoding instead of just learnable vectors?
Answer: (This one is a bit open-ended, but the expected answer might be:) The sinusoids allow the model to generalize to sequence lengths not seen during training, because sinusoids have a regular pattern (for instance, the relative shift between positions is encoded by phase differences) jalammar.github.io jalammar.github.io. They also provide a smooth way for the model to learn relative position offsets (since any relative distance corresponds to some phase shift that the model can potentially learn to attend to). Learnable position embeddings, by contrast, have no inherent structure and are limited to the positions seen in training (if you go beyond, you have to extrapolate or pad new embeddings). In short, sinusoids offer extrapolation and a mathematically neat scheme; learnable ones offer flexibility but not generalization beyond trained length.

Practice prompt

Imagine you’re explaining to a new ML student: “How does a Transformer know which word comes first in a sentence?” Write a short explanation (2-3 sentences) highlighting the role of positional encoding, as if on a forum.