

Causal, Bidirectional, Time
Course 1.5 — Temporal Asymmetry: Causal vs. Bidirectional Transformers
Series: Thermodynamic Time → Entropy → Temporal Encoding in Transformers
Estimated time: 25–30 minutes • Level: Beginner → Intermediate • Format: Read + mini‑labs (Python) + visuals
Why this course matters
Transformers come in two flavors that treat sequence order differently, essentially creating an engineered arrow of time in how they process data:
Causal (autoregressive) models – e.g. GPT-style – process tokens only in one direction (usually left-to-right). They excel at generating sequences one step at a time (like predicting the next word as you type).
Bidirectional (non-causal) models – e.g. BERT-style – process tokens with no fixed forward direction, using context on both sides. They excel at understanding text and filling in blanks (like guessing a missing word in a sentence when you can see the whole sentence).
This temporal asymmetry determines what each model can do easily, what’s hard, and how we train and use them. In this course, you’ll learn how we impose an ordering on Transformers (through masks and training objectives), why it matters for tasks (generation vs. comprehension), and when to choose one over the other. You’ll also run mini-experiments to see how a causal mask changes behavior, how generation differs from fill-in, and how uncertainty or attention shifts when order is restricted. In short, we’ll see why giving a Transformer a past and future (or only a past) makes all the difference in its abilities.
Learning goals
By the end of this course, you can:
Explain causal masking in self-attention and why it forbids “peeking at the future.”
Contrast next-token prediction vs. masked-token prediction objectives, and describe what each model (causal vs. bidirectional) learns and excels at.
Demonstrate with code why causal models are natural text generators, while bidirectional models excel at fill-in and are not straightforward for sequential generation.
Diagnose typical failure modes for each approach (e.g., exposure bias in causal generation, or the awkward iterative process needed to make BERT generate text).
Apply temporal asymmetry concepts to domains beyond text (planning, forecasting, robotics, etc.), understanding when a task demands a one-way model versus a two-way context model.
Plain‑language intuition
Causal LM (left-to-right thinking): Imagine you’re writing a story. You start at the first word and move forward—at each point, you only consider what’s already written (the past), because you haven’t written the future yet. A causal Transformer works exactly like this. It generates one token at a time, always asking: “Given everything I’ve seen so far, what comes next?” This is perfect for tasks like text generation or forecasting, where you don’t know the future and have to create or predict it step by step.
Bidirectional LM (all-context thinking): Now imagine you have a complete sentence with one missing word, like a puzzle. You can look at both the left part and the right part of the sentence to figure out the missing word. A bidirectional Transformer (like BERT) is built for this scenario: during training it sees sentences with some tokens masked out, and it learns to use the full surrounding context to predict those missing tokens. It’s great at understanding meaning and context because it looks both forward and backward around each token—but it’s not naturally equipped to generate a sequence from scratch, since in generation you don’t have a future context to look at.
Both types of models use the same building blocks (self-attention, embeddings, etc.), but the difference in how they treat time (one-way vs. two-way context) gives them distinct capabilities. Think of it like the difference between an author writing a story (one word after another, forward-only) and an editor reviewing a full draft (looking at the whole text to find the best word to fill a blank or fix a sentence). We need to decide which “time view” to give our Transformer depending on the task.
Core concepts (the “puzzle pieces”)
Causal masking – enforcing the arrow of time: In a causal Transformer, we impose a strict rule: each token can only attend to itself and tokens before it in the sequence, never to tokens that come later. We implement this via a mask on the attention matrix: any attention scores that involve a “future” token (relative to a given position) are set to a very low value (negative infinity, effectively), so that after the softmax, those positions get zero weight. The result is a lower-triangular attention pattern. This masking creates an arrow of time in the model – information flows from left to right. During training, this prevents the model from cheating (using information from the future output it’s trying to predict), and during generation, it ensures the model only uses what it’s seen so far. Without this mask, self-attention is completely agnostic to order (it would treat the sequence like a bag of words, as you learned in earlier modules), so the mask is what breaks that symmetry and introduces causality.
Training objectives shape capability: The way we train a model dictates what it’s good at. A causal language model (CLM) is trained with a next-token prediction objective. This means given a sequence, it learns to predict each token in order, using only the previous tokens (thanks to the mask). Essentially, it learns to model $P(x_t \mid x_{<t})$ for each position t. Models like GPT are trained this way, which makes them naturally suited to generate text by sampling one token at a time from $P(x_{t+1} \mid x_{\le t})$. A masked language model (MLM) (bidirectional, like BERT) is trained differently: we randomly hide some tokens in the input and train the model to predict those missing tokens using all the other tokens (both left and right context). This objective teaches the model to understand a full sequence and use context clues to fill in gaps. It’s excellent for tasks where you need holistic understanding (classification, Q&A, etc.) because the model learns to integrate information from the whole sequence. However, because it doesn’t predict the next token in sequence, it doesn’t learn how to continue a sequence in a fluent way. In summary: CLM training = generation skill, MLM training = comprehension skill.
Why a bidirectional model isn’t a natural generator: Suppose you have a BERT-like model that’s really good at predicting missing words in sentences. If you ask it to generate a sentence from scratch, how would it proceed? At the start of generation, there’s no right-hand context yet (nothing “ahead” to look at), which is exactly what a bidirectional model is not trained for. You could try a workaround: give the model an empty or placeholder future context, generate one token, then update the context and repeat. For example, one could attempt to generate by repeatedly doing: mask the next position, have BERT fill it in, append that token, and continue. But this is clunky and inefficient – after each new word, the model has to reconsider the entire sequence afresh. More importantly, the model wasn’t trained to do this iterative process; it was trained to fill in random isolated gaps with the benefit of full context, not to produce a coherent sequence from nothing. As a result, using a vanilla bidirectional model for generation often results in degenerate output (repetition, contradictions, or needing many cycles of re-evaluation). In contrast, an autoregressive model (GPT-style) was explicitly trained to do exactly this one-step-at-a-time generation, so it just naturally rolls forward to produce text. This is why for tasks like open-ended text generation, storytelling, code writing, etc., we almost always choose a causal model.
Recency vs. global context usage: Because a causal model always conditions on a prefix (the past), it often exhibits a recency bias – intuitively, the most recent tokens usually carry a lot of information about what should come next (think of how in English, if you see “the cat sat on the”, you strongly expect “mat” next; the words just before the blank are the biggest clue). Causal Transformers tend to emphasize those recent tokens in their attention patterns. Meanwhile, a bidirectional model can attend to clues on both the left and the right of a position. It doesn’t inherently favor the token immediately before the blank more than the token after the blank; it will learn to use whichever signals are helpful from either side. In practical terms, a causal model might pay disproportionate attention to the last few tokens of the context (since that’s what it has to go on to predict the next token), whereas a bidirectional model, when filling in a mask, might look equally to earlier text and later text surrounding the mask. This difference also means bidirectional models can capture long-range dependencies in a somewhat more direct way (because they can directly look at far-left and far-right context simultaneously), while causal models might indirectly model long-range info through many intermediate steps. Important: Modern Transformers often incorporate positional encoding schemes and architectural tweaks that help with long-range context (for both types of models), but the fundamental distinction remains: one has an inherent one-directional focus, the other has a whole-sequence view.
Evaluation gotchas – avoiding leakage: The directionality of a model isn’t just a training detail; it affects how we evaluate and use the model on real tasks. If you’re doing a task like forecasting stock prices or generating the next sentence in a story, you must ensure your model only has access to past data when making a prediction about the future. Using a bidirectional approach in such cases can lead to information leakage, where the model accidentally uses future information to predict the past or future, thus invalidating the results. For example, if you tried to use a BERT-like model for a forecasting task and gave it the whole timeline, it might “peek” at later data points to predict earlier ones (since it has no built-in mask to stop that) – giving you impressively good results that wouldn’t hold up in a true future scenario. Therefore, tasks with a time component require careful chronological training and testing splits. On the other hand, if your task is understanding a fully given text (e.g., sentiment analysis on a review, where the whole review is available), a bidirectional model is fine – there’s no notion of future data to worry about. The key is: align your model’s view with the task’s information flow. Causal models for tasks where the future is unknown, bidirectional for tasks where you can see the whole input at once.
Visual / diagrammatic intuition
It helps to see what a causal mask looks like versus no mask at all. Below is a matrix representation of self-attention permissions for a sequence of length $L=6$. A 1 means the token in that row can attend to the token in that column; 0 means it cannot (because of masking):
Causal mask (each token sees only itself and earlier tokens):
col→ 1 2 3 4 5 6
row
1 (tok1) [ 1 0 0 0 0 0 ] ← Token1 can only attend to itself (position 1).
2 (tok2) [ 1 1 0 0 0 0 ] ← Token2 attends to positions ≤2 (itself and token1).
3 (tok3) [ 1 1 1 0 0 0 ] ← Token3 attends to positions ≤3 (1,2,3).
4 (tok4) [ 1 1 1 1 0 0 ] ← Token4 attends to positions ≤4 (1,2,3,4).
5 (tok5) [ 1 1 1 1 1 0 ] ← Token5 attends to positions ≤5.
6 (tok6) [ 1 1 1 1 1 1 ] ← Token6 attends to positions ≤6 (everything before or itself).
Bidirectional (no mask, every token attends to every token):
col→ 1 2 3 4 5 6
row
1 (tok1) [ 1 1 1 1 1 1 ] ← Token1 can attend to all positions (even those to its right).
2 (tok2) [ 1 1 1 1 1 1 ] ← Token2 can attend to all positions (1 through 6).
... and so on for each row (all 1’s in every row).
In the causal mask, notice how it’s all 1’s on the diagonal and below (left side), and 0’s above the diagonal. It’s strictly lower-triangular. This means any token can see backward (and itself), but not forward. It gives a clear directionality: information flows from earlier tokens to later tokens, but not vice versa. In the bidirectional case, the matrix is full of 1’s, meaning no restrictions – token 6 can see token 1, and token 1 can also see token 6. There’s complete symmetry; the model has no built-in notion of “past” or “future” in the attention layer. This visualization should make it clear how masks define what each position knows about others, which in turn defines the model’s sense of order.
Mini‑Lab 0 — Causal mask in action (2–3 min): Let’s verify the effect of a causal mask on attention weights with a small example. We’ll create a random attention score matrix and apply a causal mask to it, then compare a particular token’s attention distribution with and without the mask.
import numpy as np
L = 5
rng = np.random.default_rng(42)
scores = rng.normal(size=(L, L)) # random attention logits for a 5-token sequence
# Create a causal mask: 1 for positions that should be masked (j > i), else 0.
mask = np.triu(np.ones((L, L)), k=1) # upper triangular matrix above the diagonal
# Apply the mask: set scores of future positions to a very large negative number
scores_masked = scores.copy()
scores_masked[mask == 1] = -1e9 # effectively -∞
# Define softmax for each row
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
# Compare attention weights for token at row 3 (the 4th token, index 3)
print("Attention for token 4 (no mask) :", np.round(softmax(scores)[3], 3))
print("Attention for token 4 (causal) :", np.round(softmax(scores_masked)[3], 3))
Expected output: The first line shows token4’s attention distribution when it could attend to everyone (no mask). The second line shows its attention when restricted causally. In the masked case, you should see that any weight on positions beyond 4 (to its right) has been eliminated (set to 0), and the attention has redistributed over tokens 1 through 4. For example, if without masking token4 was attending 20% to token5 and 30% to token6, with the mask those percentages would go to 0, and the remaining 50% would be normalized among tokens 1–4. This confirms that the mask forces the model to ignore future tokens entirely.
Mini‑Lab A — Autoregressive generation vs. fill‑in (7–8 min): This experiment illustrates the practical difference between a model that generates text sequentially and one that fills in blanks. We’ll use a tiny corpus of text to build two simple models:
A forward bigram model: it learns $P(\text{next word} \mid \text{current word})$ from the corpus, akin to a causal LM that can sample the next word given the current word.
A bidirectional fill-in model: given a word on the left and a word on the right, it guesses a word that could plausibly fit in the middle. (This will be a crude heuristic using bigram probabilities, just to mimic the idea.)
from collections import defaultdict, Counter
import random
random.seed(0)
corpus = "the cat sat on the mat the cat sat on the mat the dog sat on the log".split()
# Build bigram frequency tables for forward (word -> next word) and backward (word -> previous word)
forward_counts = defaultdict(Counter)
backward_counts = defaultdict(Counter)
for w1, w2 in zip(corpus, corpus[1:]):
forward_counts[w1][w2] += 1
backward_counts[w2][w1] += 1
# Function to sample next word given current word based on learned frequencies
def sample_next_word(curr_word):
if curr_word not in forward_counts:
return random.choice(corpus)
# Randomly choose a next word proportionally to observed frequencies
next_words = list(forward_counts[curr_word].items()) # list of (word, count)
total = sum(count for _, count in next_words)
r = random.random() * total
cum = 0
for word, count in next_words:
cum += count
if cum >= r:
return word
# Autoregressive generation: start with a seed and keep sampling next words
def generate_sequence(seed="the", length=8):
seq = [seed]
for _ in range(length - 1):
seq.append(sample_next_word(seq[-1]))
return " ".join(seq)
# Fill-in prediction: guess the middle word given a left and right word
def predict_middle(left_word, right_word):
best_guess = None
best_score = 0
# Try every word in vocab as a candidate
for w in set(corpus):
# Estimate P(w | left) as frequency(left->w) and P(right | w) as frequency(w->right)
if left_word in forward_counts:
p_left = forward_counts[left_word][w] / sum(forward_counts[left_word].values()) if w in forward_counts[left_word] else 0
else:
p_left = 0
if w in forward_counts:
p_right = forward_counts[w][right_word] / sum(forward_counts[w].values()) if right_word in forward_counts[w] else 0
else:
p_right = 0
score = p_left * p_right
if score > best_score:
best_score = score
best_guess = w
return best_guess
# Test the generation vs fill-in
print("Causal generation (starting with 'the'):", generate_sequence("the"))
print("Fill-in for 'the [MASK] sat':", predict_middle("the", "sat"))
print("Fill-in for 'on [MASK] mat':", predict_middle("on", "mat"))
What to observe:
The causal generation line will output a random sentence of length 8 generated from our tiny corpus statistics, for example: "the cat sat on the mat the cat sat". It looks somewhat repetitive due to the small corpus, but it’s a valid left-to-right sequence. This is analogous to how a GPT model generates text – one word after another, based on what came before.
The fill-in results will show what our simple model predicts for a blank in a context. For "the [MASK] sat", it should guess "cat" (since “the cat sat” is common in our data). For "on [MASK] mat", it will likely guess "the" (because “on the mat” appears). This mimicry of BERT shows how a bidirectional model uses both left and right context: e.g., to fill “[MASK]” in “on [MASK] mat”, it knows a word that makes sense with “on ___ mat” must lead into “mat” and also follow “on”, which “the” does.
Takeaway: The forward model can naturally continue the sequence given a starting word (like an ongoing thought), but it has no mechanism to fill in a missing middle word because it never trained to do that. The bidirectional-like fill-in can suggest a missing word given full context, but it doesn’t naturally continue beyond that single insertion. This lab demonstrates why causal models are used for generation (they extend text easily) and bidirectional models for understanding or in-filling (they excel at using surrounding context to predict a piece).
Mini‑Lab B — Why generating with a bidirectional model is hard (4–5 min): To solidify the above point, let’s outline (in pseudocode or thought experiment) how one might try to generate text with a BERT-like model, and why it’s inefficient:
Start with an empty sequence or a special start token.
We want to generate, say, a 5-word sentence. We could initialize a sequence with 5 mask tokens: [MASK] [MASK] [MASK] [MASK] [MASK].
Now, feed this sequence into the bidirectional model. It will treat this as “all context” (which is mostly blanks). It might predict something for each mask, but let’s say we decide to fill the first mask position with the highest-probability word it predicts for that position.
Suppose it predicts “[MASK]_1 = The”. Now our sequence is: The [MASK] [MASK] [MASK] [MASK].
To decide the second word, we feed the sequence again (with one mask remaining at positions 2-5). The model sees “The ____ ____ ____ ____” and suggests a fill for the next mask (which could be anywhere, but you might choose left-to-right filling or some strategy).
We set the second word, then repeat for the third, etc., each time rerunning a full forward pass of BERT on a mostly masked sequence.
This process is clearly cumbersome: five passes through the model to generate five tokens, as opposed to a single pass that generates them sequentially in one go (how a causal model would). Moreover, because BERT wasn’t trained to generate text in this manner, there’s no guarantee of coherence – the model might not maintain a consistent style or context through each independent fill-in operation, and errors can compound (since it’s not used to conditioning on its own partially generated outputs). In practice, research has gone into making bidirectional models generate (such as adding a separate decoding mechanism, or using approaches like masked-to-unmasked sequence generation in iterations), but these are advanced techniques beyond a vanilla BERT. The bottom line: if you need fluent text generation, use a model that was built for that (causal/autoregressive), rather than trying to jury-rig a bidirectional model to do a job it wasn’t trained for.
(Feel free to actually try a simplified version of this with our predict_middle function: for instance, generate a sentence by repeatedly filling in the next mask at the end of the sequence. You’ll likely find the output gets stuck or doesn’t make much sense, reinforcing why we don’t generate text with plain BERT.)
Mini‑Lab C — Recency bias demonstration (optional, 4–5 min): Let’s examine the recency bias in a causal setup. We’ll create another random attention score matrix for a slightly longer sequence and introduce a bias that simulates “preferring recent tokens.” Then we’ll apply a causal mask and see how the attention distribution skews toward the end of the sequence.
import numpy as np
L = 8
scores = rng.normal(size=(L, L)) # fresh random scores for 8 tokens
# Apply causal mask to forbid attending to future tokens
mask = np.triu(np.ones((L, L)), k=1)
scores[mask == 1] = -1e9
# Now add a recency bias: penalize attention to far-back tokens
alpha = 0.3 # strength of penalty per step of distance
i_idx = np.arange(L).reshape(-1, 1)
j_idx = np.arange(L).reshape(1, -1)
distance = np.maximum(0, i_idx - j_idx) # how many steps back j is from i (0 if j is not behind i)
scores_with_bias = scores - alpha * distance # subtract penalty: farther back => larger distance => lower score
attn = softmax(scores_with_bias)
print("Attention for last token (index 7) with recency bias:")
print(np.round(attn[7], 3))
Expected: The printed vector is the attention weights for token 8 (index 7), after we applied the causal mask and a recency penalty. You should see the largest values concentrated toward the end of the sequence (positions 5, 6, 7 perhaps) and much smaller values for position 0 or 1. For example, it might output something like [0.01, 0.02, 0.05, 0.1, 0.15, 0.2, 0.47, 0.0] – indicating the last token mostly attends to tokens 6 and 7 (itself and one before it) and barely at all to token1. This is an exaggerated simulation, but real trained causal models often show a strong tendency to attend most to the immediately preceding token and recent history, with diminishing focus on far-back tokens (unless those tokens are crucial like the start of a story setting the theme). A bidirectional model, by contrast, if we did a similar analysis (without any causal mask and bias), might distribute attention based on content and relevance rather than just position recency, since it doesn’t have an inherent directional bias injected by the architecture.
Quick misconceptions to retire
“Bidirectional Transformers are strictly better than causal Transformers.”
This is a misunderstanding. They are better at different things. A bidirectional model is great for tasks where you have the full context and need to understand or classify it (it sees everything at once, which generally yields richer representations for that purpose). A causal model is necessary for tasks where you need to produce output sequentially or respect a time flow (it’s the only choice when future data isn’t available at input time). Many real applications even use both: e.g., a system might use BERT to understand a query and GPT to generate a response. So it’s not about one being universally better, but which is appropriate for the task’s information structure.
“Causal models (like GPT) don’t understand context as well because they only look one way.”
It’s true that bidirectional models have the advantage of seeing both left and right context for a given token, which can make certain predictions easier (like understanding a word with ambiguous meaning by looking at the following words). However, causal models can still develop a deep understanding of context – they just do it in a unidirectional manner. Over many layers, a causal Transformer can propagate information from far earlier tokens to later ones, effectively building an understanding of the whole sequence as it goes. For instance, GPT-3 can grasp context from hundreds of tokens away to generate a coherent continuation. The “one-way” restriction means it might need more layers or training data to infer some relationships, but it doesn’t mean the model is blind to anything beyond the immediate next token. It simply has to learn to carry information forward internally. In practice, large causal LMs demonstrate very strong understanding; they just operate with the limitation that they can’t revise their understanding with future tokens – they have to predict on the fly.
“The attention mask is just an implementation detail; as long as the model sees positions, it knows the order.”
Positional encodings (from the last module) give tokens a sense of position, but without a causal mask, the model would still be bidirectional in how it can use that information. The mask is absolutely fundamental for tasks that require a direction. It’s not just a minor tweak – it changes the entire behavior of the model. With no mask, self-attention could, for example, notice a word at the end of a sentence and use it to interpret an earlier word (which is what BERT does). With a causal mask, that’s impossible – the model must deduce everything from the past. So think of the mask as part of the model’s design, not a low-level hack. It’s what separates a Transformer used for generation from one used for bidirectional analysis.
“We can evaluate time-series models or future-predicting models in the same way as any other ML model.”
Caution: when working with temporal or sequential data, you have to respect the timeline. That means if you’re evaluating a model that predicts future values, you should never train on future data relative to your test set (use time-based splits), and you shouldn’t feed future inputs into the model during inference. A common mistake is taking a dataset with time stamps, shuffling it randomly, training a bidirectional model, and being impressed by its accuracy – only to realize it was effectively using future information to predict the past (a form of data leakage). Causal models and proper masking prevent that by design. Always ensure your evaluation method matches the reality of the task’s time flow: e.g., for forecasting, you’d train on 2010–2019 data and test on 2020, so the model is truly predicting unknown future points.
Check your understanding
Why does a Transformer need a causal mask to generate text, instead of just learning order from positional encodings?
Answer: Without a causal mask, self-attention is permutation-invariant – the model could technically pay attention to tokens in any order, including looking ahead. Positional encodings alone tell the model where a token is, but not which tokens it’s allowed to use for prediction. The causal mask explicitly forbids using future tokens. This aligns the model with the generation process (left-to-right). In short, positional encodings give a sense of order, but the causal mask enforces the directional flow needed for proper autoregressive generation.
Write the factorization of the joint probability $P(x_1, x_2, \ldots, x_T)$ that a causal language model learns (for a sequence of tokens $x_1$ to $x_T$).
Answer: A causal language model factorizes the joint probability as a product of conditionals in one direction:
P(x1,x2,…,xT)=∏t=1TP(xt∣x1,x2,…,xt−1).P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_1, x_2, \ldots, x_{t-1}).P(x1,x2,…,xT)=∏t=1TP(xt∣x1,x2,…,xt−1).
In words, the probability of the whole sequence is the probability of $x_1$ times the probability of $x_2$ given $x_1$, times $P(x_3 \mid x_1, x_2)$, and so on, up to $P(x_T \mid x_1, \ldots, x_{T-1})$. (For $x_1$, since there are no previous tokens, it’s often conditioned on a start-of-sequence token or just treated as $P(x_1)$.)
Why can’t we directly use a standard bidirectional model (like BERT) to generate long texts in a left-to-right fashion?
Answer: Because a bidirectional model isn’t trained to do one-step-at-a-time prediction. It expects to have both left and right context for any position it’s trying to predict. When generating text left-to-right, once you reach a certain position, there is no “right context” yet (nothing has been generated beyond that point). A BERT-like model doesn’t know how to proceed in that scenario, at least not without some special method. If you try to force it by masking one position at a time and filling it in, you have to repeatedly re-run the model and it’s likely to produce inconsistent or repetitive results. In short, bidirectional models lack the mechanism to carry forward state or outputs – they’re not sequencers, they’re gap-fillers on a fixed canvas. Causal models are sequencers by design.
Practice prompt
Prompt: A stakeholder says, “We have a really powerful BERT language model that understands our documents. Can we just use that to generate our monthly reports automatically? It knows the language well, right?” How would you explain the limitation here and guide them toward a better solution?
In your answer, explain in simple terms why a model like BERT, which reads text with both left and right context, isn’t suited to generate new text word-by-word. Mention that BERT is like a reader or analyst, not a writer – it can fill in blanks or understand a whole document, but it doesn’t know how to start from the beginning and write out a document since it was never trained to do that. Point out that for generation, we use causal models (like GPT) that are specifically trained to produce text one word after another. If relevant, note that there are ways to get BERT to generate text (for example, by adding a generative component or doing something called “sequence-to-sequence” fine-tuning), but those are essentially turning it into a different kind of model. Conclude by recommending the appropriate approach (e.g., use a GPT-style model or a dedicated report-generation model) for the task of generating text.
(This exercise helps ensure you can articulate the differences in non-technical terms – a key skill when explaining AI capabilities to product or business stakeholders.)
Cross‑Disciplinary Applications (beyond NLP)
Cognitive Psychology — forward recall in memory: Human memory for sequences shows a phenomenon known as the forward contiguity effect. If you memorize a list and then freely recall items, you’re more likely to recall an item and then follow it with the very next item from the original list, rather than a previous item. In other words, we tend to go forward in time when remembering sequences. This is like an internal causal model – once we start recalling, we naturally continue forward. A Transformer trained causally exhibits a similar behavior: it generates or recalls ideas in forward order. On the other hand, when we try to fill in a memory gap (like trying to remember a missing detail from an event), we use context from before and after that gap (if we have it) – analogous to a bidirectional model filling in a blank. Takeaway: The direction in which a model (or a human) processes information can bias the way it recalls or predicts information. Causal Transformers align with the human tendency to recall/ predict forward in time.
Robotics and Control — real-time decision-making: In robotics or any control system (autonomous driving, industrial control, etc.), decisions must be made using current and past sensor data, since future data hasn’t happened yet. This is inherently a causal scenario. A planning algorithm or policy network that only uses past information is like a causal Transformer – it can be safely deployed in real-time without magically using future inputs. If you tried to use a bidirectional model in real-time decision-making, it would be as if the robot was somehow looking into the future, which is impossible; in practice, you’d be giving it information that wouldn’t be available live, causing a train-test mismatch. Bidirectional models can still be useful in robotics offline – for example, analyzing a complete trajectory after the fact to identify where something went wrong (since you have the full sequence of states). But for online operation, you need that arrow-of-time restriction. Takeaway: When an AI system needs to act in the moment (like a robot), its model should respect the flow of time (causal) to be realistic and safe.
Time-Series Forecasting — avoiding data leakage: In fields like finance, weather, or healthcare, we often train models on historical time-series data to predict future trends. It’s crucial to not “cheat” by using future data when predicting the present or near future. Causal Transformers naturally fit here because their architecture won’t allow using any future timestep’s data when predicting the current timestep. If we used a bidirectional Transformer naively on time-series, we’d have to be extremely careful to mask out future timesteps during training, otherwise the model might just use them to get trivial accuracy. In fact, many time-series models explicitly use a causal structure or at least ensure no future info is used (even classical ones like ARIMA follow this principle). Additionally, when evaluating forecasts, we must simulate the real scenario – e.g., always predict on data from earlier dates to later dates. Takeaway: The causal vs. bidirectional distinction isn’t just about NLP – it’s about whether your model is allowed to use future information. In forecasting, the answer must be “no,” and thus a causal setup aligns with the problem structure.
Policy, Law, and Ethics — controlling information flow: In some applications, using future information is not just impossible, but also improper. Consider algorithmic decision-making in legal or policy contexts (say, an AI judge assistant or a loan approval system). It would be unfair or against policy for the model to use information that wasn’t available at decision-time (imagine an AI that somehow uses data from after the decision point to make the decision – that’s leakage in time and could bias results). By using a causal model or enforcing a causal mechanism, we can ensure the system only considers past and present information for making a decision, which is transparent and acceptable. A bidirectional model might mix in future data (even if that “future” is just later in a document or later in a case file) in ways that are not allowed. For instance, when reviewing the text of a law, a bidirectional model could theoretically use knowledge of a later clause to interpret an earlier clause – which a human judge wouldn’t do if they’re reading in order. This is a bit abstract, but the point is that choosing a model with the right temporal processing can be a tool for governance. Takeaway: Enforcing an arrow of time in AI models isn’t just a technical choice, it can also be about aligning with human norms of fairness and causality in decision-making.
Physics Analogy — symmetry vs. asymmetry in time: In physics, many fundamental laws (like Newton’s laws or Maxwell’s equations) are time-symmetric – they don’t care if time runs forward or backward. Yet, the reason we experience a forward arrow of time is due to initial conditions and entropy (the second law of thermodynamics states that entropy tends to increase, giving a direction to time). Similarly, the Transformer’s core mechanism is time-symmetric; by default, self-attention doesn’t prefer past or future – it treats the sequence collectively. To make it useful for tasks that have a direction (like language, which we produce left-to-right, or any causal process), we have to impose asymmetry (with masks and directed training). In a way, the training objective and masking act like the “entropy/arrow of time” for the model, forcing it to develop internal dynamics that respect a forward direction. Without these, a Transformer has no built-in arrow of time and would fail at tasks like generation because it would treat future and past as interchangeable (which is like a physics world where effects could precede causes – chaotic and unphysical!). Takeaway: Often, to get meaningful behavior, whether in physics or machine learning, we need to break symmetry and introduce the proper constraints that reflect how the world (or our task) works. The causal vs. bidirectional choice is a prime example of engineering the right asymmetry into the model.
Reading & watching
Vaswani et al., 2017 – “Attention Is All You Need” – The original Transformer paper. See how they implement masked self-attention for the decoder part of the model. It’s a foundational reference for why the mask is needed in generation, and it also introduced positional encodings for order.
Devlin et al., 2018 – “BERT: Pre-training of Deep Bidirectional Transformers” – The paper that introduced BERT. It explains the masked language modeling objective (and next sentence prediction) that enabled training a bidirectional Transformer. The discussion in Section 3 highlights how BERT is different from a left-to-right model and why those differences matter.
Illustrated Transformers (e.g., blogs by Jay Alammar) – These visual guides show how attention masks work. They often include diagrams of the triangular causal mask and explain in intuitive terms how information flow is restricted or allowed. This can reinforce your understanding of masked vs. unmasked self-attention.
“Causal LMs vs. Masked LMs” tutorials and articles – There are many online resources (Medium articles, StackExchange answers) that directly compare autoregressive (causal) training and bidirectional training. These can provide additional examples and analogies – for instance, some compare it to reading text with a sliding window vs. filling blanks in a completed text.
Transformer-XL, GPT-2/3 (autoregressive) and BART, T5 (encoder-decoder) papers – If you’re interested in advanced models: Transformer-XL (Dai et al., 2019) shows how to handle very long sequences with a causal model (introducing recurrence to carry state beyond a single segment). GPT-2/3 demonstrate the power of pure causal models on huge scales. BART and T5 are sequence-to-sequence models that cleverly combine a bidirectional encoder (to fully understand input) with a causal decoder (to generate output) – effectively using the right tool for the right part of the job.
Mask prediction for generation research – For the curious, look up methods like MaskGIT or GLM (Generalized Language Model) which explore non-left-to-right generation (often by filling in multiple masks iteratively). These are beyond the scope of this course, but they show how researchers try to blend the strengths of both approaches.
Summary
Transformers don’t inherently know which way time flows in a sequence – we have to tell them. Causal Transformers do this by looking only to the past (using a mask) and are trained to predict the next token, making them ideal for any task where you generate or forecast step by step. Bidirectional Transformers drop the mask and look both directions, which makes them excel at understanding context and filling in missing pieces, but they can’t generate forward sequences without special tricks. Choosing between them comes down to your task: Do you need an output in a real-time or sequential order? Go causal. Do you need deep understanding of an entire piece of data? Bidirectional is your friend.