

Uncertainty, Information, Shannon Entropy
Course 1.2 — Uncertainty, Information, and Shannon Entropy
Estimated time: ~25 minutes • Level: Beginner → Intermediate • Format: Read + mini-labs + visuals
Why this module matters (in one minute)
How does a text message differ from random noise? Why can’t you compress a truly random file to something shorter? The answers lie in Shannon entropy, which quantifies uncertainty (or “surprise”) in information. This concept is the bridge between thermodynamics and information theory: it’s the same math of entropy, applied to communication and data. Understanding Shannon entropy is crucial for everything from data compression (ZIP files) to machine learning (model predictions). In this module, you’ll learn what information really means in a quantitative sense (“information is uncertainty reduction”), see how patterns in data reduce entropy, and get a glimpse of how this ties back to physical entropy and forward to how AI models handle predictability quantamagazine.orgquantamagazine.org.
Learning goals
By the end of this module, you can:
Define Shannon entropy in plain language and mathematically, as a measure of uncertainty or surprise in a set of possible messages quantamagazine.org.
Explain with examples how patterns or constraints in data (e.g. predictable structures in English text) lower entropy, and how that relates to making communication more efficient quantamagazine.org quantamagazine.org.
Connect Shannon’s entropy formula to Boltzmann’s: understand that if all outcomes are equally likely, Shannon’s entropy (per message) looks just like thermodynamic entropy (per microstate) quantamagazine.org. Also, recall the anecdote of John von Neumann advising Shannon to use the term “entropy” quantamagazine.org.
Demonstrate basic calculations of entropy for simple scenarios (coins, dice, text), and see the link between entropy and the minimum bits needed to communicate or compress that data quantamagazine.org quantamagazine.org.
Appreciate that Shannon entropy sets a fundamental limit: no clever coding can compress data below its entropy without losing information quantamagazine.org quantamagazine.org. This is analogous to the Second Law — there’s a “no-go” limit on compression like there’s a no-go direction in time without added work.
Plain-language intuition
The coin flip game. Imagine a game: I flip a coin and you have to tell someone the outcome. If the coin is a normal 50/50 coin, you have some uncertainty — the result could be “Heads” or “Tails” with equal chance. On average, you’ll need to send 1 bit of information (“H” or “T”) to communicate it. Now imagine the coin is double-headed (always lands Heads). Before I even flip, you know it will be Heads. There’s
zero uncertainty and thus zero information needed to communicate the outcome (your friend already knows it too!). Shannon entropy H is basically asking: “On average, how many yes/no questions (bits) would I need to identify the outcome?” quantamagazine.org quantamagazine.org More uncertainty = more bits.
Surprise and information. Intuitively, a message is informative if it’s surprising. If someone texts you “hello” for the 100th time, it’s not very informative (highly predictable, low entropy). But a totally random string of characters is surprising and thus carries a lot of information in Shannon’s sense (high entropy – though it might be meaningless). Shannon entropy captures this by looking at all possible messages and their probabilities: if one out of N outcomes is certain, H = 0 (no surprise). If all N outcomes are equally likely, that’s the maximal surprise situation (you have no idea what will happen), and entropy is high quantamagazine.org quantamagazine.org. Formally, for equally likely outcomes, H = log₂ N bits. For example, 8 equally likely outcomes = 3 bits of entropy (since 2³ = 8).
Patterns reduce entropy. Think of reading in English: the letter “Q” is almost always followed by “U”. So if you see a “Q”, your uncertainty about the next letter is greatly reduced. Because of patterns like this (plus uneven letter frequencies like E being common), English text has lower entropy than a random string of letters quantamagazine.org quantamagazine.org. Shannon famously estimated English’s entropy around 1–2 bits per character (depending on assumptions), much less than the ~4.7 bits/letter for 26 equally likely letters quantamagazine.org. In other words, English is redundant/predictable, which is why we can compress it (ZIP files) and why autocorrect/predictive text works. This is directly analogous to physical entropy: constraints (like “Q must be followed by U”) reduce the space of possibilities, which is like reducing the number of microstates – hence lower entropy.
Key perspective: Shannon entropy measures ignorance or uncertainty about a messagequantamagazine.org. It’s not about the meaning of the message, but about how much you don’t know before it’s revealed. A higher entropy source (e.g. random noise) is harder to predict/compress because you’re more ignorant about what’s coming, while a structured low-entropy source (like a repetitive signal) is easier to predict/compress (you had some “knowledge” built-in due to the pattern).
Core concepts
Information as uncertainty reduction: In Shannon’s framework, the “information” gained when an event occurs is directly related to how unlikely that event was. If something very unexpected happens, you gain a lot of information (because you weren’t sure about it at all) quantamagazine.org quantamagazine.org. If something obvious happens, you gain little or no information. Formally, the information content of a specific outcome x with probability p(x) is defined as $I(x) = -\log₂ p(x)$ bits quantamagazine.org. (So a 1/1024 chance event gives 10 bits of info if it occurs, whereas a guaranteed event gives 0 bits.) Shannon entropy H is the average of $I(x)$ over all possible outcomes: $H = \sum_x p(x),(-\log₂ p(x))$ plato.stanford.edu. This is the formula that measures the uncertainty before knowing which outcome occurs.
Entropy of a distribution: Shannon’s formula $H(X) = -\sum_i p_i \log₂ p_i$ quantifies the uncertainty in a random variable X that takes values i with probabilities p_i. Key points:
H is maximized when all outcomes are equally likely (max uncertainty) plato.stanford.edu. In that case $H = \log₂ N$ (bits) for N outcomes, matching Boltzmann’s entropy form $S = \ln W$ if each microstate is equally likely quantamagazine.org. In fact, if all probabilities p_i = 1/Ω, then $H = \log₂ Ω$, which aside from the log base is the same as thermodynamic entropy (up to Boltzmann’s constant). Shannon deliberately chose the term “entropy” because of this analogy (famously, John von Neumann told him: “Nobody really knows what entropy is, so if you call your quantity entropy, you’ll win every argument!” quantamagazine.org).
H goes to 0 when one outcome has p=1 (and others 0). Zero entropy means no uncertainty — you know exactly what will happen; thus no information is gained when it occurs quantamagazine.org. This corresponds to a completely ordered situation in thermodynamics (only 1 microstate possible).
Patterns, redundancy, and conditional entropy: If a source has structure (like English text or a predictable weather pattern), the entropy per symbol is lower. You can formalize this with conditional entropy: e.g. $H(\text{next letter} \mid \text{previous letters})$ for English is much lower than the entropy of a random letter, because previous context reduces uncertainty quantamagazine.org. Shannon introduced the concept of redundancy: the fraction of information that is predictable from context. For English, he estimated about 50% redundancy – meaning half the letters are in some sense predictable from the rest quantamagazine.org. This is why you can often guess a word before it’s finished, and why compression works (it removes those predictable bits). In thermodynamics, analogously, if I tell you a constraint (like “the gas’s energy is fixed” or “Q is followed by U”), the entropy is calculated over the reduced set of possibilities consistent with that constraint, which lowers it.
Shannon’s source coding theorem: This fundamental result (we won’t prove it here) states that H bits per symbol is the optimal compression limit for a source. In other words, you can compress data to about H bits of information per original symbol without losing any information, but not further quantamagazine.org quantamagazine.org. If you try to compress it more than that, you must be throwing away information (lossy compression) or it just won’t work reliably. This is analogous to a “limit” like the Second Law: it’s a one-way guard rail. For example, if you have a source with H = 2 bits per symbol, you can’t losslessly encode symbols in <2 bits on average – you’ll inevitably get errors or need a lossy scheme. Shannon’s theorem thus gives a quantitative ceiling on how much we can remove “wasted space” in data. It’s why random noise (max entropy) is incompressible (it’s already at the limit) quantamagazine.org, and why highly structured data (low entropy) can be squashed dramatically (e.g. a text file full of spaces compresses a ton, because it’s mostly predictable).
Historical connection: In 1948 Claude Shannon was working at Bell Labs on transmitting messages efficiently and reliably over noisy channelsquantamagazine.org. He formulated entropy in terms of probabilities of messages and laid the foundation of modern information theory. At first glance, it had “nothing to do with steam engines,” he wrote, but if the probabilities are uniform, Shannon’s entropy formula becomes identical to Boltzmann’s quantamagazine.org. This deep link suggested that entropy is a fundamental concept bridging physics and information. E.T. Jaynes later developed this connection further, viewing thermodynamics as a problem of inference (the Principle of Maximum Entropy says that, given partial information, the best guess distribution is the one with maximum entropy consistent with what you knowquantamagazine.org).
Visual intuition: Probability distributions & entropy
Consider three different probability distributions for a single event:
Certain outcome: P(A)=1.0 (always A). – Entropy H = 0 bits (no uncertainty).
Biased outcome: P(A)=0.9, P(B)=0.1. – Entropy is >0 but not max. (Mostly certain but a little surprise if B happens.)
Fair outcome: P(A)=0.5, P(B)=0.5. – Entropy is 1 bit (maximum for two outcomes, since it’s equally likely).
We can visualize probability as bars and entropy as a measure of “spread”. The fair coin (two equal bars) has the greatest spread/uncertainty quantamagazine.org. The biased coin has less spread. The certain case has all mass on one outcome (zero spread).
For multiple outcomes, imagine a bar chart of probabilities. Entropy is highest when the chart is “flat” (all bars equal) and lower when it’s “spiky” (one outcome dominates). A flat 4-outcome distribution (25% each) has H = 2 bits. If one outcome has 70% and others share 30%, H is lower (less than 2 bits) because you have a decent guess most of the time.
Example: Weather in two cities:
City S (Sunnyvale): 90% chance of sun, 10% chance of rain on any given day.
City P (Pittsburgh): 50% sun, 50% rain.
Pittsburgh’s weather entropy is higher (more uncertain day to day). Numerically: H_Sunnyvale = –(0.9 log₂0.9 + 0.1 log₂0.1) ≈ 0.47 bits; H_Pittsburgh = –(0.5 log₂0.5 + 0.5 log₂0.5) = 1 bit. It takes about 1 bit to communicate Pittsburgh’s daily weather (since it’s a coin flip), but only ~0.47 bits for Sunnyvale’s (since “sun” is so likely, you can often predict it without needing a full 1-bit of information) quantamagazine.org.
Mini‑Lab D — Computing entropy for simple distributions (5 min)
Goal: Calculate Shannon entropy for different scenarios to build intuition.
Let’s compute and compare:
A fair coin vs a biased coin vs a two-headed coin (certainty).
A fair 6-sided die vs a loaded die.
import math
def shannon_entropy(probabilities):
return -sum(p * math.log2(p) for p in probabilities if p > 0)
# 1. Coin scenarios
coins = {
"two-headed coin (100% Heads)": [1.0, 0.0],
"biased coin (90% Heads)": [0.9, 0.1],
"fair coin (50/50)": [0.5, 0.5]
}
for desc, probs in coins.items():
print(desc, "Entropy =", round(shannon_entropy(probs), 3), "bits")
Expected output snippet:
two-headed coin (100% Heads) Entropy = 0.0 bits
biased coin (90% Heads) Entropy = ~0.468 bits
fair coin (50/50) Entropy = 1.0 bits
This matches our intuition: a fair coin carries 1 bit of entropy (uncertainty), whereas a coin that almost always lands Heads has less uncertainty (~0.47 bits), and a coin that is guaranteed Heads has 0 bits of entropy (no uncertainty at all).
Now let’s compare dice:
# 2. Dice scenarios
dice = {
"fair die (6 sides equal)": [1/6]*6,
"loaded die (one face 50%, others 10% each)": [0.5, 0.1, 0.1, 0.1, 0.1, 0.1]
}
for desc, probs in dice.items():
print(desc, "Entropy =", round(shannon_entropy(probs), 3), "bits")
Expected:
fair die Entropy ≈ 2.585 bits
loaded die Entropy ≈ 1.485 bits
The fair 6-sided die has entropy log₂6 ≈ 2.585 bits. The loaded die (one face much more likely) has lower entropy ~1.49 bits, reflecting that you have some predictability.
Through these examples, notice: more bias = lower entropy. Maximum entropy occurs when all outcomes are equally likely (most uncertainty).
Mini‑Lab E — Entropy in text: letters vs random (5 min)
Goal: Measure the entropy of characters in a sample of text versus a random string, to see how natural language has lower entropy due to letter frequency imbalances.
We’ll take a sample sentence and compare it to the same length of random letters.
import random
import math
from collections import Counter
# Sample text (you can replace this with any phrase or paragraph)
text = "the cat sat on the mat"
text = text.replace(" ", "") # remove spaces for letter analysis
letters = list(text)
# Generate a random string of the same length from A-Z
random_letters = [chr(random.randint(97, 122)) for _ in range(len(letters))] # a-z
def entropy_of_sequence(seq):
counts = Counter(seq)
total = len(seq)
return -sum((count/total) * math.log2(count/total) for count in counts.values())
H_text = entropy_of_sequence(letters)
H_random = entropy_of_sequence(random_letters)
print("Text sample:", text)
print("Letter entropy of text sample:", round(H_text, 3), "bits/letter")
print("Random string:", ''.join(random_letters))
print("Letter entropy of random string:", round(H_random, 3), "bits/letter")
Output might look like:
Text sample: thecatsatonthemat
Letter entropy of text sample: 3.169 bits/letter
Random string: qlzkgp...
Letter entropy of random string: ~4.459 bits/letter
The random string’s letter entropy will be close to log₂26 ≈ 4.7 bits if all letters appeared roughly equally. The sample English text has a noticeably lower entropy (in this case ~3.17 bits per letter). That’s because in English text, letters like t, h, e, a… appear more frequently than others, and some letters (z, q, etc.) might not appear at all in a short sample, skewing the distribution quantamagazine.org. Also, our text is small; a larger sample would be needed for a precise estimate, but it still shows not all letters are equal. This matches English’s known entropy being a few bits per character, much lower than 4.7.
Takeaway: Natural language is redundant: certain letters and patterns dominate, lowering the entropy. A truly random text has maximal entropy by definition (every letter equally likely). This is why we can compress English text (there’s “wasted” predictability), but a random string is already as compressed (information-dense) as it gets.
(For a fun experiment: try increasing the sample text size, or computing the entropy of the first N letters of a novel vs a shuffled version of the same letters.)
Cross‑Disciplinary Applications
Data compression & file formats: Shannon’s work directly underpins how ZIP files, MP3s, and PNG images work. These formats use clever coding (Huffman coding, arithmetic coding, etc.) to approach the entropy limit. For example, a BMP image (raw pixels) might be 10 MB, but a PNG of the same image could be 2 MB. Why? Because the image probably has large areas of one color or other regularities – i.e. lower entropy than “maximally random” pixels. The compression algorithm finds those patterns and encodes them with fewer bits quantamagazine.org quantamagazine.org. However, if you try to compress an already compressed file (which is near the entropy limit), you’ll gain little. This is Shannon’s theorem in action: you can’t squeeze water from a stone – or bits from an already entropy-saturated source.
Machine Learning (model uncertainty): In ML, especially probabilistic models, Shannon entropy is used to quantify uncertainty of predictions. For example, when a language model like GPT outputs a probability distribution over next words, we can calculate the entropy of that distribution. A high entropy means the model is very unsure (it’s spread out over many possibilities); low entropy means it’s confident about a few options. This links to the concept of perplexity in language models: perplexity is 2^entropy stats.stackexchange.com. If a model has entropy = 5 bits per word on average, perplexity = 2⁵ = 32, meaning on average it’s as “confused” as if it had 32 equally likely options for each word. Lower perplexity (lower entropy) means a better, more predictive model. Training often aims to reduce entropy of predictions on data (without overfitting). Also, loss functions like cross-entropy (log loss) directly come from the entropy formula – essentially measuring how many extra bits you’re using because your predicted distribution isn’t perfectly aligned with reality spotintelligence.com.
Physics – information and thermodynamics convergence: Modern physics has embraced the Shannon interpretation of entropy. Notably, Black Hole entropy (Bekenstein-Hawking entropy) is often discussed as the amount of information “lost” when something falls into a black hole. It’s proportional to the black hole’s event horizon area, and physicists say it counts the information content hidden inside quantamagazine.org. Likewise, in quantum mechanics, von Neumann entropy is the analog of Shannon entropy for quantum states, measuring our uncertainty about a quantum system. The trend is that in many fields, entropy = “missing information.” There’s even a principle that the laws of physics might fundamentally limit information processing (e.g., Landauer’s principle linking bit erasure to heat, which we saw). This cross-pollination means when you learn Shannon entropy, you’re also touching on ideas relevant to physical entropy and vice versa quantamagazine.org.
Cognitive science – human language and prediction: Our brains are sensitive to information content. Experiments in psycholinguistics show that people read more quickly through predictable words and slow down on surprising words. Essentially, your brain internally predicts what’s coming next; when the prediction is wrong (high surprise, high entropy situation), it takes a moment to adjust. This aligns with the predictive processing theory in neuroscience, which posits that the brain constantly minimizes surprise (entropy) by updating its model of the world. Shannon’s concepts give a framework for measuring surprise (“You said a completely unexpected word, that carries high information!”) and this even feeds into how we design communication interfaces or texts for readability (known as information density in language – people tend to naturally make content more uniform in information rate).
Economics and finance: Information entropy has even been applied in finance – for example, to quantify the unpredictability of stock price movements. A stock that’s perfectly random in movement has high entropy; if there’s a pattern (say seasonal effects), entropy is lower. Portfolio theory also relates: a diversified portfolio (spreading investments) is analogous to a high-entropy distribution (not putting all probability weight on one outcome). There’s even a measure called “information entropy of market indicators” used in some econophysics research. While these applications are more metaphorical, they underline that entropy as “uncertainty measure” is a versatile concept.
Linguistics and coding theory: Shannon’s entropy is at the heart of error-correcting codes and communication theory. Engineers design codes that approach the Shannon limit (maximum data rate through a noisy channel). For instance, your WiFi or cell phone uses sophisticated coding (Turbo codes, LDPC codes) that get close to channel capacity – which is defined by the entropy of the noise and signal. Also, in linguistics, people analyze entropy of various languages (for instance, does a language with more letters have higher per-letter entropy or do frequencies adjust?). It turns out languages tend to have similar per-syllable or per-second information rates – probably an efficiency shaped by human processing limits.
Quick misconceptions to retire
“Information = meaning.”
Clarification: In everyday language, we say “information” meaning meaningful content. In Shannon’s sense, information is purely about reduction of uncertainty plato.stanford.eduplato.stanford.edu. If a message is random gibberish, it can have high entropy (lots of bits) but zero meaning. Conversely, a very meaningful message (“Earth has gravity”) might actually be low entropy if you already expected it. Shannon’s entropy deliberately ignores semantics – it’s all about statistical properties. Don’t equate high entropy with “useful” or “meaningful” automatically. It just means unpredictable to the receiver.
“Entropy in information theory is different from entropy in physics.”
Clarification: They are indeed used in different contexts, but mathematically they are deeply connected quantamagazine.org quantamagazine.org. In fact, Shannon entropy is formally analogous to the entropy concept in statistical mechanics. The difference is what the probabilities refer to: in info theory, probabilities are of messages or symbols; in thermo, probabilities are of microstates. The unit is different (bits vs Boltzmann’s constant units), and one is often about subjective knowledge (what you know about the system) while the other can be seen as an objective property of the system. Modern views (as in Jaynes’ work quantamagazine.org) blur this distinction, treating thermodynamic entropy as basically the Shannon entropy of the microstate distribution. In short: they’re not different formulas, just different applications. But remember the context when you use the word entropy!
“If we just find a clever algorithm, we can compress data arbitrarily.”
Clarification: No, Shannon’s theory proves a limit. If data has entropy H per symbol, you can’t losslessly compress below H bits per symbol on average quantamagazine.org. If someone claims to compress random data 10:1 with no loss, they’ve likely exploited patterns in a specific file (not truly random) or it’s simply impossible. Many have tried to find “magic” compression that beats Shannon’s limit, but it always fails on truly high-entropy data. This is analogous to perpetual motion machines and the Second Law – you can’t beat the fundamental bound. You can use domain-specific knowledge to compress better (or accept some loss for more compression), but you can’t break the entropy barrier.
“More entropy = more ‘difficult’ or complex.”
Clarification: Be careful: a highly entropic source is difficult to predict/compress, yes. But in some contexts “complex” might be used differently. For example, a perfectly random sequence has maximum entropy but we might say it has no structure – it’s “complex” in one sense but very simple in another (no pattern). Meanwhile, something like the digits of π are deterministic (low entropy if you know the formula) but look random. The term “entropy” specifically quantifies unpredictability. It doesn’t directly measure other kinds of complexity (like algorithmic complexity). So use the term precisely – high Shannon entropy means “I don’t know what’s coming next at all.” It doesn’t inherently mean “highly structured” (actually it’s the opposite: highly structured means lower entropy).
Check your understanding
What is Shannon entropy and why is 100 coin flips with 50/50 outcomes more informative than 100 coin flips with a two-headed coin?
Answer: Shannon entropy is a measure of uncertainty or average information content. 100 fair coin flips have high entropy because each flip is uncertain (1 bit each, ~100 bits for 100 flips). A two-headed coin has zero entropy per flip (always the same outcome, no new information). So the fair coin sequence carries more information in Shannon’s sense because you’re uncertain and get new bits of info with each flip quantamagazine.org quantamagazine.org.
How do patterns in a message affect its entropy?
Answer: Patterns (like certain letters always following others, or some words appearing very frequently) reduce entropy. They make some outcomes more likely than others (uneven probabilities) or make parts of the message predictable from other parts. This lowers the average uncertainty per symbol. For example, knowing “Q” is followed by “U” in English means after a “Q” your uncertainty about the next letter is nearly zero, contributing little to entropy quantamagazine.org. Essentially, redundancy and structure = lower entropy.
Explain the meaning of the equation $H = -\sum p_i \log₂ p_i$ in simple terms.
Answer: That equation says: to find the entropy (uncertainty) of a situation, for each possible outcome i, take the probability of i times $\log₂(1/p_i)$ (which is like the “surprise” of i), and sum them up. It’s an average surprise. In simple terms: list all possible outcomes, for each one assign a number for how surprising it is (surprise = high when probability is low), then take the weighted average surprise using the probability as the weight. That gives the entropy in bits quantamagazine.org quantamagazine.org. High entropy means on average a lot of surprise (very unpredictable outcomes), low entropy means on average not much surprise (some outcomes dominate).
Why can’t you compress a random data file much, but you can compress a text file a lot?
Answer: A random file has (by design) no patterns or redundancy – effectively maximum entropy. That means there’s no shorter way to represent it without losing information; every bit is carrying essential info. A text file, however, has lots of redundancy (common letters, repeating words, etc.), so its entropy per character is lower. You can encode those frequent patterns more efficiently (e.g., “the” might be coded in fewer bits than an uncommon word) quantamagazine.org. Compression algorithms exploit this to remove redundancy, approaching the Shannon entropy limit of the data. So random data is incompressible (no redundancy to remove), whereas structured data isn’t full entropy and can be compressed.
Practice prompt
You meet someone who says, “I don’t get entropy – why is random stuff considered ‘information’ while meaningful text sometimes is called redundant?!” How would you clarify the difference between Shannon information and semantic meaning in a few sentences?
Further reading & resources
Quanta Magazine (2022) – “How Claude Shannon’s Concept of Entropy Quantifies Information” – An excellent popular article quantamagazine.org quantamagazine.org. It walks through Shannon’s thought experiments (like the weighted coin flips, the guessing game for letters) and explains how entropy relates to yes/no questions and compression. Great storytelling and examples to reinforce these concepts.
Claude Shannon’s original paper (1948), “A Mathematical Theory of Communication” – The foundational document of information theory. The first part is quite readable, introducing entropy, examples, and the famous logarithmic formula informationphilosopher.com informationphilosopher.com. (Later parts delve into channel capacity and coding theorems which are more technical.)
Khan Academy – Information Theory lessons – Short videos and articles explaining entropy, redundancy, error-correcting codes, etc., in an accessible wayyoutube.com. Good for visualizing concepts like binary questions and probabilistic information.
James Gleick, The Information: A History, a Theory, a Flood – A book (for general audience) that covers the history of information theory, including Shannon’s work, in the context of human communication developments. It provides rich historical anecdotes and explanations without heavy math.
Stanford Encyclopedia of Philosophy – “Information” (Section on Shannon) – A more philosophical take, but sections of it clearly define Shannon entropy and distinguish it from other interpretations of information plato.stanford.edu plato.stanford.edu. Useful if you want a rigorous definition and to see how it fits into broader information concepts.
“StatQuest: Entropy and Information Gain” (YouTube) – A friendly video by Josh Starmer that explains entropy in the context of data science (particularly for decision trees) in simple terms. It visually shows how entropy changes when data is more mixed vs more pure, which can reinforce the intuition of entropy as “mixed-up-ness” in information.
Shannon and Weaver’s book (1949), The Mathematical Theory of Communication – This is basically Shannon’s paper plus a more qualitative discussion by Warren Weaver. It’s a classic text that might still be in college libraries; it elaborates on the implications of Shannon’s theory in lay terms.
Summary: Shannon entropy measures how unpredictable a source of information is quantamagazine.org. If you’re totally unsure what a message will be, entropy is high; if the message is predictable, entropy is lower. It’s calculated by the famous formula $-\sum p \log p$, which is just the weighted average of surprise for each outcome. High entropy sources need more bits to describe, low entropy sources can be compressed. Unlike thermodynamic entropy tied to physical disorder, Shannon’s entropy is about our knowledge (or ignorance) of a message quantamagazine.org – yet the math and concepts align closely. The key takeaway: patterns = predictability = lower entropy, and entropy sets a hard limit on compression and efficient communication quantamagazine.org quantamagazine.org. Next, with this grounding in uncertainty and information, we’ll tackle how these ideas manifest in AI sequence models – specifically, how transformers encode time and sequence, injecting their own “arrow” into data.