

Information, Time, Memory
Course 1.4 — Information & the Arrow of Time: Maxwell’s Demon and the Thermodynamics of Memory
Series: Thermodynamic Time → Entropy → Temporal Encoding in Transformers
Estimated time: 25–30 minutes • Level: Beginner → Intermediate • Format: Read + mini‑labs (Python) + visuals
Why this course matters
Memory and time’s one-way street are deeply connected. The famed Maxwell’s demon paradox showed that gaining information (about molecules) can seemingly reverse entropy locally – but there’s a catch: erasing that information inevitably produces heat quantamagazine.org quantamagazine.org. In short, “information is physical,” as Landauer put it quantamagazine.org, and the Second Law holds firm when you count the demon’s memory. This nails down why we remember the past (low entropy left records) but can’t remember the future: recording or deleting information pushes entropy into the environment, enforcing a forward arrow of time quantamagazine.org. Whether it’s a brain or a hard drive, storing and erasing information has an unavoidable thermodynamic cost, tying information to the arrow of time.
Learning goals
By the end, you can:
Explain why we only have reliable memories of the past (and not the future) in terms of thermodynamics: past low entropy states can imprint records, whereas creating or erasing those records today must increase entropy elsewhere, preserving the arrow of time quantamagazine.org.
Describe Maxwell’s demon and its resolution: any local entropy decrease by information sorting is paid for by entropy increase when the demon’s memory is erased (Landauer’s principle: ≥k<sub>B</sub>T ln2 of heat per bit) quantamagazine.org quantamagazine.org. No demon or computer can beat this cost.
Compute the Landauer limit – the minimum heat required to erase one bit – and compare it to real-world computing. (Spoiler: at room temperature it’s ~2.8×10⁻²¹ J, while modern processors dissipate billions of times more en.wikipedia.org.)
Apply these ideas to real technology and science: see how lossless compression hits Shannon’s entropy limit, why cryptographic keys need high entropy, how reversible computing aims to reduce heat, how ecologists use Shannon’s index for biodiversity, and even parallels in neuroscience and AI.
Plain‑language intuition
The past is written, the future is blank: We can remember the past because it left physical traces (photographs, brain synapses, etc.) when entropy was lower. To create those memory traces (e.g. burn a CD, form a memory) you must increase entropy in the environment (waste heat). No such traces from the future exist yet. In essence, any process that records information (decreases uncertainty for you) necessarily dumps entropy into surroundings, pointing time’s arrow forward.
Maxwell’s demon exorcised: Imagine a tiny demon that sorts fast (hot) and slow (cold) gas molecules, seemingly making one side hot, the other cold – apparently reversing entropy. The demon’s trick: it gains information about each molecule. But when its memory fills up, it must erase old data to keep working. That erasure dumps at least k<sub>B</sub>T ln2 of heat per bit into the environment, outweighing the entropy it removed quantamagazine.org. Outcome: no net violation of the Second Law. The “demon” (or any clever device) pays the piper via heat when resetting memory.
“No free lunch” for bits: Rolf Landauer proved that any logical irreversibility – like resetting a register to 0 (losing information about its prior state) – has a minimum energy cost. At 300 K, erasing 1 bit releases ≥2.8×10⁻²¹ J as heat en.wikipedia.org. This is tiny, but real. Erasing a billion bits (roughly 125 MB) releases on the order of 10⁻¹² J (still tiny) – yet current computers use far more energy per operation en.wikipedia.org. Importantly, if a computation is done reversibly (no information is erased), in principle it could approach zero energy dissipation en.wikipedia.org. But every time you delete data or forget something, you increase entropy. In everyday terms: deleting files, compressing data, or even just tidying your desk – all have hidden heat costs!
Core concepts (the “puzzle pieces”)
1) Memory, records, and the arrow of time
Past vs. future: We have records (memory) of the past because the universe began in a low-entropy state, allowing information to be imprinted as entropy rose. Imagine photographing an egg as it cooks – the film (or sensor) absorbs some energy/information, increasing entropy, while capturing the low-entropy moment. You cannot photograph the future because it hasn’t happened – no orderly state has imprinted on a medium. In thermodynamic terms, memory formation is an irreversible process: you reduce uncertainty about the past event by increasing entropy in your camera, brain, or notebook.
“Entropy was lower then”: Ultimately, as noted by physicist Sean Carroll, the reason we can form a reliable memory of the past is that entropy was lower in the past than it is now boardgamegeek.com. Any information you store (a measured value, a written note) today correlates with an earlier state and carries an entropy price. This idea aligns the psychological arrow of time (we remember yesterday, not tomorrow) with the thermodynamic arrow – memory-making is a physical, entropy-increasing act. If you try to “remember” the future by predicting it, you’ll either be wrong (increasing entropy in your brain as you update your knowledge later), or if somehow correct, it’s because some record from the past was involved. Either way, entropy rises consistent with the Second Law.
2) Maxwell’s Demon: information has a heat cost
The thought experiment: James Clerk Maxwell in 1867 imagined a demon sorting molecules to create a hot and cold side from an initially uniform gas. This seems to lower thermodynamic entropy without work – a direct challenge to the Second Law quantamagazine.org quantamagazine.org. The demon achieves this by acquiring information: it measures molecule speeds and positions. For decades, the paradox tantalized scientists.
Landauer’s principle to the rescue: In 1961, Rolf Landauer stated the key: erasing one bit of information dissipates ≥k<sub>B</sub>T ln2 of heat en.wikipedia.org. Later, in 1982, Charles Bennett applied this to the demon: the act of measurement can be done with negligible energy if the memory is cleared reversibly, but eventually the demon must reset its memory to continue working quantamagazine.org. That erasure produces enough entropy (heat) to satisfy the Second Law. In other words, the demon’s entropy export (from memory erasure) ≥ entropy it removed by sorting quantamagazine.org. The ledger balances, and entropy overall still increases.
“Information is physical”: Landauer famously emphasized this motto quantamagazine.org. The demon paradox cemented the idea that information (bits, knowledge) isn’t abstract magic – it has physical consequences. A memory register full of bits is like a physical subsystem; if you randomize it (erase to all zeros), you increase the entropy of the universe by the corresponding amount. This also explained why no clever being can break the Second Law: any gain of information that lowers entropy locally must be paid for by entropy (energy dispersal) released when that information is used or erased. The Second Law remains safe quantamagazine.org quantamagazine.org.
3) “Entropy + information” — a useful heuristic (not a new conservation law)
Some have described a combined quantity “entropy + information” that stays constant in closed systems, i.e. if a system gains information (uncertainty reduced), an equivalent entropy increase is dumped to the environment. This interpretation comes from thought experiments like Maxwell’s demon, where the demon’s gained information is often regarded as “negative entropy”. But be careful: this is not an independent law of nature, just a way to keep score. It’s essentially the Second Law in disguise. In reality, only physical entropy is conserved/increased, and information is accounted for in those entropy terms. For instance, when the demon measures molecules, the entropy of the gas decreases, but the total entropy (gas + demon’s memory + heat bath) does not decrease quantamagazine.org. We can say the demon’s information has entropy value k<sub>B</sub> ln2 per bit, which will be released upon erasure. Use this lens as a heuristic to reason about processes (it helps to ask “where did the entropy go?” when you see information being gained), but remember it all ultimately reduces to standard thermodynamics. There is no separate “information entropy” that violates physics; it’s the same entropy accounting, just partitioned into different forms (physical disorder vs knowledge gain) quantamagazine.org quantamagazine.org.
Visual / diagrammatic intuition

A highly ordered state (all pixels identical) has extremely low Shannon entropy – easy to describe, with no uncertainty. Here, a uniform gray image can be described in a few words (“all gray”), and a computer file of it can be compressed to almost nothing.
A highly disordered state (pixels random noise) has maximal entropy – impossible to succinctly describe, high uncertainty in every part. This random noise image would require specifying every pixel (no shortcuts), and its file won’t compress smaller because it’s already as unpredictable as possible.
Think of the first image like a low-entropy physical state (all molecules neatly arranged, or all bits the same) and the second like a high-entropy state (molecules randomized, or bits random). Any process that increases order (going from noise to uniform) must expend work and dump entropy elsewhere – akin to compressing the random image, which inevitably produces heat in the compressor. Conversely, disorder can easily increase on its own (like the gray image becoming noisy if noise is introduced). This mirrors how thermodynamic entropy and information entropy parallel each other: more order = less entropy = fewer bits needed, while more disorder/unpredictability = more entropy = more bits needed to describe.
Three mini‑labs
If you can’t run code now, read the Expected outcome lines to cement the intuition.
Lab A — Compressing structured vs. random data (showing Shannon’s limit)
Let’s see how well we can losslessly compress an orderly text versus a random text using the built-in zlib compressor.
import zlib, random
# Create a highly ordered text (repeated pattern) and a random text:
ordered_text = "ABCDE" * 10000 # 50,000 characters, repeating "ABCDE"
random_letters = "".join(random.choice("ABCDEFGHIJKLMNOPQRSTUVWXYZ") for _ in range(10000))
# Compress both
compressed_ord = zlib.compress(ordered_text.encode('utf-8'))
compressed_rand = zlib.compress(random_letters.encode('utf-8'))
print(f"Ordered text length: {len(ordered_text)} -> Compressed length: {len(compressed_ord)} bytes")
print(f"Random text length: {len(random_letters)} -> Compressed length: {len(compressed_rand)} bytes")
Expected outcome: The ordered text (highly redundant) compresses massively (e.g. 50,000 → ~40 bytes!), whereas the random text (high entropy) stays large (e.g. 10,000 → ~6,200 bytes). In fact, truly random data can even grow when “compressed” due to overhead. This demonstrates Shannon’s source coding theorem: you can’t losslessly compress data on average below its entropyen.wikipedia.org. The pattern data had low entropy (few surprises, repeated “ABCDE”), so zlib finds a super-short description for it. The random data was near maximum entropy (each letter equally likely; no repeats to exploit), so compression yields little gain. Bottom line: More predictability ⇒ fewer bits; more randomness ⇒ more bits. This is exactly the information-theory twin of the Second Law: you can’t “compress” entropy for freeen.wikipedia.org (just like you can’t compress a gas without expending work).
Lab B — Entropy of image data (uniform vs. noise)
We’ll quantify the Shannon entropy of two 64×64 grayscale images: one uniform gray, one pure noise. Entropy is in bits per pixel (assuming base‑2 logs).
import math, numpy as np
from collections import Counter
def shannon_entropy(values):
n = len(values)
freq = Counter(values)
return -sum((count/n) * math.log2(count/n) for count in freq.values())
# Uniform image (all pixels = 128), and random image (pixels 0–255 uniform)
uniform_pixels = [128] * (64*64)
random_pixels = np.random.randint(0, 256, size=(64*64,)).tolist()
print("H(uniform image) =", round(shannon_entropy(uniform_pixels), 3), "bits/pixel")
print("H(random image) =", round(shannon_entropy(random_pixels), 3), "bits/pixel")
Expected outcome: H(uniform image) ≈ 0.0 bits/pixel, H(random image) ≈ 8.0 bits/pixel. A uniform image has essentially zero uncertainty (every pixel the same, no information gained by revealing a pixel’s value), whereas a 8-bit random image has ~8 bits of entropy per pixel (256 possible values, each equally likely ⇒ log₂256 = 8). This echoes the idea that a completely ordered state carries no new information (low entropy), while a completely random state carries maximal information (high entropy). If you saved both images as PNG files, the random one would be ~64× larger! (PNG compression would shrink the uniform image to a tiny file, but can’t do much for noise.) This links back to thermodynamics: an ordered crystal has low entropy (little “surprise” in its structure), whereas a random gas has high entropy (much “surprise” in exact microstates). Shannon’s entropy gives a common currency to compare these situations in bits.
Lab C — Landauer’s limit: joules per bit (at room temp)
Let’s calculate the minimum energy required to erase one bit at T = 300 K (room temperature), and the energy for a larger data quantity, using Landauer’s formula E_min = k<sub>B</sub> T ln 2.
kB = 1.380649e-23 # Boltzmann’s constant, J/K
T = 300.0 # temperature in K
energy_per_bit = kB * T * math.log(2)
print("Minimum heat to erase 1 bit at 300K:", f"{energy_per_bit:.3e}", "J")
print("Erasing 1e9 bits (125 MB):", f"{energy_per_bit * 1e9:.3e}", "J")
Expected outcome: Minimum heat to erase 1 bit at 300 K: ~2.8×10⁻²¹ J. Erasing 10⁹ bits (~125 MB) dissipates at least ~2.8×10⁻¹² J (still tiny). For perspective, 2.8×10⁻²¹ J is about 0.018 eV – on the order of thermal noise in a molecule. Modern computers are nowhere near this limit: as of 2012, they used about a billion times more energy per operation than the Landauer limit en.wikipedia.org en.wikipedia.org. This gap is due to many inefficiencies (transistor switching energy, resistance, etc.), but Landauer’s principle tells us a floor we cannot beat unless we do only reversible operations. No matter how advanced computers get, if they irreversibly erase information, they must emit heat ≥ k<sub>B</sub>T ln2 per bit. No physical computer can be more thermodynamically efficient than this limit. It’s a symmetric counterpart to the Second Law: just as a heat engine can’t be 100% efficient, a computation can’t erase bits for free. (Reversible computing offers a way out in principle by never losing information – but it requires designing logic that can run backward, which is an active research challenge.) en.wikipedia.org
Quick misconceptions to retire
“Maxwell’s demon violates the Second Law.” – No, it appears to locally, but when you include the demon’s information processing, the Second Law holds. The demon’s measurement doesn’t create entropy (it can be done reversibly), but erasing the measurement results (to reuse its memory) necessarily dumps heat ≥ k<sub>B</sub>T ln2 per bit quantamagazine.org, more than offsetting the entropy it removed. In short, information acquisition can’t be used for free work once you account for the full cycle. The entropy cost is just moved to the demon’s side of the ledger.
“Erasing information can be done without energy if we build a better device.” – In practice we can approach the Landauer limit with ultra-efficient or reversible logic, but the principle itself is fundamental. Any logically irreversible operation has that energy cost quantamagazine.org. You could only avoid it by avoiding erasure altogether (as in reversible computing, which instead of deleting bits, keeps all of them but in transformed ways). As long as bits are erased or forgotten, entropy is generated. There’s no clever engineering that can bypass this fact, though engineers can reduce overhead to get closer to the limit.
“Entropy in information theory is totally different from thermodynamic entropy.” – They are used in different contexts (information entropy measures uncertainty in a probability distribution, thermodynamic entropy measures molecular disorder), but mathematically they are deeply analogous – both are given by a form of –∑ p ln p (Boltzmann’s S = k<sub>B</sub> ln Ω is related too). They even share units if you choose constants and log base appropriately (1 bit of Shannon entropy corresponds to an entropy increase of k<sub>B</sub> ln 2 in physical entropy). However, be careful not to conflate them one-to-one: physical entropy involves energy and heat, whereas Shannon entropy involves information content. The safe takeaway is that information entropy can be treated like an entropy that a physical system carries when that information is embodied physically en.wikipedia.org quantamagazine.org. When you erase or lose that information, it shows up as ordinary entropy (heat). But a random message’s Shannon entropy isn’t automatically heat unless you try to extract work from it or erase it in a physical device. Context matters!
Check your understanding
Why can’t we “remember” the future in the same way we remember the past?
Hint: What would it take, entropy-wise, to have a memory of something that hasn’t happened yet?
Answer: Because no low-entropy record of the future exists for us to observe. All reliable memories correspond to physical traces laid down when entropy was lower boardgamegeek.com. To “remember” (accurately know) a future event, we’d need information from it, which would require that event to have influenced our low-entropy past – a physical impossibility. Any attempt to predict the future in advance either fails or, if successful, effectively creates a new record (increasing entropy now). The Second Law dictates that entropy (disorder) increases toward the future, so we can’t have knowledge of future microstates without doing work and increasing entropy. In summary: we remember the past because it left ordered imprints; we can’t remember the future because it hasn’t and because acquiring such information would break thermodynamic constraints.
What “pays the price” for Maxwell’s demon’s clever sorting of molecules?
Answer: The erasure of the demon’s memory. The demon lowers entropy in the gas by knowing which molecules are fast or slow. But when the demon’s memory (full of those measurements) is reset to start again, that erasure must dump at least k<sub>B</sub>T ln2 of entropy (heat) per bit to the environment quantamagazine.org. That emitted entropy outweighs the gas’s entropy drop, preserving the Second Law. In short: **the demon’s tiny “memory dump” as heat pays for the entropy it removed, so overall entropy still increases. quantamagazine.org
State Landauer’s principle in your own words.
Answer: Erasing one bit of information in any physical system will release a minimum amount of heat into the environment: at least k<sub>B</sub>T ln2 joules en.wikipedia.org. This is the fundamental cost of forgetting. If you don’t erase (i.e. you use reversible operations), you can in principle avoid this heat, but whenever a computation or process loses information (merges two states into one, like resetting memory to zero), you necessarily increase entropy by that amount. Often summarized as: “Deleting a bit releases a bit of heat.”
Practice prompt
Prompt: In a short paragraph, explain how Maxwell’s demon uses information to seemingly break the Second Law, and why incorporating Landauer’s principle saves the day. Include the numerical value for the heat cost of erasing one bit at room temperature (from Lab C) to ground your explanation. Your answer: (Imagine you’re explaining to a curious student or colleague, referencing the demon’s measurement and memory erasure, and quantifying the heat per bit.)
Reading & watching
Rolf Landauer, “Information is Physical” (Physics Today, 1991) – The classic article by Landauer explaining in accessible terms why erasing information has a thermodynamic cost and how this links to computation limits quantamagazine.org. Great historical insight into the mantra that information must be embodied physically.
Quanta Magazine, “How Maxwell’s Demon Continues to Startle Scientists” (2021) – A fascinating overview of the Maxwell’s demon saga from 19th-century paradox to modern experiments. Explains Shannon’s and Landauer’s contributions and even recent studies making demon-like systems (and why they still obey the Second Law) quantamagazine.org quantamagazine.org.
Quanta Magazine, “How Shannon Entropy Quantifies Information” (2022) – Background reading on Shannon’s entropy, i.e. the information measure that underpins compression and data entropy. Useful to connect the dots between uncertainty in messages and physical entropy (features Claude Shannon’s work in an intuitive way) en.wikipedia.org.
David J. C. MacKay, Information Theory, Inference, and Learning Algorithms, Ch. 1 – An open-access textbook chapter that gently introduces Shannon entropy with everyday examples and even touches on the link to thermodynamic entropyen.wikipedia.org. (MacKay was a master of clear explanation; this is a great resource for diving deeper into information theory basics.)
Plenio & Vitelli (2001), “The physics of forgetting: Landauer’s erasure principle” – A concise review article (available online) that discusses Landauer’s principle in both classical and quantum contexts. It elaborates on why erasing a bit causes heat and discusses thought experiments and proofs. Good for those curious about the deeper physics (and it’s only a few pages).
Stanford Encyclopedia of Philosophy, “Information Entropy” – For a rigorous conceptual take, this entry clarifies the distinctions and connections between thermodynamic entropy and information entropy in philosophical terms en.wikipedia.org en.wikipedia.org. It can solidify understanding and clear up misconceptions if the dual usage of “entropy” is confusing.
Seth Lloyd (lecture): “Physics of Information” – A popular talk by quantum computing pioneer Seth Lloyd, discussing how information and thermodynamics intersect. It’s an engaging overview touching on Maxwell’s demon, Landauer’s limit, and even black holes and information. Useful to reinforce that these ideas span from everyday tech to fundamental physics.
Entropy & Information in real-world contexts
Finally, to see why this topic matters across disciplines, let’s look at a few real-world contexts where entropy and information theory meet:
Computer Science – Data compression and entropy: Shannon’s source coding theorem tells us that the average bits needed to encode data can’t go below the source entropy H (without losing information) en.wikipedia.org. Practical compression algorithms (Huffman, LZ, etc.) approach this bound by giving shorter codes to frequent symbols and longer codes to rare ones. In plain terms: more predictability ⇒ fewer bits; more uncertainty ⇒ more bits. This is exactly what we saw in Lab A. Why it matters: It’s the information-theory analogue of the Second Law – you can’t “compress” pure randomness. Try zipping a truly random file and you’ll get no size reduction, or even a slight increase. High entropy data has no redundancy to squeeze out. Our Quanta reading en.wikipedia.org bridges uncertainty to bit budgets well. Thermo tie-in: When you compress data on real hardware, you typically have to throw away or reorder bits (if lossy or rearranging for redundancy removal), which is a physical process that dissipates energy. And whenever you erase the now-redundant bits (like clearing out repeated patterns), Landauer’s principle says you’ll release heat (k<sub>B</sub>T ln2 per erased bit) en.wikipedia.org. In other words, organizing information in one place exports entropy to the environment. In an extreme view, algorithms that massively compress data must dump proportional heat – emphasizing an interplay between logical efficiency and thermodynamic cost. (This cost is usually negligible in everyday computing, since 10⁻²¹ J per bit is tiny, but it’s there in principle and becomes relevant in very energy-constrained computing.)
Mini-exercise: Take a large text file, estimate its Shannon entropy (e.g. by character frequencies), and then compress it with a tool. Divide compressed size (in bits) by original length – you’ll get close to the entropy per character. Try the same with a highly random file (or encrypted file) – you should see the ratio approach 1 (meaning you can’t compress it further). This empirically verifies that entropy sets a hard limit on lossless compression en.wikipedia.org.
Takeaway: Entropy is the fundament of compression. No clever algorithm can beat Shannon’s entropy limit on average – mirroring how no machine can beat thermodynamic limits.
Cryptography & Security – Entropy = unpredictability: In security, entropy = strength. Cryptographic keys, random number generators (RNGs), one-time pads – all demand high unpredictability. Standards like NIST SP 800-90B outline how to design and test entropy sources (e.g. electronic noise, jitter) to feed random bit generators tsapps.nist.gov tsapps.nist.gov. If your RNG has low entropy, the keys it produces might be guessable. Real incidents have occurred: a study found thousands of network devices that generated easily guessable (even duplicate!) keys due to insufficient entropy in their random sources tsapps.nist.gov – a disaster for security. Why it matters here: Shannon entropy measures uncertainty in a distribution, which directly translates to how hard a key is to guess. A 128-bit truly random key has 2¹²⁸ possibilities – effectively unguessable. But if an RNG only had, say, 40 bits of entropy (perhaps it was poorly seeded), then there are effectively 2⁴⁰ ≈ 10¹² possible keys – within reach of brute force. NIST’s guidelines essentially demand that cryptographic keys have entropy close to their bit-length (112-bit minimum for many applications, meaning ~2¹¹² possibilities) tsapps.nist.gov tsapps.nist.gov. Thermo tie-in: Many hardware random generators harness thermal noise or quantum effects – literally tapping into entropy produced by physical processes. So when your computer gathers randomness (from clock drift, device noise, user mouse movements), it’s converting microscopic entropy into secure keys. If an RNG is mis-designed such that it doesn’t actually gather enough entropy, then you’re retroactively lowering entropy in the system by reusing random bits – and the informational “order” this creates (like repeated keys) is a vulnerability. In essence, low information entropy = potential for order = predictability, which for an attacker means an opportunity.
Mini-exercise: Check your operating system’s random source (e.g. /dev/random on Linux) – it often provides an estimate of available entropy. Write a short script to sample bytes from it and estimate the Shannon entropy of the distribution of bytes. (It should be ~8 bits/byte if working properly.) Then try a biased source (e.g. a RNG that outputs more 0s than 1s) and see the entropy drop. This will demonstrate why even subtle biases can weaken security by reducing entropy.
Takeaway: No entropy, no security. Modern systems measure and budget entropy carefully – it’s the currency of secrecy. And it’s another reminder that information entropy behaves like a conserved quantity: to get unpredictable bits, you usually need to draw from a physical entropy well. Mismanage that, and the second law (and attackers) will catch up with you.
Computing hardware – Landauer’s limit and reversible computing: As we computed in Lab C, each bit erased carries an unavoidable energy cost en.wikipedia.org. Today’s CPUs are far above this limit, but as we pack more transistors and try to reduce power, Landauer’s principle looms as a fundamental barrier. Why it matters: It provides a target for how efficient computation could become and motivates research into reversible computing – circuits that ideally erase 0 bits of information, reusing them instead of dropping them to garbage. Reversible logic (theoretically) can operate with arbitrarily low energy dissipation (in practice, other losses intervene before Landauer’s limit). Real-world status: In 2020s, researchers have demonstrated logic gates and even small CPUs that are reversible and can run on significantly less energy, though still not anywhere near zero. There ’s a great Communications of the ACM article titled “Taking the Heat” that discusses how as we approach nanoscale and quantum computing, avoiding Landauer erasures becomes important for controlling heat cacm.acm.org en.wikipedia.org. Even quantum computers must heed this – qubits manipulated reversibly can, in principle, avoid heat, but if you measure (erase superposition information), you dump entropy. Thermo tie-in: This is a direct application of thermodynamic entropy ideas to computing. It’s essentially designing computers that don’t increase entropy (or do so as little as possible) during calculation. The dream is a computer that, like a frictionless engine, does work without waste heat – which requires it to never lose information. It flips bits like perfectly elastic collisions, time-reversible. We’re not there yet, but the physics is clear: to keep advancing computing performance in an energy-constrained world, we may need to incorporate these thermodynamic insights.
Mini-exercise: Identify irreversible steps in a computing task you do. For example, copying a file then deleting the original – deletion is irreversible. Or logic operations like NAND (which merges two inputs into one output, losing info) – those are inherently irreversible. How might you compute the same result without discarding information? (Hint: reversible computing would output the result and leave enough info to reconstruct the inputs – like computing function f(x) while also outputting x so you can undo f.) This exercise will make you appreciate how ubiquitous information destruction is in computing, and hence why today’s computers produce so much heat.
Takeaway: Landauer’s principle is the bridge between information theory and hardware: it tells us the ultimate efficiency limit. Every bit of information lost is entropy gained in the physical world. Future computing paradigms might need to work with this fact, not against it.
Ecology – Biodiversity as Shannon entropy: Ecologists quantify biodiversity with the Shannon index H′, which is exactly the Shannon entropy of the species abundance distribution en.wikipedia.org en.wikipedia.org. If p_i is the fraction of individuals that are species i, then H′ = –∑ p<sub>i</sub> ln p<sub>i</sub>. A high H′ means an ecosystem has many species with relative balance (no single species dominates), analogous to a high-entropy message where many symbols are equally likely. Why it matters: This gives a single number for “diversity” that accounts for both richness (number of species) and evenness (how evenly individuals are spread). For example, a forest with 10 equally-common species is more diverse (higher H′) than one with 10 species where one species constitutes 90% of individuals. Conservationists use this index to compare habitats. They even convert entropy to an “effective number of species” by exp(H′) en.wikipedia.org – essentially asking, “this community has the diversity equivalent to X equally-common species.” That makes H′ very intuitive: if H′ = 2.3 (using ln), exp(2.3) ≈ 10, meaning the diversity ~ 10 equally abundant species. Thermo tie-in: This is a beautiful crossover where an information metric illuminates a physical/ecological system. It reinforces that Shannon entropy is a broad concept of “uncertainty” or “mixed-ness.” In an ecosystem, more entropy = more uncertainty in guessing what species the next observed individual will be en.wikipedia.org en.wikipedia.org. That’s conceptually similar to a gas: higher entropy = more uncertainty in microstate (e.g. if we pick a random molecule, hard to guess its state/species). There’s no literal heat here, but the math is identical.
Mini-exercise: Calculate H′ for two hypothetical communities: (A) 4 species with proportions [0.25,0.25,0.25,0.25]; (B) 4 species with proportions [0.85,0.05,0.05,0.05]. You’ll find H′<sub>A</sub> = ln(4) ≈ 1.386 (in base e, or 2 bits in base2), and H′<sub>B</sub> much lower (~0.56). Exponentiating (base e) gives effective species: A has 4, B has ~1.75. This quantifies what you intuitively see: community B is dominated by one species (low diversity).
Takeaway: Entropy = diversity. Shannon’s formula, born in information theory, finds real-world use in ecology (and beyond – even in economics for market diversity, in sociology for income diversity, etc.). It’s one more instance of the counting of possibilities linking different fields.
Neuroscience & AI – Efficient coding and memory constraints: Brains, it turns out, often seem to follow information-theoretic and thermodynamic principles.
Efficient coding in sensory systems: In the 1950s, Horace Barlow hypothesized that sensory neurons reduce redundancy in inputs – essentially performing compression to make the most of limited capacityen.wikipedia.orgen.wikipedia.org. For example, the retina decorrelates visual input (a form of compression) because the optic nerve has far lower bandwidth than the raw eye data streamen.wikipedia.org. Experimental evidence supports this: the neural code in eyes and ears approximates removing predictable components and coding surprises, much like how JPEG removes redundant image information. The brain is constrained by energy and channel capacity, so an efficient (near-entropy-limit) code is favored by natural selection. This is an info-theoretic spin on why the brain’s early processing looks like compression/transformation – it’s maximizing information transfer with minimal spikes (since spikes cost metabolic energy).
Predictive coding & free energy principle: The brain might also minimize “surprise” (prediction error) at higher cognitive levels. Karl Friston’s free energy principle posits that brains avoid states of high surprise (high Shannon entropy) by continuously updating internal models. In simpler terms, your brain tries to model the world such that incoming stimuli are as predicted as possible, minimizing the Shannon entropy of sensory inputs given your model. This is still debated, but it’s a grand attempt to apply information entropy (surprise) minimization to neural dynamics and behavior. It connects to thermodynamics via an analogy: minimizing surprise is like minimizing a free energy in a thermodynamic system – a principle of least action for cognitive states.
Memory, forgetting, and thermodynamics: Every memory you form (storing information in synapses) presumably has a small thermodynamic cost – neurons use ATP, release some heat. Forgetting (perhaps a “pruning” of synapses or overwriting of memories) might be analogized to Landauer erasure in a loose sense: if you consider the brain+environment, disposing of memories might increase entropy (though brains often reuse and rewire rather than hard-erasing like a computer). There’s interesting work on the limit of human memory capacity and energy: our brains run on ~20 W; if we were at Landauer’s limit for every bit stored/erased, how close are we? (Spoiler: brains are far less efficient than Landauer’s limit, but they have other design constraints!)
LLMs and temporal contiguity: Here’s a striking recent finding – large language models, trained purely on text, show a “temporal contiguity” effect similar to human memory. In human free recall, if you remember one item from a list, you’re likely to next remember an item that was near it in the original list (a forward order bias). Large language models, in their attention patterns and text generations, tend to retrieve content with a similar forward-neighbor bias. Researchers have found that certain attention heads in Transformers (so-called “induction heads”) mimic the behavior of human episodic memory modelsneurips.cc. Essentially, because the training data has sequential structure, the model develops an implicit memory for that structure – an echo of how our brains link memories by temporal proximity. The fact that LLMs (information processors) and brains (biophysical systems) exhibit analogously entropy-minimizing strategies (like compressing context or biasing to recent context) suggests that there are convergent principles. Both are under pressure to efficiently encode and recall sequences – the AI because it’s learned to predict text (Shannon-style minimization of surprise), the brain because it’s adapted to a world where sequential recall is useful.
Why it matters: It shows that information theory isn’t just about engineering; it might also elucidate cognitive science. Concepts like entropy, redundancy reduction, surprise minimization appear in theories of perception (efficient coding) en.wikipedia.org en.wikipedia.org, memory, and even AI behaviors. It’s a rich area where thermodynamics, information, and biology intersect.
Mini-exercise: Take a simple dataset of natural images. Compute the pixel intensity entropy. Then perform a PCA or whitening (which removes correlations/redundancy), and compute the entropy of the transformed data’s distribution. You’ll often see the entropy go up after whitening – because the data is now spread out (less predictable per dimension). But note: the total entropy of the whole dataset doesn’t change under invertible transforms; you’re just moving redundancy around. The brain’s trick is doing something like whitening in the early visual system to spread information across neurons more evenly. This maximizes the use of each spike (each spike carries more new info on average).
Takeaway: Brains and AI systems seem to obey “entropy economics.” They have finite resources and adapt to efficiently encode information from the environment. Redundancy is reduced (like a compressor), predictions are made to reduce surprise (like an adaptive codec or an engine minimizing free energy), and even memory retrieval shows statistical biases reminiscent of information-optimal strategies. All these can be viewed through the lens of entropy and information – reinforcing that these concepts are fundamental in both technology and nature.
Quick prompts (optional): Why can’t we compress truly random data? (Because its Shannon entropy is maximal – no patterns to exploit en.wikipedia.org.) – How does NIST ensure an “entropy source” is good for cryptography? (They require statistical tests and a minimum min-entropy so that outputs are unpredictable tsapps.nist.gov.) – If two ecosystems have the same number of species, what factor does Shannon diversity capture that a simple species count doesn’t? (It captures evenness – a more even distribution gives higher entropy en.wikipedia.org.) – In ML terms, how is cross-entropy minimization (in training) related to the brain’s surprise minimization? (Cross-entropy is basically expected surprise – both the model and brain try to reduce unexpected events, aligning model predictions with actual data, thus lowering entropy of errors.)