21 Chapter 13: Research and Industry Applications - Where Algorithms Meet Reality

From Theory to Trillion-Dollar Industries

“The best way to predict the future is to invent it.” - Alan Kay

“And the best way to invent it is to understand the algorithms that make it possible.” - Every algorithm researcher ever

21.1 13.1 Introduction: Algorithms in the Wild

Here’s something that might surprise you: every major technological breakthrough of the last 50 years has algorithms at its core.

Think about it: - Google Search (1998): PageRank algorithm turned web search from terrible to magical, creating a $2 trillion company - Netflix recommendations (2006): Matrix factorization algorithms keep 230 million subscribers binge-watching - Bitcoin (2009): Cryptographic hash algorithms enabled decentralized currency, spawning a trillion-dollar market - AlphaGo (2016): Monte Carlo tree search + deep learning beat the world Go champion, a feat experts thought was decades away - COVID-19 vaccines (2020): Sequence alignment algorithms helped develop vaccines in record time, saving millions of lives - ChatGPT (2022): Transformer algorithms changed how we interact with computers - AlphaFold (2020): Deep learning algorithms solved the 50-year-old protein folding problem

But here’s what’s really exciting: we’re just getting started. The algorithms you’ve learned in this book aren’t just academic exercises—they’re the foundation for the next generation of breakthroughs.

In this chapter, we’ll explore: 1. What problems algorithm researchers are tackling right now 2. How algorithms power modern AI and machine learning 3. The challenges of processing data at planetary scale 4. How cryptography keeps our digital world secure 5. The ethical implications when algorithms make life-changing decisions 6. How you can contribute to algorithmic research

Let’s see where algorithms are taking us!

21.2 13.2 Current Research Trends in Algorithms

21.2.1 13.2.1 Beyond Worst-Case Analysis: Algorithms for the Real World

For decades, algorithm analysis focused obsessively on worst-case complexity. If quicksort has O(n²) worst case, we worried about it constantly, even though it almost never happens in practice.

But around 2000, researchers started asking: “What if we analyzed algorithms the way they actually perform?”

This led to several revolutionary frameworks:

21.2.1.1 Smoothed Analysis

Proposed by Spielman and Teng (2001), smoothed analysis asks: What’s the average performance when inputs are slightly perturbed by random noise?

Why this matters: Real-world inputs are never perfectly adversarial. There’s always some randomness—measurement errors, rounding, unpredictable human behavior.

Classic example - The Simplex Algorithm:

The simplex algorithm (1947) for linear programming has exponential worst-case complexity, but works incredibly well in practice. For 50 years, this was a mystery.

Spielman and Teng proved: under smoothed analysis, simplex runs in polynomial time! The “worst cases” are so fragile that the tiniest random perturbation destroys them.

Impact: This earned Spielman the Nevanlinna Prize (essentially the Nobel of computer science). It explained why many algorithms work far better than their worst-case suggests.

import numpy as np
import time

def demonstrate_smoothed_analysis():
    """
    Demonstrate smoothed analysis with quicksort.
    
    Worst-case input: sorted array → O(n²)
    But add tiny noise → back to O(n log n)!
    """
    print("=== Smoothed Analysis: Quicksort ===\n")
    
    n = 10000
    
    # Worst-case input: sorted array
    sorted_array = list(range(n))
    
    # Add tiny noise (smoothing)
    smoothed_array = sorted_array.copy()
    noise_level = 0.01  # Swap 1% of elements
    num_swaps = int(n * noise_level)
    for _ in range(num_swaps):
        i, j = np.random.randint(0, n, 2)
        smoothed_array[i], smoothed_array[j] = smoothed_array[j], smoothed_array[i]
    
    # Time worst-case (approximation with Python's sort)
    start = time.time()
    _ = sorted(sorted_array)
    worst_time = time.time() - start
    
    # Time smoothed case
    start = time.time()
    _ = sorted(smoothed_array)
    smoothed_time = time.time() - start
    
    print(f"Array size: {n:,}")
    print(f"Noise level: {noise_level*100}% element swaps")
    print(f"\nWorst-case input (sorted): {worst_time*1000:.3f}ms")
    print(f"Smoothed input (1% noise): {smoothed_time*1000:.3f}ms")
    print(f"\nKey insight: Tiny perturbations destroy worst cases!")
    print("This explains why many algorithms work better than theory predicts.")

21.2.1.2 Instance-Optimal Algorithms

An algorithm is instance-optimal if it’s the best possible for every input, not just worst-case.

Example: Fagin et al.’s instance-optimal join algorithms (2003) for database queries. These algorithms detect what kind of join you’re doing (easy or hard) and adapt automatically.

Why this matters: Traditional “one-size-fits-all” algorithms are being replaced by algorithms that adapt to input characteristics.

21.2.1.3 Fine-Grained Complexity

Around 2015, researchers realized: many problems seem to require specific running times (like O(n²) for edit distance), and we can’t do better even though we can’t prove it.

The Strong Exponential Time Hypothesis (SETH): A conjecture that k-SAT requires 2^n time for some k.

If SETH is true, it implies lower bounds for hundreds of problems: - Edit distance requires Ω(n²) - Longest common subsequence requires Ω(n²) - Frequent itemset mining requires exponential time

Impact: This explains why, despite 50 years of effort, we haven’t found faster algorithms for certain problems. They’re probably impossible!

Current research: Mapping the landscape of which problems are “equivalent” in difficulty. If you solve one problem faster, you can solve hundreds of others faster.

21.2.2 13.2.2 Quantum Algorithms: The Revolution That’s Actually Happening

Quantum computing used to be science fiction. Not anymore.

Current state (2024): - Google’s Willow chip (2024): 105 qubits with error correction - IBM’s quantum computers: Available via cloud - Amazon Braket: Quantum computing as a service - IonQ, Rigetti, and others: Commercial quantum computers

But here’s the crucial question: What can quantum computers actually do?

21.2.2.1 Shor’s Algorithm: Breaking the Internet

In 1994, Peter Shor proved that quantum computers can factor integers in polynomial time: O((log N)³).

Why this matters: RSA encryption (which secures most of the internet) relies on factoring being hard. A large quantum computer would break RSA instantly.

How it works (simplified): 1. Factoring N reduces to finding the period of a function f(x) = a^x mod N 2. Quantum computers can find periods exponentially faster using the Quantum Fourier Transform 3. Once you know the period, you can factor N efficiently

The quantum threat timeline: - 2024: Current quantum computers can factor ~10-digit numbers - ~2030-2035: Cryptographically relevant quantum computers predicted - Right now: Organizations are transitioning to “post-quantum cryptography”

Real-world response: NIST standardized post-quantum cryptographic algorithms in 2024. Banks, governments, and tech companies are already upgrading their systems.

21.2.2.2 Grover’s Algorithm: Quantum Search

Grover’s algorithm (1996) searches an unsorted database in O(√N) instead of O(N).

Why this matters: This is a provable quadratic speedup for a fundamental problem.

Implications: - Brute-force search: If classical takes 2^128 operations, quantum takes 2^64 (still hard, but concerning) - Symmetric cryptography: Need to double key sizes (AES-128 → AES-256) - NP-complete problems: Get √ speedup (not enough to solve them efficiently, but still significant)

The intuition: Classical search checks possibilities one by one. Quantum search uses “amplitude amplification” to boost the probability of the correct answer, checking many possibilities simultaneously through superposition.

21.2.2.3 Quantum Simulation: The Killer App

The most promising quantum application isn’t breaking codes—it’s simulating quantum systems.

Why classical computers struggle: Simulating n quantum particles requires 2^n classical bits. For just 300 particles, that’s more atoms than in the universe!

Quantum advantage: Quantum computers naturally simulate quantum systems efficiently.

Applications already being explored: - Drug discovery: Simulating molecular interactions (Pfizer, Biogen are investing heavily) - Materials science: Designing better batteries, superconductors, catalysts - Optimization: Portfolio optimization, route planning (D-Wave’s quantum annealers) - Machine learning: Quantum neural networks (jury still out on practical advantage)

21.2.2.4 The Limitations

Important reality check: Quantum computers aren’t magic.

What quantum computers DON’T speed up: - Sorting: Still Ω(n log n) (maybe √n speedup on some measures) - Graph problems: Most remain hard - Matrix multiplication: No proven speedup - Database operations: No fundamental speedup beyond Grover

The engineering challenge: Quantum states are fragile. Noise and “decoherence” corrupt calculations. Current quantum computers work for seconds before errors dominate.

Error correction: Requires ~1000 physical qubits per logical qubit. This is why we’re still years from breaking RSA despite having quantum computers today.

21.2.3 13.2.3 Learning-Augmented Algorithms: When ML Meets Classical CS

Imagine combining the worst-case guarantees of classical algorithms with the pattern-recognition power of machine learning. That’s the promise of learning-augmented algorithms.

21.2.3.1 The Concept

Traditional algorithms: Designed by humans, work for all inputs, worst-case guarantees.

Machine learning: Learn from data, work great on typical inputs, no guarantees.

Learning-augmented algorithms: Use ML predictions + classical algorithms as backup.

The framework (Lykouris & Vassilvitskii, 2018): - ML provides “hints” or predictions - Algorithm uses hints when they’re good - Falls back to classical algorithm when predictions are wrong - Guarantee: Never worse than O(α) × classical, often much better

21.2.3.2 Learned Index Structures

One of the most successful examples: learned indexes (Kraska et al., 2018).

Traditional B-tree index: O(log n) lookup, works for any data distribution.

Learned index: Train neural network to predict position of key in sorted array. When predictions are accurate, lookup is O(1)!

The trick: Use B-tree as safety net. Structure is:

Prediction: NN(key) → approximate position
Verify: Check nearby positions
Fallback: If not found quickly, use B-tree

Results: Google reported 3-5x speedup on real database workloads.

Why it works: Real data has patterns! Dates, IDs, names follow distributions ML can learn.

21.2.3.3 Learned Caching

Cache eviction (which item to remove when cache is full) is fundamental to systems performance.

Traditional: LRU (Least Recently Used) - Evict item unused for longest time - No lookahead, purely reactive

Learning-augmented: Belady-inspired - ML predicts when items will be used next - Evict item that won’t be needed for longest time - Falls back to LRU if predictions are poor

Results (Lykouris & Vassilvitskii, 2018): Up to 50% improvement in cache hit rate.

Real deployment: Partially deployed in CDN systems, file systems.

21.2.3.4 Learned Optimizers

Database query optimization is NP-hard. Traditional optimizers use heuristics.

Learned optimizers (Marcus & Papaemmanouil, 2018): - Train on past query execution times - Learn which join orders, which indexes to use - Adapt to specific workload patterns

Results: PostgreSQL with learned optimizer: 2-3x faster on analytics workloads.

Deployment: Still mostly research, but major databases (Oracle, SQL Server) are incorporating ML.

21.2.3.5 The Theory

Consistency-robustness tradeoff: You can’t be arbitrarily good when predictions are accurate AND arbitrarily close to optimal when they’re wrong.

Formal results: For many problems, we now know: - The best possible consistency (how good with perfect predictions) - The best possible robustness (how bad with worst predictions) - The tradeoff curve between them

Open problems: Most learning-augmented algorithms are still being discovered. Active research areas: - Learned scheduling - Learned routing - Learned compression - Learned streaming algorithms

21.2.4 13.2.4 Differential Privacy: Computing on Sensitive Data

With data breaches everywhere, researchers asked: Can we learn from data without violating privacy?

21.2.4.1 The Problem

Traditional anonymization fails. AOL released “anonymized” search logs in 2006—journalists identified specific users within days.

Netflix released “anonymized” viewing data in 2007—researchers de-anonymized users by correlating with IMDB reviews.

The insight: Simply removing names doesn’t protect privacy. Statistical patterns can reveal individuals.

21.2.4.2 Differential Privacy (Dwork et al., 2006)

Definition: An algorithm is ε-differentially private if changing one person’s data changes the output distribution by at most e^ε.

Intuitive meaning: Observing the output teaches you almost nothing about any individual.

How it works: Add carefully calibrated random noise to results.

Example: Census data

True count of city population: 1,234,567
Add Laplace noise: ±300 (depending on privacy parameter ε)
Released count: 1,234,823

Privacy guarantee: Even if you know everyone else's data, 
you can't tell if any specific person is in the dataset.

21.2.4.3 Real-World Deployment

U.S. Census Bureau (2020): First census using differential privacy. Injected noise to protect individuals while maintaining statistical accuracy.

Apple (2016-): Uses local differential privacy for: - Autocorrect suggestions - Safari web tracking data - Health data trends

Google (2014-): RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) for Chrome telemetry.

21.2.4.4 The Algorithms

Laplace mechanism: For numeric queries (counts, sums)

Add noise from Laplace(Δf/ε) distribution
where Δf = sensitivity (max change from one person)

Exponential mechanism: For choosing from a set of options

Probability of choosing option o ∝ exp(ε × quality(o) / 2Δ)
Better options more likely, but randomized for privacy

Sparse vector technique: For answering many queries efficiently.

21.2.4.5 The Cost of Privacy

Accuracy vs. Privacy tradeoff: More privacy (smaller ε) means more noise, less accurate results.

Typical values: - ε = 0.1: Very strong privacy, significant noise - ε = 1: Strong privacy, moderate noise - ε = 10: Weak privacy, little noise

Composition: Privacy budget depletes with each query. Answer n queries → effective privacy ≈ √n × ε (with advanced composition).

Current research: - How to allocate privacy budget optimally? - Can we get better accuracy for the same privacy? - Local vs. central differential privacy tradeoffs

21.2.5 13.2.5 Algorithmic Fairness: Eliminating Bias

Algorithms are making life-changing decisions: loan approvals, hiring, criminal sentencing, medical diagnoses. But what if the algorithms are biased?

21.2.5.1 How Bias Creeps In

Historical bias: Training data reflects past discrimination - Example: Amazon’s hiring algorithm (discontinued 2018) penalized résumés mentioning “women’s” (as in “women’s chess club”) - Why: Historical hires were mostly male, algorithm learned to prefer male candidates

Representation bias: Training data doesn’t represent everyone - Example: Facial recognition works worse for darker skin tones (Buolamwini & Gebru, 2018) - Why: Training datasets over-represented lighter skin tones

Measurement bias: Labels reflect biased decisions - Example: COMPAS recidivism prediction (Northpointe) - Why: Historical arrest data reflects policing patterns, not just crime patterns

21.2.5.2 Defining Fairness

Turns out, “fairness” isn’t one thing. Multiple mathematical definitions exist, and they’re mutually exclusive!

Individual fairness: Similar people treated similarly - Formally: d(x₁, x₂) small → |f(x₁) - f(x₂)| small - Problem: Defining “similar” is subjective

Group fairness (Demographic parity): Equal outcomes across groups - Formally: P(Ŷ=1|A=a) = P(Ŷ=1|A=b) for protected attribute A - Example: Loan approval rate same for all races - Problem: May be unfair if groups have different qualification distributions

Equal opportunity: Equal true positive rates across groups - Formally: P(Ŷ=1|Y=1,A=a) = P(Ŷ=1|Y=1,A=b) - Example: Among qualified applicants, approval rate same for all races - Used when false negatives are more concerning than false positives

Calibration: Predictions equally accurate across groups - Formally: P(Y=1|Ŷ=p,A=a) = P(Y=1|Ŷ=p,A=b) = p - Example: If algorithm says 70% risk, actual risk should be 70% for all groups

Impossibility result (Kleinberg et al., 2016): You can’t satisfy calibration, equal opportunity, AND balance (equal positive predictive value) simultaneously unless base rates are equal or the classifier is perfect.

This means: We must make value judgments about which fairness criterion matters most for each application.

21.2.5.3 Fairness Algorithms

Preprocessing: Clean training data - Reweighing (Kamiran & Calders, 2012): Weight training examples to balance groups - Learning fair representations (Zemel et al., 2013): Transform features to remove bias

In-processing: Constrained optimization - Zafar et al. (2017): Add fairness constraints to loss function - Agarwal et al. (2018): Reduction approach—convert any ML algorithm to fair version

Post-processing: Adjust predictions - Hardt et al. (2016): Calibrate thresholds per group to achieve equal opportunity - Pleiss et al. (2017): Isotonic regression for calibration across groups

21.2.5.4 Real-World Examples

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions): - Used in criminal sentencing across the U.S. - ProPublica investigation (2016): Found racial disparities in false positive rates - Northpointe response: Algorithm is calibrated (predictions equally accurate across races) - The controversy: Both sides were mathematically correct! Different fairness definitions.

Hiring algorithms: - Amazon shut down AI recruiting tool (2018) due to gender bias - HireVue uses video interviews + AI scoring (ongoing fairness concerns) - LinkedIn’s approach: Separate models for different groups, combined carefully

Credit scoring: - Apple Card investigation (2019): Appeared to give men higher limits than women - Problem: Hard to debug—algorithm is proprietary, individual factors confidential - Regulation: EU’s GDPR includes “right to explanation” for algorithmic decisions

21.2.5.5 Current Research

Multi-objective optimization: Can we be fair to multiple groups simultaneously?

Long-term fairness: Short-term equal outcomes might not lead to long-term fairness. Example: If algorithm rejects qualified minority applicants, they don’t build credit history, perpetuating inequality.

Feedback loops: Biased predictions → biased actions → biased future data → more biased predictions. How to break the cycle?

Fairness without demographics: Can we ensure fairness without knowing sensitive attributes? (Important for privacy, but algorithmically challenging)

21.3 13.3 Algorithms in AI and Machine Learning

Machine learning has transformed from academic curiosity to world-changing technology. Let’s understand the algorithms that make it work.

21.3.1 13.3.1 Deep Learning: The Revolution

In 2012, a neural network called AlexNet won the ImageNet competition by a shocking margin. It started the deep learning revolution that gave us: - Image recognition better than humans - Real-time language translation - Self-driving cars - ChatGPT and Large Language Models

But how do neural networks actually learn?

21.3.1.1 Backpropagation: The Learning Algorithm

The setup: A neural network is a function f(x; θ) where θ are parameters (weights). We want to minimize loss L(f(x; θ), y).

The challenge: Networks have millions of parameters. How do we compute ∂L/∂θ for each one?

Naive approach: Finite differences

∂L/∂θᵢ ≈ (L(θ + εeᵢ) - L(θ)) / ε

For n parameters, this requires n forward passes. For a million parameters, that’s impossibly slow!

Backpropagation (Rumelhart et al., 1986): Use the chain rule to compute all gradients in one backward pass.

How it works:

Forward pass: Compute network output

Layer 1: h₁ = σ(W₁x + b₁)
Layer 2: h₂ = σ(W₂h₁ + b₂)
...
Output: ŷ = σ(Wₙhₙ₋₁ + bₙ)
Loss: L = (ŷ - y)²

Backward pass: Compute gradients layer by layer

∂L/∂ŷ = 2(ŷ - y)
∂L/∂Wₙ = ∂L/∂ŷ × ∂ŷ/∂Wₙ = ∂L/∂ŷ × hₙ₋₁
∂L/∂hₙ₋₁ = ∂L/∂ŷ × ∂ŷ/∂hₙ₋₁ = ∂L/∂ŷ × Wₙ
(continue backwards through layers)

The magic: Each gradient computation reuses calculations from the layer above. Total cost: one forward pass + one backward pass, regardless of number of parameters!

Time complexity: O(E) where E = number of edges in network (typically E ≈ n for n parameters).

Why this matters: Without backpropagation, training deep networks would be impossible. It’s the algorithm that makes deep learning feasible.

21.3.1.2 Stochastic Gradient Descent: The Optimization Workhorse

Once we have gradients, how do we optimize?

Gradient descent: θ ← θ - η∇L(θ)

Problem: Computing L(θ) requires entire dataset. For millions of examples, one update takes forever!

Stochastic Gradient Descent (SGD): Use one random example at a time

Pick random example (x, y)
Compute gradient ∇L(f(x; θ), y)
Update: θ ← θ - η∇L

Mini-batch SGD: Use small batches (typically 32-256 examples) - Balances speed vs. gradient accuracy - Enables parallel computation on GPUs - Reduces gradient noise

Why SGD works: Individual gradients are noisy, but on average point toward optimum. The noise even helps escape bad local minima!

21.3.1.3 Modern Optimizers

Momentum (1964): Accelerate in consistent directions

v ← βv + ∇L    (velocity accumulates gradients)
θ ← θ - ηv     (update includes momentum)

Effect: Smoother optimization, faster convergence, dampens oscillations.

Adam (Kingma & Ba, 2014): Adaptive learning rates per parameter

m ← β₁m + (1-β₁)∇L           (first moment: mean)
v ← β₂v + (1-β₂)(∇L)²        (second moment: variance)
m̂ = m/(1-β₁ᵗ), v̂ = v/(1-β₂ᵗ)  (bias correction)
θ ← θ - η m̂/√(v̂ + ε)          (update)

Effect: Parameters with large gradients get smaller updates (more conservative). Parameters with small gradients get larger updates (more aggressive).

Why Adam is popular: Works well with minimal hyperparameter tuning. Default choice for many applications.

Current research: Better optimizers (AdamW, LAMB), understanding why SGD generalizes better than sophisticated methods, adversarial examples.

21.3.2 13.3.2 Transformers: The Architecture Revolution

In 2017, “Attention is All You Need” (Vaswani et al.) introduced Transformers. They now power: - GPT (ChatGPT, GPT-4) - BERT (Google Search) - T5 (translation) - DALL-E, Midjourney (image generation) - AlphaFold (protein folding)

21.3.2.1 The Self-Attention Mechanism

The problem: Understanding context in sequences. In “The animal didn’t cross the street because it was too tired”, what does “it” refer to?

RNNs/LSTMs: Process sequentially, struggle with long-range dependencies.

Transformers: Process entire sequence simultaneously using attention.

How attention works:

For each position i, compute how much to “attend” to each other position j:

Query, Key, Value: For each word, compute three vectors

Qᵢ = Wᵠxᵢ  (what I'm looking for)
Kᵢ = Wₖxᵢ  (what I contain)
Vᵢ = Wᵥxᵢ  (what I'll contribute)

Attention scores: How relevant is position j to position i?

scoreᵢⱼ = Qᵢ · Kⱼ / √d  (dot product, scaled)

Softmax: Convert scores to probabilities

αᵢⱼ = exp(scoreᵢⱼ) / Σₖ exp(scoreᵢₖ)

Weighted sum: Output is weighted combination of values

outputᵢ = Σⱼ αᵢⱼVⱼ

Intuition: “The animal” has high attention to “it” and “tired”, learning that “it” refers to the animal, not the street.

Time complexity: O(n²d) where n = sequence length, d = dimension - Quadratic in sequence length (problem for long sequences!) - But parallelizes perfectly (unlike RNNs)

21.3.2.2 Multi-Head Attention

Run attention multiple times in parallel with different learned projections:

head₁ = Attention(Q₁, K₁, V₁)
head₂ = Attention(Q₂, K₂, V₂)
...
output = Concat(head₁, head₂, ...) × Wₒ

Why: Different heads learn different relationships. One might focus on syntax, another on semantics, another on coreference.

Typical setup: 8-16 heads. GPT-3 uses 96 heads!

21.3.2.3 Positional Encoding

Problem: Attention is permutation-invariant. “Dog bites man” and “Man bites dog” look the same!

Solution: Add position information to input embeddings

PE(pos, 2i) = sin(pos/10000^(2i/d))
PE(pos, 2i+1) = cos(pos/10000^(2i/d))

Why sinusoidal: Allows model to learn relative positions. Also extrapolates to longer sequences than training.

21.3.2.4 The Full Transformer Architecture

Encoder (for understanding):

Input → Embedding + Positional Encoding
     → Multi-Head Attention
     → Add & Normalize
     → Feed-Forward Network
     → Add & Normalize
     → (repeat for multiple layers)

Decoder (for generation):

(similar to encoder, but with masked attention 
 to prevent looking at future tokens)

Training objective: Predict next token

Given "The cat sat on the"
Predict "mat"

Scaling laws (Kaplan et al., 2020): Performance improves smoothly with: - Model size (number of parameters) - Data size (number of training tokens) - Compute (GPU hours)

Power law: Loss ∝ N^(-α) where N is model size.

This led to: - GPT-2 (2019): 1.5B parameters - GPT-3 (2020): 175B parameters - GPT-4 (2023): ~1.7T parameters (estimated)

21.3.2.5 Efficient Transformers

The n² problem: Standard attention is quadratic in sequence length.

Solutions:

Sparse attention (Child et al., 2019): Only attend to subset of positions - Local attention: nearby tokens - Global attention: special tokens - Reduces to O(n√n) or O(n log n)

Linformer (Wang et al., 2020): Project keys/values to lower dimension - Reduces to O(nd) where d << n

Flash Attention (Dao et al., 2022): Optimize memory access patterns - Same complexity, but 2-4x faster wall-clock time - Key innovation: algorithmic improvements for GPUs

Current state: Can now handle sequences up to 100k+ tokens (vs. 512 in original Transformer).

21.3.3 13.3.3 Reinforcement Learning: Learning by Doing

Reinforcement learning (RL) achieved: - AlphaGo beating world champion (2016) - OpenAI Five beating Dota 2 pros (2018) - AlphaStar mastering StarCraft II (2019) - ChatGPT’s human-like responses (2022)

How does RL work?

21.3.3.1 The RL Framework

Setup: - Agent in environment - At each timestep: observe state s, take action a, receive reward r - Goal: maximize cumulative reward

The challenge: Actions have delayed consequences. Sacrificing a piece in chess might lead to winning 20 moves later.

Value function: V(s) = expected cumulative reward starting from state s

Q-function: Q(s,a) = expected cumulative reward from state s after action a

The Bellman equation:

Q(s,a) = r + γ × max_a' Q(s',a')
where s' = next state after action a
      γ = discount factor (typically 0.99)

Intuitive meaning: Value of state = immediate reward + discounted future value.

21.3.3.2 Q-Learning: The Classic Algorithm

Q-learning (Watkins, 1989): Learn Q-function through experience

Algorithm:

Initialize Q(s,a) arbitrarily
Loop:
    Observe state s
    Choose action a (ε-greedy: random with probability ε)
    Take action, observe reward r and next state s'
    Update: Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]

The update rule: Move Q-value toward observed reward + future value.

Exploration vs. exploitation: ε-greedy balances trying new actions (exploration) with using known good actions (exploitation).

Convergence: Provably converges to optimal Q-function if all state-action pairs are visited infinitely often.

21.3.3.3 Deep Q-Networks (DQN)

The scaling problem: Q-learning stores Q(s,a) in table. For Atari games: 10^9 states × 18 actions = 18 billion entries!

Solution (Mnih et al., 2015): Approximate Q with neural network

Q(s,a; θ) ≈ Q*(s,a)

Training: Use TD (temporal difference) error as loss

Loss = [r + γ max_a' Q(s',a'; θ⁻) - Q(s,a; θ)]²
where θ⁻ = target network parameters (updated periodically)

Key innovations:

Experience replay: Store past experiences (s,a,r,s’) in buffer, sample randomly for training - Breaks correlation between consecutive samples - Improves data efficiency

Target network: Use old parameters θ⁻ for computing targets - Stabilizes learning (target isn’t constantly moving) - Update periodically (every 10k steps)

Results: DQN learned to play 49 Atari games from pixels, achieving human-level performance on many.

21.3.3.4 Policy Gradient Methods

Alternative approach: Learn policy π(a|s) directly (probability of action a in state s).

REINFORCE (Williams, 1992): Increase probability of actions that led to high reward

∇J(θ) = E[∇ log π(a|s; θ) × R]
where R = cumulative reward

Intuition: If an action led to good outcome, make it more likely. If bad outcome, make it less likely.

Actor-Critic: Combine value function (critic) with policy (actor)

Actor: π(a|s; θ)
Critic: V(s; w)
Update actor using critic's value estimate as baseline

Advantage: Reduces variance, learns faster.

21.3.3.5 Proximal Policy Optimization (PPO)

PPO (Schulman et al., 2017): Current state-of-the-art policy gradient method.

The problem: Policy gradients are unstable. One bad update can destroy learned policy.

PPO’s solution: Constrain policy updates

Maximize: min(ratio × A, clip(ratio, 1-ε, 1+ε) × A)
where ratio = π_new(a|s) / π_old(a|s)
      A = advantage estimate
      ε = clipping parameter (typically 0.2)

Effect: Limits how much policy can change per update. More stable, more reliable.

Applications: - OpenAI Five (Dota 2) - AlphaStar (StarCraft II) - ChatGPT (RLHF: Reinforcement Learning from Human Feedback)

21.3.3.6 AlphaGo: Putting It All Together

AlphaGo combined multiple techniques:

Supervised learning: Train on expert human games

Policy network: predict human moves
Value network: evaluate board positions

Self-play RL: Play against itself millions of times

Policy improvement via policy gradient
Value updates via TD learning

Monte Carlo Tree Search: Smart exploration during game play

Selection: Choose promising moves (UCB formula)
Expansion: Add new nodes
Simulation: Use neural nets to evaluate
Backpropagation: Update statistics

The innovation: Deep learning (pattern recognition) + tree search (planning)

Results: - AlphaGo beat Lee Sedol 4-1 (2016) - AlphaGo Zero learned from scratch, beat AlphaGo 100-0 (2017) - AlphaZero generalized to chess and shogi (2017)

Impact: Showed that RL + deep learning could master complex strategic games, opening path to real-world applications.

21.4 13.4 Big Data Algorithms: Computing at Planetary Scale

When Google processes 8.5 billion searches per day, Facebook handles 4 petabytes of new data daily, and Netflix streams 250 million hours of video daily, traditional algorithms break down.

Welcome to big data algorithms: techniques for when data doesn’t fit in memory.

21.4.1 13.4.1 The MapReduce Revolution

21.4.1.1 The Problem

Traditional algorithm analysis assumes: data fits in RAM, you can access any element instantly.

Reality in 2004 (when Google invented MapReduce): - Web: billions of pages - Indexing: terabytes of data - Distributed across thousands of machines - Machines fail constantly

Traditional approach doesn’t work: Can’t load everything into one machine’s memory. Can’t write algorithms assuming reliable hardware.

21.4.1.2 The MapReduce Paradigm

Key insight: Most data processing has two phases: 1. Map: Apply function to each element independently 2. Reduce: Aggregate results

Example - Word Count:

Input: Millions of documents

Map phase:
Document 1 → ("hello", 1), ("world", 1), ("hello", 1)
Document 2 → ("world", 1), ("foo", 1)
...

Shuffle phase (automatic):
("hello", [1, 1, ...])
("world", [1, 1, ...])
("foo", [1, ...])

Reduce phase:
("hello", [1, 1, ...]) → ("hello", 4521)
("world", [1, 1, ...]) → ("world", 3892)
("foo", [1, ...]) → ("foo", 1023)

What makes MapReduce powerful:

Automatic parallelization: Framework handles distributing work - No explicit thread management - No message passing - Just write map() and reduce() functions

Fault tolerance: If machine fails, rerun just that task - Map tasks are idempotent (can rerun safely) - Output written to distributed file system - Automatic retry on failure

Data locality: Move computation to data - Minimize network transfer - Process data where it’s stored

Scalability: Add more machines → proportionally faster - Google runs MapReduce on 100,000+ machines - No algorithmic changes needed

21.4.1.3 MapReduce Algorithms

Many algorithms can be expressed in MapReduce:

PageRank:

Map: For each page, emit (link_target, pagerank/num_links)
Reduce: Sum contributions to get new pagerank
Iterate until convergence

Inverted index (Google Search):

Map: For each document, emit (word, doc_id)
Reduce: Collect all doc_ids for each word

Join (database operation):

Map: Emit (join_key, (table_name, record))
Reduce: Combine records with same key from different tables

Matrix multiplication:

A × B where A is n×m, B is m×p

Map: For each A[i,k], emit ((i,j), A[i,k] × B[k,j]) for all j
Reduce: Sum all contributions for each (i,j)

21.4.1.4 Limitations and Evolution

MapReduce limitations: - High latency (disk I/O between stages) - Not great for iterative algorithms - Programmer has to think in map/reduce paradigm

Apache Spark (2012): In-memory successor - Keep data in RAM between operations - 10-100x faster for iterative algorithms - More expressive programming model

Apache Flink (2014): True streaming - Process data as it arrives (real-time) - Event time processing - Exactly-once guarantees even with failures

21.4.2 13.4.2 Streaming Algorithms: Computing in One Pass

The constraint: Data arrives as stream, you can only look at it once, using limited memory.

Applications: - Network monitoring (terabytes/day of traffic) - Social media analytics (millions of posts/second) - Financial trading (microsecond decisions) - Sensor networks (billions of IoT devices)

21.4.2.1 Count-Min Sketch

Problem: Count frequency of millions of distinct items, but you only have memory for thousands of counters.

Naive approach: Hash table → O(n) space where n = number of distinct items. If n = billions, you’re out of memory.

Count-Min Sketch (Cormode & Muthukrishnan, 2005):

Data structure: w × d array of counters (typically w=2000, d=5)
Hash functions: h₁, h₂, ..., hₐ

Update(item):
    for i = 1 to d:
        count[i][hᵢ(item)]++

Query(item):
    return min(count[i][hᵢ(item)] for i = 1 to d)

Why it works: - True count ≤ returned count (never underestimate) - With high probability: returned count ≤ true count + ε × total_items - Space: O((1/ε) × log(1/δ)) where δ = failure probability

Applications: - Network traffic analysis - Top-k frequent items - Heavy hitters detection

Real deployment: Google uses Count-Min Sketch in their data centers for monitoring.

21.4.2.2 HyperLogLog

Problem: Count number of distinct items in stream.

Naive approach: Hash set → O(n) space.

HyperLogLog (Flajolet et al., 2007):

Space: O(log log n) → Yes, double logarithm!

Algorithm:
1. Hash each item to binary string
2. Count leading zeros: ρ(hash(item))
3. Keep maximum: M = max(ρ(hash(item)) for all items)
4. Estimate: distinct_count ≈ 2^M

Refinement: Use m buckets, combine estimates

Why it works: If you’ve seen n distinct items, you expect one hash to have log₂(n) leading zeros.

Accuracy: Error ≈ 1.04/√m where m = number of buckets.

Example: 1% error with just 16 KB of memory, for billions of distinct items!

Real deployment: - Reddit uses HyperLogLog for unique visitor counts - Azure CosmosDB uses it internally - Redis has HyperLogLog built-in

21.4.2.3 Bloom Filters

Problem: Test set membership (“have I seen this before?”) with limited memory.

Bloom Filter (Bloom, 1970):

Data structure: bit array of size m
Hash functions: k different hash functions

Add(item):
    for each hash function h:
        set bit[h(item)] = 1

Query(item):
    for each hash function h:
        if bit[h(item)] = 0:
            return "definitely not present"
    return "probably present"

Properties: - No false negatives (if it says “not present”, it’s really not present) - Possible false positives (if it says “present”, might be wrong) - False positive probability ≈ (1 - e^(-kn/m))k

Optimal parameters: k = (m/n) × ln(2) hash functions minimizes false positives.

Applications: - Web browsers: Check if URL is malicious before visiting - Databases: Avoid expensive disk lookups - Distributed systems: Check if data is cached

Chrome’s Safe Browsing: Uses Bloom filter to check 1M+ malicious URLs locally before querying Google’s servers.

21.4.3 13.4.3 Graph Processing at Scale

The challenge: Social networks have billions of users, trillions of connections. How do you compute PageRank, find communities, detect fraud?

21.4.3.1 Pregel: Thinking Like a Vertex

Pregel (Google, 2010): Framework for graph algorithms on billions of nodes.

Programming model:

class Vertex:
    def compute(self, messages):
        # Process messages from neighbors
        # Update vertex state
        # Send messages to neighbors
        # Vote to halt or continue

Computation proceeds in supersteps:
1. All vertices process messages in parallel
2. Send messages for next superstep
3. Repeat until all vertices halt

Example - PageRank:

def compute(self, messages):
    if superstep > 0:
        self.pagerank = 0.15 + 0.85 × sum(messages)
    
    if superstep < 30:  # 30 iterations
        for neighbor in self.neighbors:
            send_message(neighbor, self.pagerank / len(self.neighbors))
    else:
        vote_to_halt()

Why this works at scale: - Vertices process independently (massive parallelism) - Only send messages to neighbors (limited communication) - Automatic fault tolerance (rerun failed partitions) - Graph partitioning optimizes locality

Deployed at: Google (Pregel), Facebook (Apache Giraph), Twitter (Cassovary)

21.4.3.2 GraphX and GraphFrames

Built on Spark, these provide graph algorithms with: - Connected components - PageRank - Triangle counting - Shortest paths - Community detection

Integration with machine learning: Can combine graph structure with node features for: - Node classification - Link prediction - Graph neural networks

21.5 13.5 Security and Cryptographic Algorithms

Every secure website, encrypted message, digital signature, and blockchain depends on clever algorithms. Let’s explore the algorithms keeping the digital world secure—and the ones threatening to break it.

21.5.1 13.5.1 Public-Key Cryptography: The Mathematics of Secrets

21.5.1.1 The Problem That Seemed Impossible

Before 1976, secret communication required shared secret keys. If Alice and Bob wanted secure communication: 1. Meet in person to exchange key 2. Or trust a courier 3. Or use complex key distribution centers

For the internet: How do billions of people establish shared secrets?

The miracle: Public-key cryptography (Diffie-Hellman 1976, RSA 1977)

The idea: Two keys - Public key: Freely shared, used to encrypt - Private key: Kept secret, used to decrypt

Amazing property: Knowing the public key doesn’t help you figure out the private key (assuming certain mathematical problems are hard).

21.5.1.2 RSA: The Algorithm That Secured the Internet

RSA (Rivest, Shamir, Adleman, 1977) is based on a simple idea: multiplying primes is easy, factoring is hard.

Key generation:

1. Choose two large primes p, q (typically 1024 bits each)
2. Compute N = p × q (the modulus)
3. Compute φ(N) = (p-1)(q-1) (Euler's totient)
4. Choose public exponent e (commonly 65537)
5. Compute private exponent d where e × d ≡ 1 (mod φ(N))

Public key: (N, e)
Private key: (N, d)

Encryption: message^e mod N = ciphertext

Decryption: ciphertext^d mod N = message

Why it works (mathematically):

(message^e)^d ≡ message^(ed) (mod N)
Since ed ≡ 1 (mod φ(N)), by Euler's theorem:
message^(ed) ≡ message (mod N)

Security basis: Factoring N to find p and q is believed to be hard. Best known classical algorithm: General Number Field Sieve, time ≈ exp((log N)^(1/3)).

For 2048-bit N: Would take millions of years with current computers.

21.5.1.3 The Quantum Threat to RSA

Shor’s algorithm (1994): Quantum computer can factor N in polynomial time: O((log N)³).

Timeline: - 2012: Factor 21 using quantum computer - 2019: Factor 35 (still pathetic) - 2024: Still nowhere near breaking real RSA - But: Cryptographically relevant quantum computers might arrive by 2030-2035

The response: NIST’s post-quantum cryptography standardization (completed 2024).

Selected algorithms: - CRYSTALS-Kyber (key exchange): Based on “learning with errors” (LWE) - CRYSTALS-Dilithium (digital signatures): Based on lattice problems - FALCON (signatures): Based on NTRU lattices - SPHINCS+ (signatures): Based on hash functions

Why lattices: Best known quantum algorithms only achieve modest speedup against lattice problems. Believed to be quantum-resistant.

21.5.2 13.5.2 Blockchain and Cryptocurrencies

Love them or hate them, cryptocurrencies are a triumph of algorithm design. Let’s understand the algorithms, not the hype.

21.5.2.1 The Byzantine Generals Problem

The challenge: How do distributed parties agree on something when some might be malicious?

Byzantine Generals Problem (Lamport, 1982): - n generals surrounding city, need to coordinate attack - Some generals might be traitors - Communication by messenger (can be intercepted) - Goal: All loyal generals decide on same plan

Classical result: Need n ≥ 3f + 1 generals to tolerate f traitors.

Blockchain’s innovation: Use computational work (proof-of-work) instead of assuming number of honest parties.

21.5.2.2 Bitcoin’s Proof-of-Work

The algorithm:

Block contains:
- Previous block hash
- Transactions
- Nonce (random number)

Mining:
    repeat:
        nonce = random()
        hash = SHA256(SHA256(block_data || nonce))
        if hash < target:
            broadcast block
            break

The target: Adjusted so blocks found every ~10 minutes. Currently requires ~80 zeros in binary representation (2^80 hashes expected).

Why this secures Bitcoin:

Immutability: To change past transaction, you’d need to: 1. Remine that block (2^80 hashes) 2. Remine all subsequent blocks 3. Outpace the rest of the network

Consensus: Longest chain wins. Attackers would need 51% of network’s computational power to create longer chain.

Incentives: Miners get reward (currently 6.25 BTC ≈ $260k) + transaction fees. More profitable to mine honestly than attack.

21.5.2.3 The Energy Cost

Current state (2024): - Bitcoin network: ~400 EH/s (exahashes per second = 10^18) - Energy consumption: ~150 TWh/year (comparable to Argentina) - Carbon footprint: ~90 MT CO₂/year

The inefficiency: Only one miner wins per block. All other computation is “wasted” (though it provides security).

21.5.2.4 Alternative Consensus: Proof-of-Stake

Proof-of-Stake (Ethereum 2.0, 2022): Validators chosen based on stake (how much cryptocurrency they hold), not computational work.

Algorithm:

1. Validators lock up stake (32 ETH minimum)
2. Randomly selected to propose blocks (probability ∝ stake)
3. Other validators vote on validity
4. Rewards for honest behavior, penalties for malicious behavior

Advantages: - 99.95% less energy than proof-of-work - Faster finality (blocks confirmed in minutes, not hours) - 51% attack requires owning 51% of currency (very expensive)

Challenge: “Nothing at stake” problem—validators could vote for multiple chains. Solved through slashing (destroying stake of malicious validators).

Results: Ethereum’s “Merge” (Sept 2022) reduced energy consumption from 112 TWh/year to 0.01 TWh/year.

21.5.3 13.5.3 Zero-Knowledge Proofs: Proving Without Revealing

The amazing idea: Prove you know something without revealing what you know.

Example: Prove you know solution to Sudoku puzzle without showing the solution.

Applications: - Anonymous credentials (prove you’re over 18 without showing ID) - Private blockchain transactions (Zcash) - Scaling blockchains (zkRollups) - Password-less authentication

21.5.3.1 Interactive Zero-Knowledge

Original protocol (Goldwasser, Micali, Rackoff, 1985):

Prover-Verifier interaction (for graph 3-coloring):

Prover knows valid coloring of graph
Verifier wants to verify, but not learn coloring

Repeat many times:
    1. Prover randomly permutes colors, commits to new coloring
    2. Verifier randomly picks an edge
    3. Prover reveals colors of both endpoints
    4. Verifier checks: different colors? If yes, continue

After n rounds:
    If prover is honest: always passes
    If prover is cheating: probability of passing = (1 - 1/|E|)^n ≈ 0

Properties: - Completeness: Honest prover convinces verifier - Soundness: Cheating prover caught with high probability - Zero-knowledge: Verifier learns nothing except validity

21.5.3.2 Non-Interactive Zero-Knowledge (SNARKs)

Problem with interactive: Requires back-and-forth. Not suitable for blockchain.

SNARKs (Succinct Non-interactive ARguments of Knowledge): - Prover generates single proof - Anyone can verify - Proof is short (hundreds of bytes) - Verification is fast (milliseconds)

How it works (simplified):

1. Convert statement to arithmetic circuit
2. Use cryptographic pairing to create proof
3. Proof: π = combination of circuit values and randomness
4. Verification: Check pairing equation e(π, g) = e(h, vk)

Applications:

Zcash: Private transactions - Prove “I have money to send” without revealing how much or to whom - Transaction size: ~300 bytes - Verification: ~5ms

zkRollups: Scaling Ethereum - Bundle thousands of transactions - Generate proof that all transitions are valid - Post proof to blockchain (not all transaction data) - Result: 100x increase in throughput

Challenges: - Trusted setup (some schemes require initial ceremony) - Computational cost of proof generation (seconds to minutes) - Complexity of writing circuits

Current research: STARK proofs (no trusted setup), recursive composition (proofs of proofs), practical tooling.

21.6 13.6 Ethical Implications: When Algorithms Make Decisions

Algorithms aren’t neutral. They encode choices, reflect biases, and have real impacts on people’s lives. Let’s confront the ethical challenges head-on.

21.6.1 13.6.1 The Accountability Problem

Question: When an algorithm makes a mistake, who’s responsible?

21.6.1.1 Case Study: Tesla Autopilot

March 2018: Tesla Model X on Autopilot crashes into highway barrier, killing driver.

The algorithm: Neural network trained on millions of miles of driving data. Makes predictions 10 times per second.

The failure: Misclassified concrete barrier as continuation of road.

Questions: - Was the algorithm defective? - Was the driver misusing it? - Did Tesla adequately communicate limitations? - Should the algorithm have recognized its own uncertainty?

Current state: No clear legal framework. Liability unclear. Regulations being developed.

21.6.1.2 Case Study: Algorithmic Hiring

Amazon’s hiring algorithm (disclosed 2018): - Trained on 10 years of résumés from successful hires - Automatically ranked candidates - Discovered to penalize résumés mentioning “women’s” (as in women’s chess club)

The problem: Historical hires were biased → algorithm learned bias.

Amazon’s response: Discontinued the tool.

Questions: - Is it illegal? (Disparate impact under Civil Rights Act) - Even if algorithm is more accurate than humans, is it fair? - Should protected attributes be included (to ensure fairness) or excluded (to prevent discrimination)?

No easy answers: Companies now use fairness-aware ML, but what “fair” means is contested.

21.6.2 13.6.2 Transparency vs. Performance

The dilemma: Most accurate models (deep learning) are least interpretable.

Example: COMPAS recidivism prediction - Predicts whether criminal defendant will reoffend - Used in sentencing decisions across U.S. - Proprietary algorithm, opaque to defendants and judges

Arguments for opacity: - More accurate predictions - Gaming prevention (can’t manipulate score if don’t know how it works) - Trade secrets

Arguments for transparency: - Right to explanation (GDPR) - Ability to challenge decisions - Public oversight and accountability - Trust

Current approaches:

LIME (Local Interpretable Model-Agnostic Explanations): - Approximate black-box model locally with simple model - “For this specific case, decision was based on…”

SHAP (Shapley Additive Explanations): - Use game theory to assign importance to features - “Feature X contributed +0.3 to prediction”

Attention visualization: For neural networks, show what parts of input the model focused on.

Limitations: Explanations are post-hoc. Don’t guarantee the model makes sense globally.

21.6.3 13.6.3 Privacy vs. Utility

The fundamental tradeoff: More data and less privacy → better algorithms. But at what cost?

21.6.3.1 Surveillance Capitalism

Business model: 1. Collect data on user behavior 2. Train algorithms to predict behavior 3. Sell predictions to advertisers 4. Use algorithms to manipulate behavior (maximize engagement)

Concerns: - Filter bubbles: Algorithms show you content you’ll engage with, creating echo chambers - Addiction: Algorithms optimized for engagement, not well-being - Manipulation: Political microtargeting, radicalization - Surveillance: Everything tracked, profiled, monetized

21.6.3.2 Case Study: Cambridge Analytica

What happened (2018 disclosure): - Harvested data from 87 million Facebook users - Built psychological profiles - Microtargeted political ads in 2016 elections - Attempted to manipulate voter behavior

The algorithm: Psychographic modeling - Five-factor personality model (OCEAN) - Predict personality from Facebook likes - Target messages based on personality

Questions: - Was this illegal? (Debatable) - Was it effective? (Disputed) - Should microtargeting be allowed? - What regulations are needed?

Responses: - GDPR (EU): Strict data protection, consent requirements - CCPA (California): Consumer data rights - Proposed federal regulations (U.S.)

21.6.4 13.6.4 Autonomous Weapons

The prospect: Weapons that select and engage targets without human intervention.

Current state: - Military drones (human in loop) - Autonomous defensive systems (ship/base protection) - Research into fully autonomous systems

The trolley problem, militarized:

Scenario: Autonomous drone identifies target in civilian area. Estimates: - 90% chance of eliminating high-value target - 10% chance of civilian casualties

Should it engage?

Arguments against: - Lack of human judgment - Risk of accidents (misidentification) - Lowering threshold for using force - Arms race concerns - Violation of human dignity (killed by algorithm)

Arguments for: - Potentially more discriminate than human soldiers - Faster reaction time (defensive systems) - Protects own soldiers - Enemies will develop anyway

Current policy: - UN discussing regulation - Many AI researchers oppose autonomous weapons - Some nations committed to keeping “human in loop” - No international treaty (yet)

21.6.5 13.6.5 Algorithmic Justice

The reality: Algorithms are increasingly used in criminal justice.

Applications: - Predictive policing (where to patrol) - Risk assessment (bail, sentencing, parole) - Facial recognition (identifying suspects) - Gang databases (often algorithmic)

21.6.5.1 Predictive Policing

The algorithm: Predict where crime likely to occur - Input: Historical crime data - Output: “hotspots” for patrol

Problem: Historical data reflects biased policing - More patrols in minority neighborhoods → more arrests → algorithm predicts more crime in those areas → more patrols (feedback loop)

Studies: - Lum & Isaac (2016): Showed predictive policing amplifies bias - Algorithmic bias compounds over time

Real impact: - Oakland Police discontinued use (2018) - LAPD scaled back program (2020)

21.6.5.2 Risk Assessment

COMPAS scores: Predict recidivism risk (1-10 scale)

ProPublica investigation (2016): - False positive rate (predicted to reoffend, didn’t): 45% for Black defendants, 23% for white defendants - False negative rate (predicted not to reoffend, did): 28% for Black defendants, 48% for white defendants

Northpointe response: Algorithm is calibrated - Among defendants scored 7, recidivism rate is similar across races - Both perspectives are mathematically correct (impossibility theorem!)

Policy questions: - Should risk assessment be used at all? - If used, which fairness criterion matters? - Should it be open source for auditing? - What role for human judgment?

21.6.6 13.6.6 The Path Forward

What can we do?

For researchers: - Publish datasets and code for reproducibility - Report failures, not just successes - Consider societal impact, not just technical novelty - Engage with ethicists, policymakers, affected communities

For practitioners: - Algorithmic impact assessments - Diverse teams (not just demographics, but perspectives) - Regular audits for bias - Clear documentation of limitations - Channels for feedback and recourse

For regulators: - Right to explanation for consequential decisions - Auditing requirements for high-risk applications - Liability frameworks for algorithmic harm - Funding for algorithmic accountability research

For individuals: - Data literacy: understand what algorithms can/can’t do - Advocate for transparency and accountability - Support ethical AI organizations - Vote for representatives who prioritize these issues

The goal: Harness the power of algorithms while protecting human rights, dignity, and autonomy.

21.7 13.7 Reading and Analyzing Research Papers

Want to contribute to algorithmic research? Start by reading papers. Here’s how.

21.7.1 13.7.1 Anatomy of a Research Paper

Typical structure:

Abstract: 150-300 words summarizing contribution
- What to look for: Main result, key innovation, performance improvement
Introduction: Motivation and context
- What to look for: What problem are they solving? Why does it matter? What’s new?
Related Work: Comparison to prior work
- What to look for: How does this improve on previous approaches? What gap does it fill?
Technical Content: The meat of the paper
- Algorithm description: Precise steps
- Theoretical analysis: Correctness proofs, complexity bounds
- Experimental evaluation: Benchmarks, comparisons
Results: What they achieved
- What to look for: Quantitative improvements, limitations, when it works well/poorly
Conclusion: Summary and future work
- What to look for: Open problems, potential applications

21.7.2 13.7.2 How to Read a Paper (Three-Pass Method)

First pass (5-10 minutes): - Read title, abstract, introduction, conclusion - Skim section headings - Goal: What is this paper about? Is it relevant to me?

Second pass (1 hour): - Read carefully, but skip proofs - Look at figures, tables, graphs - Note key contributions and techniques - Goal: Understand the main ideas and results

Third pass (several hours): - Read everything in detail - Work through proofs and derivations - Try to reproduce key results - Think critically: What assumptions? What limitations? What’s missing? - Goal: Deep understanding, ability to critique and extend

21.7.3 13.7.3 Critical Reading Questions

For algorithms: - Is the algorithm clearly described? Could you implement it? - Is the complexity analysis tight? Are there hidden constants? - What assumptions are made? Do they hold in practice? - Are there cases where the algorithm fails or performs poorly?

For experiments: - Are benchmarks realistic? Representative? - Is comparison fair? (Same hardware, fair baselines?) - Are error bars / confidence intervals provided? - Can results be reproduced? (Code/data available?)

For theory: - Are proofs rigorous? Any gaps? - Are bounds tight? Lower bounds provided? - Do theorems match experimental results? - What about constants hidden by big-O notation?

21.7.4 13.7.4 Where to Find Papers

Major venues:

Theory: - FOCS (Foundations of Computer Science) - STOC (Symposium on Theory of Computing) - SODA (Algorithms and Discrete Algorithms)

Machine Learning: - NeurIPS (Neural Information Processing Systems) - ICML (International Conference on Machine Learning) - ICLR (International Conference on Learning Representations)

Databases/Systems: - SIGMOD (Management of Data) - VLDB (Very Large Databases) - OSDI (Operating Systems Design and Implementation)

Archives: - arXiv.org: Preprints (not peer-reviewed, but most recent) - Google Scholar: Search engine for papers - Semantic Scholar: AI-powered paper search

Recommendation: Start with survey papers and tutorial articles, then dive into specific papers.

21.8 13.8 Chapter Project: Research Paper Analysis

Let’s put it all together by analyzing a real research paper.

21.8.1 13.8.1 Project Description

Choose a recent algorithmic research paper (published in the last 5 years) and perform a comprehensive analysis:

Summary: Summarize the paper in your own words (1-2 pages)
- What problem does it solve?
- What is the key innovation?
- What are the main results?
Technical Deep Dive: Explain the algorithm in detail
- Provide pseudocode
- Explain time/space complexity
- Describe key proof techniques
Implementation: Implement the algorithm
- Test on example inputs
- Compare with baseline approaches
- Reproduce key experimental results
Critical Analysis:
- What are the strengths?
- What are the limitations?
- What assumptions might not hold?
- Where might the algorithm fail?
Extensions: Propose improvements or variations
- Can you extend to related problems?
- Can you improve worst-case or average-case performance?
- Can you simplify the algorithm?
Impact Assessment: Consider broader implications
- What are potential applications?
- Are there ethical concerns?
- What future research does this enable?

21.8.2 13.8.2 Example Paper Choices

Learning-Augmented Algorithms: - Lykouris & Vassilvitskii (2018): “Competitive Caching with Machine Learned Advice”

Differential Privacy: - Dwork et al. (2014): “The Algorithmic Foundations of Differential Privacy”

Graph Algorithms: - Cohen et al. (2017): “Sketching and Streaming Algorithms for Analyzing Massive Graphs”

Quantum Algorithms: - Harrow et al. (2009): “Quantum Algorithm for Linear Systems of Equations” (HHL)

ML/Deep Learning: - Vaswani et al. (2017): “Attention is All You Need” (Transformers) - He et al. (2015): “Deep Residual Learning for Image Recognition” (ResNet)

Fairness: - Hardt et al. (2016): “Equality of Opportunity in Supervised Learning”

21.8.3 13.8.3 Analysis Template

# Paper Analysis: [Title]

## 1. Citation
[Full citation in standard format]

## 2. One-Sentence Summary
[What is the single most important contribution?]

## 3. Problem Statement
- **What problem does this paper address?**
- **Why is this problem important?**
- **What makes this problem challenging?**

## 4. Prior Work
- **What did previous approaches do?**
- **What were their limitations?**
- **What gap does this paper fill?**

## 5. Key Innovation
- **What is the main new idea?**
- **What makes this approach better?**

## 6. Algorithm Description
- **High-level overview**
- **Detailed pseudocode**
- **Key subroutines**
- **Data structures used**

## 7. Theoretical Analysis
- **Time complexity**: [with derivation]
- **Space complexity**: [with derivation]
- **Correctness proof**: [sketch]
- **Optimality**: [lower bounds, if provided]

## 8. Experimental Evaluation
- **Datasets used**
- **Baselines compared against**
- **Key results** [with numbers]
- **Where it works well / poorly**

## 9. Implementation
[Your implementation with code]

## 10. Reproduction
- **Were you able to reproduce results?**
- **Any discrepancies?**
- **Insights from implementation**

## 11. Critical Analysis
### Strengths
- [What does this paper do well?]

### Limitations
- [What are the weaknesses?]

### Assumptions
- [What assumptions are made? Are they realistic?]

## 12. Extensions
- **Possible improvements**
- **Related problems this could solve**
- **Open questions**

## 13. Broader Impact
- **Applications**
- **Ethical considerations**
- **Future research directions**

## 14. Your Assessment
- **Would you recommend this paper? Why?**
- **What did you learn?**
- **How might you build on this work?**

21.9 13.9 Summary: Algorithms Shaping the Future

We’ve journeyed through the cutting edge of algorithmic research and seen how algorithms are transforming our world:

Current research trends: - Beyond worst-case analysis: algorithms for real-world data - Quantum algorithms: the coming revolution - Learning-augmented algorithms: ML meets classical CS - Differential privacy: computing on sensitive data - Algorithmic fairness: eliminating bias

AI and ML: - Deep learning: backpropagation and SGD - Transformers: attention revolutionizing everything - Reinforcement learning: algorithms that learn by doing

Big Data: - MapReduce and Spark: distributed computing at scale - Streaming algorithms: processing infinite data - Graph processing: analyzing networks with billions of edges

Cryptography: - Public-key cryptography securing the internet - Quantum threat to current systems - Blockchain and cryptocurrencies - Zero-knowledge proofs: proving without revealing

Ethics: - Accountability for algorithmic decisions - Transparency vs. performance tradeoffs - Privacy vs. utility - Algorithmic justice

The future is algorithmic. The problems we’ll solve, the technologies we’ll build, and the challenges we’ll face will all be shaped by the algorithms we design.

Your role: You now have the foundation to understand, analyze, and contribute to this future. The algorithms you’ve learned in this book are the building blocks. What you build with them is up to you.

21.10 13.10 Exercises

21.10.1 Understanding

Smoothed Analysis: Explain why sorted input (worst-case for quicksort) is fragile under perturbation.
Quantum Advantage: Why do quantum computers provide exponential speedup for factoring but not for sorting?
Fairness Impossibility: Prove that you can’t simultaneously achieve calibration and equal opportunity with different base rates.

21.10.2 Analysis

Paper Reading: Choose a paper from STOC/FOCS/SODA 2023. Apply the three-pass method. Write a 5-page analysis.
Algorithm Comparison: Compare Count-Min Sketch vs. exact counting. For what error rates does Count-Min Sketch use less space?
Privacy-Utility Tradeoff: For Census data with differential privacy (ε=1), calculate expected error in population count.

21.10.3 Implementation

Learning-Augmented Cache: Implement LRU and learning-augmented caching. Generate realistic workload with patterns. Compare hit rates.
Streaming Distinct Count: Implement HyperLogLog. Test on stream of web requests. Compare space usage vs. exact hash set.
Fair Classifier: Take a biased dataset (COMPAS or equivalent). Train fair classifier using different fairness definitions. Compare accuracy-fairness tradeoffs.

21.10.4 Research

Extend an Algorithm: Choose a streaming algorithm. Propose and implement an improvement for a specific use case.
Fairness Metrics: Design a new fairness metric for recommendation systems. Prove it’s achievable (or show it conflicts with existing metrics).
Literature Survey: Survey recent papers (last 3 years) on one topic from this chapter. Identify trends and open problems.

21.11 13.11 Further Reading

21.11.1 Books

Algorithms: - Mitzenmacher & Upfal: “Probability and Computing” (randomized algorithms) - Roughgarden: “Twenty Lectures on Algorithmic Game Theory”

Machine Learning: - Goodfellow, Bengio, Courville: “Deep Learning” - Sutton & Barto: “Reinforcement Learning: An Introduction”

Cryptography: - Katz & Lindell: “Introduction to Modern Cryptography” - Boneh & Shoup: “A Graduate Course in Applied Cryptography”

Ethics: - O’Neil: “Weapons of Math Destruction” - Noble: “Algorithms of Oppression” - Eubanks: “Automating Inequality”

21.11.2 Papers (Foundational)

Algorithms: - Spielman & Teng (2001): “Smoothed Analysis of Algorithms” - Muthukrishnan (2005): “Data Streams: Algorithms and Applications”

Machine Learning: - Vaswani et al. (2017): “Attention is All You Need” - Goodfellow et al. (2014): “Generative Adversarial Networks”

Fairness: - Dwork et al. (2012): “Fairness Through Awareness” - Hardt et al. (2016): “Equality of Opportunity in Supervised Learning”

21.11.3 Online Resources

arXiv.org: Latest research preprints
Papers With Code: Papers + implementations
Distill.pub: Clear ML explanations
CACM Research Highlights: Accessible explanations

You’ve completed your journey through advanced algorithms! From ancient algorithmic ideas to the cutting edge of quantum computing and AI, you now understand the foundations of computer science and the algorithms shaping our future.

The next chapter is yours to write.

What will you build? What problems will you solve? What algorithms will you invent?

The future of computing awaits. Go make it happen.