14 Chapter 6: Randomized Algorithms - The Power of Controlled Chaos
14.1 When Dice Make Better Decisions
“God does not play dice with the universe.” - Einstein
“But randomized algorithms do, and they win.” - Computer Scientists
14.2 Introduction: Embracing Uncertainty for Certainty
Imagine you’re at a party with 30 people. What are the odds that two people share the same birthday?
Your intuition might say it’s unlikely—after all, there are 365 days in a year. But mathematics says otherwise: the probability is over 70%! This counterintuitive result, known as the Birthday Paradox, illustrates a fundamental principle of randomized algorithms: probability often defies intuition, and we can exploit this to our advantage.
14.2.1 The Paradox of Random Success
Consider this seemingly impossible scenario: - You need to check if two files are identical - The files are on different continents (network latency is huge) - The files are massive (terabytes)
Deterministic approach: Send entire file across network—takes hours, costs fortune.
Randomized approach: 1. Pick 100 random positions 2. Compare bytes at those positions 3. If all match, declare “probably identical” with 99.999…% confidence 4. Takes seconds, costs pennies!
This is the magic of randomized algorithms: trading absolute certainty for near-certainty with massive efficiency gains.
14.2.2 Why Randomness?
Randomized algorithms offer unique advantages:
- Simplicity: Often much simpler than deterministic alternatives
- Speed: Expected running time frequently beats worst-case deterministic
- Robustness: No pathological inputs (adversary can’t predict random choices)
- Impossibility Breaking: Solve problems with no deterministic solution
- Load Balancing: Natural distribution of work
- Symmetry Breaking: Resolve ties and deadlocks elegantly
14.2.3 Real-World Impact
Randomized algorithms power critical systems:
Internet Security: - RSA Encryption: Randomized primality testing - TLS/SSL: Random nonces prevent replay attacks - Password Hashing: Random salts defeat rainbow tables
Big Data: - MinHash: Find similar documents in billions - HyperLogLog: Count distinct elements in streams - Bloom Filters: Space-efficient membership testing
Machine Learning: - Stochastic Gradient Descent: Random sampling speeds training - Random Forests: Random feature selection improves accuracy - Monte Carlo Tree Search: Game-playing AI (AlphaGo)
Distributed Systems: - Consistent Hashing: Random node placement - Gossip Protocols: Random peer selection - Byzantine Consensus: Random leader election
14.2.4 Chapter Roadmap
We’ll master the art and science of randomized algorithms:
- Section 6.1: Fundamentals - Las Vegas vs Monte Carlo algorithms
- Section 6.2: Randomized QuickSort and selection algorithms
- Section 6.3: Probabilistic analysis and concentration inequalities
- Section 6.4: Hash functions and fingerprinting techniques
- Section 6.5: Advanced algorithms - MinCut, primality testing
- Section 6.6: Streaming algorithms and sketching
- Section 6.7: Project - Comprehensive randomized algorithm library
14.3 Section 6.1: Fundamentals of Randomized Algorithms
14.3.1 Understanding Randomness in Computing
Before we dive into specific algorithms, let’s understand what we mean by “randomized algorithms” and why adding randomness—seemingly making things less predictable—actually makes algorithms better.
14.3.1.1 A Simple Example: Finding Your Friend in a Crowd
Imagine you’re looking for your friend in a massive stadium with 50,000 people. You have two strategies:
Strategy 1 (Deterministic): Start at Section A, Row 1, Seat 1. Check every seat in order. - Worst case: Your friend is in the last seat—you check all 50,000 seats! - Problem: If someone knew your strategy, they could always put your friend in the worst spot.
Strategy 2 (Randomized): Pick random sections and rows to check. - Expected case: On average, you’ll find them after checking half the seats (25,000). - Key insight: No one can force a worst case—every arrangement is equally likely to be good or bad!
This is the power of randomization: it eliminates predictable worst cases.
14.3.2 Two Flavors of Randomized Algorithms
Randomized algorithms come in two main types, named after famous gambling cities (appropriately enough!):
14.3.2.1 Las Vegas Algorithms: “Always Right, Sometimes Slow”
The Guarantee: These algorithms ALWAYS give you the correct answer, but the time they take is random.
Real-Life Analogy: Think of shuffling a deck of cards to find all the aces. You’ll always find all four aces eventually (correctness guaranteed), but sometimes you’ll get lucky and find them quickly, other times it takes longer.
Characteristics: - ✅ Output is always correct - ⏱️ Running time varies (we analyze expected/average time) - 🔍 Can verify the answer is correct - 📊 No error probability—only time varies
Example - Finding a Restaurant: You’re in a new city looking for a good restaurant. You randomly walk around until you find one with good reviews. You’ll definitely find one (correct), but it might take 5 minutes or 50 minutes (random time).
14.3.2.2 Monte Carlo Algorithms: “Always Fast, Usually Right”
The Guarantee: These algorithms ALWAYS finish quickly, but might occasionally give a wrong answer.
Real-Life Analogy: A medical test that takes exactly 5 minutes. It correctly identifies illness 99% of the time, but has a 1% false positive/negative rate. The test always takes 5 minutes (fixed time), but might be wrong (small error probability).
Characteristics: - ⏱️ Running time is fixed and predictable - ⚠️ Might give wrong answer (with small probability) - ❓ Cannot always verify if answer is correct - 📊 Has bounded error probability
Example - Opinion Polling: Instead of asking all 300 million Americans their opinion (correct but slow), you ask 1,000 random people (fast but might be slightly wrong). The poll takes exactly one day (fixed time) but has a 3% margin of error (probability of being off).
14.3.3 Why Do We Accept Uncertainty?
You might wonder: “Why would I want an algorithm that might be wrong?” Here’s why:
Massive Speed Improvements: A Monte Carlo algorithm might run in 1 second with 99.9999% accuracy, while a deterministic algorithm takes 1 hour for 100% accuracy.
Good Enough is Perfect: If a medical test is 99.99% accurate, is it worth waiting 10x longer for 100%?
We Can Boost Accuracy: Run the algorithm multiple times! If error rate is 1%, running it 10 times gives error rate of 0.1^10 = 0.0000000001%!
Real World is Uncertain: Your computer already has random hardware failures (cosmic rays flip bits!). If hardware has a 10^-15 error rate, why demand 0% error from algorithms?
14.4 Section 6.2: Randomized QuickSort - Learning from Card Shuffling
14.4.1 The Problem with Regular QuickSort
Before we see how randomization helps, let’s understand the problem it solves.
14.4.1.1 Regular QuickSort: The Predictable Approach
Imagine you’re organizing a deck of 52 cards by number. Regular QuickSort works like this:
- Pick the first card as the “pivot” (say it’s a 7)
- Divide into two piles:
- Left pile: All cards less than 7
- Right pile: All cards greater than 7
- Recursively sort each pile
- Combine: Left pile + 7 + Right pile = Sorted!
The Fatal Flaw: What if the cards are already sorted? - First pivot: Ace (1) → Left pile: empty, Right pile: 51 cards - Next pivot: 2 → Left pile: empty, Right pile: 50 cards - And so on…
We get the most unbalanced splits possible! This is like trying to balance a see-saw with all the kids on one side.
Time complexity: O(n²) - absolutely terrible for large datasets!
14.4.2 Enter Randomized QuickSort: The Magic of Random Pivots
The solution is beautifully simple: pick a random card as the pivot!
14.4.2.1 Why Random Pivots Save the Day
Let’s understand this with an analogy:
Scenario: You’re dividing 100 students into two groups for a game.
Bad approach (deterministic): Always pick the shortest student as the divider. - If students line up by height (worst case), you get groups of 0 and 99!
Good approach (randomized): Pick a random student as the divider. - Sometimes you get 20 vs 80 (not great) - Sometimes you get 45 vs 55 (pretty good!)
- Sometimes you get 50 vs 50 (perfect!) - On average: You get reasonably balanced groups
The Mathematical Magic: - Probability of picking a “good” pivot (between 25th and 75th percentile): 50% - With good pivots, we get balanced splits - Expected number of times we split: O(log n) - Total expected work: O(n log n) - MUCH better!
14.4.2.2 Step-by-Step Example
Let’s sort the array [3, 7, 1, 9, 2, 5] using randomized QuickSort:
Step 1: Pick random pivot - Randomly choose position 3 → pivot = 9 - Partition: [3, 7, 1, 2, 5] | 9 | [ ] - Left has 5 elements, right has 0 (not great, but okay)
Step 2: Recursively sort left [3, 7, 1, 2, 5] - Random pivot: position 2 → pivot = 7 - Partition: [3, 1, 2, 5] | 7 | [ ]
Step 3: Sort [3, 1, 2, 5] - Random pivot: position 1 → pivot = 3 - Partition: [1, 2] | 3 | [5] - Nice balanced split!
Step 4: Sort [1, 2] - Random pivot: 2 - Partition: [1] | 2 | [ ]
Final result: [1, 2, 3, 5, 7, 9]
Notice how even with one bad split (step 1), we still got good overall performance because other splits were balanced!
14.4.2.3 The Implementation
Now that we understand WHY it works, here’s the code:
import random
def randomized_quicksort(arr):
"""
Las Vegas algorithm: Always sorts correctly.
Expected O(n log n), worst case O(n²) but rare.
"""
if len(arr) <= 1:
return arr
# The KEY INNOVATION: Pick a random pivot instead of first/last element
pivot = arr[random.randint(0, len(arr) - 1)]
# Partition around pivot (same as regular QuickSort)
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot] # Handle duplicates
right = [x for x in arr if x > pivot]
# Recursively sort and combine
return randomized_quicksort(left) + middle + randomized_quicksort(right)14.4.2.4 Why This Simple Change Is So Powerful
Mathematical Insight: - With n elements, there are n possible pivots - A “good” pivot lands between the 25th and 75th percentile - Probability of good pivot = 50% (half the elements are good!) - Expected depth of recursion ≈ 2 log₂ n (since we get good pivots half the time) - Total expected comparisons ≈ 2n ln n ≈ 1.39n log₂ n
Practical Impact: - Sorting 1 million items: - Worst case (deterministic): 1 trillion comparisons - Expected (randomized): 20 million comparisons - That’s 50,000 times faster!
14.4.3 Monte Carlo Algorithms: The Speed-Accuracy Tradeoff
Now let’s explore Monte Carlo algorithms, which trade a tiny bit of accuracy for massive speed gains.
14.4.3.1 Primality Testing: Is This Number Prime?
The Challenge: Checking if a huge number (say, 100 digits) is prime.
Naive Approach: Try dividing by all numbers up to √n - For a 100-digit number, that’s 10^50 divisions - Would take longer than the age of the universe!
Monte Carlo Solution: Miller-Rabin Test - Takes only ~1000 operations - Might incorrectly say a composite number is prime - BUT: Error probability < 0.0000000001% - Good enough for cryptography!
14.4.3.2 How Miller-Rabin Works (Intuitive Explanation)
Think of it like a “prime number detector” test:
The Fermat Test Foundation:
- If n is prime, then for any a: a^(n-1) ≡ 1 (mod n)
- This is like saying: “Prime numbers have a special mathematical fingerprint”
The Problem: Some composite numbers (liars) also pass this test!
The Miller-Rabin Improvement:
- Uses a more sophisticated test that catches most liars
- Tests multiple random values
- Each test catches at least 75% of liars
- After k tests, probability of being fooled ≤ (1/4)^k
Analogy: It’s like having a counterfeit bill detector: - One test might miss 25% of fakes - Two tests miss only 6.25% of fakes
- Ten tests miss only 0.0000001% of fakes - Good enough for practical use!
Let’s implement this powerful algorithm:
def miller_rabin_primality(n, k=10):
"""
Monte Carlo algorithm: Tests if n is prime.
Error probability ≤ (1/4)^k
With k=10: Error ≤ 0.0000001%
Args:
n: Number to test
k: Number of rounds (higher = more accurate)
Returns:
False if definitely composite
True if probably prime
"""
# Handle simple cases
if n < 2:
return False
if n == 2 or n == 3:
return True # 2 and 3 are prime
if n % 2 == 0:
return False # Even numbers (except 2) aren't prime
# Express n-1 as 2^r * d (where d is odd)
# This is the mathematical setup for the test
r, d = 0, n - 1
while d % 2 == 0:
r += 1
d //= 2
# Run k rounds of testing
import random
for _ in range(k):
# Pick a random "witness" number
a = random.randrange(2, n - 1)
# Compute a^d mod n
x = pow(a, d, n)
# Check if this witness proves n is composite
if x == 1 or x == n - 1:
continue # This witness doesn't prove anything
# Square x repeatedly (r-1 times)
for _ in range(r - 1):
x = pow(x, 2, n)
if x == n - 1:
break # This witness doesn't prove composite
else:
# If we never got n-1, then n is definitely composite
return False
# Passed all tests - probably prime!
return True14.4.4 Understanding Probability in Randomized Algorithms
14.4.5 Understanding Probability in Randomized Algorithms
Before we go further, let’s understand key probability concepts using everyday examples.
14.4.5.1 Expected Value: Your “Average” Outcome
Concept: Expected value is what you’d get “on average” if you repeated something many times.
Real-World Example - Rolling a Die: - Possible outcomes: 1, 2, 3, 4, 5, 6 - Each has probability 1/6 - Expected value = 1×(1/6) + 2×(1/6) + 3×(1/6) + 4×(1/6) + 5×(1/6) + 6×(1/6) = 3.5
You can’t actually roll 3.5, but if you roll many times and average, you’ll get close to 3.5!
Algorithm Example - Searching: - Linear search in array of n elements - Best case: 1 comparison (element is first) - Worst case: n comparisons (element is last) - Expected case: (n+1)/2 comparisons (on average, halfway through)
14.4.5.2 The Birthday Paradox: When Intuition Fails
This famous paradox shows why we need math, not intuition, for probabilities.
The Question: In a room of 23 people, what’s the probability that two share a birthday?
Intuitive (Wrong) Answer: - “23 out of 365 days ≈ 6%? Very unlikely!”
Actual (Surprising) Answer: - Probability > 50%! - With 50 people: > 97% - With 100 people: > 99.99999%
Why Our Intuition Is Wrong: We think about one person matching another specific person. But we should think about ANY pair matching! - With 23 people, there are 253 possible pairs - Each pair has a small chance of matching - But with 253 chances, it adds up quickly!
Algorithm Application - Hash Collisions: This same principle explains why hash tables have collisions sooner than expected!
Let’s see this in action:
def birthday_paradox_simulation(n_people=23, n_days=365, trials=10000):
"""
Simulate the birthday paradox to verify probability.
This demonstrates how randomized events can be analyzed.
"""
import random
collisions = 0
for _ in range(trials):
birthdays = []
for _ in range(n_people):
birthday = random.randint(1, n_days)
if birthday in birthdays:
collisions += 1
break # Found a match!
birthdays.append(birthday)
probability = collisions / trials
print(f"With {n_people} people:")
print(f"Simulated probability of shared birthday: {probability:.2%}")
print(f"Theoretical probability: ~50.7%")
return probability
# Try it out!
# birthday_paradox_simulation(23) # Should be close to 50%
# birthday_paradox_simulation(50) # Should be close to 97%14.4.5.3 Amplification: Making Algorithms More Reliable
The Problem: Your Monte Carlo algorithm is 90% accurate. Not good enough?
The Solution: Run it multiple times and vote!
Analogy - Medical Testing: - One test: 90% accurate - Two tests agreeing: 99% accurate - Three tests agreeing: 99.9% accurate
How It Works Mathematically: If error probability = p, and we run k independent trials: - Taking majority vote - Error probability ≤ p^(k/2) (approximately)
Example:
def amplify_accuracy(monte_carlo_func, input_data, desired_accuracy=0.99):
"""
Boost accuracy by running algorithm multiple times.
This is like getting multiple medical opinions!
"""
# Calculate how many runs we need
single_accuracy = 0.9 # Assume 90% accurate
single_error = 1 - single_accuracy
# To get 99% accuracy, we need error < 0.01
# With majority voting: error ≈ single_error^(k/2)
# So: 0.01 ≥ 0.1^(k/2)
# Therefore: k ≥ 2 * log(0.01) / log(0.1) ≈ 4
import math
k = math.ceil(2 * math.log(1 - desired_accuracy) / math.log(single_error))
# Run k times and take majority
results = []
for _ in range(k):
results.append(monte_carlo_func(input_data))
# Return most common result
from collections import Counter
most_common = Counter(results).most_common(1)[0][0]
return most_common14.5 Section 6.3: Randomized Selection - Finding Needles in Haystacks
14.5.1 The Selection Problem
Goal: Find the kth smallest element in an unsorted array.
Examples: - Find the median (k = n/2) - Find the 90th percentile (k = 0.9n) - Find the third smallest (k = 3)
14.5.2 The Naive Approach
Method 1: Sort then Select - Sort the entire array: O(n log n) - Return element at position k: O(1) - Total: O(n log n)
Problem: We’re doing too much work! We sort everything when we only need one element.
Analogy: It’s like organizing your entire bookshelf alphabetically just to find the 10th book. Wasteful!
14.5.3 QuickSelect: The Randomized Solution
Key Insight: We can use QuickSort’s partitioning idea but only recurse on ONE side!
14.5.3.1 How QuickSelect Works
Imagine finding the 30th tallest person in a group of 100:
Pick a random person as “pivot” (say they’re 5’10”)
Divide into two groups:
- Shorter than 5’10”: 45 people
- Taller than 5’10”: 54 people
Determine which group to search:
- We want the 30th tallest
- There are 54 people taller than pivot
- So the 30th tallest is in the “taller” group
- It’s the 30th person in that group
Recursively search just that group
- We’ve eliminated 46 people from consideration!
Keep going until we find our target
14.5.3.2 Why It’s Fast
- Each partition cuts the problem roughly in half (on average)
- We only recurse on one side (unlike QuickSort which does both)
- Expected comparisons: n + n/2 + n/4 + … = 2n = O(n)
That’s linear time! Much better than O(n log n) for sorting.
14.5.3.3 The Implementation
import random
def quickselect(arr, k):
"""
Find the kth smallest element (0-indexed).
Las Vegas algorithm: Always returns correct answer.
Expected time: O(n)
Worst case: O(n²) but very rare with random pivots
Example:
arr = [3, 7, 1, 9, 2, 5]
quickselect(arr, 2) returns 3 (the 3rd smallest element)
"""
if len(arr) == 1:
return arr[0]
# Random pivot is the key!
pivot = arr[random.randint(0, len(arr) - 1)]
# Partition into three groups
smaller = [x for x in arr if x < pivot]
equal = [x for x in arr if x == pivot]
larger = [x for x in arr if x > pivot]
# Determine which group contains our target
if k < len(smaller):
# kth smallest is in the 'smaller' group
return quickselect(smaller, k)
elif k < len(smaller) + len(equal):
# kth smallest is the pivot
return pivot
else:
# kth smallest is in the 'larger' group
# Adjust k to be relative to the larger group
return quickselect(larger, k - len(smaller) - len(equal))14.5.3.4 Practical Applications of QuickSelect
- Finding Medians in Data Analysis
- Dataset: Customer purchase amounts
- Need: Find median purchase (not affected by billionaire outliers)
- QuickSelect: O(n) vs Sorting: O(n log n)
- Percentile Calculations
- Finding 95th percentile response time in web servers
- Identifying top 10% performers without sorting everyone
- Statistical Sampling
- Quickly finding quartiles for box plots
- Real-time analytics on streaming data
14.6 Section 6.4: Hash Functions and Randomization
14.6.1 Understanding Hash Tables First
Before diving into universal hashing, let’s understand why randomization helps with hash tables.
14.6.1.1 The Hash Table Dream
Imagine you’re building a library catalog system: - Goal: Find any book instantly by its ISBN - Naive approach: Check every book (slow!) - Array approach: Use ISBN as array index (wastes massive space!) - Hash table approach: Transform ISBN into small array index
14.6.1.2 The Problem: Collisions
Scenario: Two different books map to the same shelf location!
This is like two different people having the same locker combination. What do we do?
Deterministic Problem: If an attacker knows your hash function, they can deliberately cause collisions: - Send 1000 items that all hash to the same bucket - Your O(1) lookup becomes O(n) - disaster!
14.6.2 Universal Hashing: Randomization to the Rescue
The Solution: Pick a random hash function from a family of functions!
Analogy: - Instead of always using the same locker assignment rule - Randomly choose from 100 different assignment rules - Attacker can’t predict which rule you’ll use - Can’t deliberately cause collisions!
14.6.2.1 What Makes a Hash Family “Universal”?
A family of hash functions is universal if: - For any two different keys x and y - Probability that h(x) = h(y) ≤ 1/m - Where m is the table size
In Simple Terms: The chance of any two items colliding is as small as if we assigned them random locations!
14.6.2.2 The Carter-Wegman Construction
One elegant universal hash family:
h(x) = ((ax + b) mod p) mod m
Where: - p is a prime number larger than your universe - a is randomly chosen from {1, 2, …, p-1} - b is randomly chosen from {0, 1, …, p-1} - m is your table size
Let’s implement this:
import random
class UniversalHashTable:
"""
Hash table using universal hashing for guaranteed performance.
This prevents adversarial attacks on hash table performance!
"""
def __init__(self, initial_size=16):
self.size = self._next_prime(initial_size)
self.prime = self._next_prime(2**32) # Large prime
# Randomly select hash function parameters
self.a = random.randint(1, self.prime - 1)
self.b = random.randint(0, self.prime - 1)
# Initialize empty buckets
self.buckets = [[] for _ in range(self.size)]
self.num_items = 0
print(f"Selected hash function: h(x) = (({self.a}*x + {self.b}) mod {self.prime}) mod {self.size}")
def _next_prime(self, n):
"""Find the next prime number >= n."""
def is_prime(num):
if num < 2:
return False
for i in range(2, int(num**0.5) + 1):
if num % i == 0:
return False
return True
while not is_prime(n):
n += 1
return n
def _hash(self, key):
"""
Universal hash function.
Probability of collision for any two keys ≤ 1/size
"""
# Convert key to integer if needed
if isinstance(key, str):
key = sum(ord(c) * (31**i) for i, c in enumerate(key))
# Apply universal hash function
return ((self.a * key + self.b) % self.prime) % self.size
def insert(self, key, value):
"""Insert key-value pair."""
bucket_index = self._hash(key)
bucket = self.buckets[bucket_index]
# Check if key already exists
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value) # Update
return
# Add new key-value pair
bucket.append((key, value))
self.num_items += 1
# Resize if load factor too high
if self.num_items > self.size * 0.75:
self._resize()
def get(self, key):
"""Retrieve value for key."""
bucket_index = self._hash(key)
bucket = self.buckets[bucket_index]
for k, v in bucket:
if k == key:
return v
raise KeyError(f"Key '{key}' not found")
def _resize(self):
"""
Resize table and rehash with new random function.
This maintains the universal hashing guarantee!
"""
old_buckets = self.buckets
# Double size and pick new hash function
self.size = self._next_prime(self.size * 2)
self.a = random.randint(1, self.prime - 1)
self.b = random.randint(0, self.prime - 1)
self.buckets = [[] for _ in range(self.size)]
self.num_items = 0
# Rehash all items
for bucket in old_buckets:
for key, value in bucket:
self.insert(key, value)14.6.3 Bloom Filters: Space-Efficient Membership Testing
14.6.3.1 The Problem
You’re building a web crawler that shouldn’t visit the same URL twice: - Billions of URLs to track - Storing all URLs in a set would take terabytes of RAM - Need a space-efficient solution
14.6.3.2 The Bloom Filter Solution
Trade-off: Use WAY less space, but accept small false positive rate.
How it works: 1. Create a bit array (like a row of light switches) 2. Use multiple hash functions 3. To add item: Turn on bits at positions given by hash functions 4. To check item: See if all corresponding bits are on
The Catch: - Can have false positives (say item is present when it’s not) - NEVER has false negatives (if it says item is absent, it definitely is)
Real-World Analogy: It’s like a bouncer with a partial guest list: - If your name’s not on the list, you’re definitely not invited (no false negatives) - If your name IS on the list, you’re probably invited (small chance of error)
14.6.3.3 Understanding Bloom Filter Parameters
The math behind optimal Bloom filter parameters: - m = number of bits - n = expected number of items - k = number of hash functions - p = false positive probability
Optimal formulas: - Bits needed: m = -n × ln(p) / (ln(2)²) - Hash functions: k = (m/n) × ln(2)
Example: To track 1 million URLs with 1% false positive rate: - Need only 9.6 bits per item = 1.2 MB total! - Compare to storing actual URLs: ~50 bytes per URL = 50 MB - That’s 40× space savings!
Let’s implement it:
import math
import random
class BloomFilter:
"""
Space-efficient probabilistic data structure for membership testing.
Use case: "Have I seen this before?" when storing everything is too expensive.
"""
def __init__(self, expected_items=1000000, false_positive_rate=0.01):
"""
Initialize Bloom filter with optimal parameters.
Args:
expected_items: How many items you expect to add
false_positive_rate: Acceptable error rate (e.g., 0.01 = 1%)
"""
# Calculate optimal size and number of hash functions
self.m = self._optimal_bit_size(expected_items, false_positive_rate)
self.k = self._optimal_hash_count(self.m, expected_items)
# Initialize bit array (using list of booleans for clarity)
self.bits = [False] * self.m
self.items_added = 0
print(f"Bloom Filter initialized:")
print(f" Expected items: {expected_items:,}")
print(f" False positive rate: {false_positive_rate:.1%}")
print(f" Bits needed: {self.m:,} ({self.m/8/1024:.1f} KB)")
print(f" Hash functions: {self.k}")
print(f" Bits per item: {self.m/expected_items:.1f}")
def _optimal_bit_size(self, n, p):
"""
Calculate optimal number of bits.
Formula: m = -n × ln(p) / (ln(2)²)
"""
return int(-n * math.log(p) / (math.log(2) ** 2))
def _optimal_hash_count(self, m, n):
"""
Calculate optimal number of hash functions.
Formula: k = (m/n) × ln(2)
"""
return max(1, int(m / n * math.log(2)))
def _hash(self, item, seed):
"""
Generate hash with different seed for each hash function.
In practice, you'd use MurmurHash or similar.
"""
# Simple hash for demonstration
import hashlib
data = f"{item}{seed}".encode('utf-8')
hash_hex = hashlib.md5(data).hexdigest()
return int(hash_hex, 16) % self.m
def add(self, item):
"""
Add item to the filter.
Sets k bits to True.
"""
for i in range(self.k):
bit_index = self._hash(item, i)
self.bits[bit_index] = True
self.items_added += 1
def might_contain(self, item):
"""
Check if item might be in the set.
Returns:
True: Item MIGHT be in the set (or false positive)
False: Item is DEFINITELY NOT in the set
"""
for i in range(self.k):
bit_index = self._hash(item, i)
if not self.bits[bit_index]:
return False # Definitely not in set
return True # Might be in set
def current_false_positive_rate(self):
"""
Calculate current false positive probability.
Formula: (1 - e^(-kn/m))^k
"""
if self.items_added == 0:
return 0
# Probability that a bit is still 0
prob_zero = math.exp(-self.k * self.items_added / self.m)
# Probability of false positive
return (1 - prob_zero) ** self.k
# Example usage showing space efficiency
def bloom_filter_demo():
"""Demonstrate Bloom filter efficiency."""
# Track 10 million URLs with 0.1% false positive rate
bloom = BloomFilter(expected_items=10_000_000, false_positive_rate=0.001)
# Add some URLs
urls_visited = [
"https://example.com",
"https://google.com",
"https://github.com"
]
for url in urls_visited:
bloom.add(url)
# Check membership
test_urls = [
"https://example.com", # Should be found
"https://facebook.com" # Should not be found
]
for url in test_urls:
if bloom.might_contain(url):
print(f"✓ {url} might have been visited")
else:
print(f"✗ {url} definitely not visited")
# Compare space usage
actual_storage = len(urls_visited) * 50 # ~50 bytes per URL
bloom_storage = bloom.m / 8 # bits to bytes
print(f"\nSpace comparison for {len(urls_visited)} URLs:")
print(f" Storing actual URLs: {actual_storage} bytes")
print(f" Bloom filter: {bloom_storage:.0f} bytes")
print(f" Space saved: {(1 - bloom_storage/actual_storage)*100:.1f}%")14.7 Section 6.5: Streaming Algorithms - Processing Infinite Data
14.7.1 The Streaming Challenge
Imagine you’re monitoring Twitter: - 500 million tweets per day - Can’t store everything in memory - Need real-time statistics - Data arrives continuously
The Constraint: You can only make ONE PASS through the data!
14.7.2 Reservoir Sampling: Fair Sampling from Streams
14.7.2.1 The Problem
You want to maintain a random sample of tweets, but you don’t know how many total tweets there will be!
14.7.2.2 The Solution: Reservoir Sampling
Analogy: Imagine you’re at a parade and want to photograph 10 random floats: - You don’t know how many floats there will be - You can only keep 10 photos in your camera - You want each float to have equal chance of being photographed
The Algorithm: 1. Keep first k items in your “reservoir” 2. For item n (where n > k): - With probability k/n, include it - If including, randomly replace one existing item 3. Magic: This gives uniform probability to all items!
14.7.2.3 Why It Works (Intuitive Explanation)
For any item to be in the final sample: - It needs to be selected when it arrives - It needs to survive all future replacements
Math Magic: - Probability item i is in final sample = k/N (where N is total items) - This is exactly uniform sampling!
Let’s implement it:
import random
class ReservoirSampler:
"""
Maintain a uniform random sample from a stream of unknown size.
Applications:
- Sampling tweets for sentiment analysis
- Random sampling from database queries
- A/B testing with streaming data
"""
def __init__(self, sample_size=100):
"""
Initialize reservoir.
Args:
sample_size: Number of items to maintain in sample
"""
self.k = sample_size
self.reservoir = []
self.items_seen = 0
def add(self, item):
"""
Process new item from stream.
Maintains uniform probability for all items seen so far.
"""
self.items_seen += 1
if len(self.reservoir) < self.k:
# Haven't filled reservoir yet
self.reservoir.append(item)
else:
# Randomly decide whether to include this item
# Probability = k / items_seen
j = random.randint(1, self.items_seen)
if j <= self.k:
# Include this item, replace random existing item
replace_index = random.randint(0, self.k - 1)
self.reservoir[replace_index] = item
def get_sample(self):
"""Get current random sample."""
return self.reservoir.copy()
def sample_probability(self):
"""
Probability that any specific item is in the sample.
Should be k/n for uniform sampling.
"""
if self.items_seen == 0:
return 0
return min(1.0, self.k / self.items_seen)
# Demonstration
def reservoir_sampling_demo():
"""
Demonstrate that reservoir sampling is truly uniform.
"""
sampler = ReservoirSampler(sample_size=10)
# Stream of 1000 items
stream = range(1000)
for item in stream:
sampler.add(item)
sample = sampler.get_sample()
print(f"Random sample of 10 from 1000 items: {sorted(sample)}")
print(f"Each item had probability {sampler.sample_probability():.1%} of being selected")
# Verify uniformity with multiple runs
counts = {}
for _ in range(10000):
sampler = ReservoirSampler(sample_size=1)
for item in range(10):
sampler.add(item)
selected = sampler.get_sample()[0]
counts[selected] = counts.get(selected, 0) + 1
print("\nUniformity test (should be ~1000 each):")
for item in sorted(counts.keys()):
print(f" Item {item}: {counts[item]} times")14.7.3 Count-Min Sketch: Frequency Estimation
14.7.3.1 The Problem
Count how many times each hashtag appears in Twitter: - Millions of different hashtags - Can’t maintain counter for each - Need approximate counts
14.7.3.2 The Solution: Count-Min Sketch
Intuition: - Use multiple small hash tables instead of one huge one - Each hashtag increments one counter in each table - Take minimum across tables (reduces overestimation from collisions)
Why “Count-Min”? - We COUNT in multiple tables - Take the MINimum to reduce error from hash collisions
class CountMinSketch:
"""
Estimate frequencies in data streams with limited memory.
Guarantees: Estimate ≥ True Count (never underestimates)
Estimate ≤ True Count + ε×N with probability 1-δ
"""
def __init__(self, epsilon=0.01, delta=0.01):
"""
Initialize Count-Min Sketch.
Args:
epsilon: Error bound (e.g., 0.01 = within 1% of stream size)
delta: Failure probability (e.g., 0.01 = 99% confidence)
"""
# Calculate dimensions
self.width = int(math.ceil(math.e / epsilon))
self.depth = int(math.ceil(math.log(1 / delta)))
# Initialize counter tables
self.tables = [[0] * self.width for _ in range(self.depth)]
# Random hash functions (simplified - use better hashes in production)
self.hash_params = []
for _ in range(self.depth):
a = random.randint(1, 2**31 - 1)
b = random.randint(0, 2**31 - 1)
self.hash_params.append((a, b))
print(f"Count-Min Sketch initialized:")
print(f" Width: {self.width} (controls accuracy)")
print(f" Depth: {self.depth} (controls confidence)")
print(f" Total memory: {self.width * self.depth} counters")
def _hash(self, item, table_index):
"""Hash item for specific table."""
a, b = self.hash_params[table_index]
# Convert item to integer
if isinstance(item, str):
item_hash = hash(item)
else:
item_hash = item
# Universal hash function
return ((a * item_hash + b) % (2**31 - 1)) % self.width
def add(self, item, count=1):
"""
Add occurrences of item.
"""
for i in range(self.depth):
j = self._hash(item, i)
self.tables[i][j] += count
def estimate(self, item):
"""
Estimate count for item.
Returns minimum across all tables (reduces overestimation).
"""
estimates = []
for i in range(self.depth):
j = self._hash(item, i)
estimates.append(self.tables[i][j])
return min(estimates)
# Example: Tracking word frequencies in text stream
def count_min_demo():
"""
Demonstrate Count-Min Sketch for word frequency.
"""
sketch = CountMinSketch(epsilon=0.001, delta=0.01)
# Simulate text stream
text_stream = """
the quick brown fox jumps over the lazy dog
the fox was quick and the dog was lazy
""" * 100 # Repeat for volume
words = text_stream.lower().split()
# Add words to sketch
for word in words:
sketch.add(word)
# Check frequencies
test_words = ["the", "fox", "dog", "cat"]
print("\nWord frequency estimates:")
for word in test_words:
true_count = words.count(word)
estimate = sketch.estimate(word)
error = estimate - true_count
print(f" '{word}': true={true_count}, estimate={estimate}, error={error}")14.7.4 HyperLogLog: Counting Unique Elements
14.7.4.1 The Problem
Count unique users visiting your website: - Billions of visits - Same users visit multiple times - Can’t store all user IDs
14.7.4.2 The HyperLogLog Magic
Key Insight: - In random bit strings, rare patterns indicate large sets - Like inferring crowd size from the rarest jersey number you see
Intuition: - If you flip coins, getting 10 heads in a row is rare - If you see this pattern, you probably flipped LOTS of coins - HyperLogLog uses this principle with hash functions
Amazing Properties: - Count billions of unique items - Using only ~16KB of memory! - Error rate ~2%
14.8 This algorithm is so elegant and powerful that it’s used by Redis, Google BigQuery, and many other systems!
14.9 Section 6.6: Analyzing Randomized Algorithms
14.9.1 Understanding Performance Through Probability
When we analyze randomized algorithms, we can’t just say “it takes X steps” because X is now random! Instead, we need to understand the probability distribution of running times.
14.9.1.1 Expected Value: The Average Case
Definition: If an algorithm has different possible running times, the expected value is the average, weighted by probability.
Example - Finding an Item in Random Position:
Array has 4 slots: [_, _, _, _]
Item could be in position: 1, 2, 3, or 4 (each with probability 1/4)
Comparisons needed: 1, 2, 3, or 4
Expected comparisons = 1×(1/4) + 2×(1/4) + 3×(1/4) + 4×(1/4) = 2.5
14.9.1.2 High Probability Bounds
“Expected” performance is nice, but what if we get unlucky? We want stronger guarantees!
High Probability Statement: “The algorithm finishes in O(n log n) time with probability ≥ 1 - 1/n²”
What this means: - For n = 1000: Fails less than once in a million runs - For n = 1,000,000: Fails less than once in a trillion runs - As input grows, failure becomes astronomically unlikely!
14.9.2 Concentration Inequalities: Why Randomized Algorithms Don’t Get Unlucky
These mathematical tools prove that random events cluster around their expected values.
14.9.2.1 Markov’s Inequality: The Weakest Bound
Statement: For non-negative random variable X: P(X ≥ k × E[X]) ≤ 1/k
In Plain English: “The probability of being k times worse than expected is at most 1/k”
Example: If QuickSort expects 100 comparisons: - P(≥ 1000 comparisons) ≤ 1/10 = 10% - P(≥ 10000 comparisons) ≤ 1/100 = 1%
14.9.2.2 Chernoff Bounds: Much Stronger Guarantees
For sums of independent random events, we get exponentially decreasing failure probability!
Example - Coin Flips: Flip 1000 fair coins. Expected heads = 500. - Probability of ≥ 600 heads ≈ 0.0000000002% - Probability of ≥ 700 heads ≈ 10^-88 (essentially impossible!)
This is why randomized algorithms are reliable despite using randomness!
Let’s see these principles in action:
import random
import math
import time
class RandomizedAnalysis:
"""
Tools for analyzing and demonstrating randomized algorithm behavior.
"""
@staticmethod
def analyze_quicksort_concentration(n=1000, trials=1000):
"""
Demonstrate that QuickSort concentrates around expected time.
"""
def count_comparisons_quicksort(arr):
"""Count comparisons in randomized QuickSort."""
if len(arr) <= 1:
return 0
pivot = arr[random.randint(0, len(arr) - 1)]
comparisons = len(arr) - 1 # Compare all elements to pivot
left = [x for x in arr if x < pivot]
right = [x for x in arr if x > pivot]
# Recursively count comparisons
return comparisons + count_comparisons_quicksort(left) + count_comparisons_quicksort(right)
# Run many trials
comparison_counts = []
for _ in range(trials):
arr = list(range(n))
random.shuffle(arr)
comparisons = count_comparisons_quicksort(arr)
comparison_counts.append(comparisons)
# Calculate statistics
expected = 2 * n * math.log(n) # Theoretical expectation
actual_mean = sum(comparison_counts) / trials
# Count how many are far from expected
far_from_expected = sum(1 for c in comparison_counts
if abs(c - expected) > 0.5 * expected)
print(f"QuickSort Analysis (n={n}, trials={trials}):")
print(f" Theoretical expected: {expected:.0f}")
print(f" Actual average: {actual_mean:.0f}")
print(f" Min comparisons: {min(comparison_counts)}")
print(f" Max comparisons: {max(comparison_counts)}")
print(f" Runs >50% from expected: {far_from_expected}/{trials} = {far_from_expected/trials:.1%}")
print(f" Conclusion: QuickSort strongly concentrates around expected value!")
return comparison_counts
@staticmethod
def demonstrate_amplification():
"""
Show how repetition reduces error probability.
"""
def unreliable_prime_test(n):
"""
Fake primality test that's right 75% of the time.
(For demonstration only!)
"""
if n < 2:
return False
if n == 2:
return True
# Simulate 75% accuracy
true_answer = all(n % i != 0 for i in range(2, min(int(n**0.5) + 1, 100)))
if random.random() < 0.75:
return true_answer
else:
return not true_answer # Wrong answer
def amplified_prime_test(n, k=10):
"""Run test k times and take majority vote."""
votes = [unreliable_prime_test(n) for _ in range(k)]
return sum(votes) > k // 2
# Test on known primes and composites
test_numbers = [17, 18, 19, 20, 23, 24, 29, 30]
true_answers = [True, False, True, False, True, False, True, False]
# Single test accuracy
single_correct = 0
for num, truth in zip(test_numbers, true_answers):
correct_count = sum(unreliable_prime_test(num) == truth
for _ in range(1000))
single_correct += correct_count
# Amplified test accuracy
amplified_correct = 0
for num, truth in zip(test_numbers, true_answers):
correct_count = sum(amplified_prime_test(num, k=10) == truth
for _ in range(1000))
amplified_correct += correct_count
print("\nError Amplification Demo:")
print(f" Single test accuracy: {single_correct/8000:.1%}")
print(f" Amplified (10 runs) accuracy: {amplified_correct/8000:.1%}")
print(f" Improvement: {(amplified_correct-single_correct)/single_correct:.1%}")
# Run the demonstrations
analyzer = RandomizedAnalysis()
# analyzer.analyze_quicksort_concentration()
# analyzer.demonstrate_amplification()14.10 Section 6.7: Advanced Randomized Algorithms
14.10.1 Karger’s Min-Cut Algorithm: Finding Bottlenecks
14.10.1.1 The Problem
Find the minimum cut in a network - the smallest number of connections that, if removed, would split the network into two parts.
Real-World Applications: - Finding weakest points in internet infrastructure - Identifying community boundaries in social networks - Circuit design and reliability analysis
14.10.1.2 The Elegant Randomized Solution
Karger’s Algorithm: 1. Pick a random edge 2. “Contract” it (merge the two endpoints into one node) 3. Repeat until only 2 nodes remain 4. Count edges between them
Why It Works (Intuition): - Min-cut edges are “rare” (there are few of them) - Random edge is unlikely to be in min-cut - If we avoid min-cut edges, we find the min-cut!
Success Probability: - Single run: ≥ 2/n² (seems small!) - But run n² log n times: Success probability > 1 - 1/n - Multiple runs find the true min-cut with high probability
Let’s implement this beautiful algorithm:
import random
import copy
class KargerMinCut:
"""
Randomized algorithm for finding minimum cut in a graph.
Simple, elegant, and probabilistically correct!
"""
def __init__(self, graph):
"""
Initialize with graph represented as adjacency list.
Example graph:
{
'A': ['B', 'C', 'D'],
'B': ['A', 'C'],
'C': ['A', 'B', 'D'],
'D': ['A', 'C']
}
"""
self.original_graph = graph
def contract_edge(self, graph, u, v):
"""
Contract edge (u,v) - merge v into u.
This is like combining two cities into a metropolis!
All of v's connections become u's connections.
"""
# Add v's edges to u
for neighbor in graph[v]:
if neighbor != u: # Skip self-loops
graph[u].append(neighbor)
# Update the neighbor's connections
graph[neighbor] = [u if x == v else x for x in graph[neighbor]]
# Remove v from graph
del graph[v]
# Remove any self-loops created
graph[u] = [x for x in graph[u] if x != u]
def single_min_cut_trial(self):
"""
One trial of Karger's algorithm.
Returns the cut size found (might not be minimum!).
"""
# Make a copy since we'll modify the graph
graph = copy.deepcopy(self.original_graph)
vertices = list(graph.keys())
# Contract down to 2 nodes
while len(vertices) > 2:
# Pick random edge
u = random.choice(vertices)
if not graph[u]: # No edges from u
vertices.remove(u)
continue
v = random.choice(graph[u])
# Contract this edge
self.contract_edge(graph, u, v)
vertices.remove(v)
# Count edges between final two nodes
if len(graph) == 2:
remaining = list(graph.keys())
return len(graph[remaining[0]])
return float('inf')
def find_min_cut(self, confidence=0.99):
"""
Find minimum cut with high probability.
Args:
confidence: Desired probability of success (e.g., 0.99 = 99%)
Returns:
Minimum cut size found
"""
n = len(self.original_graph)
# Calculate trials needed for desired confidence
# Probability of success in one trial ≥ 2/(n²)
# Probability of failure in k trials ≤ (1 - 2/n²)^k
# We want this ≤ 1 - confidence
import math
single_success_prob = 2 / (n * (n - 1))
trials_needed = int(math.log(1 - confidence) / math.log(1 - single_success_prob))
print(f"Running {trials_needed} trials for {confidence:.1%} confidence...")
# Run multiple trials
min_cut = float('inf')
for trial in range(trials_needed):
cut_size = self.single_min_cut_trial()
if cut_size < min_cut:
min_cut = cut_size
print(f" Trial {trial+1}: Found cut of size {cut_size}")
return min_cut
# Example usage
def karger_demo():
"""
Demonstrate Karger's algorithm on a simple graph.
"""
# Create a simple graph with known min-cut
graph = {
'A': ['B', 'C', 'D'],
'B': ['A', 'C', 'E'],
'C': ['A', 'B', 'D', 'E'],
'D': ['A', 'C', 'E', 'F'],
'E': ['B', 'C', 'D', 'F'],
'F': ['D', 'E']
}
print("Graph structure:")
print(" A---B")
print(" |\\ |\\")
print(" | \\ | E")
print(" | \\|/|")
print(" C---D-F")
print("\nThe min-cut is 2 (cut edges D-F and E-F to separate F)")
karger = KargerMinCut(graph)
min_cut = karger.find_min_cut(confidence=0.99)
print(f"\nKarger's algorithm found min-cut: {min_cut}")14.10.2 Monte Carlo Integration: Using Randomness for Math
14.10.2.1 The Problem
Calculate the area under a complex curve or the value of π.
Traditional Approach: Complex calculus, might be impossible for some functions!
Monte Carlo Approach: Throw random darts and count how many land under the curve!
14.10.2.2 Estimating π with Random Points
The Setup: - Square from -1 to 1 (area = 4) - Circle of radius 1 inside (area = π) - Ratio of circle to square = π/4
The Algorithm: 1. Throw random points in the square 2. Count how many land in the circle 3. π ≈ 4 × (points in circle) / (total points)
import random
import math
def estimate_pi(num_samples=1000000):
"""
Estimate π using Monte Carlo simulation.
This is like throwing darts at a circular dartboard
inside a square frame!
"""
inside_circle = 0
for _ in range(num_samples):
# Random point in square [-1, 1] × [-1, 1]
x = random.uniform(-1, 1)
y = random.uniform(-1, 1)
# Check if point is inside unit circle
if x*x + y*y <= 1:
inside_circle += 1
# Estimate π
pi_estimate = 4 * inside_circle / num_samples
# Calculate statistics
actual_pi = math.pi
error = abs(pi_estimate - actual_pi)
relative_error = error / actual_pi * 100
print(f"Monte Carlo π Estimation:")
print(f" Samples: {num_samples:,}")
print(f" Points in circle: {inside_circle:,}")
print(f" Estimate: {pi_estimate:.6f}")
print(f" Actual π: {actual_pi:.6f}")
print(f" Error: {error:.6f} ({relative_error:.3f}%)")
# Calculate theoretical standard error
p = math.pi / 4 # True probability
std_error = 4 * math.sqrt(p * (1 - p) / num_samples)
print(f" Theoretical std error: {std_error:.6f}")
return pi_estimate
# Try with different sample sizes to see convergence
def monte_carlo_convergence_demo():
"""Show how estimate improves with more samples."""
sample_sizes = [100, 1000, 10000, 100000, 1000000]
print("\nMonte Carlo Convergence:")
for n in sample_sizes:
# Suppress detailed output
inside = sum(random.uniform(-1,1)**2 + random.uniform(-1,1)**2 <= 1
for _ in range(n))
estimate = 4 * inside / n
error = abs(estimate - math.pi)
print(f" n={n:>8,}: π≈{estimate:.4f}, error={error:.4f}")14.11 Section 6.8: Real-World Applications
14.11.1 Load Balancing with Randomization
14.11.1.1 The Problem
You have 1000 servers and millions of requests. How do you distribute load fairly?
Deterministic Approach: Round-robin, but if some requests are heavier, servers become imbalanced!
Randomized Solution: Power of Two Choices
- Pick TWO random servers
- Send request to the less loaded one
- This simple change dramatically improves balance!
Why It Works: - Random choice alone: Maximum load ≈ log n / log log n - Power of two choices: Maximum load ≈ log log n (MUCH better!)
class LoadBalancer:
"""
Demonstrate different load balancing strategies.
"""
def __init__(self, num_servers=100):
self.num_servers = num_servers
self.loads = [0] * num_servers
def random_assignment(self, num_requests=10000):
"""Pure random assignment."""
self.loads = [0] * self.num_servers
for _ in range(num_requests):
server = random.randint(0, self.num_servers - 1)
self.loads[server] += 1
return max(self.loads)
def power_of_two_choices(self, num_requests=10000):
"""Pick two random servers, use less loaded."""
self.loads = [0] * self.num_servers
for _ in range(num_requests):
# Pick two random servers
server1 = random.randint(0, self.num_servers - 1)
server2 = random.randint(0, self.num_servers - 1)
# Choose less loaded
if self.loads[server1] <= self.loads[server2]:
self.loads[server1] += 1
else:
self.loads[server2] += 1
return max(self.loads)
def compare_strategies(self):
"""Compare load balancing strategies."""
random_max = self.random_assignment()
two_choices_max = self.power_of_two_choices()
print(f"\nLoad Balancing Comparison ({self.num_servers} servers, 10000 requests):")
print(f" Random assignment:")
print(f" Max load: {random_max}")
print(f" Expected: ~{10000/self.num_servers + 3*math.sqrt(10000/self.num_servers):.0f}")
print(f" Power of two choices:")
print(f" Max load: {two_choices_max}")
print(f" Improvement: {random_max/two_choices_max:.1f}x better!")
# Demo
balancer = LoadBalancer(100)
balancer.compare_strategies()14.11.2 A/B Testing with Statistical Significance
In web development and product design, randomized experiments help make data-driven decisions.
class ABTest:
"""
Simple A/B testing framework with statistical significance.
"""
def __init__(self, name="Experiment"):
self.name = name
self.group_a = {'visitors': 0, 'conversions': 0}
self.group_b = {'visitors': 0, 'conversions': 0}
def assign_visitor(self):
"""Randomly assign visitor to group A or B."""
if random.random() < 0.5:
self.group_a['visitors'] += 1
return 'A'
else:
self.group_b['visitors'] += 1
return 'B'
def record_conversion(self, group):
"""Record a conversion for the specified group."""
if group == 'A':
self.group_a['conversions'] += 1
else:
self.group_b['conversions'] += 1
def analyze_results(self):
"""
Calculate statistical significance of results.
Using normal approximation to binomial.
"""
# Calculate conversion rates
rate_a = self.group_a['conversions'] / max(self.group_a['visitors'], 1)
rate_b = self.group_b['conversions'] / max(self.group_b['visitors'], 1)
n_a = self.group_a['visitors']
n_b = self.group_b['visitors']
if n_a < 100 or n_b < 100:
print(f"Need more data! (A: {n_a} visitors, B: {n_b} visitors)")
return
# Pooled conversion rate for variance calculation
pooled_rate = (self.group_a['conversions'] + self.group_b['conversions']) / (n_a + n_b)
# Standard error
se = math.sqrt(pooled_rate * (1 - pooled_rate) * (1/n_a + 1/n_b))
# Z-score
z = (rate_b - rate_a) / se if se > 0 else 0
# P-value (two-tailed test)
# Using approximation for normal CDF
p_value = 2 * (1 - self._normal_cdf(abs(z)))
print(f"\nA/B Test Results: {self.name}")
print(f" Group A: {rate_a:.2%} conversion ({self.group_a['conversions']}/{n_a})")
print(f" Group B: {rate_b:.2%} conversion ({self.group_b['conversions']}/{n_b})")
print(f" Relative improvement: {((rate_b/rate_a - 1) * 100):.1f}%")
print(f" Z-score: {z:.3f}")
print(f" P-value: {p_value:.4f}")
if p_value < 0.05:
print(f" Result: STATISTICALLY SIGNIFICANT! (p < 0.05)")
winner = 'B' if rate_b > rate_a else 'A'
print(f" Winner: Group {winner}")
else:
print(f" Result: Not statistically significant (need more data)")
def _normal_cdf(self, z):
"""Approximate normal CDF using error function."""
return 0.5 * (1 + math.erf(z / math.sqrt(2)))
# Demo A/B test
def ab_test_demo():
"""
Simulate an A/B test with different conversion rates.
"""
test = ABTest("Button Color Test")
# Simulate visitors
# Group A (blue button): 10% conversion
# Group B (green button): 12% conversion (20% better!)
for _ in range(5000):
group = test.assign_visitor()
# Simulate conversion based on group
if group == 'A':
if random.random() < 0.10: # 10% conversion
test.record_conversion('A')
else:
if random.random() < 0.12: # 12% conversion
test.record_conversion('B')
test.analyze_results()
# Run the demo
# ab_test_demo()14.12 Summary: The Power of Controlled Randomness
14.12.1 Key Takeaways
- Randomization Eliminates Worst Cases
- No adversary can force bad performance
- Every input becomes “average case”
- Two Types of Randomized Algorithms
- Las Vegas: Always correct, random time
- Monte Carlo: Fixed time, probably correct
- Probability Concentrates
- Random events cluster around expectations
- Bad luck is exponentially unlikely
- Multiple runs boost confidence
- Simplicity and Elegance
- Randomized algorithms are often simpler
- Easier to implement and understand
- Natural parallelization
- Real-World Impact
- Used in databases, networks, security
- Powers big data analytics
- Essential for modern computing
14.12.2 When to Use Randomization
✅ Use When: - Worst-case is much worse than average - Need simple, practical solution - Dealing with massive data - Want to prevent adversarial inputs - Small error probability is acceptable
❌ Avoid When: - Absolute correctness required - Need reproducible results - Limited random number generation - Real-time guarantees essential
14.12.3 Final Thought
“In the face of complexity, randomness is often our best strategy.”
Randomized algorithms show us that embracing uncertainty can lead to more certain outcomes. By giving up a tiny bit of determinism, we gain massive improvements in simplicity, speed, and robustness.
Next chapter, we’ll explore the limits of computation itself with NP-Completeness!