14 Chapter 6: Randomized Algorithms - The Power of Controlled Chaos

14.1 When Dice Make Better Decisions

“God does not play dice with the universe.” - Einstein
“But randomized algorithms do, and they win.” - Computer Scientists

14.2 Introduction: Embracing Uncertainty for Certainty

Imagine you’re at a party with 30 people. What are the odds that two people share the same birthday?

Your intuition might say it’s unlikely—after all, there are 365 days in a year. But mathematics says otherwise: the probability is over 70%! This counterintuitive result, known as the Birthday Paradox, illustrates a fundamental principle of randomized algorithms: probability often defies intuition, and we can exploit this to our advantage.

14.2.1 The Paradox of Random Success

Consider this seemingly impossible scenario: - You need to check if two files are identical - The files are on different continents (network latency is huge) - The files are massive (terabytes)

Deterministic approach: Send entire file across network—takes hours, costs fortune.

Randomized approach: 1. Pick 100 random positions 2. Compare bytes at those positions 3. If all match, declare “probably identical” with 99.999…% confidence 4. Takes seconds, costs pennies!

This is the magic of randomized algorithms: trading absolute certainty for near-certainty with massive efficiency gains.

14.2.2 Why Randomness?

Randomized algorithms offer unique advantages:

Simplicity: Often much simpler than deterministic alternatives
Speed: Expected running time frequently beats worst-case deterministic
Robustness: No pathological inputs (adversary can’t predict random choices)
Impossibility Breaking: Solve problems with no deterministic solution
Load Balancing: Natural distribution of work
Symmetry Breaking: Resolve ties and deadlocks elegantly

14.2.3 Real-World Impact

Randomized algorithms power critical systems:

Internet Security: - RSA Encryption: Randomized primality testing - TLS/SSL: Random nonces prevent replay attacks - Password Hashing: Random salts defeat rainbow tables

Big Data: - MinHash: Find similar documents in billions - HyperLogLog: Count distinct elements in streams - Bloom Filters: Space-efficient membership testing

Machine Learning: - Stochastic Gradient Descent: Random sampling speeds training - Random Forests: Random feature selection improves accuracy - Monte Carlo Tree Search: Game-playing AI (AlphaGo)

Distributed Systems: - Consistent Hashing: Random node placement - Gossip Protocols: Random peer selection - Byzantine Consensus: Random leader election

14.2.4 Chapter Roadmap

We’ll master the art and science of randomized algorithms:

Section 6.1: Fundamentals - Las Vegas vs Monte Carlo algorithms
Section 6.2: Randomized QuickSort and selection algorithms
Section 6.3: Probabilistic analysis and concentration inequalities
Section 6.4: Hash functions and fingerprinting techniques
Section 6.5: Advanced algorithms - MinCut, primality testing
Section 6.6: Streaming algorithms and sketching
Section 6.7: Project - Comprehensive randomized algorithm library

14.3 Section 6.1: Fundamentals of Randomized Algorithms

14.3.1 Understanding Randomness in Computing

Before we dive into specific algorithms, let’s understand what we mean by “randomized algorithms” and why adding randomness—seemingly making things less predictable—actually makes algorithms better.

14.3.1.1 A Simple Example: Finding Your Friend in a Crowd

Imagine you’re looking for your friend in a massive stadium with 50,000 people. You have two strategies:

Strategy 1 (Deterministic): Start at Section A, Row 1, Seat 1. Check every seat in order. - Worst case: Your friend is in the last seat—you check all 50,000 seats! - Problem: If someone knew your strategy, they could always put your friend in the worst spot.

Strategy 2 (Randomized): Pick random sections and rows to check. - Expected case: On average, you’ll find them after checking half the seats (25,000). - Key insight: No one can force a worst case—every arrangement is equally likely to be good or bad!

This is the power of randomization: it eliminates predictable worst cases.

14.3.2 Two Flavors of Randomized Algorithms

Randomized algorithms come in two main types, named after famous gambling cities (appropriately enough!):

14.3.2.1 Las Vegas Algorithms: “Always Right, Sometimes Slow”

The Guarantee: These algorithms ALWAYS give you the correct answer, but the time they take is random.

Real-Life Analogy: Think of shuffling a deck of cards to find all the aces. You’ll always find all four aces eventually (correctness guaranteed), but sometimes you’ll get lucky and find them quickly, other times it takes longer.

Characteristics: - ✅ Output is always correct - ⏱️ Running time varies (we analyze expected/average time) - 🔍 Can verify the answer is correct - 📊 No error probability—only time varies

Example - Finding a Restaurant: You’re in a new city looking for a good restaurant. You randomly walk around until you find one with good reviews. You’ll definitely find one (correct), but it might take 5 minutes or 50 minutes (random time).

14.3.2.2 Monte Carlo Algorithms: “Always Fast, Usually Right”

The Guarantee: These algorithms ALWAYS finish quickly, but might occasionally give a wrong answer.

Real-Life Analogy: A medical test that takes exactly 5 minutes. It correctly identifies illness 99% of the time, but has a 1% false positive/negative rate. The test always takes 5 minutes (fixed time), but might be wrong (small error probability).

Characteristics: - ⏱️ Running time is fixed and predictable - ⚠️ Might give wrong answer (with small probability) - ❓ Cannot always verify if answer is correct - 📊 Has bounded error probability

Example - Opinion Polling: Instead of asking all 300 million Americans their opinion (correct but slow), you ask 1,000 random people (fast but might be slightly wrong). The poll takes exactly one day (fixed time) but has a 3% margin of error (probability of being off).

14.3.3 Why Do We Accept Uncertainty?

You might wonder: “Why would I want an algorithm that might be wrong?” Here’s why:

Massive Speed Improvements: A Monte Carlo algorithm might run in 1 second with 99.9999% accuracy, while a deterministic algorithm takes 1 hour for 100% accuracy.
Good Enough is Perfect: If a medical test is 99.99% accurate, is it worth waiting 10x longer for 100%?
We Can Boost Accuracy: Run the algorithm multiple times! If error rate is 1%, running it 10 times gives error rate of 0.1^10 = 0.0000000001%!
Real World is Uncertain: Your computer already has random hardware failures (cosmic rays flip bits!). If hardware has a 10^-15 error rate, why demand 0% error from algorithms?

14.4 Section 6.2: Randomized QuickSort - Learning from Card Shuffling

14.4.1 The Problem with Regular QuickSort

Before we see how randomization helps, let’s understand the problem it solves.

14.4.1.1 Regular QuickSort: The Predictable Approach

Imagine you’re organizing a deck of 52 cards by number. Regular QuickSort works like this:

Pick the first card as the “pivot” (say it’s a 7)
Divide into two piles:
- Left pile: All cards less than 7
- Right pile: All cards greater than 7
Recursively sort each pile
Combine: Left pile + 7 + Right pile = Sorted!

The Fatal Flaw: What if the cards are already sorted? - First pivot: Ace (1) → Left pile: empty, Right pile: 51 cards - Next pivot: 2 → Left pile: empty, Right pile: 50 cards - And so on…

We get the most unbalanced splits possible! This is like trying to balance a see-saw with all the kids on one side.

Time complexity: O(n²) - absolutely terrible for large datasets!

14.4.2 Enter Randomized QuickSort: The Magic of Random Pivots

The solution is beautifully simple: pick a random card as the pivot!

14.4.2.1 Why Random Pivots Save the Day

Let’s understand this with an analogy:

Scenario: You’re dividing 100 students into two groups for a game.

Bad approach (deterministic): Always pick the shortest student as the divider. - If students line up by height (worst case), you get groups of 0 and 99!

Good approach (randomized): Pick a random student as the divider. - Sometimes you get 20 vs 80 (not great) - Sometimes you get 45 vs 55 (pretty good!)
- Sometimes you get 50 vs 50 (perfect!) - On average: You get reasonably balanced groups

The Mathematical Magic: - Probability of picking a “good” pivot (between 25th and 75th percentile): 50% - With good pivots, we get balanced splits - Expected number of times we split: O(log n) - Total expected work: O(n log n) - MUCH better!

14.4.2.2 Step-by-Step Example

Let’s sort the array [3, 7, 1, 9, 2, 5] using randomized QuickSort:

Step 1: Pick random pivot - Randomly choose position 3 → pivot = 9 - Partition: [3, 7, 1, 2, 5] | 9 | [ ] - Left has 5 elements, right has 0 (not great, but okay)

Step 2: Recursively sort left [3, 7, 1, 2, 5] - Random pivot: position 2 → pivot = 7 - Partition: [3, 1, 2, 5] | 7 | [ ]

Step 3: Sort [3, 1, 2, 5] - Random pivot: position 1 → pivot = 3 - Partition: [1, 2] | 3 | [5] - Nice balanced split!

Step 4: Sort [1, 2] - Random pivot: 2 - Partition: [1] | 2 | [ ]

Final result: [1, 2, 3, 5, 7, 9]

Notice how even with one bad split (step 1), we still got good overall performance because other splits were balanced!

14.4.2.3 The Implementation

Now that we understand WHY it works, here’s the code:

import random

def randomized_quicksort(arr):
    """
    Las Vegas algorithm: Always sorts correctly.
    Expected O(n log n), worst case O(n²) but rare.
    """
    if len(arr) <= 1:
        return arr
    
    # The KEY INNOVATION: Pick a random pivot instead of first/last element
    pivot = arr[random.randint(0, len(arr) - 1)]
    
    # Partition around pivot (same as regular QuickSort)
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]  # Handle duplicates
    right = [x for x in arr if x > pivot]
    
    # Recursively sort and combine
    return randomized_quicksort(left) + middle + randomized_quicksort(right)

14.4.2.4 Why This Simple Change Is So Powerful

Mathematical Insight: - With n elements, there are n possible pivots - A “good” pivot lands between the 25th and 75th percentile - Probability of good pivot = 50% (half the elements are good!) - Expected depth of recursion ≈ 2 log₂ n (since we get good pivots half the time) - Total expected comparisons ≈ 2n ln n ≈ 1.39n log₂ n

Practical Impact: - Sorting 1 million items: - Worst case (deterministic): 1 trillion comparisons - Expected (randomized): 20 million comparisons - That’s 50,000 times faster!

14.4.3 Monte Carlo Algorithms: The Speed-Accuracy Tradeoff

Now let’s explore Monte Carlo algorithms, which trade a tiny bit of accuracy for massive speed gains.

14.4.3.1 Primality Testing: Is This Number Prime?

The Challenge: Checking if a huge number (say, 100 digits) is prime.

Naive Approach: Try dividing by all numbers up to √n - For a 100-digit number, that’s 10^50 divisions - Would take longer than the age of the universe!

Monte Carlo Solution: Miller-Rabin Test - Takes only ~1000 operations - Might incorrectly say a composite number is prime - BUT: Error probability < 0.0000000001% - Good enough for cryptography!

14.4.3.2 How Miller-Rabin Works (Intuitive Explanation)

Think of it like a “prime number detector” test:

The Fermat Test Foundation:
- If n is prime, then for any a: a^(n-1) ≡ 1 (mod n)
- This is like saying: “Prime numbers have a special mathematical fingerprint”
The Problem: Some composite numbers (liars) also pass this test!
The Miller-Rabin Improvement:
- Uses a more sophisticated test that catches most liars
- Tests multiple random values
- Each test catches at least 75% of liars
- After k tests, probability of being fooled ≤ (1/4)^k

Analogy: It’s like having a counterfeit bill detector: - One test might miss 25% of fakes - Two tests miss only 6.25% of fakes
- Ten tests miss only 0.0000001% of fakes - Good enough for practical use!

Let’s implement this powerful algorithm:

def miller_rabin_primality(n, k=10):
    """
    Monte Carlo algorithm: Tests if n is prime.
    
    Error probability ≤ (1/4)^k
    With k=10: Error ≤ 0.0000001%
    
    Args:
        n: Number to test
        k: Number of rounds (higher = more accurate)
    
    Returns:
        False if definitely composite
        True if probably prime
    """
    # Handle simple cases
    if n < 2:
        return False
    if n == 2 or n == 3:
        return True  # 2 and 3 are prime
    if n % 2 == 0:
        return False  # Even numbers (except 2) aren't prime
    
    # Express n-1 as 2^r * d (where d is odd)
    # This is the mathematical setup for the test
    r, d = 0, n - 1
    while d % 2 == 0:
        r += 1
        d //= 2
    
    # Run k rounds of testing
    import random
    for _ in range(k):
        # Pick a random "witness" number
        a = random.randrange(2, n - 1)
        
        # Compute a^d mod n
        x = pow(a, d, n)
        
        # Check if this witness proves n is composite
        if x == 1 or x == n - 1:
            continue  # This witness doesn't prove anything
        
        # Square x repeatedly (r-1 times)
        for _ in range(r - 1):
            x = pow(x, 2, n)
            if x == n - 1:
                break  # This witness doesn't prove composite
        else:
            # If we never got n-1, then n is definitely composite
            return False
    
    # Passed all tests - probably prime!
    return True

14.4.4 Understanding Probability in Randomized Algorithms

14.4.5 Understanding Probability in Randomized Algorithms

Before we go further, let’s understand key probability concepts using everyday examples.

14.4.5.1 Expected Value: Your “Average” Outcome

Concept: Expected value is what you’d get “on average” if you repeated something many times.

Real-World Example - Rolling a Die: - Possible outcomes: 1, 2, 3, 4, 5, 6 - Each has probability 1/6 - Expected value = 1×(1/6) + 2×(1/6) + 3×(1/6) + 4×(1/6) + 5×(1/6) + 6×(1/6) = 3.5

You can’t actually roll 3.5, but if you roll many times and average, you’ll get close to 3.5!

Algorithm Example - Searching: - Linear search in array of n elements - Best case: 1 comparison (element is first) - Worst case: n comparisons (element is last) - Expected case: (n+1)/2 comparisons (on average, halfway through)

14.4.5.2 The Birthday Paradox: When Intuition Fails

This famous paradox shows why we need math, not intuition, for probabilities.

The Question: In a room of 23 people, what’s the probability that two share a birthday?

Intuitive (Wrong) Answer: - “23 out of 365 days ≈ 6%? Very unlikely!”

Actual (Surprising) Answer: - Probability > 50%! - With 50 people: > 97% - With 100 people: > 99.99999%

Why Our Intuition Is Wrong: We think about one person matching another specific person. But we should think about ANY pair matching! - With 23 people, there are 253 possible pairs - Each pair has a small chance of matching - But with 253 chances, it adds up quickly!

Algorithm Application - Hash Collisions: This same principle explains why hash tables have collisions sooner than expected!

Let’s see this in action:

def birthday_paradox_simulation(n_people=23, n_days=365, trials=10000):
    """
    Simulate the birthday paradox to verify probability.
    
    This demonstrates how randomized events can be analyzed.
    """
    import random
    
    collisions = 0
    
    for _ in range(trials):
        birthdays = []
        
        for _ in range(n_people):
            birthday = random.randint(1, n_days)
            
            if birthday in birthdays:
                collisions += 1
                break  # Found a match!
            
            birthdays.append(birthday)
    
    probability = collisions / trials
    
    print(f"With {n_people} people:")
    print(f"Simulated probability of shared birthday: {probability:.2%}")
    print(f"Theoretical probability: ~50.7%")
    
    return probability

# Try it out!
# birthday_paradox_simulation(23)  # Should be close to 50%
# birthday_paradox_simulation(50)  # Should be close to 97%

14.4.5.3 Amplification: Making Algorithms More Reliable

The Problem: Your Monte Carlo algorithm is 90% accurate. Not good enough?

The Solution: Run it multiple times and vote!

Analogy - Medical Testing: - One test: 90% accurate - Two tests agreeing: 99% accurate - Three tests agreeing: 99.9% accurate

How It Works Mathematically: If error probability = p, and we run k independent trials: - Taking majority vote - Error probability ≤ p^(k/2) (approximately)

Example:

def amplify_accuracy(monte_carlo_func, input_data, desired_accuracy=0.99):
    """
    Boost accuracy by running algorithm multiple times.
    
    This is like getting multiple medical opinions!
    """
    # Calculate how many runs we need
    single_accuracy = 0.9  # Assume 90% accurate
    single_error = 1 - single_accuracy
    
    # To get 99% accuracy, we need error < 0.01
    # With majority voting: error ≈ single_error^(k/2)
    # So: 0.01 ≥ 0.1^(k/2)
    # Therefore: k ≥ 2 * log(0.01) / log(0.1) ≈ 4
    
    import math
    k = math.ceil(2 * math.log(1 - desired_accuracy) / math.log(single_error))
    
    # Run k times and take majority
    results = []
    for _ in range(k):
        results.append(monte_carlo_func(input_data))
    
    # Return most common result
    from collections import Counter
    most_common = Counter(results).most_common(1)[0][0]
    
    return most_common

14.5 Section 6.3: Randomized Selection - Finding Needles in Haystacks

14.5.1 The Selection Problem

Goal: Find the kth smallest element in an unsorted array.

Examples: - Find the median (k = n/2) - Find the 90th percentile (k = 0.9n) - Find the third smallest (k = 3)

14.5.2 The Naive Approach

Method 1: Sort then Select - Sort the entire array: O(n log n) - Return element at position k: O(1) - Total: O(n log n)

Problem: We’re doing too much work! We sort everything when we only need one element.

Analogy: It’s like organizing your entire bookshelf alphabetically just to find the 10th book. Wasteful!

14.5.3 QuickSelect: The Randomized Solution

Key Insight: We can use QuickSort’s partitioning idea but only recurse on ONE side!

14.5.3.1 How QuickSelect Works

Imagine finding the 30th tallest person in a group of 100:

Pick a random person as “pivot” (say they’re 5’10”)
Divide into two groups:
- Shorter than 5’10”: 45 people
- Taller than 5’10”: 54 people
Determine which group to search:
- We want the 30th tallest
- There are 54 people taller than pivot
- So the 30th tallest is in the “taller” group
- It’s the 30th person in that group
Recursively search just that group
- We’ve eliminated 46 people from consideration!
Keep going until we find our target

14.5.3.2 Why It’s Fast

Each partition cuts the problem roughly in half (on average)
We only recurse on one side (unlike QuickSort which does both)
Expected comparisons: n + n/2 + n/4 + … = 2n = O(n)

That’s linear time! Much better than O(n log n) for sorting.

14.5.3.3 The Implementation

import random

def quickselect(arr, k):
    """
    Find the kth smallest element (0-indexed).
    
    Las Vegas algorithm: Always returns correct answer.
    Expected time: O(n)
    Worst case: O(n²) but very rare with random pivots
    
    Example:
        arr = [3, 7, 1, 9, 2, 5]
        quickselect(arr, 2) returns 3 (the 3rd smallest element)
    """
    if len(arr) == 1:
        return arr[0]
    
    # Random pivot is the key!
    pivot = arr[random.randint(0, len(arr) - 1)]
    
    # Partition into three groups
    smaller = [x for x in arr if x < pivot]
    equal = [x for x in arr if x == pivot]
    larger = [x for x in arr if x > pivot]
    
    # Determine which group contains our target
    if k < len(smaller):
        # kth smallest is in the 'smaller' group
        return quickselect(smaller, k)
    elif k < len(smaller) + len(equal):
        # kth smallest is the pivot
        return pivot
    else:
        # kth smallest is in the 'larger' group
        # Adjust k to be relative to the larger group
        return quickselect(larger, k - len(smaller) - len(equal))

14.5.3.4 Practical Applications of QuickSelect

Finding Medians in Data Analysis
- Dataset: Customer purchase amounts
- Need: Find median purchase (not affected by billionaire outliers)
- QuickSelect: O(n) vs Sorting: O(n log n)
Percentile Calculations
- Finding 95th percentile response time in web servers
- Identifying top 10% performers without sorting everyone
Statistical Sampling
- Quickly finding quartiles for box plots
- Real-time analytics on streaming data

14.6 Section 6.4: Hash Functions and Randomization

14.6.1 Understanding Hash Tables First

Before diving into universal hashing, let’s understand why randomization helps with hash tables.

14.6.1.1 The Hash Table Dream

Imagine you’re building a library catalog system: - Goal: Find any book instantly by its ISBN - Naive approach: Check every book (slow!) - Array approach: Use ISBN as array index (wastes massive space!) - Hash table approach: Transform ISBN into small array index

14.6.1.2 The Problem: Collisions

Scenario: Two different books map to the same shelf location!

This is like two different people having the same locker combination. What do we do?

Deterministic Problem: If an attacker knows your hash function, they can deliberately cause collisions: - Send 1000 items that all hash to the same bucket - Your O(1) lookup becomes O(n) - disaster!

14.6.2 Universal Hashing: Randomization to the Rescue

The Solution: Pick a random hash function from a family of functions!

Analogy: - Instead of always using the same locker assignment rule - Randomly choose from 100 different assignment rules - Attacker can’t predict which rule you’ll use - Can’t deliberately cause collisions!

14.6.2.1 What Makes a Hash Family “Universal”?

A family of hash functions is universal if: - For any two different keys x and y - Probability that h(x) = h(y) ≤ 1/m - Where m is the table size

In Simple Terms: The chance of any two items colliding is as small as if we assigned them random locations!

14.6.2.2 The Carter-Wegman Construction

One elegant universal hash family:

h(x) = ((ax + b) mod p) mod m

Where: - p is a prime number larger than your universe - a is randomly chosen from {1, 2, …, p-1} - b is randomly chosen from {0, 1, …, p-1} - m is your table size

Let’s implement this:

import random

class UniversalHashTable:
    """
    Hash table using universal hashing for guaranteed performance.
    
    This prevents adversarial attacks on hash table performance!
    """
    
    def __init__(self, initial_size=16):
        self.size = self._next_prime(initial_size)
        self.prime = self._next_prime(2**32)  # Large prime
        
        # Randomly select hash function parameters
        self.a = random.randint(1, self.prime - 1)
        self.b = random.randint(0, self.prime - 1)
        
        # Initialize empty buckets
        self.buckets = [[] for _ in range(self.size)]
        self.num_items = 0
        
        print(f"Selected hash function: h(x) = (({self.a}*x + {self.b}) mod {self.prime}) mod {self.size}")
    
    def _next_prime(self, n):
        """Find the next prime number >= n."""
        def is_prime(num):
            if num < 2:
                return False
            for i in range(2, int(num**0.5) + 1):
                if num % i == 0:
                    return False
            return True
        
        while not is_prime(n):
            n += 1
        return n
    
    def _hash(self, key):
        """
        Universal hash function.
        Probability of collision for any two keys ≤ 1/size
        """
        # Convert key to integer if needed
        if isinstance(key, str):
            key = sum(ord(c) * (31**i) for i, c in enumerate(key))
        
        # Apply universal hash function
        return ((self.a * key + self.b) % self.prime) % self.size
    
    def insert(self, key, value):
        """Insert key-value pair."""
        bucket_index = self._hash(key)
        bucket = self.buckets[bucket_index]
        
        # Check if key already exists
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)  # Update
                return
        
        # Add new key-value pair
        bucket.append((key, value))
        self.num_items += 1
        
        # Resize if load factor too high
        if self.num_items > self.size * 0.75:
            self._resize()
    
    def get(self, key):
        """Retrieve value for key."""
        bucket_index = self._hash(key)
        bucket = self.buckets[bucket_index]
        
        for k, v in bucket:
            if k == key:
                return v
        
        raise KeyError(f"Key '{key}' not found")
    
    def _resize(self):
        """
        Resize table and rehash with new random function.
        This maintains the universal hashing guarantee!
        """
        old_buckets = self.buckets
        
        # Double size and pick new hash function
        self.size = self._next_prime(self.size * 2)
        self.a = random.randint(1, self.prime - 1)
        self.b = random.randint(0, self.prime - 1)
        self.buckets = [[] for _ in range(self.size)]
        self.num_items = 0
        
        # Rehash all items
        for bucket in old_buckets:
            for key, value in bucket:
                self.insert(key, value)

14.6.3 Bloom Filters: Space-Efficient Membership Testing

14.6.3.1 The Problem

You’re building a web crawler that shouldn’t visit the same URL twice: - Billions of URLs to track - Storing all URLs in a set would take terabytes of RAM - Need a space-efficient solution

14.6.3.2 The Bloom Filter Solution

Trade-off: Use WAY less space, but accept small false positive rate.

How it works: 1. Create a bit array (like a row of light switches) 2. Use multiple hash functions 3. To add item: Turn on bits at positions given by hash functions 4. To check item: See if all corresponding bits are on

The Catch: - Can have false positives (say item is present when it’s not) - NEVER has false negatives (if it says item is absent, it definitely is)

Real-World Analogy: It’s like a bouncer with a partial guest list: - If your name’s not on the list, you’re definitely not invited (no false negatives) - If your name IS on the list, you’re probably invited (small chance of error)

14.6.3.3 Understanding Bloom Filter Parameters

The math behind optimal Bloom filter parameters: - m = number of bits - n = expected number of items - k = number of hash functions - p = false positive probability

Optimal formulas: - Bits needed: m = -n × ln(p) / (ln(2)²) - Hash functions: k = (m/n) × ln(2)

Example: To track 1 million URLs with 1% false positive rate: - Need only 9.6 bits per item = 1.2 MB total! - Compare to storing actual URLs: ~50 bytes per URL = 50 MB - That’s 40× space savings!

Let’s implement it:

import math
import random

class BloomFilter:
    """
    Space-efficient probabilistic data structure for membership testing.
    
    Use case: "Have I seen this before?" when storing everything is too expensive.
    """
    
    def __init__(self, expected_items=1000000, false_positive_rate=0.01):
        """
        Initialize Bloom filter with optimal parameters.
        
        Args:
            expected_items: How many items you expect to add
            false_positive_rate: Acceptable error rate (e.g., 0.01 = 1%)
        """
        # Calculate optimal size and number of hash functions
        self.m = self._optimal_bit_size(expected_items, false_positive_rate)
        self.k = self._optimal_hash_count(self.m, expected_items)
        
        # Initialize bit array (using list of booleans for clarity)
        self.bits = [False] * self.m
        self.items_added = 0
        
        print(f"Bloom Filter initialized:")
        print(f"  Expected items: {expected_items:,}")
        print(f"  False positive rate: {false_positive_rate:.1%}")
        print(f"  Bits needed: {self.m:,} ({self.m/8/1024:.1f} KB)")
        print(f"  Hash functions: {self.k}")
        print(f"  Bits per item: {self.m/expected_items:.1f}")
    
    def _optimal_bit_size(self, n, p):
        """
        Calculate optimal number of bits.
        Formula: m = -n × ln(p) / (ln(2)²)
        """
        return int(-n * math.log(p) / (math.log(2) ** 2))
    
    def _optimal_hash_count(self, m, n):
        """
        Calculate optimal number of hash functions.
        Formula: k = (m/n) × ln(2)
        """
        return max(1, int(m / n * math.log(2)))
    
    def _hash(self, item, seed):
        """
        Generate hash with different seed for each hash function.
        In practice, you'd use MurmurHash or similar.
        """
        # Simple hash for demonstration
        import hashlib
        data = f"{item}{seed}".encode('utf-8')
        hash_hex = hashlib.md5(data).hexdigest()
        return int(hash_hex, 16) % self.m
    
    def add(self, item):
        """
        Add item to the filter.
        Sets k bits to True.
        """
        for i in range(self.k):
            bit_index = self._hash(item, i)
            self.bits[bit_index] = True
        
        self.items_added += 1
    
    def might_contain(self, item):
        """
        Check if item might be in the set.
        
        Returns:
            True: Item MIGHT be in the set (or false positive)
            False: Item is DEFINITELY NOT in the set
        """
        for i in range(self.k):
            bit_index = self._hash(item, i)
            if not self.bits[bit_index]:
                return False  # Definitely not in set
        
        return True  # Might be in set
    
    def current_false_positive_rate(self):
        """
        Calculate current false positive probability.
        Formula: (1 - e^(-kn/m))^k
        """
        if self.items_added == 0:
            return 0
        
        # Probability that a bit is still 0
        prob_zero = math.exp(-self.k * self.items_added / self.m)
        
        # Probability of false positive
        return (1 - prob_zero) ** self.k

# Example usage showing space efficiency
def bloom_filter_demo():
    """Demonstrate Bloom filter efficiency."""
    
    # Track 10 million URLs with 0.1% false positive rate
    bloom = BloomFilter(expected_items=10_000_000, false_positive_rate=0.001)
    
    # Add some URLs
    urls_visited = [
        "https://example.com",
        "https://google.com", 
        "https://github.com"
    ]
    
    for url in urls_visited:
        bloom.add(url)
    
    # Check membership
    test_urls = [
        "https://example.com",  # Should be found
        "https://facebook.com"  # Should not be found
    ]
    
    for url in test_urls:
        if bloom.might_contain(url):
            print(f"✓ {url} might have been visited")
        else:
            print(f"✗ {url} definitely not visited")
    
    # Compare space usage
    actual_storage = len(urls_visited) * 50  # ~50 bytes per URL
    bloom_storage = bloom.m / 8  # bits to bytes
    
    print(f"\nSpace comparison for {len(urls_visited)} URLs:")
    print(f"  Storing actual URLs: {actual_storage} bytes")
    print(f"  Bloom filter: {bloom_storage:.0f} bytes")
    print(f"  Space saved: {(1 - bloom_storage/actual_storage)*100:.1f}%")

14.7 Section 6.5: Streaming Algorithms - Processing Infinite Data

14.7.1 The Streaming Challenge

Imagine you’re monitoring Twitter: - 500 million tweets per day - Can’t store everything in memory - Need real-time statistics - Data arrives continuously

The Constraint: You can only make ONE PASS through the data!

14.7.2 Reservoir Sampling: Fair Sampling from Streams

14.7.2.1 The Problem

You want to maintain a random sample of tweets, but you don’t know how many total tweets there will be!

14.7.2.2 The Solution: Reservoir Sampling

Analogy: Imagine you’re at a parade and want to photograph 10 random floats: - You don’t know how many floats there will be - You can only keep 10 photos in your camera - You want each float to have equal chance of being photographed

The Algorithm: 1. Keep first k items in your “reservoir” 2. For item n (where n > k): - With probability k/n, include it - If including, randomly replace one existing item 3. Magic: This gives uniform probability to all items!

14.7.2.3 Why It Works (Intuitive Explanation)

For any item to be in the final sample: - It needs to be selected when it arrives - It needs to survive all future replacements

Math Magic: - Probability item i is in final sample = k/N (where N is total items) - This is exactly uniform sampling!

Let’s implement it:

import random

class ReservoirSampler:
    """
    Maintain a uniform random sample from a stream of unknown size.
    
    Applications:
    - Sampling tweets for sentiment analysis
    - Random sampling from database queries
    - A/B testing with streaming data
    """
    
    def __init__(self, sample_size=100):
        """
        Initialize reservoir.
        
        Args:
            sample_size: Number of items to maintain in sample
        """
        self.k = sample_size
        self.reservoir = []
        self.items_seen = 0
    
    def add(self, item):
        """
        Process new item from stream.
        
        Maintains uniform probability for all items seen so far.
        """
        self.items_seen += 1
        
        if len(self.reservoir) < self.k:
            # Haven't filled reservoir yet
            self.reservoir.append(item)
        else:
            # Randomly decide whether to include this item
            # Probability = k / items_seen
            j = random.randint(1, self.items_seen)
            
            if j <= self.k:
                # Include this item, replace random existing item
                replace_index = random.randint(0, self.k - 1)
                self.reservoir[replace_index] = item
    
    def get_sample(self):
        """Get current random sample."""
        return self.reservoir.copy()
    
    def sample_probability(self):
        """
        Probability that any specific item is in the sample.
        Should be k/n for uniform sampling.
        """
        if self.items_seen == 0:
            return 0
        return min(1.0, self.k / self.items_seen)

# Demonstration
def reservoir_sampling_demo():
    """
    Demonstrate that reservoir sampling is truly uniform.
    """
    sampler = ReservoirSampler(sample_size=10)
    
    # Stream of 1000 items
    stream = range(1000)
    
    for item in stream:
        sampler.add(item)
    
    sample = sampler.get_sample()
    
    print(f"Random sample of 10 from 1000 items: {sorted(sample)}")
    print(f"Each item had probability {sampler.sample_probability():.1%} of being selected")
    
    # Verify uniformity with multiple runs
    counts = {}
    for _ in range(10000):
        sampler = ReservoirSampler(sample_size=1)
        for item in range(10):
            sampler.add(item)
        selected = sampler.get_sample()[0]
        counts[selected] = counts.get(selected, 0) + 1
    
    print("\nUniformity test (should be ~1000 each):")
    for item in sorted(counts.keys()):
        print(f"  Item {item}: {counts[item]} times")

14.7.3 Count-Min Sketch: Frequency Estimation

14.7.3.1 The Problem

Count how many times each hashtag appears in Twitter: - Millions of different hashtags - Can’t maintain counter for each - Need approximate counts

14.7.3.2 The Solution: Count-Min Sketch

Intuition: - Use multiple small hash tables instead of one huge one - Each hashtag increments one counter in each table - Take minimum across tables (reduces overestimation from collisions)

Why “Count-Min”? - We COUNT in multiple tables - Take the MINimum to reduce error from hash collisions

class CountMinSketch:
    """
    Estimate frequencies in data streams with limited memory.
    
    Guarantees: Estimate ≥ True Count (never underestimates)
                Estimate ≤ True Count + ε×N with probability 1-δ
    """
    
    def __init__(self, epsilon=0.01, delta=0.01):
        """
        Initialize Count-Min Sketch.
        
        Args:
            epsilon: Error bound (e.g., 0.01 = within 1% of stream size)
            delta: Failure probability (e.g., 0.01 = 99% confidence)
        """
        # Calculate dimensions
        self.width = int(math.ceil(math.e / epsilon))
        self.depth = int(math.ceil(math.log(1 / delta)))
        
        # Initialize counter tables
        self.tables = [[0] * self.width for _ in range(self.depth)]
        
        # Random hash functions (simplified - use better hashes in production)
        self.hash_params = []
        for _ in range(self.depth):
            a = random.randint(1, 2**31 - 1)
            b = random.randint(0, 2**31 - 1)
            self.hash_params.append((a, b))
        
        print(f"Count-Min Sketch initialized:")
        print(f"  Width: {self.width} (controls accuracy)")
        print(f"  Depth: {self.depth} (controls confidence)")
        print(f"  Total memory: {self.width * self.depth} counters")
    
    def _hash(self, item, table_index):
        """Hash item for specific table."""
        a, b = self.hash_params[table_index]
        
        # Convert item to integer
        if isinstance(item, str):
            item_hash = hash(item)
        else:
            item_hash = item
        
        # Universal hash function
        return ((a * item_hash + b) % (2**31 - 1)) % self.width
    
    def add(self, item, count=1):
        """
        Add occurrences of item.
        """
        for i in range(self.depth):
            j = self._hash(item, i)
            self.tables[i][j] += count
    
    def estimate(self, item):
        """
        Estimate count for item.
        
        Returns minimum across all tables (reduces overestimation).
        """
        estimates = []
        for i in range(self.depth):
            j = self._hash(item, i)
            estimates.append(self.tables[i][j])
        
        return min(estimates)

# Example: Tracking word frequencies in text stream
def count_min_demo():
    """
    Demonstrate Count-Min Sketch for word frequency.
    """
    sketch = CountMinSketch(epsilon=0.001, delta=0.01)
    
    # Simulate text stream
    text_stream = """
    the quick brown fox jumps over the lazy dog
    the fox was quick and the dog was lazy
    """ * 100  # Repeat for volume
    
    words = text_stream.lower().split()
    
    # Add words to sketch
    for word in words:
        sketch.add(word)
    
    # Check frequencies
    test_words = ["the", "fox", "dog", "cat"]
    
    print("\nWord frequency estimates:")
    for word in test_words:
        true_count = words.count(word)
        estimate = sketch.estimate(word)
        error = estimate - true_count
        print(f"  '{word}': true={true_count}, estimate={estimate}, error={error}")

14.7.4 HyperLogLog: Counting Unique Elements

14.7.4.1 The Problem

Count unique users visiting your website: - Billions of visits - Same users visit multiple times - Can’t store all user IDs

14.7.4.2 The HyperLogLog Magic

Key Insight: - In random bit strings, rare patterns indicate large sets - Like inferring crowd size from the rarest jersey number you see

Intuition: - If you flip coins, getting 10 heads in a row is rare - If you see this pattern, you probably flipped LOTS of coins - HyperLogLog uses this principle with hash functions

Amazing Properties: - Count billions of unique items - Using only ~16KB of memory! - Error rate ~2%

14.8 This algorithm is so elegant and powerful that it’s used by Redis, Google BigQuery, and many other systems!

14.9 Section 6.6: Analyzing Randomized Algorithms

14.9.1 Understanding Performance Through Probability

When we analyze randomized algorithms, we can’t just say “it takes X steps” because X is now random! Instead, we need to understand the probability distribution of running times.

14.9.1.1 Expected Value: The Average Case

Definition: If an algorithm has different possible running times, the expected value is the average, weighted by probability.

Example - Finding an Item in Random Position:

Array has 4 slots: [_, _, _, _]
Item could be in position: 1, 2, 3, or 4 (each with probability 1/4)
Comparisons needed: 1, 2, 3, or 4

Expected comparisons = 1×(1/4) + 2×(1/4) + 3×(1/4) + 4×(1/4) = 2.5

14.9.1.2 High Probability Bounds

“Expected” performance is nice, but what if we get unlucky? We want stronger guarantees!

High Probability Statement: “The algorithm finishes in O(n log n) time with probability ≥ 1 - 1/n²”

What this means: - For n = 1000: Fails less than once in a million runs - For n = 1,000,000: Fails less than once in a trillion runs - As input grows, failure becomes astronomically unlikely!

14.9.2 Concentration Inequalities: Why Randomized Algorithms Don’t Get Unlucky

These mathematical tools prove that random events cluster around their expected values.

14.9.2.1 Markov’s Inequality: The Weakest Bound

Statement: For non-negative random variable X: P(X ≥ k × E[X]) ≤ 1/k

In Plain English: “The probability of being k times worse than expected is at most 1/k”

Example: If QuickSort expects 100 comparisons: - P(≥ 1000 comparisons) ≤ 1/10 = 10% - P(≥ 10000 comparisons) ≤ 1/100 = 1%

14.9.2.2 Chernoff Bounds: Much Stronger Guarantees

For sums of independent random events, we get exponentially decreasing failure probability!

Example - Coin Flips: Flip 1000 fair coins. Expected heads = 500. - Probability of ≥ 600 heads ≈ 0.0000000002% - Probability of ≥ 700 heads ≈ 10^-88 (essentially impossible!)

This is why randomized algorithms are reliable despite using randomness!

Let’s see these principles in action:

import random
import math
import time

class RandomizedAnalysis:
    """
    Tools for analyzing and demonstrating randomized algorithm behavior.
    """
    
    @staticmethod
    def analyze_quicksort_concentration(n=1000, trials=1000):
        """
        Demonstrate that QuickSort concentrates around expected time.
        """
        def count_comparisons_quicksort(arr):
            """Count comparisons in randomized QuickSort."""
            if len(arr) <= 1:
                return 0
            
            pivot = arr[random.randint(0, len(arr) - 1)]
            comparisons = len(arr) - 1  # Compare all elements to pivot
            
            left = [x for x in arr if x < pivot]
            right = [x for x in arr if x > pivot]
            
            # Recursively count comparisons
            return comparisons + count_comparisons_quicksort(left) + count_comparisons_quicksort(right)
        
        # Run many trials
        comparison_counts = []
        for _ in range(trials):
            arr = list(range(n))
            random.shuffle(arr)
            comparisons = count_comparisons_quicksort(arr)
            comparison_counts.append(comparisons)
        
        # Calculate statistics
        expected = 2 * n * math.log(n)  # Theoretical expectation
        actual_mean = sum(comparison_counts) / trials
        
        # Count how many are far from expected
        far_from_expected = sum(1 for c in comparison_counts 
                               if abs(c - expected) > 0.5 * expected)
        
        print(f"QuickSort Analysis (n={n}, trials={trials}):")
        print(f"  Theoretical expected: {expected:.0f}")
        print(f"  Actual average: {actual_mean:.0f}")
        print(f"  Min comparisons: {min(comparison_counts)}")
        print(f"  Max comparisons: {max(comparison_counts)}")
        print(f"  Runs >50% from expected: {far_from_expected}/{trials} = {far_from_expected/trials:.1%}")
        print(f"  Conclusion: QuickSort strongly concentrates around expected value!")
        
        return comparison_counts
    
    @staticmethod
    def demonstrate_amplification():
        """
        Show how repetition reduces error probability.
        """
        def unreliable_prime_test(n):
            """
            Fake primality test that's right 75% of the time.
            (For demonstration only!)
            """
            if n < 2:
                return False
            if n == 2:
                return True
            
            # Simulate 75% accuracy
            true_answer = all(n % i != 0 for i in range(2, min(int(n**0.5) + 1, 100)))
            if random.random() < 0.75:
                return true_answer
            else:
                return not true_answer  # Wrong answer
        
        def amplified_prime_test(n, k=10):
            """Run test k times and take majority vote."""
            votes = [unreliable_prime_test(n) for _ in range(k)]
            return sum(votes) > k // 2
        
        # Test on known primes and composites
        test_numbers = [17, 18, 19, 20, 23, 24, 29, 30]
        true_answers = [True, False, True, False, True, False, True, False]
        
        # Single test accuracy
        single_correct = 0
        for num, truth in zip(test_numbers, true_answers):
            correct_count = sum(unreliable_prime_test(num) == truth 
                               for _ in range(1000))
            single_correct += correct_count
        
        # Amplified test accuracy
        amplified_correct = 0
        for num, truth in zip(test_numbers, true_answers):
            correct_count = sum(amplified_prime_test(num, k=10) == truth 
                               for _ in range(1000))
            amplified_correct += correct_count
        
        print("\nError Amplification Demo:")
        print(f"  Single test accuracy: {single_correct/8000:.1%}")
        print(f"  Amplified (10 runs) accuracy: {amplified_correct/8000:.1%}")
        print(f"  Improvement: {(amplified_correct-single_correct)/single_correct:.1%}")

# Run the demonstrations
analyzer = RandomizedAnalysis()
# analyzer.analyze_quicksort_concentration()
# analyzer.demonstrate_amplification()

14.10 Section 6.7: Advanced Randomized Algorithms

14.10.1 Karger’s Min-Cut Algorithm: Finding Bottlenecks

14.10.1.1 The Problem

Find the minimum cut in a network - the smallest number of connections that, if removed, would split the network into two parts.

Real-World Applications: - Finding weakest points in internet infrastructure - Identifying community boundaries in social networks - Circuit design and reliability analysis

14.10.1.2 The Elegant Randomized Solution

Karger’s Algorithm: 1. Pick a random edge 2. “Contract” it (merge the two endpoints into one node) 3. Repeat until only 2 nodes remain 4. Count edges between them

Why It Works (Intuition): - Min-cut edges are “rare” (there are few of them) - Random edge is unlikely to be in min-cut - If we avoid min-cut edges, we find the min-cut!

Success Probability: - Single run: ≥ 2/n² (seems small!) - But run n² log n times: Success probability > 1 - 1/n - Multiple runs find the true min-cut with high probability

Let’s implement this beautiful algorithm:

import random
import copy

class KargerMinCut:
    """
    Randomized algorithm for finding minimum cut in a graph.
    Simple, elegant, and probabilistically correct!
    """
    
    def __init__(self, graph):
        """
        Initialize with graph represented as adjacency list.
        
        Example graph:
        {
            'A': ['B', 'C', 'D'],
            'B': ['A', 'C'],
            'C': ['A', 'B', 'D'],
            'D': ['A', 'C']
        }
        """
        self.original_graph = graph
    
    def contract_edge(self, graph, u, v):
        """
        Contract edge (u,v) - merge v into u.
        
        This is like combining two cities into a metropolis!
        All of v's connections become u's connections.
        """
        # Add v's edges to u
        for neighbor in graph[v]:
            if neighbor != u:  # Skip self-loops
                graph[u].append(neighbor)
                # Update the neighbor's connections
                graph[neighbor] = [u if x == v else x for x in graph[neighbor]]
        
        # Remove v from graph
        del graph[v]
        
        # Remove any self-loops created
        graph[u] = [x for x in graph[u] if x != u]
    
    def single_min_cut_trial(self):
        """
        One trial of Karger's algorithm.
        Returns the cut size found (might not be minimum!).
        """
        # Make a copy since we'll modify the graph
        graph = copy.deepcopy(self.original_graph)
        vertices = list(graph.keys())
        
        # Contract down to 2 nodes
        while len(vertices) > 2:
            # Pick random edge
            u = random.choice(vertices)
            if not graph[u]:  # No edges from u
                vertices.remove(u)
                continue
            
            v = random.choice(graph[u])
            
            # Contract this edge
            self.contract_edge(graph, u, v)
            vertices.remove(v)
        
        # Count edges between final two nodes
        if len(graph) == 2:
            remaining = list(graph.keys())
            return len(graph[remaining[0]])
        return float('inf')
    
    def find_min_cut(self, confidence=0.99):
        """
        Find minimum cut with high probability.
        
        Args:
            confidence: Desired probability of success (e.g., 0.99 = 99%)
        
        Returns:
            Minimum cut size found
        """
        n = len(self.original_graph)
        
        # Calculate trials needed for desired confidence
        # Probability of success in one trial ≥ 2/(n²)
        # Probability of failure in k trials ≤ (1 - 2/n²)^k
        # We want this ≤ 1 - confidence
        import math
        single_success_prob = 2 / (n * (n - 1))
        trials_needed = int(math.log(1 - confidence) / math.log(1 - single_success_prob))
        
        print(f"Running {trials_needed} trials for {confidence:.1%} confidence...")
        
        # Run multiple trials
        min_cut = float('inf')
        for trial in range(trials_needed):
            cut_size = self.single_min_cut_trial()
            if cut_size < min_cut:
                min_cut = cut_size
                print(f"  Trial {trial+1}: Found cut of size {cut_size}")
        
        return min_cut

# Example usage
def karger_demo():
    """
    Demonstrate Karger's algorithm on a simple graph.
    """
    # Create a simple graph with known min-cut
    graph = {
        'A': ['B', 'C', 'D'],
        'B': ['A', 'C', 'E'],
        'C': ['A', 'B', 'D', 'E'],
        'D': ['A', 'C', 'E', 'F'],
        'E': ['B', 'C', 'D', 'F'],
        'F': ['D', 'E']
    }
    
    print("Graph structure:")
    print("  A---B")
    print("  |\\  |\\")
    print("  | \\ | E")
    print("  |  \\|/|")
    print("  C---D-F")
    print("\nThe min-cut is 2 (cut edges D-F and E-F to separate F)")
    
    karger = KargerMinCut(graph)
    min_cut = karger.find_min_cut(confidence=0.99)
    
    print(f"\nKarger's algorithm found min-cut: {min_cut}")

14.10.2 Monte Carlo Integration: Using Randomness for Math

14.10.2.1 The Problem

Calculate the area under a complex curve or the value of π.

Traditional Approach: Complex calculus, might be impossible for some functions!

Monte Carlo Approach: Throw random darts and count how many land under the curve!

14.10.2.2 Estimating π with Random Points

The Setup: - Square from -1 to 1 (area = 4) - Circle of radius 1 inside (area = π) - Ratio of circle to square = π/4

The Algorithm: 1. Throw random points in the square 2. Count how many land in the circle 3. π ≈ 4 × (points in circle) / (total points)

import random
import math

def estimate_pi(num_samples=1000000):
    """
    Estimate π using Monte Carlo simulation.
    
    This is like throwing darts at a circular dartboard
    inside a square frame!
    """
    inside_circle = 0
    
    for _ in range(num_samples):
        # Random point in square [-1, 1] × [-1, 1]
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)
        
        # Check if point is inside unit circle
        if x*x + y*y <= 1:
            inside_circle += 1
    
    # Estimate π
    pi_estimate = 4 * inside_circle / num_samples
    
    # Calculate statistics
    actual_pi = math.pi
    error = abs(pi_estimate - actual_pi)
    relative_error = error / actual_pi * 100
    
    print(f"Monte Carlo π Estimation:")
    print(f"  Samples: {num_samples:,}")
    print(f"  Points in circle: {inside_circle:,}")
    print(f"  Estimate: {pi_estimate:.6f}")
    print(f"  Actual π: {actual_pi:.6f}")
    print(f"  Error: {error:.6f} ({relative_error:.3f}%)")
    
    # Calculate theoretical standard error
    p = math.pi / 4  # True probability
    std_error = 4 * math.sqrt(p * (1 - p) / num_samples)
    print(f"  Theoretical std error: {std_error:.6f}")
    
    return pi_estimate

# Try with different sample sizes to see convergence
def monte_carlo_convergence_demo():
    """Show how estimate improves with more samples."""
    sample_sizes = [100, 1000, 10000, 100000, 1000000]
    
    print("\nMonte Carlo Convergence:")
    for n in sample_sizes:
        # Suppress detailed output
        inside = sum(random.uniform(-1,1)**2 + random.uniform(-1,1)**2 <= 1 
                    for _ in range(n))
        estimate = 4 * inside / n
        error = abs(estimate - math.pi)
        print(f"  n={n:>8,}: π≈{estimate:.4f}, error={error:.4f}")

14.11 Section 6.8: Real-World Applications

14.11.1 Load Balancing with Randomization

14.11.1.1 The Problem

You have 1000 servers and millions of requests. How do you distribute load fairly?

Deterministic Approach: Round-robin, but if some requests are heavier, servers become imbalanced!

Randomized Solution: Power of Two Choices

Pick TWO random servers
Send request to the less loaded one
This simple change dramatically improves balance!

Why It Works: - Random choice alone: Maximum load ≈ log n / log log n - Power of two choices: Maximum load ≈ log log n (MUCH better!)

class LoadBalancer:
    """
    Demonstrate different load balancing strategies.
    """
    
    def __init__(self, num_servers=100):
        self.num_servers = num_servers
        self.loads = [0] * num_servers
    
    def random_assignment(self, num_requests=10000):
        """Pure random assignment."""
        self.loads = [0] * self.num_servers
        
        for _ in range(num_requests):
            server = random.randint(0, self.num_servers - 1)
            self.loads[server] += 1
        
        return max(self.loads)
    
    def power_of_two_choices(self, num_requests=10000):
        """Pick two random servers, use less loaded."""
        self.loads = [0] * self.num_servers
        
        for _ in range(num_requests):
            # Pick two random servers
            server1 = random.randint(0, self.num_servers - 1)
            server2 = random.randint(0, self.num_servers - 1)
            
            # Choose less loaded
            if self.loads[server1] <= self.loads[server2]:
                self.loads[server1] += 1
            else:
                self.loads[server2] += 1
        
        return max(self.loads)
    
    def compare_strategies(self):
        """Compare load balancing strategies."""
        random_max = self.random_assignment()
        two_choices_max = self.power_of_two_choices()
        
        print(f"\nLoad Balancing Comparison ({self.num_servers} servers, 10000 requests):")
        print(f"  Random assignment:")
        print(f"    Max load: {random_max}")
        print(f"    Expected: ~{10000/self.num_servers + 3*math.sqrt(10000/self.num_servers):.0f}")
        print(f"  Power of two choices:")
        print(f"    Max load: {two_choices_max}")
        print(f"    Improvement: {random_max/two_choices_max:.1f}x better!")

# Demo
balancer = LoadBalancer(100)
balancer.compare_strategies()

14.11.2 A/B Testing with Statistical Significance

In web development and product design, randomized experiments help make data-driven decisions.

class ABTest:
    """
    Simple A/B testing framework with statistical significance.
    """
    
    def __init__(self, name="Experiment"):
        self.name = name
        self.group_a = {'visitors': 0, 'conversions': 0}
        self.group_b = {'visitors': 0, 'conversions': 0}
    
    def assign_visitor(self):
        """Randomly assign visitor to group A or B."""
        if random.random() < 0.5:
            self.group_a['visitors'] += 1
            return 'A'
        else:
            self.group_b['visitors'] += 1
            return 'B'
    
    def record_conversion(self, group):
        """Record a conversion for the specified group."""
        if group == 'A':
            self.group_a['conversions'] += 1
        else:
            self.group_b['conversions'] += 1
    
    def analyze_results(self):
        """
        Calculate statistical significance of results.
        Using normal approximation to binomial.
        """
        # Calculate conversion rates
        rate_a = self.group_a['conversions'] / max(self.group_a['visitors'], 1)
        rate_b = self.group_b['conversions'] / max(self.group_b['visitors'], 1)
        
        n_a = self.group_a['visitors']
        n_b = self.group_b['visitors']
        
        if n_a < 100 or n_b < 100:
            print(f"Need more data! (A: {n_a} visitors, B: {n_b} visitors)")
            return
        
        # Pooled conversion rate for variance calculation
        pooled_rate = (self.group_a['conversions'] + self.group_b['conversions']) / (n_a + n_b)
        
        # Standard error
        se = math.sqrt(pooled_rate * (1 - pooled_rate) * (1/n_a + 1/n_b))
        
        # Z-score
        z = (rate_b - rate_a) / se if se > 0 else 0
        
        # P-value (two-tailed test)
        # Using approximation for normal CDF
        p_value = 2 * (1 - self._normal_cdf(abs(z)))
        
        print(f"\nA/B Test Results: {self.name}")
        print(f"  Group A: {rate_a:.2%} conversion ({self.group_a['conversions']}/{n_a})")
        print(f"  Group B: {rate_b:.2%} conversion ({self.group_b['conversions']}/{n_b})")
        print(f"  Relative improvement: {((rate_b/rate_a - 1) * 100):.1f}%")
        print(f"  Z-score: {z:.3f}")
        print(f"  P-value: {p_value:.4f}")
        
        if p_value < 0.05:
            print(f"  Result: STATISTICALLY SIGNIFICANT! (p < 0.05)")
            winner = 'B' if rate_b > rate_a else 'A'
            print(f"  Winner: Group {winner}")
        else:
            print(f"  Result: Not statistically significant (need more data)")
    
    def _normal_cdf(self, z):
        """Approximate normal CDF using error function."""
        return 0.5 * (1 + math.erf(z / math.sqrt(2)))

# Demo A/B test
def ab_test_demo():
    """
    Simulate an A/B test with different conversion rates.
    """
    test = ABTest("Button Color Test")
    
    # Simulate visitors
    # Group A (blue button): 10% conversion
    # Group B (green button): 12% conversion (20% better!)
    
    for _ in range(5000):
        group = test.assign_visitor()
        
        # Simulate conversion based on group
        if group == 'A':
            if random.random() < 0.10:  # 10% conversion
                test.record_conversion('A')
        else:
            if random.random() < 0.12:  # 12% conversion
                test.record_conversion('B')
    
    test.analyze_results()

# Run the demo
# ab_test_demo()

14.12 Summary: The Power of Controlled Randomness

14.12.1 Key Takeaways

Randomization Eliminates Worst Cases
- No adversary can force bad performance
- Every input becomes “average case”
Two Types of Randomized Algorithms
- Las Vegas: Always correct, random time
- Monte Carlo: Fixed time, probably correct
Probability Concentrates
- Random events cluster around expectations
- Bad luck is exponentially unlikely
- Multiple runs boost confidence
Simplicity and Elegance
- Randomized algorithms are often simpler
- Easier to implement and understand
- Natural parallelization
Real-World Impact
- Used in databases, networks, security
- Powers big data analytics
- Essential for modern computing

14.12.2 When to Use Randomization

✅ Use When: - Worst-case is much worse than average - Need simple, practical solution - Dealing with massive data - Want to prevent adversarial inputs - Small error probability is acceptable

❌ Avoid When: - Absolute correctness required - Need reproducible results - Limited random number generation - Real-time guarantees essential

14.12.3 Final Thought

“In the face of complexity, randomness is often our best strategy.”

Randomized algorithms show us that embracing uncertainty can lead to more certain outcomes. By giving up a tiny bit of determinism, we gain massive improvements in simplicity, speed, and robustness.

Next chapter, we’ll explore the limits of computation itself with NP-Completeness!