Real-World Anonymization: Why It's Trickier Than You Think
[Previous sections remain the same]
Real-World Anonymization: Why It’s Trickier Than You Think
The Illusion of Anonymity
Let’s look at a typical “anonymized” dataset:
# Original customer data
original_data = [
{"id": 1, "name": "John Doe", "age": 34, "zipcode": "90210", "purchase": "$299"},
{"id": 2, "name": "Jane Smith", "age": 28, "zipcode": "90001", "purchase": "$199"},
{"id": 3, "name": "Bob Johnson", "age": 45, "zipcode": "90003", "purchase": "$399"}
]
# "Anonymized" version
anonymized_data = [
{"user_id": "A742", "age_group": "30-40", "region": "LA", "purchase": "$299"},
{"user_id": "B234", "age_group": "20-30", "region": "LA", "purchase": "$199"},
{"user_id": "C891", "age_group": "40-50", "region": "LA", "purchase": "$399"}
]
Looks safe, right? Wrong! Here’s why:
How Anonymized Data Gets Compromised
1. The Linkage Attack
Let’s say we have this “anonymized” health dataset:
# "Anonymized" health records
health_data = [
{"patient_id": "X47", "age": 38, "zipcode": "90210", "condition": "diabetes"},
{"patient_id": "Y82", "age": 29, "zipcode": "90001", "condition": "hypertension"},
{"patient_id": "Z93", "age": 46, "zipcode": "90003", "condition": "asthma"}
]
And a public voter registration database:
# Publicly available voter records
voter_records = [
{"name": "John Doe", "age": 38, "zipcode": "90210"},
{"name": "Jane Smith", "age": 29, "zipcode": "90001"},
{"name": "Bob Johnson", "age": 46, "zipcode": "90003"}
]
How the Attack Works:
- Find unique combinations in both datasets
- Match patterns (like age + zipcode)
- Re-identify individuals
def demonstrate_linkage_attack(health_record, voter_records):
for voter in voter_records:
if (voter["age"] == health_record["age"] and
voter["zipcode"] == health_record["zipcode"]):
return f"Found match: {voter['name']} has {health_record['condition']}"
return "No match found"
# Example usage
print(demonstrate_linkage_attack(health_data[0], voter_records))
# Output: Found match: John Doe has diabetes
Real Attack Examples
- The Netflix Prize Dataset (2007)
- What happened: Netflix released 100 million “anonymized” movie ratings
- The attack: Researchers cross-referenced with public IMDB reviews
- Result: Successfully identified 84% of users [1]
- The AOL Search Data Leak (2006)
- Released: 20 million web searches from 650,000 users
- Identification method: Unique patterns in search queries
- Famous case: User #4417749 identified as Thelma Arnold [9]
Better Anonymization Techniques
1. K-Anonymity
Ensures each record is similar to at least k-1 other records.
# Bad anonymization (vulnerable)
bad_data = [
{"age": 28, "zipcode": "90210", "disease": "flu"},
{"age": 29, "zipcode": "90213", "disease": "cold"},
{"age": 30, "zipcode": "90215", "disease": "fever"}
]
# K-anonymity (k=3)
k_anonymous_data = [
{"age_range": "25-30", "zipcode": "902**", "disease": "flu"},
{"age_range": "25-30", "zipcode": "902**", "disease": "cold"},
{"age_range": "25-30", "zipcode": "902**", "disease": "fever"}
]
2. Differential Privacy in Action
def count_with_privacy(data, query_function, epsilon=0.1):
true_count = query_function(data)
noise = np.random.laplace(0, 1/epsilon)
return max(0, int(round(true_count + noise)))
# Example usage
def count_disease(data, disease):
return sum(1 for record in data if record["disease"] == disease)
flu_count = count_with_privacy(k_anonymous_data,
lambda d: count_disease(d, "flu"))
print(f"Private count of flu cases: {flu_count}")
Best Practices for Data Anonymization
- Use Multiple Techniques
- Combine k-anonymity with differential privacy
- Example: First group data, then add noise
- Consider Temporal Aspects
- Data over time can reveal patterns
- Solution: Regularly rotate identifiers
# Bad practice: Static identifiers
user_123_purchases = [
{"date": "2023-01-01", "amount": 100},
{"date": "2023-01-02", "amount": 150},
{"date": "2023-01-03", amount: 200}
]
# Better: Rotating identifiers
user_purchases = [
{"user_id": "A742", "date": "2023-01-01", "amount": 100},
{"user_id": "B234", "date": "2023-01-02", "amount": 150},
{"user_id": "C891", "date": "2023-01-03", "amount": 200}
]
Additional References
[9] Barbaro, M., & Zeller, T. (2006). A Face Is Exposed for AOL Searcher No. 4417749. New York Times.
[10] Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems.
[These additions would be integrated into the full blog post, with the rest remaining as before]
Would you like me to:
- Add more specific attack scenarios?
- Include more code examples for privacy-preserving techniques?
- Expand on any particular section?