Masking PII data the right way
Simplification is even more important with complex topics.
Even though the regulations protecting user’s data are in place for quite some time, companies still struggle to set their data pipelines in a way that robustly cover PII data processing. Unfortunately, the technologies we use were not designed with masking PII data in mind. Software engineers love immutability, they lean towards functional programming languages, stateless transformations, append only models and all other beautiful techniques that simplify our life. Suddenly, your team gets a query from one of your stakeholders asking
Is there a place in our system where we store raw PII?
In institutions without the right governance, data protection rules and awareness the true answer can be devastating. This question is so tricky that some of the companies that already have procedures in place aren’t even aware that they are not protecting PII data properly. In case of data breach this can lead to severe financial consequences. First half of this year generated 100M Euro in fines for GDPR breach.
I want to challenge you. You might have worked with systems like this before or even built them yourself. You might be aware about email addresses, IP addresses and more PII types of data. You might even know that it is safe to apply K-anonymization (e.g. instead of storing an age, you store an age group like 16-24
) and that some data naturally create groups like gender
or zip codes
. But have you considered combinations of seemingly innocent columns? Check out this research on a topic — it only gets a zip code, gender and date of birth to uniquely identify 87% of Americans.
What is PII?
In contemporary world we leave a lot of trails while both browsing the internet or interacting with physical world. Any information that can directly or indirectly reasonably link to the identity to a person is considered PII. This includes both very direct and obvious data such as full name, email address, IP address, account number but also any sorts of identifiers such as cookies or accurate GPS trails.
As you saw in the previous paragraph — sometimes a combination of irrelevant features can form a PII too. The vague definition of PII as reasonable link to the person doesn’t help either.
You might be wondering how can social media sites even work if they show your full name to almost anyone? So, first of all that’s a core of their business so they are allowed to process it according to their terms and conditions (that you, of course, read). And secondly, you can always make your profile private and share that information just with the people you want.
Why should I care?
Fines, fines, fines… EU has been first to introduce GDPR (EU General Data Protection Regulation) in 2018 and record fines are issued year after year with almost 100M Euro in first half of 2022. CCPA followed in the US (California Residents’ Privacy Rights) and more states are considering similar laws. Brazil already introduced their regulation in 2020 — LGPD or Lei Geral de Proteção de Dados Pessoais.
I won’t get into the details how those acts work. In general, they apply to processing data of citizens of certain parts of the world. Different acts may have different definitions of PII, rules that apply to processing that information and rights a person has with regards to that data. For instance, with GDPR you have to
- control what PII data is stored in your systems (databases, topics, queues…) and how they are processed (what’s a purpose of every process that accesses that data),
- allow people to retrieve and / or delete such data on their request.
Getting the processing part right is most tricky — in general you can’t store more PII data than necessary. You might be tempted to store some information for future use cases or have a more detailed data, even though your current use cases doesn’t require such levels of detail (age vs age group).
Certain laws may apply to concrete groups of people. One of such laws is COPPA (Children’s Online Privacy and Protection Act) that requires parental consent for collecting PII of children under 13.
ByteDance (a company behind TikTok) is one of the most famous examples of a company that notoriously breach privacy agreements. That poses a serious security threat for governments across the world as according to the law Chinese companies are forced to cooperate with the state intelligence. They collect data (oftentimes) without user’s consent and this includes biometric information. Not all data is synchronized with Chinese servers though and majority is stored in local data centres, outside China. Nonetheless, TikTok doesn’t comply with COPPA, effectively violating children’s privacy. Although they don’t care about the lawsuits, I bet not all companies would survive such a negative PR.
Masking techniques
Given that your company has to process PII, you have to explore available options. Not all of those techniques can be applied to every scenario — for instance, you can’t just anonymize all PII data in production. Even though that’s an effective method, you can’t reverse the mapping anymore.
K-anonymization is a technique that translates a concrete value assigned to a user (e.g. zip code) to a general group (e.g. city). A similar technique is averaging (replacing value with typical value for certain group of people).
Encryption uses cryptography (usually hashing with salt due to relative low performance impact) to mask the values. To increase security, keys should be rotated and separate keys should be used for each user. Such mapping may be reversible.
Redaction is a technique of applying a mask to values XXX-XX-XXXX
instead of the SSN. Usually used to limit access to certain information. For instance, a call center worker may be required to validate personal information about the client. They should only be presented with e.g. with the last 4 digits of a number if that’s what security policy requires. This can also mean returning certain values as NULL
or blacking out certain parts of documents in OCR or speech to text problems.
Tokenization is an interesting technique where PII is replaced with seemingly random token that can be later remapped to original value by privileged users by a table lookup (a secure vault). It is imperative not to use a deterministic mapping of tokens to values across many users — such technique is called deterministic mapping and that is susceptible to reverse engineering.
Data at-rest, in-flight and in-use should be masked. Usually it is not secure enough to have just permission based access on top of the raw data. Think of a case where the attacker gains access to your warehouse. It wouldn’t be smart to store passwords in plaintext there, would it?
- data at-rest is usually managed by your queue, database or warehouse provider. Platforms such as MSK or Snowflake offer such encryption,
- data in-transit is any data that flows through the network. At the very least you should assume zero-trust policy and encrypt all your internal communication between processing nodes,
- protecting data in-use is programmer’s responsibility. Storing PII in disk unencrypted cache is not wise and can lead to data breaches. Same goes for logging it to unencrypted logs.
Let’s now explore how PII can be handled in systems that frequently deal with such data.
Data Warehouses and databases
Data warehouses are usually places where important information from all departments of the company ends up. In my sense, treating PII data there doesn’t look much different from protecting that data in a regular database. There are many flavours of how can you protect that data.
First and foremost, with warehouses like Snowflake or Redshift, data is encrypted in the storage layer. The access to data is restricted by user and role permissions. Here’s an overview how can you address various access patterns in a warehouse:
- column-level security. Certain users shouldn’t be able to view certain columns. Data Science team might be interested in many user’s features but not necessarily their email or IP address. This can be achieved by views and RBAC,
- row-level security. If you expose orders tables to external parties such as companies that execute those orders, they should only see the records that are relevant for them,
- certain columns should be masked. Back to our call centre example — all the person needs is 4 last digits of SSN. In Snowflake you can use
CASE
statement withCURRENT_ROLE
condition to filter that out. A common convention is to embedNON_PII
string within user’s role so that the check condition can be expressed asCURRENT_ROLE() NOT LIKE ‘%NON_PII%’
, - tables with PII data shouldn’t be viewed by certain users. A solution to this is to store such tables in a separate schema and grant access to it to privileged people.
In reality, this still poses risks of users getting indirect information about PII data stored in the warehouse. Many of the discussed solutions use views. Unfortunately (at least in this case) warehouses make a lot of optimizations when processing views — including pushdowns. A user can view the view definition so they can get information about existence of e.g. masked columns. I won’t get into much details but similarly to SQL injection used by hackers, someone might be able to get information that they shouldn’t be allowed to. UDFs oftentimes are even less secure. Snowflake mitigates this by implementing SECURE VIEWS
. Their code can’t be viewed and certain optimizations are not applicable to them. But they address majority of such nasty use cases.
Data Lakes
Before technologies like Delta Lake or Iceberg, modifying data in the data lake was a mess. It usually required full scans applied to the warehouse, rewrites of large files. Formats for analytics tables enabled teams to apply CRUD operations in more performant and easier ways.
With data lakes, the same masking and access principles are applied. They can’t be enforced at row or column level anymore but data should be appropriately masked. Deleting data is as easy as issuing a DELETE
statement. If statistics are gather for the user’s identifier, it shouldn’t take weeks to remove the data for a user. Except for the fact that the data will not be actually removed. Usually maintenance jobs take care of the data that is no longer referenced but up until them, it is just a tombstone and with time-travel you can still access user’s data.
To speed up the process, you can optimize your storage with Z-ordering. It sorts the data physically in files for better multi-dimensional query performance. You can read more about Z-order curves and why they preserve local order in low dimensional space here.
Remember that you can force Delta Lake to remove that data earlier with the use of VACUUM
command.
Streaming systems
Streaming technologies are usually designed to process large amounts of data too. This means that tools like Kafka is designed as a immutable stream and with scalability in mind — it can be difficult at times to control who can read which topic and it is difficult to tell if a system that shouldn’t consume data is actually consuming it.
Deleting customer data is a challenge. For compacted topics it is just a matter of putting a tombstone for a certain key. For regular topics you can only prune partitions and this may unnecessarily remove too much data. We shouldn’t treat Kafka as a database and with relatively short retention period we should be able to comply with security policy. Any data that needs to live longer should be stored in a Data Lake or a warehouse.
Usually, an alternative approach is preferred. Instead of modifying data on topic you can crypto-shred the key for encryption or remove mapping from lookup tables for tokenization. Technically this makes the data on topic anonymous and that is fine with regulations such as GDPR.
There is more critical issue with streaming systems — changes in user’s permissions must be immediately reflected across the company. User facing teams should gather consent information about the user and this information should be exposed across the company (e.g. via a common API or a table). Stream technologies offer joining capabilities that apply here.
Stateful stream processing may also store PII data internally and regulations also apply to that storage too.
- stream processing engines usually use checkpoints for fault tolerance. Just as time-travel in data warehouses, checkpoints can be used to recover job at a certain point in the past (possibly with PII state recovery). Those should be regularly pruned,
- any state stored in long-living stateful jobs should be pruned with TTLs. In addition, jobs should have a way of removing those values from their state (e.g. by reading dedicated topic or referencing an API on schedule).
My favourite model
We live in a work where asynchronous, decentralized applications are common. With centralized role based access and SSO, organizations can gain control over permissions. Technically though, data removal is usually a challenge. These systems were not designed with deletes in place, especially random deletes. Moreover, users may be interested in getting all personal information gathered about them. Having some easy way of accessing it would be beneficial.
I advocate for centralizing access to PII data. Topic of PII is so complex that we should strive to get as much control over it as possible. I base my designs with centralized vault at heart
- this vault is technically a key value store,
- the key here is any user identifier,
- the value is a list of named pairs. A name corresponds to PII column and the value is a token to actual value mapping,
- these random tokens are later used in data storage,
- users willing to access the actual data can do so by joining information in the vault with the data stored in an event / table,
- users requests to retrieve PII data is a matter of querying that storage,
- users requests to delete PII data is a matter of removing certain pairs from that mapping. This gives a fine-grained control and technically anonymizes data in storage systems as they can no longer be decrypted or mapped back to original values,
- you only care about securing the vault so you put more efforts to it.
Of course, as any architectural decision it has its benefits and drawbacks. Most importantly it is very generic with regards to storage and processing systems. I don’t consider any particular implementation of such vault — it can be distributed, sharded, decouple column storage for better security etc.
Conclusion
Understanding PII is tricky. Following all the regulations is tricky. Implementing the systems in the right way is tricky. My goal here was to gather the necessary information for you to explore this topic further and share some of my thoughts about the PII processing.
I strongly believe that architectural simplicity and strong control over the PII data is necessary for any company to feel confident about their data. In any case it requires a lot of operations effort to get things right. Just imagine the process of key rotation for encrypted fields — and having it all in sync with all decentralized applications with their own release process. Tokenization seems like a middle ground between security, flexibility and ease of implementation. Just imagine a mess if the regulations were changed once again.