AWS Database Blog
Dynamic data masking in Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL, and Babelfish for Aurora PostgreSQL
Data obfuscation, or data anonymization, refers to a collection of techniques used to disguise or alter sensitive data to minimize the risk of data breaches, while still allowing legitimate access for necessary operations. This process is commonly used in industries such as financial, healthcare, and government industries, where confidentiality of sensitive information is important and demanded by industry compliance standards and regulations.
Data obfuscation techniques can vary in the level of privacy protection they offer. These techniques include:
- Encryption – This technique converts data into a coded language using a mathematical algorithm, which can only be deciphered with a specific key.
- Hashing – This technique uses unique SALTs fed to one-way hash functions to convert data into a fixed-length code that is irreversible, making it impossible to reverse engineer the original data.
- Tokenization – This technique substitutes sensitive data elements with non-sensitive equivalents, referred to as a token, that have no intrinsic or exploitable meaning or value. The token is a reference that maps back to the sensitive data through a tokenization system.
- Data masking – This technique uses masking patterns to partially or completely hide the data by replacing it with fictitious or obscured data.
There are a variety of different techniques available to support data masking in databases, each with their trade-offs. In this post, we explore dynamic data masking, a technique that returns anonymized data from a query without modifying the underlying data.
Many commercial database systems, such as Oracle and Microsoft SQL Server, provide dynamic data masking, and customers migrating from these database systems to PostgreSQL might require a similar functionality in PostgreSQL. In this post, we discuss a dynamic data masking technique based on dynamic masking views. These views mask personally identifiable information (PII) columns for unauthorized users. This post discusses how to implement this technique in Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition including Babelfish for Aurora PostgreSQL. In the last section, we discuss the limitations of dynamic data masking techniques.
Dynamic data masking with masking views
This post discusses a dynamic data masking (PGDDM) package that accepts a source table as input and generates a view which, based on the persona of the user accessing the view, masks the PII columns in the source table using the masking pattern declared for those columns. The masking view masks the PII columns for the unauthorized users. The authorized users see the data in those columns unmasked. The code for PGDDM can be found in the accompanying GitHub repo.
To make the PGDDM package globally available, in the case of Babelfish, place the package in the sys
schema in Babelfish_db
on the PostgreSQL endpoint of Babelfish. To load the PGDDM artifacts, sign in to Babelfish using the PostgreSQL endpoint and load the script PGDDM.SQL. This script loads all the artifacts into the sys
schema in this database. The sys
schema is global to Babelfish and makes all these artifacts visible to all the databases on a Babelfish instance.
In the case of Aurora PostgreSQL-Compatible and Amazon RDS for PostgreSQL, place the PGDDM package in the sys
schema in a given database. To load the content, use the script PGDDM.SQL. A database in PostgreSQL is independent, so to make the PGDDM package common to all the databases, you can use PostgreSQL template databases to automatically install the package in new databases. The template content doesn’t propagate to the existing databases.
Solution overview
PGDDM has five main components:
- Source tables – The tables containing PII columns
- Pii_masked_columns table – A table that lists all the PII columns in a source table, and the masking pattern to use to mask a specific PII column
- Unmasked_roles table – A table that specifies the authorized roles who can see the PII columns in a source table unmasked
- Masking artifacts – A set of functions and procedures used to generate masking views and apply masking patterns
- Masking views – The views that mask PII columns for unauthorized users, using the declared masking pattern
The following diagram shows the end-to-end process for enforcing dynamic data masking using masking views.
To implement this workflow, complete the following steps:
- Declare the masking patterns for the PII columns in each table.
- Declare the unmasked roles; users with these roles are authorized to see the unmasked data in the source tables.
- Run the procedure
GenMaskingView
and accompanying functions to generate masking views. - Grant permission to users to use the masking views. Revoke permissions from the source tables as needed.
- Use the masking views. These views check the user roles, and if unauthorized, apply the declared masking pattern to the PII columns using the masking functions defined in the dynamic data masking package.
PGDDM assumes that the source table and the masking views are located in separate database schemas. This is because the masking views and the source table have the same name.
Masking functions in PGDDM
The PGDDM package allows for the following masking patterns. A masking pattern is a function that masks the data in a PII column based on a regular expression pattern. The following table lists the available masking patterns.
Masking Pattern | Column Data Type | Masking Pattern Example | Input Column Example | Output |
default() |
Text |
MASKED WITH (FUNCTION = default()) |
admin |
X |
default() |
Number |
MASKED WITH (FUNCTION = default()) |
100 |
0 |
partial(n, xxxxx, m) |
Text |
MASKED WITH (FUNCTION = partial(0, xxxxx, 8) |
admin |
xxxxxxxx |
email() |
Text |
MASKED WITH (FUNCTION=email()) |
john@efgh.biz |
joXXXXXXXX.biz |
random(n, 1m) |
Number |
MASKED WITH (FUNCTION = random(1, 100)) |
7102933 |
15 |
The table pii_masked_columns
keeps track of the masking pattern for each PII column in a specific source table. This table has the following layout:
- Database_name – Name of the database where the source table is located
- Schema_name – Name of the schema where the source table is located
- Table_name – Name of the source table
- Column_name – Name of the PII column
- Masking – Pattern to mask the data
The following table lists examples of the entries in the pii_masked_columns
table.
database_name | schema_name | table_name | column_name | masking |
users |
source_schema |
user_job |
title |
MASKED WITH (FUNCTION = default()) |
users |
source_schema |
user_job |
job |
MASKED WITH (FUNCTION = default()) |
users |
source_schema |
user_job |
email |
MASKED WITH (FUNCTION = email()) |
users |
source_schema |
user_bank |
bank_name |
MASKED WITH (FUNCTION = partial(0,XXXXXXXX, 5)) |
users |
source_schema |
user_bank |
account_id |
MASKED WITH (FUNCTION = random(1, 100)) |
users |
source_schema |
user_bank |
balance |
MASKED WITH (FUNCTION = random(100,500)) |
Setting up authorized users for viewing unmasked data
Authorized users (those who can see the PII data unmasked) are defined using the unmasked_roles
table. The table has the following layout:
- Role – Role of the user who is authorized to see the data unmasked
- Database_name – Name of the database where the source table is located
- Schema_name – Name of the schema where the source table is located
- Table_name – Name of the source table
The following table lists examples of the entries in the unmasked_roles table.
role | database_name | schema_name | table_name |
admin |
users |
source_schema |
user_location |
admin |
users |
source_schema |
job |
staff |
users |
source_schema |
users |
hr |
users |
source_schema |
users |
postgres |
users |
source_schema |
user_location |
hr |
users |
source_schema |
user_job |
hr |
users |
source_schema |
user_bank |
hr |
users |
source_schema |
user_location |
Build dynamic data masking views
The masking procedure GenMaskingView
is used to generate the masking view for a specific table.
The following is the syntax for using the procedure in Amazon RDS for PostgreSQL and Aurora PostgreSQL-Compatible:
For example:
This call generates the following SQL statement that creates a masking view:
In this technique, the masking view checks the authorization based on the CURRENT_ROLE
(the role in which the view is executing). This implies that a user may have a different authorization pattern based on the role the user has in the current execution context.
The following is the syntax for using the procedure in Babelfish for Aurora PostgreSQL (TSQL endpoint):
For example:
This statement generates the following SQL statement that creates a masking view:
If the definition for the underlying source table changes, the masking view must be regenerated. Additionally, the unmasked_roles
and pii_masked_columns
tables must reflect the changes in the source tables before the masking views are generated.
When dealing with many tables, it can become cumbersome to keep track of the tables whose definition has changed. You can use the procedure MaskingReconciliation
to generate dynamic data masking views for all the tables in a schema.
The following is the syntax for Amazon RDS for PostgreSQL and Aurora PostgreSQL-Compatible:
The following is the syntax for Babelfish for Aurora PostgreSQL (TSQL endpoint):
Query dynamic masking views
After users are given permissions to access dynamic data masking views, they can view the data in the source table by selecting from these views. An unauthorized user sees the PII columns masked, whereas an authorized sees them unmasked. For example, an unauthorized user sees the following entries in the user_bank
source bank when selecting from the user_bank
view.
bank_id | user_id | bank_name | account_id | balance |
1 | 1 | xxxxxxxx | 75 | 300 |
2 | 1 | bankxxxxxxxxress | 19 | 379 |
3 | 2 | xxxxxxxxbank | 56 | 401 |
4 | 2 | xxxxxxxx | 20 | 222 |
5 | 3 | xxxxxxxxbank | 45 | 283 |
6 | 4 | xxxxxxxx | 20 | 295 |
7 | 5 | xxxxxxxx | 80 | 195 |
8 | 6 | bankxxxxxxxxress | 23 | 361 |
9 | 7 | xxxxxxxx | 20 | 284 |
An authorized user using the same view will see the PII data unmasked.
bank_id | user_id | bank_name | account_id | balance |
1 | 1 | bank1 | 7,102,933 | 500,000 |
2 | 1 | bank_in_congress | 8,100,033 | 100,000 |
3 | 2 | southern bank | 1,111,133 | 200,000 |
4 | 2 | bank4 | 8,188,833 | 90,000 |
5 | 3 | southern bank | 3,333,222 | 700,000 |
6 | 4 | bank1 | 9,019,292 | 15,000 |
7 | 5 | bank5 | 1,111,111 | 60,000 |
8 | 6 | Bank_in_congress | 2,222,222 | 3,000 |
9 | 7 | bank1 | 7,887,603 | 650,000 |
Limitations of dynamic data masking
Although dynamic data masking can often be simpler to get started with, it has several limitations that you must be aware of:
- Read-only nature – Dynamically masked data can’t be written back to the database, and it’s not suitable for development and testing environments where data needs to be modified
- Performance impact – The dynamic, real-time masking process can introduce additional processing overhead, and potentially impact query performance
- Complex configuration – Setting up masking rules with granular access controls or queries that require federation to remote systems can be complex and require careful management
- Potential for bypass – The dynamic masking process can potentially be bypassed through various methods, including advanced SQL queries, privilege escalation, and brute-force techniques
- Inference vulnerabilities – Although masking hides sensitive data, a user with access to masked data might still be able to infer sensitive information through pattern analysis
Clean up
If you no longer need the setup presented in this post, make sure to delete all the associated resources to avoid being charged in the future:
- On the Amazon RDS console, choose Databases in the navigation pane.
- Select the DB instance you want to delete, and on the Actions menu, choose Delete.
- Enter
delete me
to confirm and choose Delete.
Conclusion
In this post, we showed how you can implement dynamic data masking views in Aurora PostgreSQL-Compatible and Amazon RDS for PostgreSQL by generating masking views to mask the PII data for unauthorized users. We also noted that dynamic data masking is primarily for masking sensitive data as needed in production environments for authorized users with limited access, but not for full data manipulation or development purposes. You can find the source code for the dynamic data masking artifacts in the GitHub repo.
About the Author
Ezat Karimi is a Sr. Solutions Architect with the Amazon Database Migration Accelerator team.