AWS Partner Network (APN) Blog

Building a privacy preserving chatbot with Amazon Bedrock

Skyflow

By Sean Falconer, Head of Marketing – Skyflow
By Manish Ahluwalia, Technology Executive – Skyflow
By Farooq Ashraf, Sr. Solutions Architect – AWS
By Ravi Yadav, Sr. Container Specialist – AWS

In this post, we discuss an approach to creating a privacy-preserving chatbot using Amazon Bedrock and Skyflow Data Privacy Vault. We explore the importance of safeguarding sensitive information, such as personally identifiable information (PII) and intellectual property, throughout the AI model’s lifecycle. Finally, we present architectures for model training and inference that protects PII without sacrificing the usefulness of AI models.

Skyflow is an AWS Partner and AWS Marketplace Seller that’s a zero-trust data privacy vault built to simplify how companies isolate, protect, and govern their customers’ most sensitive data while facilitating region-specific compliance through data localization. is a software-as-a-service (SaaS) offering that supports multi-tenant and single-tenant deployment models.

Protecting sensitive data from exposure

In the AWS blog post on how to Build a multi-tenant chatbot with RAG using Amazon Bedrock and Amazon EKS, a Retrieval Augmented Generation (RAG) model was used with Amazon Bedrock and Amazon Elastic Kubernetes Service (Amazon EKS) to build a multi-tenant chatbot.

Consider the situation where a company wants to use this approach for internal operations. The chatbot provides an interface that lets non-technical users within the company access information from internal data and documents. Although every user of the chatbot is an employee, what they can see in a response depends on their individual roles. For example, HR might be able to see employee home addresses, while someone in customer support should only be able to access a customer’s name, official title, and work email. You need guardrails to control who can see what, when, where, and for how long.

Sensitive data detection, de-identification, and fine-grained access control

In data management, the data privacy vault architectural pattern is gaining traction as a solution to protect sensitive information such as customer PII. A data privacy vault securely isolates, protects, and governs sensitive customer data, addressing compliance with regulations such as General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA).

The vault serves as a shared service for managing PII, storing sensitive data separately from existing systems to boost security and compliance. It de-identifies data through tokenization, replacing sensitive data with tokens that reference the plaintext stored in the vault, keeping actual data hidden in external systems.

A data privacy vault enforces a zero-trust model, granting access only with explicit permission via fine-grained policies.

Figure 1 shows a traditional architecture of how sensitive data (a phone number) crosses the architecture components in plaintext, in comparison with a data privacy vault (Figure 2) to keep sensitive data isolated and avoid data sprawl (see IEEE Privacy Engineering).

Figure 1 - Example of a traditional system where sensitive data ends up copied throughout, leading to data sprawl.Figure 1 – Traditional system, sensitive data is copied throughout

Figure 2 - Example of using a data privacy vault architecture where sensitive data isolated and protected by the vault. Only non-sensitive tokens are replicated, eliminating sensitive data sprawl.Figure 2 – Data privacy vault approach, sensitive data isolated and protect by the vault

Skyflow provides a data privacy vault as a service, supporting a variety of deployment models including multi-tenant, single tenant, and deployment via an Amazon Virtual Private Cloud (Amazon VPC) with dedicated tenancy. Additionally, Skyflow supports both a built-in managed encryption key service and Bring Your Own Encryption Keys.

In the context of LLMs, Skyflow helps address privacy concerns during the entire LLM lifecycle, from training, fine-tuning, and RAG, to inference.

  • Model training – Skyflow enables privacy-safe model training by excluding sensitive data from datasets used in the model training process.
  • Vectorization – Skyflow protects the privacy of sensitive data from being vectorized during RAG model data ingestion.
  • Inference – Skyflow also protects the privacy of sensitive data from being collected by inference from prompts or RAG context.
  • Intelligent guardrails – Enable responsible and ethical use of language models. Guardrails prevent harmful output, address bias and fairness, and help achieve compliance with policies and regulations.
  • Governance and access control – Define precise access policies around de-identified data, controlling who sees what, when, and where, based on who is providing the prompt. Authorized users receive the data they need, only for the exact amount of time they need it. The access is logged and available for auditing purposes.
  • Integrated compute environment – Skyflow Data Privacy Vault seamlessly integrates into existing data infrastructure adding an effective layer of data protection. Skyflow protects sensitive data by preventing plaintext sensitive data from flowing into LLMs, only revealing sensitive data to authorized users.

Preserving privacy during data ingestion and vectorization

You can combine the principles laid out previously to create a multi-tenant architecture that can run a privacy-safe RAG model in Amazon Bedrock. In Figure 3, the data for Tenant A undergoes a privacy-enhancing de-identification process with Skyflow before it’s integrated into the multi-tenant RAG system described in the post, Build a multi-tenant chatbot with RAG using Amazon Bedrock and Amazon EKS.

In this architecture, Skyflow serves as a crucial intermediary, a privacy gateway providing a variety of capabilities to detect sensitive data within larger datasets and then it redacts that sensitive data.

Figure 3 - Example of a privacy-safe multi-tenant RAG system. The Skyflow vault serves as a privacy gateway, de-identifying sensitive data during the RAG pipeline.Figure 3 – Example of a privacy-safe multi-tenant RAG system

By adopting this approach, you not only minimize the compliance footprint, but also maintain data in its current location, rather than duplicating it in the model hosting environment.

Skyflow uses a variety of machine learning (ML) based approaches to automatically detect and redact sensitive data. This includes detection within documents, images, and audio files. Users can also define their own allow and deny terms to protect company-specific terms that they don’t want shared with the model. Skyflow’s redaction capabilities are versatile and provide a variety of options for de-identification.

Redaction example

Consider a physician’s note with HIPAA data and other forms of sensitive information has been detected and de-identified.

Patient: [ PATIENT_NAME_1 ]
Date of Birth: [ DOB_1 ]
Physician: [ PHYSICIAN_NAME_1 ]
Date of Visit: [ DATE_1 ]

Reason for Visit: Follow-up from previous visit in fall on [ DATE_2 ].

Diagnosis: None

Treatment Plan: Tylenol for pain, follow-up in 6 weeks

Notes: Patient has recovered fully from last visit. Radiologist report by
[ RADIOLOGIST_NAME_1 ] dated [ DATE_3 ] shows no evidence of fractures.

Summary:

[ PATIENT_NAME_1 ] is a [ AGE_1 ]-year old male who presented to the clinic
in a follow-up from a previous visit for a fall. He reports that he has
recovered fully from the fall and has no pain. Radiologist report by
[ RADIOLOGIST_NAME_1 ] shows no evidence of fractures. [ PATIENT_NAME_1 ]
was advised to take Tylenol for pain and follow up in 6 weeks.

Signature: [ PHYSICIAN_NAME_1 ]

This allows for preservation of relevant data relationships within the model. After redaction, direct identifiers are removed, allowing the data to be used across various AI tools with reduced privacy concerns. The de-identified data is then used to train embeddings or applied to models like Amazon Titan.

Since the source data no longer contains PII, the resulting embeddings are also free of PII, reducing overfitting risks and preventing the model from learning PII-related characteristics.

Now let’s look at how you can preserve data privacy while models are in-use.

Preserving privacy during inference

During inference, the goal is to allow natural user interactions, allowing a chatbot to accept and refer to users using their PII appropriately.

To achieve this, you can use Skyflow’s privacy gateway to re-identify the sensitive data, applying configured access policies to control access. You also need to do the following to support privacy-safe inference in our system architecture.

  • Skyflow integration – Skyflow serves as the initial collection point for user-initiated conversations, replacing PII with unique vault-generated tokens that can only be reversed by Skyflow.
  • Multi-tenant chat system – The conversation, now using tokens as stand-ins for PII, is forwarded to the multi-tenant chat system. This system conducts user authentication and authorization processes. Skyflow can be configured to preserve OpenID Connect (OIDC) interactions, allowing existing authentication flows to operate without modification.
  • Access control – Upon successful authentication, the chatbot gains access to tenant-specific content, such as embeddings and the LLM. This allows the chatbot to engage with users naturally, incorporating the replaced PII tokens as needed. Refer to the previously mentioned blog, Build a multi-tenant chatbot with RAG using Amazon Bedrock and Amazon EKS, for a comprehensive understanding of those components of the following architecture.

The result is an architecture (Figure 4) that supports privacy-safe inferences, so even if users provide their name and social security number during a chat interaction, they remain private.

Figure 4 - An architecture that provides privacy-safe model inference for a chatbot. The Skyflow vault de-identifies sensitive data from prompts before inference takes place.Figure 4 – An architecture for privacy-safe model inference using a chatbot

The chatbot integrates with Amazon DynamoDB to handle user session state management. This integration remains unchanged, with one key distinction, the stored dataset, including both content and metadata, no longer contains sensitive identifiers. Instead, it retains only the redacted versions.

Key features and benefits of this approach include the following:

  • Privacy in multi-tenant environments – Skyflow’s reversible tokenization creates distinct redacted identifiers for each tenant, thus preserving privacy in a multi-tenant setting.
  • Transparent reversibility to original PII – The tokenization process is transparently reversible to the original PII, facilitating natural and two-way chat conversations between users and chatbots.
  • Global uniqueness for the tenant – The tokenization scheme is globally unique for each tenant, allowing a seamless end-to-end session without the need for explicit integration between Skyflow and the DynamoDB instance.

This approach maintains the functionality of DynamoDB within the chatbot system while upholding privacy standards and enabling secure, natural interactions between users and chatbots.

Conclusion

Integrating Skyflow into a multi-tenant RAG model secures sensitive data throughout a generative AI application’s lifecycle. Skyflow’s AI gateway allows organizations to safely deploy AI on AWS. For more information, visit Skyflow on AWS Marketplace or contact Skyflow for a demo.



.


Skyflow – AWS Partner Spotlight

Skyflow is an AWS Partner that isolate, protects and governs sensitive information like PII, PHI, and PCI-regulated data. Skyflow customers can host data privacy vaults anywhere in the world simultaneously, with total control over data residency and access.

Contact Skyflow | Partner Overview | AWS Marketplace