How to Discover Sensitive Data in GCP

Introduction

Let’s be honest: the current AI hype is exhausting.

Every company I talk to wants to “do something with AI.” Most don’t even know why - it just feels like something they should be doing. The result? Teams rush to train models on whatever data they can get their hands on, often without thinking about what’s actually inside those datasets.

I’ve seen it too many times: internal data lakes, storage buckets, BigQuery tables - all thrown into the AI blender without a second thought. Privacy? Compliance? Data classification? That’ll come later - maybe.

This is exactly why GCP Sensitive Data Protection exists. It helps you identify sensitive information - like PII, secrets, or payment data across key services like:

Cloud Storage
BigQuery
Cloud SQL
Secrets Manager
Vertex AI Datasets

And the best part? You can integrate findings directly into Google Security Command Center (SCC), Pub/Sub, BigQuery, automatically assigning them to the right teams for remediation. It’s a scalable way to bring data governance and privacy back into the AI development loop.

How to Configure Sensitive Data Discovery in Google Cloud

Setting up GCP Sensitive Data Protection isn’t complicated, but doing it right ensures that you’re actually catching sensitive data. Here’s how to configure discovery from the ground up.

1. Choose What to Scan: Set the Discovery Target Type

Start by selecting the discovery target type. This defines the kind of resources you want to scan - for example, Cloud Storage, BigQuery, or Cloud SQL. GCP offers support for a growing number of services, so make sure you choose the relevant ones for your environment.

GCP Sensitive Data Protection discovery type

2. Define the Scope: Organization, Folder, or Project

Next, choose whether to scan your entire organization, a specific folder, or just a project. This is an important decision - you can only have one configuration per organization, folder, or project. Keep in mind that centralizing this at the org level often makes the most sense for governance.

3. Control When and What to Profile: Create a Discovery Schedule

This step lets you fine-tune when profiling happens and what data gets scanned.

Use filters to include or exclude datasets or resources.
Set minimum conditions to delay profiling until a table has enough rows or is old enough.
Add a time condition to skip outdated tables entirely - useful for avoiding noisy or irrelevant results.

This gives you the flexibility to profile only what really matters, reducing cost and improving scan quality.

4. Select Relevant InfoTypes to Scan For

You don’t need to scan everything. Focus on InfoTypes that are relevant to your business - like credit card numbers, national IDs, or custom identifiers. This reduces noise and helps you avoid false positives.

If you're feeding these results into automated workflows, staying targeted here is critical.

GCP Sensitive Data Protection inspection template

5. Choose Where to Send the Scan Results

By default, Sensitive Data Protection findings can be integrated with Security Command Center (SCC) - and this is usually the best path. It centralizes your findings, supports auto-assignment, and gives your security teams one place to work from.

You can also send findings to Pub/Sub, but this often leads to fragmentation.

💡

Recommendation: Always centralize in SCC unless you have a clear architectural reason not to.

6. Create or Reuse a Service Agent Container

Each discovery configuration needs a service agent container project, which provides the necessary identity and permissions.

You can either:

Create a new project for this.
Or reuse an existing one.

Make sure the DLP API service agent has these roles:

roles/dlp.orgDriver
roles/storage.admin
roles/resourcemanager.tagUser

Without these, discovery jobs will fail.

GCP Sensitive Data Protection service agent container and billing

7. Configure Fallback Processing Locations (Optional)

By default, Sensitive Data Protection processes data in the same region where it’s stored. But not every feature is available in all regions. If needed, set a fallback location to ensure scans still work when regional support is limited.

💡

If you're scanning data centrally, it's your responsibility to ensure you're not accidentally moving data across borders in a way that violates internal or regulatory requirements.

GCP Sensitive Data Protection fallback processing location

8. Set the Configuration Storage Location

Finally, define where the discovery configuration itself will be stored. This setting does not impact where data is profiled - it only determines the region for the configuration metadata.

GCP Sensitive Data Protection configuration storage location

What Happens Next: Viewing Your Sensitive Data Discovery Results

After the configuration is in place and your discovery schedules are running, you might need to wait a bit - sometimes even hours - before you see any results. But once the scans complete, GCP gives you a nice aggregated view of where sensitive data lives across your environment.

Your findings also automatically propagate to Security Command Center (SCC) if configured. It’s incredibly useful, especially if you integrate this with automation - for example, assigning findings to the correct team based on project or label.

GCP Sensitive Data Protection integration with SCC

You’ll also see that your service agent container project now shows data in BigQuery. Here's a simple query example that lets you summarize sensitive data findings per project:

SELECT
  file_store_profile.project_id AS project_id,
  COUNT(*) AS data_source_count
FROM
  sensitive_data_protection_discovery.discovery_profiles_latest_v1*
GROUP BY
  project_id
ORDER BY
  data_source_count DESC

This is just the beginning. Once you know where sensitive data lives, you can start asking the more important question: “Should it be there in the first place?” And that’s exactly where security context, governance, and yes - sometimes even automation with Cloud Run Functions - can take over.

Common Pitfall: DLP API Not Enabled or Project Not Fully Ready

One issue I’ve seen more than once: you configure everything correctly, hit "create," and then get this error:

“Failed to enable the DLP API service. Please try enabling manually here: https://console.cloud.google.com/apis/library/dlp.googleapis.com?project=YOUR_PROJECT_ID”

There are usually two root causes:

The DLP API isn’t enabled yet.
You’ll need to manually enable it for your container project.
The project hasn’t fully initialized.
Even if you just created the container project, it may take a few minutes before it’s ready for DLP to hook into it. In that case:
- Wait a few minutes
- Manually verify billing is enabled
- And yes… you might need to restart the configuration process from step one.

It’s frustrating, but once it’s set up, it runs very reliably.

How to Discover Sensitive Data in GCP

Introduction

How to Configure Sensitive Data Discovery in Google Cloud

1. Choose What to Scan: Set the Discovery Target Type

2. Define the Scope: Organization, Folder, or Project

3. Control When and What to Profile: Create a Discovery Schedule

4. Select Relevant InfoTypes to Scan For

5. Choose Where to Send the Scan Results

6. Create or Reuse a Service Agent Container

7. Configure Fallback Processing Locations (Optional)

8. Set the Configuration Storage Location

What Happens Next: Viewing Your Sensitive Data Discovery Results

Common Pitfall: DLP API Not Enabled or Project Not Fully Ready

Member discussion

Data Residency & Compliance in GCP made easy

The Most Underrated Service in GCP?

Automating VM Compliance with VM Manager and OS Policy Orchestrator

Building Custom Threat Detections in Google Cloud Security Command Center

How to Automate and Enrich Security Command Center Findings with Gemini 2.0

Understanding the GCP Metadata Service and Service Accounts