AWS Glue Integration

Automated data protection using Glue and the Playground

Author: Dilraj Singh

About this Sample

In this scenario we will explore the use case of automatic data discovery and protection of an unstructured text file. As data lands in S3, it will be de-identified before it is made available for future processing.

This code sample combines the powers of AWS Glue and the Protegrity’s API Playground. AWS Glue is a serverless data integration service that is widely used as an ETL tool to move data from and to AWS and non-AWS data sources. We will utilize Glue to orchestrate an ETL job comprised of picking up data received in an S3 bucket, sending it to the Playground for automatic data classification and protection, and writing it to a target output directory.

Prerequisites

  • Protegrity API Playground activated account
  • Access to a AWS non-production account
  • Access to Glue, S3, IAM, Lambda, and CloudWatch

IAM Role Setup

  1. Create 2 new IAM roles:
  • LambdaInvokeGlue for the Lambda service. The role must be able to run Glue jobs and create logs.
  • GlueS3ReadWrite for the Glue service. The role must be able to write and read from S3 and create logs. You can use the AWSGlueServiceRole policy and tune it as necessary.

Lambda Setup

  1. Go to AWS Lambda and create a new Lambda function. Call it: Glue Trigger. Set the runtime to Python 3.13. Attach the LambdaInvokeGlue IAM role to the function execution.

  2. Once the serverless function is created, pass the following in the code source and hit deploy:

    
    import boto3
    import os
    
    def lambda_handler(event, context):
        JobName='GlueClassifyProtectWrite'
        glue_client = boto3.client('glue', region_name='us-east-1')
        print(event)
        source_bucket = event['Records'][0]['s3']['bucket']['name']
        source_key = event['Records'][0]['s3']['object']['key']
     
        if source_key.endswith('.txt'):
            # Pass bucket and object key to the Glue job
            response = glue_client.start_job_run(
                JobName='GlueClassifyProtectWrite',
                Arguments={
                    '--JobName' : JobName,
                    '--source_bucket': source_bucket,
                    '--source_key': source_key
                }
            )
            
            return {
                'statusCode': 200,
                'body': f"Started Glue job with ID {response['JobRunId']}"
            }
        else:
            return {
                'statusCode': 200,
                'body': "Unsupported file format provided, ignoring."
            }
        

    This is the trigger definition that will start a Glue job when a text file is found in the S3 bucket.

S3 Setup

  1. Create 2 buckets for this exercise. Call them anything you’d like, and to distinguish between the two, append -input and -output to their names. The policy of the input S3 directory should allow GlueS3ReadWrite to write files with s3:GetObject and the output directory should allow the role to read files with s3:PutObject.

  2. Set up a new event notification on the -input bucket. Call it TextTrigger and filter the suffix to only pick up files ending with .txt. Set the destination as the Glue Trigger Lambda function.

Glue Setup

  1. The final step is creating the Glue job. Open AWS Glue and create a new ETL job. Let’s call it GlueClassifyProtectWrite (if you choose a different name, make sure to update the Lambda function). Set the type to Spark, version to Glue 5.0 and the language to Python 3. The execution role should be set to the Glue service role, GlueS3ReadWrite. You can keep the other defaults.

  2. In the script, paste the following code. In the next step we will update the sample to match your own environment.

    
    import requests
    import boto3
    import json
    import shutil 
    import sys
    from awsglue.utils import getResolvedOptions
    
    args = getResolvedOptions(sys.argv,
                              ['JobName',
                               'source_bucket',
                               'source_key'])
    
    # S3 Bucket and File configuration
    source_bucket = args['source_bucket']
    source_key = args['source_key']
    target_bucket = "api-playground-glue-output"
    target_key = source_key.replace(".txt", "-protected.txt")
    
    # Connection to S3
    s3 = boto3.client('s3', region_name = "")
    
    # Playground Login
    logon_response = requests.post('https://api.playground.protegrity.com/auth/login',
                                  headers={'Content-Type': 'application/json'},
                                  verify= False,
                                  json={ "email": "",
                                         "password": ""})
    
    # Retrieve JWT Token to authenticate requests
    response_data = logon_response.json()
    JWT_TOKEN = response_data['jwt_token']
    
    # API Playground URL
    API_URL     = "https://api.playground.protegrity.com/v1/ai"
    API_KEY   = ""
    API_VERSION = "v1"
    
    # Request headers
    headers = {
        'Content-Type': 'application/json',
        'x-api-key' : f"{API_KEY}",
        'Authorization': 'Bearer ' f"{JWT_TOKEN}"
    }
    
    # Read function
    def read_bucket(source_bucket, source_key):
        try:
            # Get the file from the source bucket
            response = s3.get_object(Bucket=source_bucket, Key=source_key)
            # Get the file content (binary)
            file_content = response['Body'].read() 
            print(file_content)
            # Assuming it's a text file, decode it to a string   
            file_text = file_content.decode('utf-8')  
            print(f"Successfully read the file {source_bucket}/{source_key}:") 
        except Exception as e:
            print(f"Error: {e}")
        return file_text
    
    # Write function
    def write_bucket(target_bucket, target_key, file_text):
        try:
            s3.put_object(Bucket = target_bucket, Key = target_key, Body=file_text)
            print(f"Successfully copied {source_key} from {source_bucket} to {target_bucket}/{target_key}")
    
        except Exception as e:
            print(f"Error: {e}")
    
    # Classify and Protect function – Calls API Playground
    def classify_protect (file_text):
        data_json = {"operation": "protect", "options": {"type": "mask", "tags" : False, "threshold": 0.6}, "data": [f"{file_text}"]}
        response = requests.post(API_URL, json = data_json, headers=headers, verify= False)
    
        return response.json()
    
    # Run the job
    file_text = read_bucket(source_bucket, source_key)
    api_response = classify_protect(file_text)
    response_data = api_response["results"]
    write_bucket(target_bucket, target_key, response_data)
        
  3. Adjust the script with your environment details, specifically the lines:

  • within the Connection to S3 section, provide the region of your input and output buckets, e.g. s3 = boto3.client(‘s3’, region_name = “us-east-1”)
  • within the Playground Login section, specify your email and password used to authenticate with the Playground
  • within the API Playground URL section, provide your API Key used to authorize your Playground requests
  1. Save your Glue job. We’re ready to roll!

Automatic Data Classification and Protection of Unstructured Files

This scenario showcases processing of an unstructured text file. The file contents are classified and protected entirely by the Protegrity API Playground.

  1. Go to the -input bucket and drop there a text file that you wish to de-identify. You can use our sample or provide your own.

    
    Alexandra Rivera
    Address: 1258 Maplewood Drive
    Springfield, IL 62704
    Phone: (217) 555-3927
    Email: arivera82@email.com
    
    Date: March 21, 2025
    
    To:
    Customer Disputes Department
    First Horizon Credit Bank
    4801 Westlake Blvd
    Austin, TX 73301
    
    Subject: Dispute of Unauthorized Credit Card Charge
    
    Dear Customer Disputes Department,
    
    I am writing to formally dispute a charge on my credit card account that I did not authorize.
    
    Cardholder Name: Alexandra Rivera
    Credit Card Number: 3709888761001982
    Date of Charge: March 17, 2025
    Amount: $198.76
    Merchant Name: "TechMart Online – NY"
    
    I did not authorize this transaction and have never conducted business with the above-mentioned merchant. I became aware of this charge after reviewing my recent statement and immediately verified that neither I nor anyone with authorized access to my account made this purchase.
    
    In accordance with the Fair Credit Billing Act, I am requesting that this charge be removed from my account, that any related interest or fees be reversed, and that a corrected statement be issued as soon as possible. Please investigate this matter and notify me of the outcome.
    
    Enclosed with this letter is a copy of my most recent statement highlighting the disputed charge. I have also taken the precaution of temporarily suspending the card to prevent any further unauthorized use.
    
    Please confirm receipt of this letter and provide a timeline for the resolution of this issue. Should you require any additional information or documentation, feel free to contact me at the number or email address listed above.
    
    Thank you for your prompt attention to this matter.
    
    Sincerely,
    Alexandra Rivera
        
  2. The processing will take up to a minute to complete. Monitor the Glue logs to see the progress.

  3. Once the job finalizes, you will see a de-identified file in your -output bucket. It will look akin to this:

    
    ################
    Address: 1258 ###############
    ###########, ## 62704
    Phone: ##############
    Email: ###################
    
    Date: March 21, 2025
    
    To:
    Customer Disputes Department
    First Horizon Credit Bank
    4801 #############
    ######, ## 73301
    
    Subject: Dispute of Unauthorized Credit Card Charge
    
    Dear Customer Disputes Department,
    
    I am writing to formally dispute a charge on my credit card account that I did not authorize.
    
    Cardholder Name: ################
    Account Number: ################
    Credit Card Number: ################
    Date of Charge: March 17, 2025
    Amount: $198.76
    Merchant Name: "TechMart Online – NY"
    
    I did not authorize this transaction and have never conducted business with the above-mentioned merchant. I became aware of this charge after reviewing my recent statement and immediately verified that neither I nor anyone with authorized access to my account made this purchase.
    
    In accordance with the Fair Credit Billing Act, I am requesting that this charge be removed from my account, that any related interest or fees be reversed, and that a corrected statement be issued as soon as possible. Please investigate this matter and notify me of the outcome.
    
    Enclosed with this letter is a copy of my most recent statement highlighting the disputed charge. I have also taken the precaution of temporarily suspending the card to prevent any further unauthorized use.
    
    Please confirm receipt of this letter and provide a timeline for the resolution of this issue. Should you require any additional information or documentation, feel free to contact me at the number or email address listed above.
    
    Thank you for your prompt attention to this matter.
    
    Sincerely,
    ################
        

    Feel free to adjust the configuration of your Glue job to set your own risk tolerance (by choosing a low or high threshold), adding classification tags, or choosing another protection type.

Summary

Pairing AWS Glue with Protegrity is a powerful value proposition: the combination facilitates custom file uploads to the Cloud whilst achieving high security and control over what is being shared. With the flexibility of Glue and the API-first approach of the Playground (and, by extension, Protegrity), the job is automated and scalable, ensuring that the best practices of data security are embedded in your cloud platform from the get-go.

This sample can be further extended to other file formats, other target systems, and platforms. You may want to leverage a data catalog to scan and store the file metadata (recommended for structured files). You may also decide to combine this example with our other samples (such as data unprotection in Snowflake).



Last modified March 21, 2025