AWS Glue Integration
Author: Dilraj Singh
About this Sample
In this scenario we will explore the use case of automatic data discovery and protection of an unstructured text file. As data lands in S3, it will be de-identified before it is made available for future processing.
This code sample combines the powers of AWS Glue and the Protegrity’s API Playground. AWS Glue is a serverless data integration service that is widely used as an ETL tool to move data from and to AWS and non-AWS data sources. We will utilize Glue to orchestrate an ETL job comprised of picking up data received in an S3 bucket, sending it to the Playground for automatic data classification and protection, and writing it to a target output directory.
Disclaimer
Non-GA Functionality: for demonstration purposes only. Note that Protegrity GenAI Security features are currently in Preview. Protegrity releases and supports an official product, the S3 Cloud Storage Protector to protect data in S3. The product is optimized for best performance, scalability, and security. We advise using this sample only to demonstrate the functionality.Prerequisites
- Protegrity API Playground activated account
- Access to a AWS non-production account
- Access to Glue, S3, IAM, Lambda, and CloudWatch
IAM Role Setup
- Create 2 new IAM roles:
LambdaInvokeGlue
for the Lambda service. The role must be able to run Glue jobs and create logs.GlueS3ReadWrite
for the Glue service. The role must be able to write and read from S3 and create logs. You can use theAWSGlueServiceRole
policy and tune it as necessary.
Lambda Setup
-
Go to AWS Lambda and create a new Lambda function. Call it:
Glue Trigger
. Set the runtime to Python 3.13. Attach theLambdaInvokeGlue
IAM role to the function execution. -
Once the serverless function is created, pass the following in the code source and hit deploy:
import boto3 import os def lambda_handler(event, context): JobName='GlueClassifyProtectWrite' glue_client = boto3.client('glue', region_name='us-east-1') print(event) source_bucket = event['Records'][0]['s3']['bucket']['name'] source_key = event['Records'][0]['s3']['object']['key'] if source_key.endswith('.txt'): # Pass bucket and object key to the Glue job response = glue_client.start_job_run( JobName='GlueClassifyProtectWrite', Arguments={ '--JobName' : JobName, '--source_bucket': source_bucket, '--source_key': source_key } ) return { 'statusCode': 200, 'body': f"Started Glue job with ID {response['JobRunId']}" } else: return { 'statusCode': 200, 'body': "Unsupported file format provided, ignoring." }
This is the trigger definition that will start a Glue job when a text file is found in the S3 bucket.
S3 Setup
-
Create 2 buckets for this exercise. Call them anything you’d like, and to distinguish between the two, append
-input
and-output
to their names. The policy of the input S3 directory should allowGlueS3ReadWrite
to write files withs3:GetObject
and the output directory should allow the role to read files withs3:PutObject
. -
Set up a new event notification on the
-input
bucket. Call itTextTrigger
and filter the suffix to only pick up files ending with.txt
. Set the destination as theGlue Trigger
Lambda function.
Glue Setup
-
The final step is creating the Glue job. Open AWS Glue and create a new ETL job. Let’s call it
GlueClassifyProtectWrite
(if you choose a different name, make sure to update the Lambda function). Set the type to Spark, version to Glue 5.0 and the language to Python 3. The execution role should be set to the Glue service role,GlueS3ReadWrite
. You can keep the other defaults. -
In the script, paste the following code. In the next step we will update the sample to match your own environment.
import requests import boto3 import json import shutil import sys from awsglue.utils import getResolvedOptions args = getResolvedOptions(sys.argv, ['JobName', 'source_bucket', 'source_key']) # S3 Bucket and File configuration source_bucket = args['source_bucket'] source_key = args['source_key'] target_bucket = "api-playground-glue-output" target_key = source_key.replace(".txt", "-protected.txt") # Connection to S3 s3 = boto3.client('s3', region_name = "") # Playground Login logon_response = requests.post('https://api.playground.protegrity.com/auth/login', headers={'Content-Type': 'application/json'}, verify= False, json={ "email": "", "password": ""}) # Retrieve JWT Token to authenticate requests response_data = logon_response.json() JWT_TOKEN = response_data['jwt_token'] # API Playground URL API_URL = "https://api.playground.protegrity.com/v1/ai" API_KEY = "" API_VERSION = "v1" # Request headers headers = { 'Content-Type': 'application/json', 'x-api-key' : f"{API_KEY}", 'Authorization': 'Bearer ' f"{JWT_TOKEN}" } # Read function def read_bucket(source_bucket, source_key): try: # Get the file from the source bucket response = s3.get_object(Bucket=source_bucket, Key=source_key) # Get the file content (binary) file_content = response['Body'].read() print(file_content) # Assuming it's a text file, decode it to a string file_text = file_content.decode('utf-8') print(f"Successfully read the file {source_bucket}/{source_key}:") except Exception as e: print(f"Error: {e}") return file_text # Write function def write_bucket(target_bucket, target_key, file_text): try: s3.put_object(Bucket = target_bucket, Key = target_key, Body=file_text) print(f"Successfully copied {source_key} from {source_bucket} to {target_bucket}/{target_key}") except Exception as e: print(f"Error: {e}") # Classify and Protect function – Calls API Playground def classify_protect (file_text): data_json = {"operation": "protect", "options": {"type": "mask", "tags" : False, "threshold": 0.6}, "data": [f"{file_text}"]} response = requests.post(API_URL, json = data_json, headers=headers, verify= False) return response.json() # Run the job file_text = read_bucket(source_bucket, source_key) api_response = classify_protect(file_text) response_data = api_response["results"] write_bucket(target_bucket, target_key, response_data)
-
Adjust the script with your environment details, specifically the lines:
- within the Connection to S3 section, provide the region of your input and output buckets, e.g.
s3 = boto3.client(‘s3’, region_name = “us-east-1”)
- within the Playground Login section, specify your email and password used to authenticate with the Playground
- within the API Playground URL section, provide your API Key used to authorize your Playground requests
- Save your Glue job. We’re ready to roll!
Automatic Data Classification and Protection of Unstructured Files
This scenario showcases processing of an unstructured text file. The file contents are classified and protected entirely by the Protegrity API Playground.
-
Go to the
-input
bucket and drop there a text file that you wish to de-identify. You can use our sample or provide your own.Alexandra Rivera Address: 1258 Maplewood Drive Springfield, IL 62704 Phone: (217) 555-3927 Email: arivera82@email.com Date: March 21, 2025 To: Customer Disputes Department First Horizon Credit Bank 4801 Westlake Blvd Austin, TX 73301 Subject: Dispute of Unauthorized Credit Card Charge Dear Customer Disputes Department, I am writing to formally dispute a charge on my credit card account that I did not authorize. Cardholder Name: Alexandra Rivera Credit Card Number: 3709888761001982 Date of Charge: March 17, 2025 Amount: $198.76 Merchant Name: "TechMart Online – NY" I did not authorize this transaction and have never conducted business with the above-mentioned merchant. I became aware of this charge after reviewing my recent statement and immediately verified that neither I nor anyone with authorized access to my account made this purchase. In accordance with the Fair Credit Billing Act, I am requesting that this charge be removed from my account, that any related interest or fees be reversed, and that a corrected statement be issued as soon as possible. Please investigate this matter and notify me of the outcome. Enclosed with this letter is a copy of my most recent statement highlighting the disputed charge. I have also taken the precaution of temporarily suspending the card to prevent any further unauthorized use. Please confirm receipt of this letter and provide a timeline for the resolution of this issue. Should you require any additional information or documentation, feel free to contact me at the number or email address listed above. Thank you for your prompt attention to this matter. Sincerely, Alexandra Rivera
-
The processing will take up to a minute to complete. Monitor the Glue logs to see the progress.
-
Once the job finalizes, you will see a de-identified file in your
-output
bucket. It will look akin to this:################ Address: 1258 ############### ###########, ## 62704 Phone: ############## Email: ################### Date: March 21, 2025 To: Customer Disputes Department First Horizon Credit Bank 4801 ############# ######, ## 73301 Subject: Dispute of Unauthorized Credit Card Charge Dear Customer Disputes Department, I am writing to formally dispute a charge on my credit card account that I did not authorize. Cardholder Name: ################ Account Number: ################ Credit Card Number: ################ Date of Charge: March 17, 2025 Amount: $198.76 Merchant Name: "TechMart Online – NY" I did not authorize this transaction and have never conducted business with the above-mentioned merchant. I became aware of this charge after reviewing my recent statement and immediately verified that neither I nor anyone with authorized access to my account made this purchase. In accordance with the Fair Credit Billing Act, I am requesting that this charge be removed from my account, that any related interest or fees be reversed, and that a corrected statement be issued as soon as possible. Please investigate this matter and notify me of the outcome. Enclosed with this letter is a copy of my most recent statement highlighting the disputed charge. I have also taken the precaution of temporarily suspending the card to prevent any further unauthorized use. Please confirm receipt of this letter and provide a timeline for the resolution of this issue. Should you require any additional information or documentation, feel free to contact me at the number or email address listed above. Thank you for your prompt attention to this matter. Sincerely, ################
Feel free to adjust the configuration of your Glue job to set your own risk tolerance (by choosing a low or high
threshold
), adding classification tags, or choosing another protection type.
Summary
Pairing AWS Glue with Protegrity is a powerful value proposition: the combination facilitates custom file uploads to the Cloud whilst achieving high security and control over what is being shared. With the flexibility of Glue and the API-first approach of the Playground (and, by extension, Protegrity), the job is automated and scalable, ensuring that the best practices of data security are embedded in your cloud platform from the get-go.
This sample can be further extended to other file formats, other target systems, and platforms. You may want to leverage a data catalog to scan and store the file metadata (recommended for structured files). You may also decide to combine this example with our other samples (such as data unprotection in Snowflake).
Last modified March 21, 2025