Parse and query CloudTrail logs with AWS Glue, Amazon Redshift Spectrum and Athena

05/11/2018
Posted in AWS Blog
05/11/2018 Jo Evens

Building on the Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena blog post on the AWS Big Data blog, this post will demonstrate how to convert CloudTrail log files into parquet format and query those optimized log files with Amazon Redshift Spectrum and Athena.

The people over at awslabs did a great job in providing scripts that allow the conversion through AWS Glue ETL jobs. I’ll be using their scripts throughout this post.

Depending on your use case, either Redshift Spectrum or Athena will come up as the best fit:
If you want  ad-hoq, multi-partitioning and complex data types go with Athena.
If on the other hand you want to integrate wit existing redshift tables, do lots of joins or aggregates go with Redshift Spectrum.

Setting things up

Users, roles and policies

For the purpose of this demo, I’ve created a demo-user with following policies attached:

  1. AmazonAthenaFullAccess
  2. AmazonRedshiftFullAccess
  3. An inline policy allowing read-only access to the CloudTrail logs on S3 and the scripts bucket.
  4. An inline policy allowing read-write access to the S3 bucket containing the Glue ETL scripts

The Glue service role contains:

  1. The managed AWSGlueServiceRole
  2. An inline policy giving read-write access to the CloudTrail logs on S3

The Redshift service role contains:

  1. The managed AWSGlueConsoleFullAccess role
  2. An inline policy giving read access to the cloudtrail logs on S3

In order to use Athena and Redshift from SQL editors, please add port 443 and 5439 to your VPC’s default security group.

Awslabs scripts

The readme.md on the github project page explains how to build and deploy the scripts. In this case, I’ve uploaded the scripts to another bucket, not the bucket containing the CloudTrail logs.

Glue

Once the scripts are in place, create the Glue ETL job using the AWS CLI:

Now, to actually start the job, you can select it in the AWS Glue console, under ETL – Jobs, and click Action – Run Job, or through the CLI:

You can follow up on progress by using:

Until the JobRunState is Succeeded:

 

Athena

Launch your favorite SQL editor (Additional drivers) ,  or open Athena in the AWS console.
Let’s see what our table looks like:

You’ll notice 4 columns starting with json_. These contain some more nested JSON data.
For example:

You can use the JSON extract functionality in Athena to dive in deeper:

Something more useful to interpret:

Redshift Spectrum

Now that we have our tables and database in the Glue catalog, querying with Redshift Spectrum is easy.
First make sure you have a Redshift cluster running, then create the external schema:

Our tables are detected automatically (Thank you Glue).

Creating a session with psql:

And a query making a bit more sense than COUNT(*) :

 

Share this AWSome post
, ,

Leave a Reply

Your email address will not be published. Required fields are marked *