awslabs/timely-security-analytics

Name: timely-security-analytics

Owner: Amazon Web Services - Labs

Owner: AWS Samples

Description: Demo code for the Timely Security Analytics and Analysis 2015 Re:Invent presentation.

Created: 2015-09-24 20:55:26.0

Updated: 2017-10-27 11:04:48.0

Pushed: 2015-10-09 15:45:56.0

Homepage: null

Size: 150

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

timely-security-analytics Overview

This repo contains demo code for the Timely Security Analytics and Analysis presentation at the Amazon Web Services 2015 Re:Invent conference. We are open sourcing this project so that others may learn from it and potentially build on it. It really contains three interesting, independent units:

This page will be updated with a link to the video of the talk, when it is available.

CloudTrailToSQL

Do you want to run SQL queries over your CloudTrail logs? Well, you're in the right place. This code is written so you can cut-and-paste it into a spark-shell and then start running SQL queries (hence why there is no package in this code and I've used as few libraries as possible). All the libraries it requires are already included in the build of Spark 1.4.1 that's available through Amazon's Elastic MapReduce (EMR). For more information, see https://aws.amazon.com/blogs/aws/new-apache-spark-on-amazon-emr/

How to use it
  1. Provision an EMR cluster with Spark 1.4.1 and an IAM role that has CloudTrail and S3 describe and read permissions.
  2. SSH to that cluster and run that spark-shell, e.g. spark-shell –master yarn-client –num-executors 40 –conf spark.executor.cores=2
  3. Cut and paste the contents of CloudTrailToSQL.scala (found in this package) into your Spark Shell (once the scala> prompt is available)
  4. Run the following commands:
    cloudtrail = CloudTrailToSQL.createTable(sc, sqlContext) //creates and registers the Spark SQL table
    dTrailToSQL.runSampleQuery(sqlContext) //runs a sample query
    
    Note that these commands will take some time to run as they load your CloudTrail data from S3 and store it in-memory on the Spark cluster. Run the sample query again and you'll see the speed up that the in-memory caching provides.
  5. Run any SQL query you want over the data, e.g.
    ontext.sql("select distinct eventSource, eventName, userIdentity.principalId from cloudtrail where userIdentity.principalId = userIdentity.accountId").show(99999) //Find services and APIs called with root credentials
    
  6. You can create a Hive table (that will persist after your program exits) by running
    cloudtrail = CloudTrailToSQL.createHiveTable(sc, sqlContext)
    
    Additional uses
    You can configure and invoke geoIP lookup functions using code like that below. To do this, you will need a copy of the Maxmind GeoIP database. See the Dependencies section of this documentation.
    rt reinvent.securityanalytics.utilities.Configuration
    rt reinvent.securityanalytics.GeoIPLookup
    config = new Configuration("<YOUR BUCKET>", "config/reinventConfig.properties")
    geoIP = new GeoIPLookup(config)
    P.registerUDFs(sqlContext) //Registers UDFs that you can use for lookups.
    ontext.sql("select distinct sourceIpAddress, city(sourceIpAddress), country(sourceIpAddress) from cloudtrail").collect.foreach(println)
    

    CloudTrailProfileAnalyzer

    How to use it
  7. Fill out a config file and load it in S3
  8. (Optional) License Maxmind's GeoIP DB to get use of the GeoIP functionality
  9. Start an EMR cluster with Spark 1.4.1
  10. Compile the code with “mvn package”
  11. Upload the fat jar (e.g., cloudtrailanalysisdemo-1.0-SNAPSHOT-jar-with-dependencies.jar) to your EMR cluster
  12. Submit it using spark-submit. See resources/startStreaming.sh for an example. Make sure to pass the bucket and key that points to your config file.
  13. Look for alerts via the subscriptions set up on your SNS topic.
Future work

Dependencies

This code has the key dependencies described below. For a full list, including versions, please see the pom.xml file included in the repo.

timely-security-analytics Overview

This repo contains demo code for the Timely Security Analytics and Analysis presentation at the Amazon Web Services 2015 Re:Invent conference. We are open sourcing this project so that others may learn from it and potentially build on it. It really contains three interesting, independent units:

This page will be updated with a link to the video of the talk, when it is available.

CloudTrailToSQL

Do you want to run SQL queries over your CloudTrail logs? Well, you're in the right place. This code is written so you can cut-and-paste it into a spark-shell and then start running SQL queries (hence why there is no package in this code and I've used as few libraries as possible). All the libraries it requires are already included in the build of Spark 1.4.1 that's available through Amazon's Elastic MapReduce (EMR). For more information, see https://aws.amazon.com/blogs/aws/new-apache-spark-on-amazon-emr/

How to use it
  1. Provision an EMR cluster with Spark 1.4.1 and an IAM role that has CloudTrail and S3 describe and read permissions.
  2. SSH to that cluster and run that spark-shell, e.g. spark-shell –master yarn-client –num-executors 40 –conf spark.executor.cores=2
  3. Cut and paste the contents of CloudTrailToSQL.scala (found in this package) into your Spark Shell (once the scala> prompt is available)
  4. Run the following commands:
    cloudtrail = CloudTrailToSQL.createTable(sc, sqlContext) //creates and registers the Spark SQL table
    dTrailToSQL.runSampleQuery(sqlContext) //runs a sample query
    
    Note that these commands will take some time to run as they load your CloudTrail data from S3 and store it in-memory on the Spark cluster. Run the sample query again and you'll see the speed up that the in-memory caching provides.
  5. Run any SQL query you want over the data, e.g.
    ontext.sql("select distinct eventSource, eventName, userIdentity.principalId from cloudtrail where userIdentity.principalId = userIdentity.accountId").show(99999) //Find services and APIs called with root credentials
    
  6. You can create a Hive table (that will persist after your program exits) by running
    cloudtrail = CloudTrailToSQL.createHiveTable(sc, sqlContext)
    
    Additional uses
    You can configure and invoke geoIP lookup functions using code like that below. To do this, you will need a copy of the Maxmind GeoIP database. See the Dependencies section of this documentation.
    rt reinvent.securityanalytics.utilities.Configuration
    rt reinvent.securityanalytics.GeoIPLookup
    config = new Configuration("<YOUR BUCKET>", "config/reinventConfig.properties")
    geoIP = new GeoIPLookup(config)
    P.registerUDFs(sqlContext) //Registers UDFs that you can use for lookups.
    ontext.sql("select distinct sourceIpAddress, city(sourceIpAddress), country(sourceIpAddress) from cloudtrail").collect.foreach(println)
    

    CloudTrailProfileAnalyzer

    How to use it
  7. Fill out a config file and load it in S3
  8. (Optional) License Maxmind's GeoIP DB to get use of the GeoIP functionality
  9. Start an EMR cluster with Spark 1.4.1
  10. Compile the code with “mvn package”
  11. Upload the fat jar (e.g., cloudtrailanalysisdemo-1.0-SNAPSHOT-jar-with-dependencies.jar) to your EMR cluster
  12. Submit it using spark-submit. See resources/startStreaming.sh for an example. Make sure to pass the bucket and key that points to your config file.
  13. Look for alerts via the subscriptions set up on your SNS topic.
Future work

Dependencies

This code has the key dependencies described below. For a full list, including versions, please see the pom.xml file included in the repo.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.