nprapps/russia-explainer-serverless-chrome

Name: russia-explainer-serverless-chrome

Owner: NPR visuals team

Description: Run headless Chrome/Chromium on AWS Lambda (maybe Azure, & GCP later)

Forked from: eads/serverless-chrome

Created: 2017-05-10 17:41:52.0

Updated: 2017-09-07 17:00:44.0

Pushed: 2017-05-23 15:20:10.0

Homepage:

Size: 175686

Language: JavaScript

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

serverless-chrome

Serverless Chrome contains everything you need to get started running headless Chrome on AWS Lambda (possibly Azure and GCP Functions soon).

The aim of this project is to provide the scaffolding for using Headless Chrome during a serverless function invocation. Serverless Chrome takes care of building and bundling the Chrome binaries and making sure Chrome is running when your serverless function executes. In addition, this project also provides a few “example” handlers for common patterns (e.g. taking a screenshot of a page, printing to PDF, some scraping, etc.)

Why? Because it's neat. It also opens up interesting possibilities for using the Chrome Debugger Protocol in serverless architectures.

Contents
  1. What is it?
  2. Installation
  3. Setup
  4. Testing
  5. Configuration and Deployment
  6. Known Issues / Limitations
  7. Roadmap
  8. Troubleshooting
Installation

Installation can be achieved with the following commands

clone https://github.com/adieuadieu/serverless-chrome
erverless-chrome
 install

(It is possible to exchange yarn for npm if yarn is too hipster for you. No problem.)

Or, if you have serverless installed globally:

erless install -u https://github.com/adieuadieu/serverless-chrome
Setup
Credentials

You must configure your AWS credentials either by defining AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environmental variables, or using an AWS profile. You can read more about this on the Serverless Credentials Guide.

In short, either:

rt AWS_PROFILE=<your-profile-name>

or

rt AWS_ACCESS_KEY_ID=<your-key-here>
rt AWS_SECRET_ACCESS_KEY=<your-secret-key-here>
Testing

Test with yarn test or just yarn ava to skip the linter.

Setup and Deployment
 deploy

This package bundles a lambda-execution-environment-ready headless Chrome binary which allows you to deploy from any OS. The current build is:

Configuration

You can override default configuration in the /config.js file generated at the root of the project after a yarn install. See the defaults in src/config.js for a full list of configuration options.

Example Handlers

Currently there are only two, very basic “proof of concept” type functions:

captureScreenshot: Capture Screenshot of a given URL

When you the serverless function, it creates a Lambda function which will take a screenshot of a URL it's provided. You can provide this URL to the Lambda function via the AWS API Gateway. After a successful deploy, an API endpoint will be provided. Use this URL to call the Lambda function with a url in the query string. E.g. https://XXXXXXX.execute-api.us-west-2.amazonaws.com/dev/chrome?url=https://google.com/

We're using API Gateway as our method to execute the function, but of course it's possible to use any other available triggers to kick things off be it an event from S3, SNS, DynamoDB, etc. TODO: explain how –^

/config.js

rt captureScreenshot from './src/handlers/captureScreenshot'

rt default {
ndler: captureScreenshot

printToPdf: Print a given URL to PDF

The printToPdf handler will create a PDF from a URL it's provided. You can provide this URL to the Lambda function via the AWS API Gateway. After a successful deploy, an API endpoint will be provided. Use this URL to call the Lambda function with a url in the query string. E.g. https://XXXXXXX.execute-api.us-west-2.amazonaws.com/dev/chrome?url=https://google.com/

Note: Headless Chrome currently doesn't expose any configuration options (paper size, orientation, margins, etc) for printing to PDF. You can follow Chromium's progress on this here and here. You can get some sense of the upcoming configuration options from the modifications to the Chrome Debugging Protocol here.

We're using API Gateway as our method to execute the function, but of course it's possible to use any other available triggers to kick things off be it an event from S3, SNS, DynamoDB, etc. TODO: explain how –^

/config.js

rt printToPdf from './src/handlers/printToPdf'

rt default {
ndler: printToPdf

Custom Handlers

You can provide your own handler via the /config.js file created when you initialize the project with yarn install. The config accepts a handler property. Pass it a function which returns a Promise when complete. For example:

/config.js

rt default {
ndler: async function(invocationEventData, executionContext) {
const { queryStringParameters: { url } } = invocationEventData
const stuff = await doSomethingWith(url)
return stuff


The first parameter, invocationEventData, is the event data with which the Lambda function is invoked. It's the first parameter provided by Lambda. The second, executionContext is the second parameter provided to the Lambda function which contains useful runtime information.

serverless-chrome calls the Lambda handlers callback() for you when your handler function completes. The result of your handler is passed to callback with callback(null, yourHandlerResult). If your handler throws an error, callback is called with callback(yourHandlerError).

For example, to create a handler which returns the version info of the Chrome Debugger Protocol, you could modify /config.js to:

rt Cdp from 'chrome-remote-interface'

rt default {
ync handler (event) {

const versionInfo = await Cdp.Version()

return {
  statusCode: 200,
  body: JSON.stringify({
    versionInfo,
  }),
  headers: {
    'Content-Type': 'application/json',
  },
}


To capture all of the Network Request events made when loading a URL, you could modify /config.js to something like:

rt Cdp from 'chrome-remote-interface'
rt { sleep } from './src/utils'

t LOAD_TIMEOUT = 1000 * 30

rt default {
ync handler (event) {
const requestsMade = []
let loaded = false

const loading = async (startTime = Date.now()) => {
  if (!loaded && Date.now() - startTime < LOAD_TIMEOUT) {
    await sleep(100)
    await loading(startTime)
  }
}

const [tab] = await Cdp.List()
const client = await Cdp({ host: '127.0.0.1', target: tab })

const { Network, Page } = client

Network.requestWillBeSent(params => requestsMade.push(params))

Page.loadEventFired(() => {
  loaded = true
})

// https://chromedevtools.github.io/debugger-protocol-viewer/tot/Network/#method-enable
await Network.enable()

// https://chromedevtools.github.io/debugger-protocol-viewer/tot/Page/#method-enable
await Page.enable()

// https://chromedevtools.github.io/debugger-protocol-viewer/tot/Page/#method-navigate
await Page.navigate({ url: 'https://www.chromium.org/' })

// wait until page is done loading, or timeout
await loading()

// It's important that we close the websocket connection,
// or our Lambda function will not exit properly
await client.close()

return {
  statusCode: 200,
  body: JSON.stringify({
    requestsMade,
  }),
  headers: {
    'Content-Type': 'application/json',
  },
}


See src/handlers for more examples.

TODO: talk about CDP and chrome-remote-interface

Known Issues / Limitations
  1. hack to chrome code to disable /dev/shm.
  2. /tmp size on Lambda
  3. it might not be the most cost efficient to do this on Lambda vs. EC2
Roadmap

1.0

  1. Don't force the use of Serverless-framework. See Issue #4
    1. Refactor the headless Chrome bundle and Chrome spawning code into an npm package
    2. Create a Serverless plugin, using above npm package
  2. OMG OMG Get unit tests up to snuff!
  3. Example serverless services using headless-chrome
    1. Printing a URL to a PDF
    2. Loading a page and taking a screenshot, with options on viewport size and device settings
    3. DOM manipulation and scraping

Future

  1. Support for Google Cloud Functions
  2. Support for Azure Functions?
  3. Example handler with nightmarejs (if this is even possible?)
Troubleshooting

I keep getting a timeout error when deploying and it's really annoying.

Indeed, that is annoying. I've had the same problem, and so that's why it's now here in this troubleshooting section. This may be an issue in the underlying AWS SDK when using a slower Internet connection. Try changing the AWS_CLIENT_TIMEOUT environment variable to a higher value. For example, in your command prompt enter the following and try deploying again:

rt AWS_CLIENT_TIMEOUT=3000000

Aaaaaarggghhhhhh!!!

Uuurrrggghhhhhh! Have you tried filing an Issue?


You might also be interested in:

CircleCI Coveralls Codacy grade David David


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.