wtsi-hgi/irobot

Name: irobot

Owner: Wellcome Trust Sanger Institute - Human Genetics Informatics

Description: iRODS data brokerage service

Created: 2017-01-05 12:18:51.0

Updated: 2017-08-23 10:37:21.0

Pushed: 2017-10-10 15:45:50.0

Homepage:

Size: 548

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

iRobot

Build Status Test Coverage

iRODS data brokerage service: Data objects from iRODS are requested by an authenticated agent, via HTTP, which are then staged on local disk before being sent out as a response. The service also acts as a precache, to presumptively seed upstream systems with data, as well as managing a connection pool to iRODS.

Work in Progress
Installation

iRobot is fully containerised, using Docker. The container image can be built using the build.sh script:

build.sh AUTH_METHOD [USER] 

Build script requirements:

If using Kerberos authentication, the Kerberos client packages also need to be available on the Docker host. Additionally, it is expected that the host's /etc/krb5.conf would be bind mounted into the container at runtime; this configuration file is also used at build-time to determine the default Kerberos realm and permitted encryption algorithms.

Before running the build script, the irods_environment.json.template file needs to be created, based on irods_environment.json.template.sample. In which, the irods_host, irods_port, irods_zone_name and, potentially, irods_cwd and irods_home values need to be set appropriately. The ${user} tags in the template file will be replaced with USER by the build script.

The AUTH_METHOD can be either native (nat) or kerberos (krb). If the USER is omitted, then the current login is used. The script then builds an image named hgi/irobot, with the USER given as its tag.

To launch the container:

docker run -v /path/to/your/precache/directory:/precache \
           -v /etc/krb5.conf:/etc/krb5.conf \
           -v /path/to/your/irobot.conf:/home/USER/irobot.conf \
           -p 5000:5000 \
           hgi/irobot:USER

(Note that bind mounting /etc/krb5.conf is only necessary when using Kerberos authentication. A cron job runs in the Kerberos container to renew credentials with the KDC periodically; if the container is down for any significant amount of time, this may fail and you'll have to rebuild the image.)

Configuration

The irobot.conf configuration is not copied into the image and ought to be bind mounted at runtime, into USER's home directory. This allows you to make configuration changes without rebuilding. An example configuration can be found in irobot.conf.sample.

Precache Policy
iRODS
HTTP API

Note that it is recommended that the HTTP API is only served over TLS (e.g., using a reverse proxy), to avoid authentication credentials being exposed as plain-text over an unencrypted connection.

HTTP Basic Authentication

This is only needed if using HTTP basic authentication.

Arvados Authentication

This is only needed if using Arvados authentication.

Logging

Log messages are tab-delimited timestamp (ISO8601 UTC), level and message records.

API
Gateway Timeout

iRobot essentially acts as an iRODS gateway through this HTTP API. If any operation takes an overly long time to complete (per the respective configuration), then a 504 Gateway Timeout response will be issued. (This may not be due to iRODS, but that will be the most likely culprit.) If this happens regularly, it may be indicative of a configuration or networking problem between iRobot and iRODS.

(Note that any client would also, presumably, hang up on an overly long-running connection.)

Authentication

All HTTP requests must include the Authorization header with a value that can be handled by any one of the configured authentication handlers. That is:

If the respective authentication handler cannot authenticate the payload it's given (or no Authorization header exists), a 401 Unauthorized response will be returned. If the payload can be authenticated, but the user (that is, the iRODS account under which iRobot operates) does not have the necessary access to the requested resource, a 403 Forbidden response will be returned.

Data Object Endpoint

iRobot exposes a single, parametrised endpoint at its root, taking the iRODS full path (collection name and data object, interspersed with slash characters) as its parameter. Note that, as the absolute path is taken as the parameter, the initial slash is assumed to be there so shouldn't be used in the URL.

That is, for example, for data object data_object in collection /full/path/to/my:

ttps://irobot:5000/full/path/to/my/data_object
tps://irobot:5000//full/path/to/my/data_object

Any special characters in the iRODS path should be percent encoded. If the requested data object does not exist in iRODS, then a 404 Not Found response will be returned.

GET and HEAD Response Summary

Status | Semantics :——:|:————————————————————– 200 | Return data object 202 | Data object still being fetched from iRODS; ETA returned, if possible 206 | Return ranges of data object 304 | Data object matches that expected by client 401 | Authentication failure 403 | Access denied to iRobot iRODS user 404 | No such data object on iRODS 405 | Method not allowed (only GET, HEAD, POST, DELETE and OPTIONS are supported) 406 | Unsupported requested media type 416 | Invalid range request 502 | An invalid operation occurred while interacting with iRODS 504 | Response timeout 507 | Precache full

A HEAD request can be made to the data object endpoint to facilitate discovery and status tracking, without the overhead of a full GET. That is, the same actions described below will be invoked on a HEAD request, but only the response headers will be returned.

Using the Accept Request Header

The Accept request header is used productively to fetch an appropriate representation of the specific data object, per the semantics of HTTP content negotation:

Note that, arguably, serving very different representations from the same endpoint breaks the true purpose of content negotiation. However the protocol followed by iRobot is seen as a better trade-off, given its primary objective of fetching data. If, however, this representation duplicity is too much for you to stomach, you can simply stick a reverse proxy in front of iRobot with an appropriate set of rewrite rules.

Client Cache Validity

The response will always include the ETag header with its value corresponding to the MD5 checksum of the data object cached by iRobot, as calculated by iRODS. (iRobot will also calculate its own MD5 sum, to check they match.) This will allow the client to verify it is requesting the same version of the data object that it is expecting.

A client can ensure this programmatically by using the If-None-Match request header, with the given entity tag. If the tags match, a 304 Not Modified response will be returned; otherwise, a full response will be returned.

This behaviour will also be true of a range request, so if a client wishes to fetch a range it doesn't have from a source it's seen before, then it would either make two requests – first with the If-None-Match header then the second without – or a single request without the If-None-Match header, that would need to be analysed by the client.

Fetching Data

Fetching of the data supports range requests using the Range request header. If this header is present and the data exists in its entirety, it will be returned with a 206 Partial Content response under the multipart/byteranges media type, where byte ranges in the response will have the media type application/octet-stream and include an entity tag of the range MD5 checksum, if one exists. The ranges may therefore be chunked differently than requested, so that they align with the precache checksum chunk size, but the requested range will be fully satisfied.

If the Range request header is omitted, then the entirety of the data will be returned as a 200 OK response, with media type application/octet-stream. If a range request is not satisfiable due to the request being out-of-bounds, then a 416 Range Not Satisfiable response will be issued.

Note that an initial range request (i.e., for data that has yet to be precached) will still fetch the entirety of the data into the precache; there is no short-cutting.

Precache Saturation

If the constraints of the precache are impossible to satisfy (e.g., trying to fetch a data object that's bigger than the precache), then a 507 Insufficient Storage response will be returned.

ETA Reponses

An ETA response indicates when data may be available. It will have media type application/vnd.irobot.eta. This will have an empty content body (i.e., content length of 0 bytes) and, if it can be calculated, a response header iRobot-ETA containing an ISO8601 UTC timestamp and an indication of confidence (in whole seconds) of when the data will be available. For example:

iRobot-ETA: 2017-09-25T12:34:56Z+0000 +/- 123

A client may choose to use this information to inform the rate at which it reissues requests.

Metadata Response

When fetching data object metadata, the response will be of media type application/vnd.irobot.metadata+json: A JSON object with the following keys:

AVUs are JSON objects with the following keys:

POST

Seed the precache with the data object, its metadata and calculate checksums; thus warranting its title of “precache”!

Status | Semantics :——:|:————————————————————– 201 | Seeded the precache with data object 202 | Seed the precache with data object; ETA returned, if possible 401 | Authentication failure 403 | Access denied to iRobot iRODS user 404 | No such data object on iRODS 405 | Method not allowed (only GET, HEAD, POST, DELETE and OPTIONS are supported) 409 | Inflight or contended data object could not be refetched 502 | An invalid operation occurred while interacting with iRODS 504 | Response timeout 507 | Precache full

Note that if the data object's state is already in the precache, this action will forcibly refetch it, providing the filesystem metadata has changed (file size, checksum and timestamps) and the precached data object is not currently inflight or contended. That is, it is not being fetched from iRODS or being pushed by iRobot to a connected client.

DELETE

Delete a data object and its associated metadata from the precache. This does not delete data from iRODS and is only for precache management; it should be used sparingly – in exceptional circumstances – as the precache is designed to manage itself automatically.

Status | Semantics :——:|:————————————————————– 204 | Data object removed from precache 401 | Authentication failure 404 | No such data object in precache 405 | Method not allowed (only GET, HEAD, POST, DELETE and OPTIONS are supported) 409 | Inflight or contended data object could not be deleted from the precache 504 | Response timeout

A data object can only be deleted from the precache if it is currently not inflight or contended.

Administrative Endpoints

Administrative endpoints are exposed at the root; they have a higher priority in the routing tree than the data object endpoints, but should never mask data objects as they cannot be contained within the iRODS “root collection”. Only GET and HEAD requests can be made to these endpoints, which can return the following:

Status | Semantics :——:|:————————————————————– 200 | Return the administrative data 401 | Authentication failure 405 | Method not allowed (only GET, HEAD and OPTIONS are supported) 406 | Unsupported requested media type 504 | Response timeout

Administrative endpoints will only ever return application/json. If the Accept request header diverges from this, a 406 Not Acceptable response will be returned.

/status

iRobot's current state:

/config

iRobot's current configuration, as a JSON object.

/manifest

An overview of the contents of the precache. This will return a JSON array of objects of the following form:

Error Responses

All 400 and 500-series errors (i.e., client and server errors, respectively) will be returned as application/json. The response body will be a JSON object with three elements: status, containing the HTTP status code; reason, containing the HTTP status reason; and description containing a human-readable description of the problem.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.