newsdev/archieml-ruby

Name: archieml-ruby

Owner: NYT Newsroom Developers

Description: Ruby parser for the Archie Markup Language (ArchieML)

Created: 2015-03-05 20:16:20.0

Updated: 2017-06-29 22:22:45.0

Pushed: 2016-07-11 23:05:12.0

Homepage: null

Size: 28

Language: Ruby

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Archieml

Parse Archie Markup Language (ArchieML) documents into Ruby Hashes.

Read about the ArchieML specification at archieml.org.

The current version is v0.3.0.

Installation

gem install archieml

Usage
ire 'archieml'

ieml.load("key: value")
"key"=>"value"}

.write("text.aml", "key: value")
ieml.load_file("text.aml")
"key"=>"value"}
Using with Google Documents

We use archieml at The New York Times to parse Google Documents containing AML. This requires a little upfront work to download the document and convert it into text that archieml can load.

The first step is authenticating with the Google Drive API, and accessing the document. For this, you will need a user account that is authorized to view the document you wish to download.

For this example, I'm going to use the official google-api-client Ruby gem, but you can use another library or authentication method if you like. Whatever mechanism, you'll need to be able to export the document either as text, or html, at which point the instructions will be identical.

The full example is at examples/google_drive.rb.

First, install the gem directly, or using a Gemfile:

m install google-api-client

Next, open up irb and run the follow code to authorize a user, and initialize and OAuth client. Note that if you want to use this on a server, you'll have to set up a more re-usable way of authorizing users.

ire 'google/api_client'
ire 'google/api_client/client_secrets'
ire 'google/api_client/auth/installed_app'

nt = Google::APIClient.new(:application_name => 'Ruby Drive sample', :application_version => '1.0.0')
nt_secrets = Google::APIClient::ClientSecrets.load
 = Google::APIClient::InstalledAppFlow.new(
lient_id => client_secrets.client_id,
lient_secret => client_secrets.client_secret,
cope => ['https://www.googleapis.com/auth/drive']

nt.authorization = flow.authorize

Log into your Google account and authorize the application to access your Google Drive files.

Now that you have an authenticated client, you can make an API call to a document saved in Drive. Create a document with some basic AML inside (such as “key: value”), save it, and note the long string of characters at the end of the URL:

https://docs.google.com/a/nytimes.com/document/d/[FILE_ID]/edit

FILE_ID is defaulted to a public test file.

_ID = "1JjYD90DyoaBuRYNxa4_nqrHKkgZf1HrUj30i3rTWX1s"
e = client.discovered_api('drive', 'v2')

lt = client.execute(
pi_method => drive.files.get,
arameters => { 'fileId' => FILE_ID })

If result executes correctly, you should now have the file's metadata. The next step is to download the body of the file. The metadata has a property called exportLinks which gives you URLs to different formats that you can export the document as. Let's start with text/plain.

_url = result.data['exportLinks']['text/plain']
_aml = client.execute(uri: text_url).body

text_aml should now contain your document in plain text! You're all set to run the text through the ArchieML parser.

ire 'archieml'
ed = Archieml.load(text_aml)

Check out parsed, and ensure that it has any data you entered into the document.

There are a few extra steps that we do to make working with Google Documents more useful. With a little more prep, we generally process the documents to:

Unfortunately, google strips out links when you export as text/plain, so if you want to preserve them, we have to export the document in a different format, text/html.

_url = result.data['exportLinks']['text/html']
_data = client.execute(uri: html_url).body

At the other extreme, html_data now contains far too much data - there's a whole DOM represented in that text! We want to turn that HTML body back into plain text so that ArchieML can load it, and we want to preserve any links that we find.

This is a lightweight DOM traverser which requires using the nokogiri gem: gem install nokogiri. It moves through the HTML document and constructs a simple text representation of the document, without things like images or tables that would be ignored by AML anyway.

ire 'nokogiri'

convert(node)
r = ''
de.children.each do |child|
if func = @node_types[child.name || child.type]
  str += func.call(child)
end
d
turn str


e_types = {
ext' => lambda { |node| return node.content },
pan' => lambda { |node| convert(node) },
'    => lambda { |node| return convert(node) + "\n" },
i'   => lambda { |node| return '* ' + convert(node) + "\n" },
'    => lambda { |node|
return convert(node) unless node.attributes['href'] && node.attributes['href'].value

# Google changes all links to be served from a google domain.
# We need to strip off the real url, which has been moved to the
# "q" querystring parameter.

href = node.attributes['href'].value
if !href.index('?').nil? && parsed_url = CGI.parse(href.split('?')[1])
  href = parsed_url['q'][0] if parsed_url['q']
end

str = "<a href=\"#{href}\">"
str += convert(node)
str += "</a>"
return str



l ol).each { |tag| @node_types[tag] = @node_types['span'] }
1 h2 h3 h4 h5 h6 br hr).each { |tag| @node_types[tag] = @node_types['p'] }

_doc = Nokogiri::HTML(html_data)
_aml = convert(html_doc.children[1].children[1])

ire 'archieml'
= Archieml.load(html_aml)

aml should now have your document with links included, and bullet points should continue to work (we transformed each <li> element into a separate line beginning with a *).

One additional step we perform is removing smart quotes. You can run html_aml through this before calling Archieml.load:

_aml.gsub!(/<[^<>]*>/) do |match|
tch.gsub("?", "'")
   .gsub("?", "'")
   .gsub("?", '"')
   .gsub("?", '"')

= Archieml.load(html_aml)
Tests

Test examples are stored in a submodule. You may need to run git submodule update --init to fetch them.

There is a full test suite using rspec. bundle install, and then rspec to execute them.

Changelog

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.