OHDSI/MedlineXmlToDatabase

Name: MedlineXmlToDatabase

Owner: Observational Health Data Sciences and Informatics

Description: A command line Java application for parsing MEDLINE XML files and inserting the data into a relational database

Created: 2014-10-21 15:46:52.0

Updated: 2017-11-10 02:19:37.0

Pushed: 2017-12-21 08:07:13.0

Homepage:

Size: 9894

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

MedlineXmlToDatabase

This is a Java application for loading MEDLINE XML files into a relational database (currently supporting SQL Server and PostgreSQL). The application was designed with two goals in mind:

  1. Everything in the XML files needs to go into the database*.

  2. Any changes in the XML structure that occur over the years should not require changing the program.

  3. In 2017 we started breaking this rule by omitting inline tags in text fields. For example, abstracts could contain <I> and <B> tags, but these are ignored when inserting into the database.

The application is run in two phases:

  1. During analysis, the structure and contents of a large set of XML files is analysed, and a database structure is build to accommodate the data. This is typically done only once a year.

  2. During parse, all XML files in a folder are parsed and their contents are inserted into the database. This is typically done every time new XML files are available from MEDLINE.

Note that the application works directly of the GZipped XML files, so no need to unzip them.

Features

Technology

This is a pure Java application that can only be used through the command line.

System Requirements

Requires Java 1.7 or higher, and write and create access to the database. Java can be downloaded from http://www.java.com.

Dependencies

Getting Started

  1. Download all xml.gz files from MEDLINE (see http://www.nlm.nih.gov/databases/license/license.html for licensing information)

  2. Create an ini file according to the example in the iniFileExamples folder, pointing to the folder containing the xml.gz files, and the server and schema where the data should be uploaded

  3. Under the Releases tab, download MedlineXmlToDatabase*.zip, and unzip the file. Alternatively, you can download the source code and use the included Ant file to build the Jar file.

  4. From the command line, use `java -jar MedlineXmlToDatabase.jar -analyse -ini <path to ini file>` to create the database structure.

  5. From the command line, use `java -jar MedlineXmlToDatabase.jar -parse -ini <path to ini file>` to load the data from the xml files into the database.

Optionally, you can also include the MeSH database:

  1. Download the XML gz files (descxxxx.gz and suppxxxx.gz) from NLM (see https://www.nlm.nih.gov/mesh/download_mesh.html)

  2. Add the path to the gz files to the ini file under `MESH_XML_FOLDER`

  3. From the command line, use `java -jar MedlineXmlToDatabase.jar -parse_mesh -ini <path to ini file>` to load the data from the xml files into the database.

Getting Involved

License

MedlineXmlToDatabase is licensed under Apache License 2.0

Development

MedlineXmlToDatabase was developed in Eclipse. Contributions are welcome.

Development status

Beta testing

Acknowledgements

Martijn Schuemie is the author of this application.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.