IBM/build-knowledge-base-with-domain-specific-documents

Name: build-knowledge-base-with-domain-specific-documents

Owner: International Business Machines

Description: Create a knowledge base using domain specific documents and the mammoth python library

Created: 2018-04-03 16:45:35.0

Updated: 2018-05-20 16:57:38.0

Pushed: 2018-05-20 16:57:37.0

Homepage: null

Size: 941

Language: Jupyter Notebook

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Build-knowledge-base-with-domain-specific-documents

Data Science Experience is now Watson Studio. Although some images in this code pattern may show the service as Data Science Experience, the steps and processes will still work.

In any business, word documents are a common occurence. They contain information in the form of raw text, tables and images. All of them contain important facts. The data used in this code pattern comes from two Wikipedia articles. The first is taken from the Wikipedia page of oncologist Suresh H. Advani the second is from the Wikipedia page about Oncology. These files are zipped up as Archive.zip.

In the figure below, there is a textual information about an oncologist Suresh H. Advani present in a word document. The table consists of the awards that he has been awarded by various organisations.

In this Code pattern, we address the problem of extracting knowledge out of text and tables in word documents. A knowledge graph is built from the knowledge extracted making the knowledge queryable.

Some of the challenges in extracting knowledge from word documents are:

  1. The Natural Language Processing (NLP) tools cannot access the text inside word documents. The word documents need to be converted to plain text files.
  2. There are business and domain experts who understand the keywords and entities that are present in the documents. But training the NLP tool to extract domain specific keywords and entities is a big effort. Also, it is impractical in many scenarios to find sufficient number of documents to train the NLP tool to process the text.

This pattern uses the below methodology to overcome the challenges:

The best of both worlds - training and rules based approach is used to extract knowledge out of documents.

In this Pattern we will demonstrate:

What makes this Code Pattern valuable:

This Code Pattern is intended to help Developers, Data Scientists to give structure to the unstructured data. This can be used to shape their analysis significantly and use the data for further processing to get better Insights.

  1. The unstructured text data from the docx files(html tables and free floating text) that need to be analyzed and correlated is extracted from the documents using python code.
  2. Use Extend Watson text Classification text is classified using Watson NLU and also tagged using the code pattern - Extend Watson text classification
  3. The text is correlated with other text using the code pattern - Correlate documents
  4. The results are filtered using python code.
  5. The knowledge graph is constructed.
Video

Included components
Featured technologies

Steps

Follow these steps to setup and run this code pattern. The steps are described in detail below.

  1. Create IBM Cloud services
  2. Run using a Jupyter notebook in the IBM Watson Studio
  3. Analyze the results
1. Create IBM Cloud services

Create the following IBM Cloud service and name it wdc-NLU-service:

2. Run using a Jupyter notebook in the IBM Watson Studio
  1. Create a new Watson Studio project
  2. Create the notebook
  3. Run the notebook
  4. Upload data
  5. Save and Share
2.1 Create a new Watson Studio project

2.2 Create the notebook

2.3 Run the notebook

When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.

Each code cell is selectable and is preceded by a tag in the left margin. The tag format is In [x]:. Depending on the state of the notebook, the x can be:

There are several ways to execute the code cells in your notebook:

2.4 Upload data
Upload the data and configuration to the notebook

Note: It is possible to use your own data and configuration files. If you use a configuration file from your computer, make sure to conform to the JSON structure given in data/config_classification.txt.

2.5 Save and Share
How to save your work:

Under the File menu, there are several ways to save your notebook:

How to share your work:

You can share your notebook by selecting the ?Share? button located in the top right section of your notebook panel. The end result of this action will be a URL link that will display a ?read-only? version of your notebook. You have several options to specify exactly what you want shared from your notebook:

3. Analyze the results

In the Section. Process of the notebook, the files are loaded. First the configuration files(config_classification.txt and config_relations.txt) are loaded. The unstructured information is extracted using python package mammoth. Mammoth converts the docx files to html from where text in the tables is also analysed along with free floating text. The results from Watson NLU are analyzed and augmented using the configuration files. The entities are augmented using the config_classification.txt and the relationships are augmented using config_relations.txt. The results are then filtered and formatted to pick up the relevant relations and discard the ones which are not relevant. The filtered relaionships are sent to draw graph function in the notebook, which will construct the knowledge graph.

Learn more

Troubleshooting

See DEBUGGING.md.

License

Apache 2.0


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.