Name: build-knowledge-base-with-domain-specific-documents
Owner: International Business Machines
Description: Create a knowledge base using domain specific documents and the mammoth python library
Created: 2018-04-03 16:45:35.0
Updated: 2018-05-20 16:57:38.0
Pushed: 2018-05-20 16:57:37.0
Homepage: null
Size: 941
Language: Jupyter Notebook
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Data Science Experience is now Watson Studio. Although some images in this code pattern may show the service as Data Science Experience, the steps and processes will still work.
In any business, word documents are a common occurence. They contain information in the form of raw text, tables and images. All of them contain important facts. The data used in this code pattern comes from two Wikipedia articles. The first is taken from the Wikipedia page of oncologist Suresh H. Advani the second is from the Wikipedia page about Oncology. These files are zipped up as Archive.zip.
In the figure below, there is a textual information about an oncologist Suresh H. Advani present in a word document. The table consists of the awards that he has been awarded by various organisations.
In this Code pattern, we address the problem of extracting knowledge out of text and tables in word documents. A knowledge graph is built from the knowledge extracted making the knowledge queryable.
Some of the challenges in extracting knowledge from word documents are:
This pattern uses the below methodology to overcome the challenges:
python package mammoth
is used to convert .docx
files to html (semi-structured format).The best of both worlds - training and rules based approach is used to extract knowledge out of documents.
In this Pattern we will demonstrate:
What makes this Code Pattern valuable:
This Code Pattern is intended to help Developers, Data Scientists to give structure to the unstructured data. This can be used to shape their analysis significantly and use the data for further processing to get better Insights.
IBM Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
IBM Cloud Object Storage: An IBM Cloud service that provides an unstructured cloud data store to build and deliver cost effective apps and services with high reliability and fast speed to market.
Watson Natural Language Understanding: A IBM Cloud service that can analyze text to extract meta-data from content such as concepts, entities, keywords, categories, sentiment, emotion, relations, semantic roles, using natural language understanding.
Follow these steps to setup and run this code pattern. The steps are described in detail below.
Create the following IBM Cloud service and name it wdc-NLU-service:
Log in or sign up for IBM's Watson Studio.
Select the New Project
option from the Watson Studio landing page and choose the Jupyter Notebooks
option.
Cloud Object Storage
service or select an existing one from your IBM Cloud account.Assets
and Settings
tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.Assets
tab, click the + New notebook
button.From URL
tab to specify the URL to the notebook in this repository.Create
button.When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
*
, this indicates that the cell is currently executing.There are several ways to execute the code cells in your notebook:
Play
button in the toolbar.Cell
menu bar, there are several options available. For example, you
can Run All
cells in your notebook, or you can Run All Below
, that will
start executing from the first cell under the currently selected cell, and then
continue executing all cells that follow.Schedule
button located in the top right section of your notebook
panel. Here you can schedule your notebook to be executed once at some future
time, or repeatedly at your specified interval.My Projects > Default
page, Use Find and Add Data
(look for the 10/01
icon)
and its Files
tab.browse
and navigate to Archive.zipbrowse
and navigate to config_relations.txtbrowse
and navigate to config_classification.txtNote: It is possible to use your own data and configuration files. If you use a configuration file from your computer, make sure to conform to the JSON structure given in
data/config_classification.txt
.
Under the File
menu, there are several ways to save your notebook:
Save
will simply save the current state of your notebook, without any version
information.Save Version
will save your current state of your notebook with a version tag
that contains a date and time stamp. Up to 10 versions of your notebook can be
saved, each one retrievable by selecting the Revert To Version
menu item.You can share your notebook by selecting the ?Share? button located in the top right section of your notebook panel. The end result of this action will be a URL link that will display a ?read-only? version of your notebook. You have several options to specify exactly what you want shared from your notebook:
Only text and output
: will remove all code cells from the notebook view.All content excluding sensitive code cells
: will remove any code cells
that contain a sensitive tag. For example, # @hidden_cell
is used to protect
your dashDB credentials from being shared.All content, including code
: displays the notebook as is.download as
options are also available in the menu.In the Section. Process of the notebook, the files are loaded. First the configuration files(config_classification.txt and config_relations.txt) are loaded. The unstructured information is extracted using python package mammoth. Mammoth converts the docx files to html from where text in the tables is also analysed along with free floating text. The results from Watson NLU are analyzed and augmented using the configuration files. The entities are augmented using the config_classification.txt
and the relationships are augmented using config_relations.txt
. The results are then filtered and formatted to pick up the relevant relations and discard the ones which are not relevant. The filtered relaionships are sent to draw graph function in the notebook, which will construct the knowledge graph.