IBM/watson-document-classifier

Name: watson-document-classifier

Owner: International Business Machines

Description: Augment IBM Watson Natural Language Understanding APIs with a configurable mechanism for text classification, uses Watson Studio.

Created: 2017-07-06 18:38:40.0

Updated: 2018-05-18 19:19:12.0

Pushed: 2018-03-22 01:40:57.0

Homepage: https://developer.ibm.com/code/patterns/extend-watson-text-classification/

Size: 1150

Language: Jupyter Notebook

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Augmented Classification of text with Watson Natural Language Understanding and IBM Data Science experience

Read this in other languages: ???.

Data Science Experience is now Watson Studio. Although some images in this code pattern may show the service as Data Science Experience, the steps and processes will still work.

In this code pattern we will use Jupyter notebooks in Watson Studio to augment IBM Watson Natural Language Understanding API output through configurable mechanism for text classification.

When the reader has completed this code pattern, they will understand how to:

The intended audience for this code pattern is developers who want to learn a method for augmenting classification metadata obtained from Watson Natural Language Understanding API, in situations when there is a scarcity of historical data. The traditional approach of training a Text Analytics model yields less than expected results. The distinguishing factor of this code pattern is that it allows a configurable mechanism of text classification. It helps give a developer a head start in the case of text from a specialized domain, with no generally available English parser.

Included components
Featured technologies

Watch the Video

Steps

Follow these steps to setup and run this code pattern. The steps are described in detail below.

  1. Sign up for Watson Studio
  2. Create IBM Cloud services
  3. Create the notebook
  4. Add the data and configuraton file
  5. Update the notebook with service credentials
  6. Run the notebook
  7. Download the results
  8. Analyze the results
1. Sign up for Watson Studio

Sign up for IBM's Watson Studio. By creating a project in Watson Studio a free tier Object Storage service will be created in your IBM Cloud account. Take note of your service names as you will need to select them in the following steps.

Note: When creating your Object Storage service, select the Free storage type in order to avoid having to pay an upgrade fee.

2. Create IBM Cloud services

Create the following IBM Cloud service and name it wdc-NLU-service:

3. Create the notebook

4. Add the data and configuration file
Add the data and configuration to the notebook

Note: It is possible to use your own data and configuration files. If you use a configuration file from your computer, make sure to conform to the JSON structure given in configuration/sample_config.txt.

Fix-up file names for your own data and configuration files

If you use your own data and configuration files, you will need to update the variables that refer to the data and configuration files in the Jupyter Notebook.

In the notebook, update the global variables in the cell following 2.3 Global Variables section.

Replace the sampleTextFileName with the name of your data file and sampleConfigFileName with your configuration file name.

5. Update the notebook with service credentials
Add the Watson Natural Language Understanding credentials to the notebook

Select the cell below 2.1 Add your service credentials from IBM Cloud for the Watson services section in the notebook to update the credentials for Watson Natural Langauage Understanding.

Open the Watson Natural Language Understanding service in your IBM Cloud Dashboard and click on your service, which you should have named wdc-NLU-service.

Once the service is open click the Service Credentials menu on the left.

In the Service Credentials that opens up in the UI, select whichever Credentials you would like to use in the notebook from the KEY NAME column. Click View credentials and copy username and password key values that appear on the UI in JSON format.

Update the username and password key values in the cell below 2.1 Add your service credentials from IBM Cloud for the Watson services section.

Add the Object Storage credentials to the notebook

6. Run the notebook

When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.

IMPORTANT: The first time you run your notebook, you will need to install the necessary packages in section 1.1 and then Restart the kernel.

Each code cell is selectable and is preceded by a tag in the left margin. The tag format is In [x]:. Depending on the state of the notebook, the x can be:

There are several ways to execute the code cells in your notebook:

7. Download the results

8. Analyze the results

After running each cell of the notebook under Classify text, the results will display.

The configuration json controls the way the text is classified. The classification process is divided into stages - Base Tagging and Domain Tagging. The Base Tagging stage can be used to specify keywords based classification, regular expression based classification, and tagging based on chunking expressions. The Domain Tagging stage can be used to specify classification that is specific to the domain, in order to augment the results from Watson Natural Language Understanding.

We can modify the configuration json to add more keywords or add regular expressions. In this way, we can augment the text classification without any changes to the code. We can add more stages to the configuration json if required and enhance the text classification results with code modifications.

It can be seen from the classification results that the keywords and regular expressions specified in the configuration have been correctly classified in the analyzed text that is displayed.

Troubleshooting

See DEBUGGING.md.

License

Apache 2.0


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.