IBM/gdpr-fingerprint-pii

Name: gdpr-fingerprint-pii

Owner: International Business Machines

Description: Use Watson Natural Language Understanding and Watson Knowledge Studio to fingerprint personal data from unstructured documents

Created: 2017-09-26 18:04:24.0

Updated: 2018-05-22 23:18:21.0

Pushed: 2018-02-16 06:04:15.0

Homepage: https://developer.ibm.com/code/patterns/fingerprinting-personal-data-from-unstructured-text

Size: 16901

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Fingerprinting personal data from unstructured documents

Read this in other languages: ???.

General Data Protection Regulation (GDPR) will be a new regulation in EU which will come into effect in May 2018. This new regulation applies to those organizations, including those outside EU, which collect and process personal data. It aims to give more control to individuals over usage of their personal data.

Right to forget - Under the new GDPR, organizations around the world must not only protect personal data but also forget personal data on request from individuals.

When a customer requests that all his or her personal data be deleted, then an organisation needs to identify all the documents where the customer's personal data reside. This code pattern addresses the need to identify the personal data from the provided documents. Also, we will see how to assign a confidence score for the personal data that indicates the confidence level in identifying an individual uniquely as part of the code pattern.

Let us try to understand this with an example chat transcript as below

 This is Thomas. How can I help you?
er: This is Alex. I want to change my plan to corporate plan
 Sure, I can help you. Do you want to change the plan for the number from which you are calling now?
er: yes
 For verification purpose may I know your date of birth and email id
er: My data of birth is 10-Aug-1979 and my email id is alex@gmail.com
 Which plan do you want to migrate to
er: Plan 450 unlimited
 Can I have your company name and date of joining
er: I work for IBM and doj 01-Feb-99
 Ok.. I have taken your request to migrate plan to 450 unlimited. You will get an update in 3 hours. Is there anything else that I can help you with
er: No
 Thanks for calling Vodaphone. Have a good day
er: you too

Personal Data extracted from the above text:

: Alex
 of birth: 10-Aug-1979
l id: alex@gmail.com
any: IBM
 of joining: 01-Feb-99

Also the confidence score is calculated

idence score: 0.7

This code pattern gives you a step by step instructions for:

Flow


Architecture/Flow diagram
1 ? Viewer passes input text to Personal Data Extractor.
2 ? Personal Data Extractor passes the text to NLU.
3 ? Personal Data extracted from the input text. NLU uses custom model to provide the response.
4 ? Personal Data Extractor passes NLU Output to Regex component.
5 ? Regex component uses the regular expressions provided in configuration to extract personal data which is then augmented to the NLU Output.
6 ? The augmented personal data is passed to scorer component.
7 ? Scorer component uses the configuration to come up with a overall document score and the result is passed back to Personal Data Extractor component.
8 ? This data is then passed to viewer component.

Included Components

Watch the Overview Video

Steps

  1. Prerequisites
  2. Concepts used
  3. Application deployment
  4. Develop Watson Knowledge Studio model
  5. Deploy WKS model to Watson Natural Language Understanding
  6. Verify that configuration parameters are correct
  7. Analyze results
  8. Consuming the output by other applications
1. Prerequisites
2. Concepts used
2.1 Data extraction methods

We have to define what personal data (e.g. Name, Email id) we would want to extract. This is done in two ways in this code pattern.
A) Using Custom model built using Watson Knowledge Studio (WKS) and
B) Using regular expressions. Details of how these are used are explained in subsequent sections.

2.2 Configuration

We use configuration to extract personal data. Personal data are classified into different categories. Each category is assigned a weight. Also we specify what personal data belongs to which category.

A sample configuration is as shown below

gories: Very_High,High,Medium,Low
_High_Weight: 50
_Weight: 40
um_Weight: 20
Weight: 10
_High_PIIs: MobileNumber,EmailId
_PIIs: Person,DOB
um_PIIs: Name,DOJ
PIIs: Company
x_params: DOB,DOJ
regex: (0[1-9]|[12][0-9]|3[01])[- /.](Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[- /.](19|20)\d\d
regex: (0[1-9]|[12][0-9]|3[01])[- /.](Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[- /.]\\d\\d

If you want to change configuration, then follow the below template

gories: <new set of categories which are comma separated>. e.g. Categories: MyCategory1,MyCategory2,MyCategory3
egory_name>_Weight: Weightage for each category. e.g. MyCategory1_Weight: 40
egory>_PIIs: Personal data (Entity types). e.g. MyCategory1_PIIS: EmailId, Employee Id
x_params: Entity types which have to be extracted using regular expressions. e.g. regex_params:

ex_param>_regex: Regular expression using which an entity needs to be extracted from text e.g. Date_regex:
-9]|[12]\[0-9]|3[01])
2.3 Brief description of application components 2.3.1 Personal Data Extractor component:

Personal Data Extractor component is the controller which controls the flow of data between all the components. It also integrates with NLU.

2.3.2 Regex component:

Regex component parses the input text using the regular expressions provided in the configuration files to extract personal data. Regular expressions are used to extract personal data to augment NLU output.

2.3.3 Scorer component:

Scorer component calculates the score of a document, which is between 0 and 1, based on the personal data identified and the configuration data. It uses the below algorithm

score be 0
For each category{
   cat_weight = weightage for the category
   cat_entity_types = list of entity types for the category
   for each cat_entity_types{
      score = score +( ( cat_weight/100 ) * ( 100 - score ) )
   }
}
e = score / 100; // to make it between 0 and 1
2.3.4 Viewer component:

Viewer component is the user interface component of the application. User can browse a file, containing chat transcript, and submit to personal data extraction component. After processed personal data are then shown in a tree view, along with the overall confidence score.

3. Application deployment
3.1 Deploy Java Liberty application to IBM Cloud

You can deploy the Java Liberty application using the Deploy to IBM Cloud button or using manual steps.

3.1.1 Deploy using “Deploy to IBM Cloud”

Click Deploy to IBM Cloud button above to deploy the application to IBM Cloud. You would be presented with a toolchain view and asked to “Deploy” the application. Go ahead and click Deploy button. The application should get deployed. Ensure that the application is started and that a NLU service is created and bound to the application just deployed.
Deploy to IBM Cloud

3.1.2 Deploy using Manual steps

If you have used Deploy to IBM Cloud button to deploy the application, then skip this section and jump to section “4. Develop Watson Knowledge Studio model”. If you have not used Deploy to IBM Cloud button to deploy the application, then complete the sections “3.1.2.1 Create NLU service instance” and “3.1.2.2 Deploy the Java application on IBM Cloud” below.

3.1.2.1 Create NLU service instance
4. Develop Watson Knowledge Studio model
4.1 Import Artifacts
4.1.1 Type Systems

You can learn more about Type Systems here Type Systems can either be created or imported from an already created Type Systems json file. It is left to user to create his or her own Type systems or use a Type Systems json file provided in this repository. If you wish to import the Type Systems json file, then download the file named TypeSystems.json under the folder WKS in this repository to your local file system. The json file has entity types such as Name, PhoneNo, EmailId, Address. You can edit/add/delete entity types to suit your requirement.

4.1.2 Documents

You can learn more about Documents here We will need a set of documents to train and evaluate the WKS model. These documents will contain the unstructured text from which we will identify personal data. Refer to some of the sample document files under the folder SampleChatTranscripts. To train WKS model, a large and varied set of documents are needed. To complete this exercise, let us consider a smaller set of documents.

You can either have your own set of documents or use the ones provided in this git repository. It is placed under WKS/Documents.zip. If you decide to use the documents provided in this repo, then download the file to your local file system.

4.2 Create Project

Login to the WKS.

4.3 Import type system
4.4 Import Documents
4.5 Create and assign annotation sets
4.6 Human Annotation
5. Deploy WKS model to Watson Natural Language Understanding
6. Verify that configuration parameters are correct
7. Analyze Results
8. Consuming the output by other applications

References

*

Learn more

License

Apache 2.0


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.