GoogleCloudPlatform/dataflow-opinion-analysis

Name: dataflow-opinion-analysis

Owner: Google Cloud Platform

Description: Opinion Analysis of News, Threaded Conversations, and User Generated Content

Created: 2017-05-05 19:17:47.0

Updated: 2018-05-22 21:20:04.0

Pushed: 2018-05-22 21:20:03.0

Homepage:

Size: 104784

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Sample: Opinion Analysis of News, Threaded Conversations, and User Generated Content

This sample uses Cloud Dataflow to build an opinion analysis processing pipeline for news, threaded conversations in forums like Hacker News, Reddit, or Twitter and other user generated content e.g. email.

Opinion Analysis can be used for lead generation purposes, user research, or automated testimonial harvesting.

About the sample

This sample contains three components:

How to run the sample

The steps for configuring and running this sample are as follows:

Prerequisites

Setup your Google Cloud Platform project and permissions

Install tools necessary for compiling and deploying the code in this sample, if not already on your system, specifically git, Google Cloud SDK, Python (for orchestration scripts), Java and Maven (for Dataflow pipelines):

Create and setup a Cloud Storage bucket and Cloud Pub/Sub topics

Create or verify a configuration for your project

Important: This tutorial uses several billable components of Google Cloud Platform. New Cloud Platform users may be eligible for a free trial.

Clone the sample code

To clone the GitHub repository to your computer, run the following command:

clone https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis

Go to the dataflow-opinion-analysis directory. The exact path depends on where you placed the directory when you cloned the sample files from GitHub.

ataflow-opinion-analysis
Specify cron jobs for the App Engine scheduling app

After you deploy the App Engine application, it uses the App Engine Cron Service to schedule sending messages to the Cloud Pub/Sub control topics. If the control Cloud Pub/Sub topic specified in your Python scripts (e.g. startjdbcimport.py) does not exist, the application creates it.

You can see the cron jobs under in the Cloud Console under:

Compute > App Engine > Task queues > Cron Jobs

You can also see the control topic in the Cloud Console:

Big Data > Pub/Sub

Create the BigQuery dataset

Table schema definitions are located in the *Schema.json files in the bigquery directory. View definitions are located in the shell script build_views.sh.

Deploy the Dataflow pipelines
Download and install the Sirocco sentiment analysis packages

If you would like to use this sample for deep textual analysis, download and install Sirocco, a framework maintained by @datancoffee.

install:install-file \
groupId=sirocco.sirocco-sa \
artifactId=sirocco-sa \
packaging=jar \
version=x.y.z \
file=sirocco-sa-x.y.z.jar \
generatePom=true
install:install-file \
groupId=sirocco.sirocco-mo \
artifactId=sirocco-mo \
packaging=jar \
version=x.y.z \
file=sirocco-mo-x.y.z.jar \
generatePom=true
Build and Deploy your Controller pipeline to Cloud Dataflow

Note (May 22,2018): We are in the process of updating the Controller pipeline. Skip this step and instead launch Indexing jobs directly as described in Release Notes for version 0.6.4

cripts
un_controljob_template.sh run_controljob.sh
.
pts/run_controljob.sh &
Run a verification job

Note (May 22,2018): We are in the process of updating the Controller pipeline. Skip this step and instead launch Indexing jobs directly as described in Release Notes for version 0.6.4

You can use the included news articles (from Google's blogs) in the src/test/resources/input directory to run a test pipeline.

and=start_gcs_import
ndardSQL
CT * FROM opinions.sentiment 
R BY DocumentTime DESC
T 100
Clean up

Now that you have tested the sample, delete the cloud resources you created to prevent further billing for them on your account.

License:

Copyright 2017 Google Inc. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.