Name: dataflow-opinion-analysis
Owner: Google Cloud Platform
Description: Opinion Analysis of News, Threaded Conversations, and User Generated Content
Created: 2017-05-05 19:17:47.0
Updated: 2018-05-22 21:20:04.0
Pushed: 2018-05-22 21:20:03.0
Size: 104784
Language: Java
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This sample uses Cloud Dataflow to build an opinion analysis processing pipeline for news, threaded conversations in forums like Hacker News, Reddit, or Twitter and other user generated content e.g. email.
Opinion Analysis can be used for lead generation purposes, user research, or automated testimonial harvesting.
This sample contains three components:
The steps for configuring and running this sample are as follows:
Setup your Google Cloud Platform project and permissions
Select or Create a Google Cloud Platform project. In the Google Cloud Console, select Create Project.
Enable billing for your project.
Enable the Google Dataflow, Compute Engine, Google Cloud Storage, and other APIs necessary to run the example.
Install tools necessary for compiling and deploying the code in this sample, if not already on your system, specifically git, Google Cloud SDK, Python (for orchestration scripts), Java and Maven (for Dataflow pipelines):
Install git
.
Install Python 2.7.
Install Python pip
.
Download and install the Java Development Kit (JDK) version 1.8 or later. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.
Create and setup a Cloud Storage bucket and Cloud Pub/Sub topics
Create a Cloud Storage bucket for your project. This bucket will be used for staging your code, as well as for temporary input/output files. For consistency with this sample, select Multi-Regional storage class and United States location.
Create folders in this bucket staging
, input
, output
, temp
, indexercontrol
Create the following Pub/Sub topics: indexercommands
, documents
Create or verify a configuration for your project
Authenticate with the Cloud Platform. Run the following command to get Application Default Credentials.
gcloud auth application-default login
Create a new configuration for your project if it does not exist already
gcloud init
Verify your configurations
gcloud config configurations list
Important: This tutorial uses several billable components of Google Cloud Platform. New Cloud Platform users may be eligible for a free trial.
To clone the GitHub repository to your computer, run the following command:
clone https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis
Go to the dataflow-opinion-analysis
directory. The exact path
depends on where you placed the directory when you cloned the sample files from
GitHub.
ataflow-opinion-analysis
In the App Engine Console create an App Engine app
In shell, activate the configuration for the project where you want to deploy the app
gcloud config configurations activate <config-name>
Include the Python API client in your App Engine app
pip install -t scheduler/lib/ google-api-python-client
Adjust the schedule for your ETL jobs and edit the scheduler/cron.yaml
file. You define tasks for App Engine Task Scheduler in YAML format. For a complete description of how to use YAML to specify jobs for Cron Service, including the schedule format, see Scheduled Tasks with Cron for Python.
Update the control topic name indexercommands
in the scheduler scripts to the name you used when you created the Pub/Sub topic. Edit the following files:
scheduler/startjdbcimport.py
scheduler/startsocialimport.py
scheduler/startstatscalc.py
Upload the scheduling application to App Engine.
gcloud app deploy --version=1 scheduler/app.yaml scheduler/cron.yaml
After you deploy the App Engine application, it uses the App Engine Cron Service
to schedule sending messages to the Cloud Pub/Sub control topics. If the control Cloud Pub/Sub topic
specified in your Python scripts (e.g. startjdbcimport.py
) does not exist, the application creates it.
You can see the cron jobs under in the Cloud Console under:
Compute > App Engine > Task queues > Cron Jobs
You can also see the control topic in the Cloud Console:
Big Data > Pub/Sub
Make sure you've activated the gcloud configuration for the project where you want to create your BigQuery dataset
gcloud config configurations activate <config-name>
In shell, go to the bigquery
directory where the build scripts and schema files for BigQuery tables and views are located
cd bigquery
Run the build_dataset.sh
script to create the dataset, tables, and views. The script will use the PROJECT_ID variable from your active gcloud configuration, and create a new dataset in BigQuery named 'opinions'. In this dataset it will create several tables and views necessary for this sample.
./build_dataset.sh
[optional] Later on, if you make changes to the table schema or views, you can update the definitions of these objects by running update commands:
./build_tables.sh update
./build_views.sh update
Table schema definitions are located in the *Schema.json files in the bigquery
directory. View definitions are located in the shell script build_views.sh.
If you would like to use this sample for deep textual analysis, download and install Sirocco, a framework maintained by @datancoffee.
Download the latest Sirocco Java framework jar file.
Download the latest Sirocco model file.
Go to the directory where the downloaded sirocco-sa-x.y.z.jar and sirocco-mo-x.y.z.jar files are located.
Install the Sirocco framework in your local Maven repository. Replace x.y.z with downloaded version.
install:install-file \
groupId=sirocco.sirocco-sa \
artifactId=sirocco-sa \
packaging=jar \
version=x.y.z \
file=sirocco-sa-x.y.z.jar \
generatePom=true
install:install-file \
groupId=sirocco.sirocco-mo \
artifactId=sirocco-mo \
packaging=jar \
version=x.y.z \
file=sirocco-mo-x.y.z.jar \
generatePom=true
Note (May 22,2018): We are in the process of updating the Controller pipeline. Skip this step and instead launch Indexing jobs directly as described in Release Notes for version 0.6.4
dataflow-opinion-analysis/scripts
directory and make a copy of the run_controljob_template.sh
filecripts
un_controljob_template.sh run_controljob.sh
Edit the run_controljob.sh
file in your favorite text editor, e.g. nano
. Specifically, set the values of the variables used for parametarizing your control Dataflow pipeline. Set the values of the PROJECT_ID, DATASET_ID and other variables at the beginning of the shell script.
Go back to the dataflow-opinion-analysis
directory and run a command to deploy the control Dataflow pipeline to Cloud Dataflow.
.
pts/run_controljob.sh &
Note (May 22,2018): We are in the process of updating the Controller pipeline. Skip this step and instead launch Indexing jobs directly as described in Release Notes for version 0.6.4
You can use the included news articles (from Google's blogs) in the src/test/resources/input
directory to run a test pipeline.
Upload the files in the src/test/resources/input
directory into the GCS input
bucket. Use the Cloud Storage browser to find the input
directory you created in Prerequisites. Then, upload all files from your local src/test/resources/input
directory.
Use the Pub/Sub console to send a command to start a file import job. Find the indexercommand
topic in the Pub/Sub console. Click on its name.
Click on “Publish Message” button. In the Message box, copy the following command and click “Publish.
and=start_gcs_import
In the Dataflow Console observe how a new input job is created. It will have a “-gcsdocimport” suffix.
Once the Dataflow job successfully finishes, you can review the data it will write into your target BigQuery dataset. Use the BigQuery console to review the dataset.
Enter the following query to list new documents that were indexed by the Dataflow job. The sample query is using the Standard SQL dialect of BigQuery.
ndardSQL
CT * FROM opinions.sentiment
R BY DocumentTime DESC
T 100
Now that you have tested the sample, delete the cloud resources you created to prevent further billing for them on your account.
Stop the control Cloud Dataflow job in the Dataflow Cloud Console.
Disable and delete the App Engine application as described in Disable or delete your application in the Google App Engine documentation.
Delete the Cloud Pub/Sub topic. You can delete the topic and associated subscriptions from the Cloud Pub/Sub section of the Cloud Console.
Copyright 2017 Google Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.