yahoo/maha

Name: maha

Owner: Yahoo Inc.

Description: A framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid.

Created: 2017-10-02 14:17:41.0

Updated: 2018-05-24 18:32:01.0

Pushed: 2018-05-24 18:32:02.0

Homepage:

Size: 2397

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Build Status

Maha

A centralised library for building reporting APIs on top of multiple data stores to exploit them for what they do best.

We run millions of queries on multiple data sources for analytics every day. They run on hive, oracle, druid etc. We needed a way to utilize the data stores in our architecture to exploit them for what they do best. This meant we needed to easily tune and identify sets of use cases where each data store fits the best. Our goal became to build a centralized system which was able to make these decisions on the fly at query time and also take care of the end to end query execution. The system needed to take in all the heuristics available, applying any constraints already defined in the system and select the best data store to run the query. It then would need to generate the underlying queries and pass on all available information to the query execution layer in order to facilitate further optimization at that layer.

Key Features!

Modules in maha
Getting Started
Installing Maha API Library
endency>
roupId>com.yahoo.maha</groupId>
rtifactId>maha-api-jersey</artifactId>
ersion>5.2</version>
ype>pom</type>
pendency>
<repositories>
    <repository>
        <id>bintray-yahoo-maven</id>
        <name>bintray</name>
        <url>http://yahoo.bintray.com/maven</url>
    </repository>
</repositories>
Example Implementation of Maha Apis Druid Wiki Ticker Example

For this example, you need druid instance running in local and wikitikcer dataset indexed into druid, please take look at http://druid.io/docs/latest/tutorials/quickstart.html

Creating Fact Definition for Druid Wikiticker

      ColumnContext.withColumnContext { implicit dc: ColumnContext =>
    Fact.newFact(
      "wikiticker_stats_datasource", DailyGrain, DruidEngine, Set(WikiSchema),
      Set(
        DimCol("channel", StrType())
        , DimCol("cityName", StrType())
        , DimCol("comment", StrType(), annotations = Set(EscapingRequired))
        , DimCol("countryIsoCode", StrType(10))
        , DimCol("countryName", StrType(100))
        , DimCol("isAnonymous", StrType(5))
        , DimCol("isMinor", StrType(5))
        , DimCol("isNew", StrType(5))
        , DimCol("isRobot", StrType(5))
        , DimCol("isUnpatrolled", StrType(5))
        , DimCol("metroCode", StrType(100))
        , DimCol("namespace", StrType(100, (Map("Main" -> "Main Namespace", "User" -> "User Namespace", "Category" -> "Category Namespace", "User Talk"-> "User Talk Namespace"), "Unknown Namespace")))
        , DimCol("page", StrType(100))
        , DimCol("regionIsoCode", StrType(10))
        , DimCol("regionName", StrType(200))
        , DimCol("user", StrType(200))
      ),
      Set(
      FactCol("count", IntType())
      ,FactCol("added", IntType())
      ,FactCol("deleted", IntType())
      ,FactCol("delta", IntType())
      ,FactCol("user_unique", IntType())
      ,DruidDerFactCol("Delta Percentage", DecType(10, 8), "{delta} * 100 / {count} ")
      )
    )
  }
    .toPublicFact("wikiticker_stats",
      Set(
        PubCol("channel", "Wiki Channel", InNotInEquality),
        PubCol("cityName", "City Name", InNotInEqualityLike),
        PubCol("countryIsoCode", "Country ISO Code", InNotInEqualityLike),
        PubCol("countryName", "Country Name", InNotInEqualityLike),
        PubCol("isAnonymous", "Is Anonymous", InNotInEquality),
        PubCol("isMinor", "Is Minor", InNotInEquality),
        PubCol("isNew", "Is New", InNotInEquality),
        PubCol("isRobot", "Is Robot", InNotInEquality),
        PubCol("isUnpatrolled", "Is Unpatrolled", InNotInEquality),
        PubCol("metroCode", "Metro Code", InNotInEquality),
        PubCol("namespace", "Namespace", InNotInEquality),
        PubCol("page", "Page", InNotInEquality),
        PubCol("regionIsoCode", "Region Iso Code", InNotInEquality),
        PubCol("regionName", "Region Name", InNotInEqualityLike),
        PubCol("user", "User", InNotInEquality)
      ),
      Set(
        PublicFactCol("count", "Total Count", InBetweenEquality),
        PublicFactCol("added", "Added Count", InBetweenEquality),
        PublicFactCol("deleted", "Deleted Count", InBetweenEquality),
        PublicFactCol("delta", "Delta Count", InBetweenEquality),
        PublicFactCol("user_unique", "Unique User Count", InBetweenEquality),
        PublicFactCol("Delta Percentage", "Delta Percentage", InBetweenEquality)
      ),
      Set.empty,
      getMaxDaysWindow, getMaxDaysLookBack
    )

Fact definition is the static object specification for the facts and dimension columns present in the table in the data-source, you can say it is object image of the table. DimCol has the base name, data-types, annotation. Annotations are the configurations stating the primary key/foreign key configuration, special character escaping in the query generation, static value mapping ie `StrType(100, (Map("Main" -> "Main Namespace", "User" -> "User Namespace", "Category" -> "Category Namespace", "User Talk"-> "User Talk Namespace"), "Unknown Namespace"))`. Fact definition can have derived columns, maha supports most common arithmetic derived expression.

Public Fact : Public fact contains the base name to public name mapping. Public Names can be directly used in the Request Json. Public fact are identified by the name called cube name ie 'wikiticker_stats'. Maha supports versioning on the cubes, you have multiple versions of the same cube.

Fact/Dimension Registration Factory: Facts and dimensions are registered under the derived static class object of FactRegistrationFactory or DimensionRegistration Factory. Factory Classes used in the maha-service-json-config.

maha-service-config.json

Maha Service Config json contains one place config for launching maha-apis which includes the following.

We have created `api-jersey/src/test/resources/maha-service-config.json` configuration to start with, this is maha api configuration for student and wiki registry.

Debugging maha-service-config json: For the configuration syntax of this json, you can take look at JsonModels/Factories in the service module. Once Maha Service loads this configuration, if there are some failures in loading the configuration then mahaService will return the list of FailedToConstructFactory/ ServiceConfigurationError/ JsonParseError.

Exposing the endpoints with api-jersey

Api-jersey uses maha-service-config json and create MahaResource beans. All you need to do is to create the following three beans 'mahaService', 'baseRequest', 'exceptionHandler' etc.

<bean id="mahaService" class="com.yahoo.maha.service.example.ExampleMahaService" factory-method="getMahaService"/>
<bean id="baseRequest" class="com.yahoo.maha.service.example.ExampleRequest" factory-method="getRequest"/>
<bean id="exceptionHandler" class="com.yahoo.maha.api.jersey.GenericExceptionMapper" scope="singleton" />
<import resource="classpath:maha-jersey-context.xml" />

Once your application context is ready, you are good to launch the war file on the web server. You can take look at the test application context that we have created for running local demo and unit test `api-jersey/src/test/resources/testapplicationContext.xml`

Launch the maha api demo in local prerequisites Run demo : Playing with demo :

"header": {
    "cube": "student_performance",
    "fields": [{
            "fieldName": "Student ID",
            "fieldType": "DIM"
        },
        {
            "fieldName": "Class ID",
            "fieldType": "DIM"
        },
        {
            "fieldName": "Section ID",
            "fieldType": "DIM"
        },
        {
            "fieldName": "Total Marks",
            "fieldType": "FACT"
        }
    ],
    "maxRows": 200
},
"rows": [
    [213, 200, 100, 125],
    [213, 198, 100, 120]
]

Contributions
Acknowledgements

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.