May 27, 2015

Streamlined Data Refinery with Pentaho Data Integration 5.4

During the past year, my team has been working on delivering the "last mile" of Pentaho's Streamlined Data Refinery vision. This "last mile" represents the last hurdle linking the world of ETL and data exploration.


In a typical Pentaho ETL workflow, PDI ingests data from virtually any data source - including both traditional systems and Big Data stores – and then processes, cleanses, and blends the data. These transformations may generate one or more tables containing datasets optimized for analytics and business intelligence. In the past, an automated way to make this data immediately available to end users for analysis did not exist. The ETL developer had to create or update the data source in the BA Server, use either the Schema Workbench tool or the thin client modeler to create an analytical model, make it available to end users, etc. - which proved to be a non-trivial and manual process.

The main goal of our efforts in the past year is to reduce the friction point in this process. We want to empower the ETL developer to build and publish the analytics datasource model straight from the same ETL tool (PDI) . This accelerates time to value in analytics projects and simplifies the process by automating the complex details of model creation and publishing - enabling end users such as Analysts and Data Scientists to perform data exploration and visualization tasks quickly and on-demand.

Incremental SDR Capabilities

We provided initial support for these "last mile" features in the 5.2 release when we introduced 2 new job entries in PDI:

  • Build Model - Builds a logical model from an output step ( Auto-Model )
  • Publish Model - Publishes the logical model to the BA Server ( For Analysis and Reporting )

Build and Publish Model 5.2

Then in the 5.3 release, we introduced the Annotate Stream transformation step to provide support for model annotations. These annotations are used to build more sophisticated models and to enhance the auto-model capabilities. In addition, we added the ability to annotate stream fields as measures and define aggregation functions. Other improvements in the 5.3 release include annotation support for hierarchized degenerate dimensions and time dimensions, role-based security for published models, and many more.

5.4 Release

For the upcoming 5.4 release, we are introducing star schema support as well as the ability to re-use previously defined annotation groups and shared dimension groups.

Star Schema Support 5.4

As you can see from the transformation above, the PDI user can add model annotations or "hints" to the stream fields as rows of data are loaded to the fact and dimension tables.

Below is an example of how to define Dimension attributes using the Shared Dimension Step:

Shared Dimension Group

The Annotate Stream step then allows you to Link the defined Dimension to the Sales Fact table, as well as define other annotations such as Create Measure:

Model Annotation Group

In addition, Model annotation groups can now be centrally stored so that they can be re-used when generating new models.

Reuse Model Annotation Groups

Once the transformation is configured and the annotations are defined, you can now run the job. The **Build Model** job entry will **apply the annotations and augment the auto-generated model** to produce a more complex analytical model. The **Publish Model** job entry will then publish the model data source, including the database connection information to the configured Pentaho BA Server.

When the model is published, you can immediately use Pentaho Analyzer in the BA Server to perform analysis on the data:

Pentaho Analyzer

As you can see, this new set of capabilities helps bridge the gap between the blending & orchestration of data to the automatic generation and publishing of the model for end user exploration and visualization. These incremental improvements in the product brings us closer to a complete end-to-end Streamlined Data Refinery solution.