Category: Apache Ignite

Pentaho PDI with Apache Ignite - Part 2

17/8/2019

Here is the second part of the blog post about Pentaho PDI and Apache Ignite - with more details. And here is the link to the first part of it.

So the first part discussed the general setup and the why it can be interesting to use Apache Ignite as an in-memory database for an ETL process: it acts as an in-memory storage layer for your data transformations.

Here is again the screenshot of the completed ETL

I am using the geonames.org files as a data sources. Geonames has over 11 million geographical placenames and details available for free.

At the beginning the tables and structures in Apache Ignite are created using ExecuteSQL Script steps. This is what the complete transformation looks like:

create tables

Five tables are created in Ignite. The last step at the bottom is to create the output table in MySQL, to persists the results of the transformation job there.

Here is an example - just a normal create table statement.

To execute the script(s) you need a connection to the Ignite server: Select "Generic Database" as type and insert the URL and the JDBC driver class name and test the connection.

define database connection

In the next transformation the source data (from files) is loaded into the Apache Ignite database tables.

load data

The next step handles the transformation of the data. All processing is done in Ignite in memory. The data is read from Ignite and then some lookups from other tables are done to join data with information on country and continent as well as on feature codes. Finally it is output to another table in Ignite.

transform data

And finally the data is output from the table in Ignite which contains the transformed data to a MySQL table.

The last step can be to drop all tables from Ignite, if they are not required. This is done by passing a parameter (true/false) to the transformation job. In some cases one will want to keep the data to review the transformation steps and results but then in other cases, it might not be required anymore so we can delete it and free the memory occupied in Ignite.

This might not be so interesting, but I wanted to find out if a whole processing chain works without issues and that PDI and Ignite work well together. And they do! It is rather easy to replace an existing connection to a traditional database, with one to Apache Ignite. As it supports JDBC and SQL there won't be a big effort to redesign the transformation job. All steps I have tested work out of the box with Ignite: creating DDL statements, querying, deleting, etc.

Issues:
There is a minor issue though, which I found: when PDI reads the data from Ignite, then all columns of type "String" have a length of "30". All - independent if they are defined shorter or longer in the database schema. Here is the create table statement of one of the tables:

And this is the table definition in Ignite:

Ignite table definition

But Pentaho PDI extracts it like this:

You can see all String type fields/columns are defined with a length "30". But no data is lost: it is just the definition of the column size that is wrong; data from Ignite that is longer than these 30 characters is correctly retrieved.

I have cross-checked this with a different SQL tool (Squirrel): I created a connection to Ignite using the same JDBC driver and retrieved the table definition details. This is what Squirrel shows:

As Squirrel shows the correct length, I do not think it is a problem with the JDBC driver. It seems to be an issue in Pentaho PDI. So I have opened a Jira ticket for this and hope somebody will have a look at it.

As the size/length of the columns is wrong, one will have to manually change these in a "Select Values" step, so that e.g. when the data is output to a table, PDI generates the correct DDL statement, with the correct lengths.

Hope this helps to get an overview or get started. It's worth a try especially if you have a good use case for processing data in memory.

Carpe Diem

1 Comment

Pentaho PDI with Apache Ignite - Part 1

17/8/2019

0 Comments

I have recently started to use Apache Ignite. And while I am learning, I wanted to see how Pentaho PDI - Pentaho Data Integration - works together with Ignite.

Apache Ignite is - from their website - an "In Memory Computing Platform". The project has a lot of traction and offers interesting features and besides other things, you can use it as a database and query using standard SQL.

Apache Ignite is very easy to install: download Ignite from here and unzip it to a folder of your choice. On the website of Apache Ignite there is a lot of documentation available, if you need help to get started.

Once unzipped, copy the file "ignite-core-<version>.jar" from the Ignite "libs" folder to the Pentaho PDI folder "lib". This jar file contains the Ignite JDBC driver.

Next, you could start Apache Ignite by running ignite.sh in the "bin" folder. Starting it like this, Ignite will run in a default configuration. All your data will be kept in memory and if you shutdown, all data will be lost. Of course Ignite allows you also to persist data - you can read in the Ignite documentation how to configure this.

But for this discussion, I want to have my data structures (tables) and data in memory only. The idea is to use Ignite to transform data in memory without using any disk-based storage for intermediately storing results, so that the data can be processed faster.

A common approach to transforming data is to first copy data from the source system to a staging area where it can be processed without interfering with the source system. So any development or repeated processing of the data will be done on the data in the staging area. Again, this makes development and testing easier and at the same time the source system is not penetrated by repetatively pulling data from it. The next step is typically transforming the data: applying certain logic, formatting and enriching and joining it with other data. And then the final step would be to output the data to a target system - maybe another database, a data warehouse or a file - there many output targets possible.

Of course it depends on the complexity of the transformation and there are obviously many ways of how to do it. If a transformation is complex, then it naturally makes sense to break it appart in different units of work, where each part has a certain task in the transformation of the data. Like in coding where spaghetti code gets unmanageble over time and is is broken apart into classes, methods and functions.

When you have multiple transformation steps, then the question is, how are the intermediate results of each step stored and passed to the next step? One way can be to store the temporary results in a database. That is convenient and Pentaho PDI has a good database integration. Also, there are many tools available to work with databases. When you run the complete transformation and the individual parts store their data in different database tables, then the developer has an easy way to query the data and see how the transformation progressed and transformed the data.

But repeatedly reading and writing data to a database also has limitations. On large data this can get slow - either on reading or wrting or deleting data. Creating the right indexes and tuning the database can be a complex tasks and there are dependencies on e.g. the disk performance of the database system.

Looking at the transformation process described above, one can see, that the transformation is just an intermediate step to process the source data to a desired output format. Once the result is produced, the intermediate data is not necessary anymore - it can be deleted. Maybe this transformation process runs on a daily basis and so the next day a new state of the source system data is retrieved and the transformation process it to the desired output again.

And if the intermediate data can be deleted then the question is, if we can leverage an in-memory store to avoid using disks and simply do the processing in memory. So this is where I got interested in Apache Ignite. At it's core it is a key/value store but it provides also SQL integration. If you have a lot of transformations which use a database for the staging of the data, but if performance is an issue, then you would not want to redsign all your transformations. With Apache Ignite you could simply change the place where the transformation is writing to and reading from. That is an easy task and it will immediately bring a performance boost processing the data in memory only. And yes, you could of course scale your traditional database as well or tune other parts. Again, there are many ways to do it.

Here is a sample PDI job using Apache Ignite "in the middle" between source system and target system. My source system in this case are several CSV files. The target system is a MySQL database. The processing is done using Apache Ignite.

Because Ignite will run in memory - and I have no disk persistence configured in this case - I have chosen to create the required table structures in Ignite at the beginning. Next the data is imported into Ignite and then the data is transformed. Finally the output to MySQL is done. I have an option in this flow to drop the Ignite tables at the end of the process, if that is desired.

As this posts is already long, I stop here. If you are interested in details then read the upcomming follow-up blog entry. I have also some comments of what does not work so nicely between PDI and Ignite at the moment.

Here is the link to part 2.

Carpe Diem

0 Comments

Pentaho PDI with Apache Ignite - Part 2

Pentaho PDI with Apache Ignite - Part 1

Author

Categories

Archives