Blog Posts

Pentaho PDI with Apache Ignite - Part 1

17/8/2019

I have recently started to use Apache Ignite. And while I am learning, I wanted to see how Pentaho PDI - Pentaho Data Integration - works together with Ignite.

Apache Ignite is - from their website - an "In Memory Computing Platform". The project has a lot of traction and offers interesting features and besides other things, you can use it as a database and query using standard SQL.

Apache Ignite is very easy to install: download Ignite from here and unzip it to a folder of your choice. On the website of Apache Ignite there is a lot of documentation available, if you need help to get started.

Once unzipped, copy the file "ignite-core-<version>.jar" from the Ignite "libs" folder to the Pentaho PDI folder "lib". This jar file contains the Ignite JDBC driver.

Next, you could start Apache Ignite by running ignite.sh in the "bin" folder. Starting it like this, Ignite will run in a default configuration. All your data will be kept in memory and if you shutdown, all data will be lost. Of course Ignite allows you also to persist data - you can read in the Ignite documentation how to configure this.

But for this discussion, I want to have my data structures (tables) and data in memory only. The idea is to use Ignite to transform data in memory without using any disk-based storage for intermediately storing results, so that the data can be processed faster.

A common approach to transforming data is to first copy data from the source system to a staging area where it can be processed without interfering with the source system. So any development or repeated processing of the data will be done on the data in the staging area. Again, this makes development and testing easier and at the same time the source system is not penetrated by repetatively pulling data from it. The next step is typically transforming the data: applying certain logic, formatting and enriching and joining it with other data. And then the final step would be to output the data to a target system - maybe another database, a data warehouse or a file - there many output targets possible.

Of course it depends on the complexity of the transformation and there are obviously many ways of how to do it. If a transformation is complex, then it naturally makes sense to break it appart in different units of work, where each part has a certain task in the transformation of the data. Like in coding where spaghetti code gets unmanageble over time and is is broken apart into classes, methods and functions.

When you have multiple transformation steps, then the question is, how are the intermediate results of each step stored and passed to the next step? One way can be to store the temporary results in a database. That is convenient and Pentaho PDI has a good database integration. Also, there are many tools available to work with databases. When you run the complete transformation and the individual parts store their data in different database tables, then the developer has an easy way to query the data and see how the transformation progressed and transformed the data.

But repeatedly reading and writing data to a database also has limitations. On large data this can get slow - either on reading or wrting or deleting data. Creating the right indexes and tuning the database can be a complex tasks and there are dependencies on e.g. the disk performance of the database system.

Looking at the transformation process described above, one can see, that the transformation is just an intermediate step to process the source data to a desired output format. Once the result is produced, the intermediate data is not necessary anymore - it can be deleted. Maybe this transformation process runs on a daily basis and so the next day a new state of the source system data is retrieved and the transformation process it to the desired output again.

And if the intermediate data can be deleted then the question is, if we can leverage an in-memory store to avoid using disks and simply do the processing in memory. So this is where I got interested in Apache Ignite. At it's core it is a key/value store but it provides also SQL integration. If you have a lot of transformations which use a database for the staging of the data, but if performance is an issue, then you would not want to redsign all your transformations. With Apache Ignite you could simply change the place where the transformation is writing to and reading from. That is an easy task and it will immediately bring a performance boost processing the data in memory only. And yes, you could of course scale your traditional database as well or tune other parts. Again, there are many ways to do it.

Here is a sample PDI job using Apache Ignite "in the middle" between source system and target system. My source system in this case are several CSV files. The target system is a MySQL database. The processing is done using Apache Ignite.

Because Ignite will run in memory - and I have no disk persistence configured in this case - I have chosen to create the required table structures in Ignite at the beginning. Next the data is imported into Ignite and then the data is transformed. Finally the output to MySQL is done. I have an option in this flow to drop the Ignite tables at the end of the process, if that is desired.

As this posts is already long, I stop here. If you are interested in details then read the upcomming follow-up blog entry. I have also some comments of what does not work so nicely between PDI and Ignite at the moment.

Here is the link to part 2.

Carpe Diem

0 Comments

Load CSV file to Redis using Awk

4/8/2019

0 Comments

I have spent some time on refreshing my knowledge about Redis. After the basics, I came to a point where I wanted to load more data into Redis and typing everything in was not a solution.

So I have created a small awk script that can be used to process a CSV file and pipe it to the Redis client command. Apparently this is the fastest way of importing data to Redis. Here is a link: Redis Mass Insertion

So Redis has a specific protocol, one has to adhere to and the awk script does this: It reads the header row - which is mandatory (for the moment) - from the CSV file and determines the field names, then it reads a row from the file and converts it to the Redis protocol and outputs the result. This can then be piped into the Redis client command.

The script sends the data to Redis as hashes: each row keys a unique id and each field of the row from the CSV file gets a name (from the header) and it's value. So the HMSET command is used to create the structure in Redis.

Btw, here is the link to the awk script on Github. And there are other awk scripts that you may find useful.

Here is an example of the awk script to execute:

Before running this make sure that Redis is running and you can connect to it.

The -b flag in awk is used so that special characters such as é, à, etc are properly processed and sent to Redis.

Now the awk script has some additional variables you can pass to it:

separator: which separator is used in the CSV file to divide the individual columns
rediskey: you can or want to group certain keys together in Redis. By domain or system name maybe. This key will be used as the first part of the unique identifier of the row. If not specified "csvfile" is used.
uidcolumn: defines which column number in each row has the unique identifier of that row. If not specified or present in the file, then simply the row number is used

In this example, I have chosen "geonames" as the rediskey variable. And if the uinique id - taken from column 1 of the input file is e.g. "123456", then this will end up in redis with the key "geonames:123456".

Here is an example CSV file with a single data row:

Run through the awk script - without the pipe to Redis - this is what the output looks like. This is how the data is sent to Redis. You can read about the protocol when you follow the link at the beginning of this post.

Basically it specifies: The total number of parts the message exists of (10), and then the Redis command to execute and all key value pairs. Before each part, the length of the part is specified - e.g. $5 specifies that the next part is 5 characters long (the HMSET).

The row would be inserted into Redis with the key "geonames:2994701" and you can retrieve it from Redis like this:

And this is the response from the Redis client:

I hope you find this useful. I will work on the script to enhance it but you are also welcome to help.

Carpe Diem

0 Comments

Neo4j-Kafka connector: MySQL - Nifi - Kafka - Neo4j

8/4/2019

0 Comments

I finally found some time to test the Neo4j connector to Kafka. Specifically I am showing here how to use the consumer in Neo4j to consume data from Kafka. I have not found a sample on the web, so I thought I show one here.

Information about the connector is available here: github.com/neo4j-contrib/neo4j-streams
Some more documentation is here: neo4j-contrib.github.io/neo4j-streams/

My setup is as follows:

MySQL: I have data in a MySQL database containing information about airlines and airports and which airline flies from which origin airport to which destination airport
Nifi: I use Apache Nifi to listen for changes to the MySQL database tables and to send the data to Kafka
Neo4j: is configured to consume data from the three topics created using Nifi

MySQL:
Here are the tables:

and here the first 10 rows from each table:

The id column is an autoincrement value and the last_update is the inserted or last updated timestamp for the records in each table. The airlines_airports table has the information of which airline flies from where to where (to which airport).

Nifi:
Updates to the MySQL tables will result in an update of the last_update column of the relevant record. Nifi will pickup the change records and send them to Kafka in JSON format.

My Dataflow looks like this:

Three QueryDatabaseTableRecord processors are used to watch for changes to the three MySQL tables. The UpdateAttribute processors are used to simply define the name of the Kafka topic. Finally the records are sent to Kafka using the PublishKafkaRecord processor.

Neo4j:
I have adjusted the Neo4j configuration as documented (see link at the beginning). First, I have added the Kafka config at the end of the neo4j.conf file:

This is configuration for zookeeper, the Kafka brokers, the consumer group id and some others. After this I added three cypher statements to process data from the three Kafka topics:

The first two cypher statements do a merge on the Airport or Airline based on the ID of the records. If the relevant ids exist they are updated, otherwise created.

The last cypher statement creates the relationship between the airports and the airlines: which airlines flies from which airport (origin) to which other airport (destination).

So this is my data pipeline: MySQL has the data and any updates are made here. The changes are picked up by Nifi, which send it to the relevant Kafka topic. And because I configured the three cypher statements in the Neo4j config, Neo4j consumes any messages that arrive in the three Kafka topics. And if there are any changes in the MySQL data, then they will automatically arrive in Neo4j.

Once the data is available or updated in Neo4j, I can run e.g. a query to see where Swiss (airline code=LX) is flying to from Zurich (airport code=ZRH).

The result would then look like this:

As you can see, configuring Neo4j to use Kafka as a streaming source is straightforward. The developers of the connector have made a good choice to use cypher as the connecting part between Kafka topics and Neo4j. This way, you have the greatest flexibility to handle the data from Kafka using the power of cypher.

Besides kafa and Neo4j, Apache Nifi is used for the dataflow management. It is a very good tool for dataflows: flexible, scalable, has many connectors and is the tool when it comes to schemas (inherit, infere), data provenance and then routing the data to various target systems.

Carpe Diem

0 Comments

Pentaho ETL and Metadata Injection

14/2/2019

0 Comments

A short follow up on the last post. Sometimes a picture explains more than 1000 words, so I have visualized the advantages of metadata injection.

In the screenshot below one can see that with different input files - because they are different e.g. in the fields and data types, the separator used or maybe the encoding - the ETL logic is duplicated. Allthough the same basic logic applies, we need to create multiple transformations because of the file differences.

The result is, that if something changes or the transformation is extended, it has to be done in multiple places. Or if you get additional input files with yet another structure, then you have even more duplication. All this has a bad influence on agility and also quality.

Of course, you could make an effort to generalize some of the logic that all have in common. In this case you would not have multiple (or as many) duplicates. But it makes the overall ETL more complicated because you have things that are different and things that are common. With a growing number of input files (differences) this also gets quickly complicated or even unmanagable.

If you use metadata injection, then the work to analyze the differences of the input files still has to be done. But the positive aspect is, that you can reduce the number of transformations. Instead you define metadata - e.g. in simple CSV files. This approach is much cleaner and simpler. Simplicity is always good when it comes to maintenance, when you share your development work with others, but also for agility and also the overall quality will benefit.

In this case, when you get additional input files yet in different formates and with different data types, you won't have to touch your ETL logic. You simply define the metadata according to the input file(s) and you are done. So the more different files you have the more you will benefit from this solution.

In Pentaho PDI (ETL) metadata injection is available for many of the steps (plugins). Once you have understood the concept and have done it 2 or 3 times, it will be an easy task to use it instead of hardcoding file structure and duplicating logic. Of course, only if metadata injection makes sense in your use case.

Carpe Diem

0 Comments

Pentaho PDI (ETL) with Neo4j (and Kafka) - Part 2

10/2/2019

0 Comments

Welcome to part two. I have been experimenting quite a lot in the last time combining Pentaho PDI (Kettle) - the Pentaho ETL tool - with Neo4j. One thing obvious is, that when you have more than a couple of nodes and relationships to create and you do it from e.g. CSV files, then you quickly start to duplicate a lot of things - e.g. reading the file and outputting it to Neo4j.

Metadata Injection:
But there is a way around this duplication: PDI - as the great ETL tool it is - supports metadata injection. So instead of hardcoding field names, data types and various other things, these can be injected to a step at runtime. So when you have different CSV files for different nodes - because they have different attributes and data types - then you can define this metadata (e.g. in a file). When a node shall be created, the relevant metadata is then used at runtime to fill the PDI steps and with the next node, exactly the same is done. This avoids the duplication I talked about previously.

Actually many steps in PDI support metadata injection. Not only for fields and data types, but also for filenames, formatting (separators, enclsure), encoding (UTF-8) and much more.

The plan:
What I plan to do is to use metadata inection in PDI to load data about Neo4j nodes from CSV files into Kafka. And then I will have another process that will consume data from a different Kafka topics (one for each Neo4j node) and create or merge or update the nodes and also the relationships. This I will show in the part 3 of this blog series.

The messages all have an "event_type" attribute. This indicates which type of event (transaction) is applicable: insert, create, merge or delete.

I put it into Kafka for several reasons:

I can wipe my graph, reset my Kafka consumer to the beginning of the topic and "replay" all events that happened, which in turn recreates my graph from scratch.
I can feed Kafka e.g. from a file or do it manually; for the consumer side of the messages nothing will change if I change this end. I could also e.g. hook up a database (with CDC - Change Data Capture) and get the data from there.
Kafka gives me the realtime processing capabilities. Data that arrives in Kafka as events can immediately be consumed and trigger an update of my graph.

Use Case:
The graph I will create, will store information about source systems, target systems, connectors, servers, clusters, people and much more. So it shows how these individual object are connected to each other and then in turn allow to make queries on it.

So the first step is to read the CSV files for the different nodes and output the data to Kafka - one topic per node type. At the beginning I have defined which nodes I need and their relationships. Then I have created several CSV files which contain a header defining the attributes of the node and then some data.

Here is an example:

The "type" field will be used as the label of the nodes in Neo4j. And the "event_type" is the type of transaction that has to be done. Later the "source_system_id" will be used to create a relation from this node (ITApplicationOwner) to the SourceSystem node.

Then I have a PDI transformation which contains the metadata injection step and several other parts that deliver the metadata to this step. Inside the metadata injection step the mapping is done of metadata to the relevant fields in the steps of another (child) transformation.

This is what the transformation looks like. In the subtransformation there are steps to:

read the relevant CSV file
concat two fields to construct the key of the Kafka message
format the data to JSON format
dependent on the "environment" setting (DEV or PROD) send the result either to the log or to Kafka.

So the steps that run the the metadata injection step in the middle, provide information about which fields are used in the CSV file, which delimiter is used, which encoding and more. For the JSON output step also the fields and data types are required and e.g for the steps that concats the two fields I need to provide the name of the resulting field of the concatenation.

Now the subtransformation:

From the metadata injection step, the "CSV file input" step above gets the definition of the fields in the CSV file and also the data types. Next the two fields are concatenated and then the "environment" setting is evaluated and the flow continues to write to the log or send the message to Kafka.

Let's have a look athe "CSV file input" step:

You can see that - at the bottom - there are no fields and data types defined. Usually - in a transformation without metadata injection - one would have to define the fields that the CSV file is made up of. But here it is empty. These settings are injected at runtime into this step. Also the "delimiter, "enclosure" and "File encoding".

For the "ITApplicationOwner" node and CSV file shown futher above, I have defined this metadata file:

The first line is the header row. The following lines each define a field and the data type and are injected to the step (for the "Name" and "Type" field) shown in the previous screenprint.

And this is what metainjection does: In this case, I only have to define the metadata of each CSV file I want to use. At runtime this structure is injected so that I can have lots of differently formatted files and still just need one flow and logic to process them.

As you can see above, there is also a configuration in the "CSV input file" step for the "Filename" - the file - that shall be processed. This does not come from the metadata. Instead I have defined parameters for that. When I run the whole transformation, then before it kicks off, I specify the file (filename) I want to process and this information will be inserted appropriately. This way - at runtime - I can dynamically load different files. And of course this could be scripted and scheduled to process many files.

Here is the log info for the processed "ITApplicationOwner" CSV file that will later become a node in Neo4j. Below you can see the key and message (in JSON) that would be sent to Kafka:

And when I run this exact transformation again, but using a different file (here: "SourceSystem") with different metadata, then this comes out in the log:

As you can see, Pentaho PDI and metadata injection are immensly helpful to avoid duplicate work and hardcoding. And as such it is a clear plus for data quality: easier flows/logic will be easier to control and to maintain.

It is a different way of constructing an ETL, but it is way more efficient than to create several flows and logic e.g. one flow per CSV file and Node type. You end up doing the same things over and over again just with a little bit of difference in the file structure.

Now there is not so much Neo4j in here today, other than I have prepared the data to be sent to Kafka. But then the next part of this series will use also PDI (with metadata injection) to consume the data from kafka and then send it to Neo4j to construct a graph.

It will be a somewhat universal process to allow to update the Neo4j graph based on messages that arrive in Kafka. So a message that is sent to Kafka (from the console or a file or maybe a database) is immediately processed by PDI and updates the graph (database).

As always, send me your comments or questions please. Hope you enjoyed it.

Carpe Diem.

0 Comments

<<Previous

Forward>>

Pentaho PDI with Apache Ignite - Part 1

Load CSV file to Redis using Awk

Neo4j-Kafka connector: MySQL - Nifi - Kafka - Neo4j

Pentaho ETL and Metadata Injection

Pentaho PDI (ETL) with Neo4j (and Kafka) - Part 2

Author

Categories

Archives