Blog Archives

Pentaho PDI: Running ETL jobs with the Coordination Server

23/12/2017

The last post presented the first version of the Coordination server, which acts as a coordinator for running ETL jobs.

Why is it useful? When ETL jobs are dependent on each other, then a way must be found to not start the second job until the first one has finished. You would get unreliable results when this happens. Usually the ETL's are separated by "time" (manual decision). So you run ETL 1, you know it takes usually 15 minutes, so you schedule the second ETL to start 25 minutes after ETL 1. But this is not reliable. If ETL 1 would run longer one day, ETL might start too early. The result is that you have to watch all processes that they do not overlap. And you will have to repeatedly do this to ensure your quality of service. The Coordination server will make sure that no job runs, before the jobs it depends on have finished.

The idea is, that the messages that are sent to the server are triggered by an existing scheduler such as cron. The server only coordinates the execution of the jobs, but does not do the scheduling itself. But this takes away the complexity of chaining (timing) ETL processes from scripts, cron or other methods and delegates it to the coordination server.

By defining the dependencies between jobs, the Coordination server will make sure that a job that depends on another one (or multiple ones) will only be executed, after the job (or jobs) has finished.

I have extended the functionality quite a bit in the last days. One of the bigger changes was to introduce dynamic date calculations.

Imagine you have an ETL job that uses a parameter "month" to define for which month data should be processed. Also, let's further assume that you always run the process for the previous month. So you need a way of calculation the previous month before the job starts. Usually this is done in a shell script which runs the job.

I have included this functionality in the job definition, so that the coordination server calculates the specified dynamic date automatically. In the parameters section of the file, you can now define a dynamic date value like this:

So from this the previous month is calculated. The number behind the variable defines the offset number for the field specified, calculated from the current date. If you skip the offset number (and the colon), the current date is used.

The variables you can use are:

year
month
day
week
hour
minute

This way you do not have to hardcode date related parameters. Note that all value calulations for dates return an integer.

With dynamic date values, you can avoid hardcoding and the job will always run for the right date. Give it a try, all is on Github on my account.

Carpe Diem

0 Comments

Pentaho PDI: Running chained ETL's - Examples

20/12/2017

0 Comments

So here are some examples that you have an idea hwat I am talking about.

The server is started standalone on a machine and listens on a dedicated port:

Now I send a message to the server from a client:

I have currently jobs with id's "id_0001" and "id_0002" defined. The JSON definition looks like this:

So I can ask the server - which loaded this file on startup, if a job has finished:

And one could start a job:

If you look on the server side, you will see this:

You can see that the client activated the ETL job at 23:27:15. But the job is scheduled to start at 23:30, so the server was waiting. Then at 23:30:05 the job was started and at 23:30:28 it finished. The ETL is just creating some random data and outputting it to a CSV file. The exit code of the process was "0", so it ran successfully.

In the JSON file where the jobs are defined, you see the attributes of a job and you get an idea of what things I currently process. You can set a dependent job, the log level, the parameters for a job and interval how often a check should be done to run a job, etc. The process also creates log files for the job runs.

Hope this helpds to understand the way I am going. I will do some futher development and then run some more realistic examples, that have many ETL processes to see if it works out/behaves well. Scaling and performance will also be checked and running multiple servers.

Carpe Diem.

0 Comments

Pentaho PDI: Running chained ETL's

20/12/2017

0 Comments

I have created many ETL processes with Pentaho PDI (aka Kettle) over the last years. Additionally to processing data there there are also many reports - depending on the processed data - and also interfaces producing data for other systems in various formats.

Once you have a nice collection together, there is one obvious problem. How do you schedule the processes to run at the right time?

Some of the ETL's might depend on each other to successfully complete. Or you might want to be sure that you send out the reports only after the data has been processed - and only then. So a good timing of the jobs and reports is necessary. But that gets complicated because some of the processes might have different runtimes. Maybe they slow down with more data to process. Or some other system gets slower in some situations. This ends up in a lot of repeated finetuning and takes a lot of attention almost every day: checking logs, runtimes, etc.

You could chain together some of the processes to run one after each other. But again, this gets complicated and I don't think there is a standard way of doing this.

I have these problems and I spend considerable time to investigate and finetune the start times of processes. Plus I have to explain to the business users why the report was empty (the ETL ended later).

This brought me to the idea, to create a java client/server process that allows to chain together ETL's and reports. Chaining will make sure, that an ETL does not start before the ETL it depends on has finished. Same is valid for the reports. If the dependent ETL is not finished a timeout can be defined, after which the other ETL will not run anymore. The server part will take care of this.

The client part will allow to send messages to the server. Such as checking if a job has finished or to start a job. To check exit codes or to reload the definition of ETL's to run - just to name some.

The process is fed by a JSON file defining the jobs, parameters, loglevel, dependant job(s) and more. I am using JSON at this phase of the project because it is easy to work with during development. But it could be changed to use a database instead.

The process is multi-threded. And it can be initiated by simply sending a message to the running server described above. This message could be sent by a crontab entry at a defined time. The dependencies are thereafter hadled by the server and you have no complicated chaining to do in your crontab or shell scripts.

But you could send this message also from somewhere else. Via a web interface or another program or the command line to check status of jobs or gather runtimes and exit codes.

I have an early beta version ready. I will publish the code as soon as possible on Github.

Let me know if you are interested. Let me know if you would like to participate or have ideas or problems to solve which I could incorporate.

Carpe Diem

0 Comments

CSV to Avro

5/12/2017

0 Comments

When it comes to data, the CSV format is very popular. You can easily open the CSV file in your favorite editor or you can process it using standard Linux tools or e.g. a scripting language. Many applications support data in CSV format. CSV files might contain a header row, which defines the names of the individual columns of a row.

But there is one problem: there is no information about the data types of the individual columns available. So any system that processes the CSV data down the road will have to find out what the data types are. Some tools try to guess the data types, but manual adjustments are often required.

Ok, you could store the data types of the fields as separate metadata in another file and process that, but there is no standard for that and so tools can't use this metadata out of the box. So this is not an ideal situation.

The Avro format can store the data and the schema defining the data types with the data in one file. Data is usually stored in binary format and is also compressed. If you use this format you have the metadata (schema) and the data together. Again, many tools nowadays support the Avro format. There are libraries for C++, Java and C# available.

An Avro schema - go check out the Avro Homepage - is in Json syntax and looks e.g. like this:

The type is set to "record" and further below is the definition of the individual fields and their types.

To make the conversion from a CSV file to an Avro file easy, I have developed a Java class, that simplifies the task and also does error handling. The code is available online on my Github account.

To use the progam you need a CSV file. Then you need to write the schema that fits to CSV data. Once you have done this, convert the CSV data to Avro format like shown in the code below.

You can use the sample above, if you have a CSV file where the columns are exactly in the same sequence as defined in the Avro schema. The code will use the first field in the CSV row and map it to the first field in the avro schema. Then the second field in the CSV row to the second field in the Avro schema, and so on.

But what if you have a CSV file with a complete different sequence than what is defined in the Avro schema? Pass the information of the fieldnames in their sequence as they are in the CSV file to the CsvToAvroGenericWriter. Like this:

You can see that the fields of the two data rows are in a different order than in the example before. This order does not fit to the order of the fields as they are defined in the Avro schema.

So define the fields of the header in their correct sequence, using the same names for the fields as defined in the Avro schema. Next use the writer.setCsvHeader() method and pass the header fields to it. That's all. The program will now match the fields by their names.

In this case - with the defined header fields - if a field of the CSV row is not found in the Avro schema definition, it is ignored.

Easy isn't it? You can loop over the rows of a CSV file and quickly convert it to Avro format. You have data and schema together and it will probably be also quite smaller.

There is also the possibility to define the Avro schema and automatically generate the equivalent Java code for it.

Like this:

When you do this, you get a Customer.java class. You can re-use it in various code projects you have. Then - instead of using the CsvToAvroGenericWriter you can use the CsvToAvroWriter class. And you instantiate it like this:

This way you have your Customer objects available after the data has been output to the avro file and you can continue to work with them.

Hope you enjoyed it. It is a really small utility to convert your data, but might be helpful and saves writting and inventing the code yourself.

Carpe Diem.

0 Comments

Pentaho PDI: Running ETL jobs with the Coordination Server

Pentaho PDI: Running chained ETL's - Examples

Pentaho PDI: Running chained ETL's

CSV to Avro

Author

Categories

Archives