Pentaho PDI: Running ETL jobs with the Coordination Server

23/12/2017

The last post presented the first version of the Coordination server, which acts as a coordinator for running ETL jobs.

Why is it useful? When ETL jobs are dependent on each other, then a way must be found to not start the second job until the first one has finished. You would get unreliable results when this happens. Usually the ETL's are separated by "time" (manual decision). So you run ETL 1, you know it takes usually 15 minutes, so you schedule the second ETL to start 25 minutes after ETL 1. But this is not reliable. If ETL 1 would run longer one day, ETL might start too early. The result is that you have to watch all processes that they do not overlap. And you will have to repeatedly do this to ensure your quality of service. The Coordination server will make sure that no job runs, before the jobs it depends on have finished.

The idea is, that the messages that are sent to the server are triggered by an existing scheduler such as cron. The server only coordinates the execution of the jobs, but does not do the scheduling itself. But this takes away the complexity of chaining (timing) ETL processes from scripts, cron or other methods and delegates it to the coordination server.

By defining the dependencies between jobs, the Coordination server will make sure that a job that depends on another one (or multiple ones) will only be executed, after the job (or jobs) has finished.

I have extended the functionality quite a bit in the last days. One of the bigger changes was to introduce dynamic date calculations.

Imagine you have an ETL job that uses a parameter "month" to define for which month data should be processed. Also, let's further assume that you always run the process for the previous month. So you need a way of calculation the previous month before the job starts. Usually this is done in a shell script which runs the job.

I have included this functionality in the job definition, so that the coordination server calculates the specified dynamic date automatically. In the parameters section of the file, you can now define a dynamic date value like this:

So from this the previous month is calculated. The number behind the variable defines the offset number for the field specified, calculated from the current date. If you skip the offset number (and the colon), the current date is used.

The variables you can use are:

year
month
day
week
hour
minute

This way you do not have to hardcode date related parameters. Note that all value calulations for dates return an integer.

With dynamic date values, you can avoid hardcoding and the job will always run for the right date. Give it a try, all is on Github on my account.

Carpe Diem

0 Comments

Pentaho PDI: Running ETL jobs with the Coordination Server

Leave a Reply.

Author

Categories

Archives