Once you have a nice collection together, there is one obvious problem. How do you schedule the processes to run at the right time?
Some of the ETL's might depend on each other to successfully complete. Or you might want to be sure that you send out the reports only after the data has been processed - and only then. So a good timing of the jobs and reports is necessary. But that gets complicated because some of the processes might have different runtimes. Maybe they slow down with more data to process. Or some other system gets slower in some situations. This ends up in a lot of repeated finetuning and takes a lot of attention almost every day: checking logs, runtimes, etc.
You could chain together some of the processes to run one after each other. But again, this gets complicated and I don't think there is a standard way of doing this.
I have these problems and I spend considerable time to investigate and finetune the start times of processes. Plus I have to explain to the business users why the report was empty (the ETL ended later).
This brought me to the idea, to create a java client/server process that allows to chain together ETL's and reports. Chaining will make sure, that an ETL does not start before the ETL it depends on has finished. Same is valid for the reports. If the dependent ETL is not finished a timeout can be defined, after which the other ETL will not run anymore. The server part will take care of this.
The client part will allow to send messages to the server. Such as checking if a job has finished or to start a job. To check exit codes or to reload the definition of ETL's to run - just to name some.
The process is fed by a JSON file defining the jobs, parameters, loglevel, dependant job(s) and more. I am using JSON at this phase of the project because it is easy to work with during development. But it could be changed to use a database instead.
The process is multi-threded. And it can be initiated by simply sending a message to the running server described above. This message could be sent by a crontab entry at a defined time. The dependencies are thereafter hadled by the server and you have no complicated chaining to do in your crontab or shell scripts.
But you could send this message also from somewhere else. Via a web interface or another program or the command line to check status of jobs or gather runtimes and exit codes.
I have an early beta version ready. I will publish the code as soon as possible on Github.
Let me know if you are interested. Let me know if you would like to participate or have ideas or problems to solve which I could incorporate.
Carpe Diem