Some years ago I have created the datagenerator Java application and now I have created an Apache Nifi processor that uses it. The datagenerator allows to generate mass data based on:
- word lists - files containing words or expressions for a certain category (e.g. airlines, seasons, weekdays, etc)
- regular expression - data generated according to the expression pattern
- purely random
By using wordlists and date references, one can generate mass data that is not purely random, but which makes sense according to the wordlists or date related columns used.
The basis for the generator is an xml file which contains the definition of the individual fields of the row. It's the definition how the ouput rows should be structured. Here is an example of such a rowlayout xml file:
<xml>
<references>
<field type="datetime" id="date1" pattern="yyyy-MM-dd"/>
<field type="datetime" id="time" reference="date1" pattern="HH:mm"/>
<field type="datetime" id="year" reference="date1" pattern="yyyy"/>
<field type="datetime" id="month" reference="date1" pattern="MM"/>
<field type="datetime" id="day" reference="date1" pattern="dd"/>
<field type="datetime" id="weekday" reference="date1" pattern="EEEE"/>
</references>
<row type="delimited" seperator=";">
<field type="category" category="airlines" length="20" />
<field type="category" category="services" length="20" />
<field type="reference" reference="date1" length="10" />
<field type="reference" reference="year" length="4" />
<field type="reference" reference="month" length="2" />
<field type="reference" reference="day" length="2" />
<field type="reference" reference="weekday" length="10" />
<field type="reference" reference="time" length="5" />
<field type="regex" pattern="[0-9]{1,10}" length="10" />
<field type="random" length="10" />
</row>
</xml>
If you define a field of type "category" (containing the seasons of the year) then this field references a category file named e.g. "seasons.category". The file contains words or expressions according to the ctaegory - each word or expression on a new line. e.g:
Spring
Summer
Autumn
Winter
When the datagenerator generates random rows of data, it will pick a random word or expression from the category file. So you can create mutiple category files containing lists of words which the datagenerator will use.
If you define a regular expression pattern like in the layout above, then the datagenerator will create data that fits to the given pattern.
Or one can define a field type of "random", where data will be generated purely random.
Below, see a sample Nifi flow. It contains the GenerateData processor - it generates the mass data - and a PutFile processor, which will store the generated data in a file in the filesystem.
If the Nifi flow runs it will constantly (depending on the scheduling settings) create flow files that contain the generated data - in this case 25 rows. This data can then be used to further route it to files, to Hadoop, to Kafka or other systems. For load or performance testing for example. Here is an example file generated with the rowlayout xml definition from above:
Air Portugal;Pushback;2018-02-25;2018;02;25;Sonntag;12:39;57;Pvxx4rQRvq
KLM;GPU;2015-03-15;2015;03;15;Sonntag;21:26;7603242;wsSXOEv2Rs
Alitalia;Maintenance Move;2005-07-22;2005;07;22;Freitag;02:38;7227193;z3wmxSX91Q
Swiss;Deicing;2005-10-31;2005;10;31;Montag;20:52;1039441778;jOOg764txX
Helvetica;Deicing;2016-08-20;2016;08;20;Samstag;19:41;451085152;99dK04NDSl
Iberia;GPU;2002-06-13;2002;06;13;Donnerstag;00:01;173;uVt24adfQJ
Thomas Cook;ASU;2009-11-04;2009;11;04;Mittwoch;02:24;30;ekiFFxVq3G
Aeroflot;Pushback;2013-11-09;2013;11;09;Samstag;17:33;68254;ARcPycIqz8
Swiss;GPU;2020-06-24;2020;06;24;Mittwoch;07:22;01401372;2ccrina8gU
KLM;Pushback;2019-06-02;2019;06;02;Sonntag;23:07;4423117;UAbrDGaWOm
Lufthansa;GPU;2013-05-04;2013;05;04;Samstag;03:29;57;pDHEwP0o1O
Gulf Air;Maintenance Move;2007-12-02;2007;12;02;Sonntag;18:10;9371941189;M1DHc5lkm0
Thomas Cook;Pushback;2000-05-27;2000;05;27;Samstag;19:30;03162698;Pw7DZz80Wq
Alitalia;Deicing;2008-12-11;2008;12;11;Donnerstag;12:44;546;EEQeEIBbXV
Alitalia;Deicing;2016-01-29;2016;01;29;Freitag;02:13;94718225;tMlvVwfNjF
Austrian;Pushback;2013-01-21;2013;01;21;Montag;08:47;179199436;Exuj0oSNDL
Austrian;Pushback;2019-08-25;2019;08;25;Sonntag;07:21;053702351;OE888j2gML
Lauda Air;GPU;2015-07-22;2015;07;22;Mittwoch;13:41;65509643;mWwAXpyGK3
Garuda;Pushback;2014-05-07;2014;05;07;Mittwoch;10:14;26918;SI6jBRX6DU
Garuda;Deicing;2014-01-23;2014;01;23;Donnerstag;19:55;3725090362;E8A9abjlgO
Aeroflot;GPU;2008-06-02;2008;06;02;Montag;10:09;98552669;U4KAyMTkyI
Garuda;GPU;2017-05-15;2017;05;15;Montag;00:08;1285886;SGA7Hvokur
Lufthansa;GPU;2018-03-07;2018;03;07;Mittwoch;00:10;8514230082;miUL7Pr4fv
Garuda;Pushback;2015-12-06;2015;12;06;Sonntag;17:03;788;LsPts8eCKx
American;Pushback;2000-04-02;2000;04;02;Sonntag;09:09;3105132180;zD2fjHX54M
Give the processor a try. You can download it from my Github pages. Simply put the ".nar" file in the "lib" folder of Apache Nifi and restart Nifi. Create a rowlayout file (copy from above) and create one or multiple category files (e.g. "manufacturers.category"). Then drag the GenerateData processor into the canvas of the Nifi flow, then the PutFile processor, connect them, configure the processors and finally run the flow.
You will find the documentation for the datagenerator also on Github here. Not all features of the datagenerator java API are implemented yet in the Nifi processor, but I will complement it over time.
It would be nice if you provide some feedback, so I can enhance the processor over time.
Carpe Diem