But there is one problem: there is no information about the data types of the individual columns available. So any system that processes the CSV data down the road will have to find out what the data types are. Some tools try to guess the data types, but manual adjustments are often required.
Ok, you could store the data types of the fields as separate metadata in another file and process that, but there is no standard for that and so tools can't use this metadata out of the box. So this is not an ideal situation.
The Avro format can store the data and the schema defining the data types with the data in one file. Data is usually stored in binary format and is also compressed. If you use this format you have the metadata (schema) and the data together. Again, many tools nowadays support the Avro format. There are libraries for C++, Java and C# available.
An Avro schema - go check out the Avro Homepage - is in Json syntax and looks e.g. like this:
To make the conversion from a CSV file to an Avro file easy, I have developed a Java class, that simplifies the task and also does error handling. The code is available online on my Github account.
To use the progam you need a CSV file. Then you need to write the schema that fits to CSV data. Once you have done this, convert the CSV data to Avro format like shown in the code below.
But what if you have a CSV file with a complete different sequence than what is defined in the Avro schema? Pass the information of the fieldnames in their sequence as they are in the CSV file to the CsvToAvroGenericWriter. Like this:
So define the fields of the header in their correct sequence, using the same names for the fields as defined in the Avro schema. Next use the writer.setCsvHeader() method and pass the header fields to it. That's all. The program will now match the fields by their names.
In this case - with the defined header fields - if a field of the CSV row is not found in the Avro schema definition, it is ignored.
Easy isn't it? You can loop over the rows of a CSV file and quickly convert it to Avro format. You have data and schema together and it will probably be also quite smaller.
There is also the possibility to define the Avro schema and automatically generate the equivalent Java code for it.
Like this:
Hope you enjoyed it. It is a really small utility to convert your data, but might be helpful and saves writting and inventing the code yourself.
Carpe Diem.