Check out the readme file on my GitHub account for more details. Download the tool and please send me feedback if you do so.
just a quick update: I have improved the performance of processing rows of CSV data significantly. And then I have changed the way how one models the nodes. So the syntax is different and in the end also easier to use.
Check out the readme file on my GitHub account for more details. Download the tool and please send me feedback if you do so.
The CSV processor tool is a Java tool that generates from a CSV file and some nodes and relations in neo4j the files, that correspond to the structure that the neo4j-import tool requires. So the tool simply saves you to do this manually. But the cool thing is, that the design of the database and the mapping to the CSV file is done within neo4j.
Make sure you have downloaded the two jar libraries from my github account. The one is the CSV processor tool and the other one is the neo4j JDBC driver we need.
Here is a sample so that it gets clearer how the CSV processor tool works.
As indicated, the design is done in neo4j itself. We will create three nodes and two relationships.
As input data we use a sample CSV file. The first line is the header with the names of the individual columns:
Start neo4j, create a database and open the browser. We create the person node first using following statement:
We create a node with the label "Person". The additional label "csvdef" is like a tag for the CSV processor tool; only nodes with this label "csvdef" will be used.
After that the properties (key/values) follow. The values of the properties point to the field names in the CSV file we use. Select the key of the property at your discretion. And the value is then the name of the CSV column.
The first property is the "id_field". It is required by the CSV processor tool and indicates which column from the CSV file is used as the unique key for the node. In this case it points to "person_id" in the CSV file. The "id_field" is just metadata that helps the tool the generate the output. It will not be output itself. But note that there must be a a regular property mapped to the same field. This is the "id" property. It is a regular node property and also points to the "person_id". Again: one is metadata for the tool and the other is the regular property.
There is also a "name" property. At the end, there is the property "year_of_birth". It points to the field "born" in the CSV file. But there is also an indicator of which type the data of the field is. String, integer, float and others are possible. It's documented on the neo4j website in the operational manual.
Let's create the other two nodes:
Nothing special here. Both nodes have the label "csvdef" as mentioned above. And both have an indicator (metadata) which is the "id_field".
Next we create the relationships:
We created a relation between Person and Country node to indicate that the Person lives in a country.
Then we create the next relationship:
Here we created the relation between Person and Car. The person drives the car. Note that the property "license_since" is a property of the relation and points to the field "license" in the CSV file. It indicates since when the person has a drivers license. And watch out: we defined the type of data to be "Integer" (the ":int" tag). The neo4j documentation will give you info on available types.
At this point you should have a similar layout like this in the neo4j browser:
That is already all we need to prepare.
Next we run the CSV processor tool. Put the CSV file to process in a folder. My CSV file in this case is named "persons.csv". Make sure the jar files you downloaded are in a folder accessible/known to Java (classpath).
Make sure you replace the hostname, username, password, output folder and delimiter in the statement below with the appropriate values. Then run the tool like this:
The tool will generate the files for the neo4j-import tool. Here is the output of the tool:
The following files have been created: Files for the nodes and the relations. The "persons.csv" file in the file containing the input data.
The file "original_schema.cyp" is special. It contains the create statements for your database. So you can re-create the schema any time, simply by running the statements again (or by importing the file).
The neo4j-import tool will create the complete files and structure of the database for us. It runs offline and is bypassing the transactional layer of neo4j and this is why it is so fast.
So we can delete the database we used for the design. The database is located in the data/databases folder of the neo4j installation. Go ahead and delete the relevant folder and files . Be careful though to delete the correct one if you have multiple ones. The active database is listed in the "neo4j.conf" file in the "conf" folder. Be careful. Ok. maybe you simply rename it... just in case...
The last step is to import the files we prepared into neo4j. The neo4j-import tool is in the installation "bin" folder. The tool requires a few parameters as you can see below. The nodes and relation files, which database to import into, etc. For the database select the same database that you used for the design and which we just deleted.
Then run the import tool using this command:
It will create quite some output. At the end you will see something like this:
Go to the neo4j browser again. Your data will be there. Here is a screenshot:
As far as I know, when you have a lot of data, this will be the fastest way to get it into neo.
If you want to, you can wipe the data and import the original schema that we designed and change it's layout: more nodes and relationships and also properties. Or better even, create a new layout for your CSV data with lots of rows and go throught the process again.
The CSV processor tool is in beta. Is is not feature complete and probably contains bugs. At the current moment it will probably not work with very large files. I will architect a further way to work with very large data files. And I will work on further improving and devloping the tool and would be grateful for comments or suggestions.
Read more about the neo4j-import tool on the neo4j website.
Hope you enjoyed it.
So I have spent quite a lot of time recently to learn neo4j. First the basics: how does it work - how do I write queries. Good fun and easy to learn the basics. And by the way neo4j has excellent documentation available online and also downloadable.
Next I was doing "load csv", to get more data in to my database than what I had before, were I manually added it. Load csv data is also very good. If you just return the first lines, without actually storing or manipulating data, you get the rows right there displayed in the browser. So you down have to open another tool to review the structure and content of your rows. The load csv data function gives you a lot of possibilities to convert your flat data to nodes and relationships and it is useful for developing a schema.
The APOC library offers a lot of functions that one can use to query data, export data, work with geo positions and more. I think there are close to 300 functions making the developers or users life easier.
While learning, if you look left and right, you find out about the additional tools like cypher-shell. And of the neo4j-import tool. It imports data fast, because it does not use the transactional layer as load csv or cypher shell, but it simply creates the files on disk. The files which make up the nodes, relations and what else is required for a complete neo4j database.
the CSV processor
As I wanted to try a larger dataset and load csv was too slow, I gave it a try. For importing csv data with the neo4j-import tool, the files need to have a defined structure. So basically you need to do some preparation work, before you can import the data. But once you have done that you get a really good speed - much better as with the other solutions. Read the documentation on the neo4j website to get started.
And this was the point where I started to think that it would be good to have a tool that does the preparation work for me. Now the normal way of how I design a database is to do it in the neo4j desktop or browser. I manually create some nodes and relations and think about if that works out well for my use case. Once I am convinced that the schema is right, I can load data and revisit if it is really correct for the larger data set.
As my design starts in neo4j, I thought it would be good that the tool just reads what I have designed and creates - together with the given data in a CSV file - the strucuture that is required by the meo4j-import tool. After thinking about it, I was convinced that it will work and started coding a tool in Java.
It is really simple: You design your nodes and relations. You label your nodes and relations and add properties to them as required. The values of the properties are a reference (a pointer) to a column in the CSV file. So if you have a node "Person" which has a property "name" and in the CSV file there is a column "fullname" then the value for the "Person" property "name" is "fullname". When you run the CSV processor tool, it reads the neo4j database schema, maps the keys and values from the schema to the CSV file and prepares the files for the neo4j-import tool - ready to use.
Ok. There is a little bit more to it. For the CSV processor tool to work, you need to provide some more metadata so that the files for the import tool can be created. One is to add a certain label to each node, so that the tool knows which nodes to use. You also have to define which is the key field/property of the node; the neo4j import tool needs unique keys to work with. There is more to say about this, but I will handle this in another blog entry.
You can also add metadata to the node and relation properties to define of which type they are (string, integer, float, etc.).
The complete workflow is: Design the schema in neo4j browser. Run the CSV processor tool together with the CSV file containing the data. Then run the neo4j-import tool to import all data from the files that the CSV processor created. That's it.
So much for the moment. The next days I will write another blog post with more details and an example and screenprints so visualize the process and for easy understanding.
The code is on GitHub: https://github.com/uwegeercken. The tool is in beta at the moment. But come back frequently to get the latest updates.
So I came across neo4j and I was curious and started reading about it. The neo4j.com website has very good documentation for the product. This was the first plus. Then I downloaded the neo4j desktop application and it ran out of the box on my Linux laptop with Fedora 27. Next plus.
And then I got quite addicted. Graph databases, their concept and neo4j are really cool. I am a beginner - I have used it only for about two weeks now. But I spent considerable hours for my first projects.
At the beginning I spend some time entering data and creating relations between nodes manually. Next step was to load data from CSV files. And then I installed the APOC library and created a JDBC connection to a MySQL/MariaDb database.
The main painpoint at the moment is the modelling of the nodes and relations. I have a lot of questions in my head of how to model some use cases and - as I am a beginner - I am getting confused of what would be the best way to model. So I read more tutorials and documentation and watch some videos.
I started with two projects: the first was to create a graph database for Pentaho ETL jobs. It stores information and relations about the ETL jobs, which databases are accessed, on which server they run and in which data center, which drives the jobs read from or write to, etc.
Here is the screenshot of the graph db schema.
I think this is a good use case. It is about relations between servers, databases, network drives and jobs. It gives an easy overview of the situation of the jobs and I can query what runs where, what is affected. This is something that was missing. And without complex reengineering I can extend this schema if I need to. For exaple the times when the jobs are scheduled to run. Information about software used, OS information, user accounts or anything else that is typically queried.
The second project is about flights. As I work for a groundhandling company, flights are the base data of our daily work. Flights come with information on the airline, aircraft type used, which groundhandler served the flight, origin and destination of course. But also other attributes such as delay minutes and delay reasons, number of passengers.
Here is the database schema screenshot:
I have written scripts which I use to load data about flights from a JDBC datasource and create the nodes and relations using the cypher-shell.
I have a lot of questions and a lot to learn. Especially handling of dates and specifically time series is quite difficult. I spent a lot of time thinking about that. The other hard thing is to find out where to share nodes and where not. Cypher queries have a steep learning curve. The basics are clear within minutes. More complicated queries do create more questions in my head though...
I will blog here about more details about the projects as I go forward.
The last post presented the first version of the Coordination server, which acts as a coordinator for running ETL jobs.
Why is it useful? When ETL jobs are dependent on each other, then a way must be found to not start the second job until the first one has finished. You would get unreliable results when this happens. Usually the ETL's are separated by "time" (manual decision). So you run ETL 1, you know it takes usually 15 minutes, so you schedule the second ETL to start 25 minutes after ETL 1. But this is not reliable. If ETL 1 would run longer one day, ETL might start too early. The result is that you have to watch all processes that they do not overlap. And you will have to repeatedly do this to ensure your quality of service. The Coordination server will make sure that no job runs, before the jobs it depends on have finished.
The idea is, that the messages that are sent to the server are triggered by an existing scheduler such as cron. The server only coordinates the execution of the jobs, but does not do the scheduling itself. But this takes away the complexity of chaining (timing) ETL processes from scripts, cron or other methods and delegates it to the coordination server.
By defining the dependencies between jobs, the Coordination server will make sure that a job that depends on another one (or multiple ones) will only be executed, after the job (or jobs) has finished.
I have extended the functionality quite a bit in the last days. One of the bigger changes was to introduce dynamic date calculations.
Imagine you have an ETL job that uses a parameter "month" to define for which month data should be processed. Also, let's further assume that you always run the process for the previous month. So you need a way of calculation the previous month before the job starts. Usually this is done in a shell script which runs the job.
I have included this functionality in the job definition, so that the coordination server calculates the specified dynamic date automatically. In the parameters section of the file, you can now define a dynamic date value like this:
So from this the previous month is calculated. The number behind the variable defines the offset number for the field specified, calculated from the current date. If you skip the offset number (and the colon), the current date is used.
The variables you can use are:
With dynamic date values, you can avoid hardcoding and the job will always run for the right date. Give it a try, all is on Github on my account.