Pentaho PDI is the ideal tool to feed Neo4j: you can join data from different source, clean it, transform and format it, add data from lookups. The possibilites are nearly unlimited. You have the full power of an ETL tool that supports you with an easy to use GUI to transform the data according to your requirements.
I am also a long-time Pentaho PDI user and as a Neo4j plugin is now available, I though I give it a try. I will do a series of blog entries here to show how PDI and Neo4j can work together.
This first part shows how to load a CSV file in PDI and create node and relationships in Neo4j. Here the Neo4j Output step is used. I am not going into the details of how to work with PDI - there is a lot of documentation available online and also videos.
To follow you will need Pentaho PDI installed. Once you have done that go to https://github.com/knowbi/knowbi-pentaho-pdi-neo4j-output/releases/ and download the file: Neo4JOutput-4.0.0-beta1.zip. Unzip the file to the PDI "plugins" folder. Now start PDI (spoon.sh or spoon.bat) and create a new transformation. You should now see the available Neo4j steps.
- id - a unique running number of the airport
- name - the name of the airport
- description - the description of the airport
- country - the name of the country
- iata - three letter IATA code of the airport
- icao - four letter ICAO code of the airport
- location - latitude/longitude of the airport (separated by a comma)
- elevation - elevation of the airport
Here is some sample data:
1;Goroka;Goroka;Papua New Guinea;GKA;AYGA;-6.08169,145.39188;5282
2;Madang;Madang;Papua New Guinea;MAG;AYMD;-5.20708,145.7887;20
3;Mount Hagen;Mount Hagen;Papua New Guinea;HGU;AYMH;-5.82679,144.29586;5388
4;Nadzab;Nadzab;Papua New Guinea;LAE;AYNZ;-6.56983,146.72624;239
5;Port Moresby Jacksons Intl;Port Moresby;Papua New Guinea;POM;AYPY;-9.44338,147.22005;146
6;Wewak Intl;Wewak;Papua New Guinea;WWK;AYWK;-3.58383,143.66919;19
7;Narsarsuaq;Narssarssuaq;Greenland;UAK;BGBW;61.16052,-45.42598;112
8;Nuuk;Godthaab;Greenland;GOH;BGGH;64.19092,-51.67806;283
9;Sondre Stromfjord;Sondrestrom;Greenland;SFJ;BGSF;67.01697,-50.68932;165
10;Thule Air Base;Thule;Greenland;THU;BGTL;76.5312,-68.70316;251
So here is the PDI transformation:
I have selected that indexes shall be created. They are defined based on the selection of the fields that make up the index by marking them as "Primary". In this case the unique "id" column from the CSV file is used.
I have also slected to "use CREATE instead of MERGE". Merge will check if the node exists and if so update it. Create will skip this check and simply create a new node. So create will be faster, but if you want to upsert nodes (create if they don't exist otherwise update), then uncheck this value.
Further down I selected the value of the field "label_airport" to be used. In the "Add Constants" step I have defined these.
The last step is to define the properties of the node. The quickest way is to click on "Get Fields". PDI will fill in all fields that are inputs to this step. Under "Property Name" the name of the property in Neo4j can be defined. I then removed all fields that should not be used, such as the labels. I also removed the "country" field because a separate node will be created for countries. Note that the "id" and "elevation" fields are defined as integer and "latitude" and "longitude" are defined as float. This will allow later to use Neo4j's spatial functions to e.g. calculate a distance between two points.
This is very easy, we add the value for the relationship between the nodes in the "Relationship value" textfield. I added "locatedIn" to indicate that an airport is located in a country.
match(a1:Airport) where a1.icao='EDDH'
match(a2:Airport) where a2.icao='LSZH'
WITH point({ latitude: a1.latitude, longitude: a1.longitude}) AS p1, point({ latitude: a2.latitude, longitude: a2.longitude }) AS p2
RETURN round(distance(p1,p2)/1000) AS dist
- you can rerun the PDI transformation at any time to update Neo4j
- you can extend the PDI transformation to e.g. add more data from other sources or to create more nodes and relationships
- I created the nodes and relationship step by step, but of course this can be done in one go
- there are other Neo4j steps in PDI - I will tackle those in other blog posts
Whenever you have data about relations between objects, then Neo4j might be a good candidate to store this data, as it is very fast in answering complex questions about relations. It answers these questions in an easier and much more sophisticated way than relational databases. The more relations you have or the more complex relations you have the more likely it is that Neo4j will make life easier for you.
And you have seen than Pentaho PDI with the Neo4j steps is easy to use and a powerful combination with Neo4j. Apart from being very flexible and versatile and scalable, PDI also allows you to use metadata injection. A powerful feature when you e.g. have lots of Excel sheets - with different structures - and you don't want to hardcode the structures into the transformation process.
Hope you enjoyed this part.
Carpe Diem