Elasticsearch - a practical Example. Part 1

3/10/2018

I want to show you a practical example here, using the geonames geographical database which covers all countries and contains over eleven million placenames that are available for download free of charge. Have a look here: geonames.org. The data can be freely dowloaded and used. Download the allCountries.zip file and unzip it.

I will use Logstash to translate the data into JSON format and to model the data, so that it can be sent to Elasticsearch. On top of that, I will show how to create some simple visualizations using Kibana.

Let's start. Here is a single row from the geonames dataset - in CSV, tab delimited - format:

2986043 Pic de Font Blanca Pic de Font Blanca Pic de Font Blanca,Pic du Port 42.64991 1.53335 T PK AD 00 02860 Europe/Andorra 2014-11-05

Below is description for each field:

geonameid : integer id of record in geonames database
name : name of geographical point (utf8) varchar(200)
asciiname : name of geographical point in plain ascii characters, varchar(200)
alternatenames: alternatenames, comma separated, ascii names
latitude: latitude in decimal degrees (wgs84)
longitude: longitude in decimal degrees (wgs84)
feature class: see http://www.geonames.org/export/codes.html, char(1)
feature code: see http://www.geonames.org/export/codes.html, varchar(10)
country code: ISO-3166 2-letter country code, 2 characters
cc2: alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code: fipscode. see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code: code for the second administrative division varchar(80)
admin3 code: code for third level administrative division, varchar(20)
admin4 code: code for fourth level administrative division, varchar(20)
population: bigint (8 byte int)
elevation: in meters, integer
dem: digital elevation model, srtm3 or gtopo30. integer
timezone: the iana timezone id (see file timeZone.txt) varchar(40)
modification date: date of last modification in yyyy-MM-dd format

We will now use Logstash to read the file line by line. In Logstash a pipeline is made out of "Input", "Filter" and "Output". The filter part allows basically to transform the data in many different ways. Here is the first part. It defines which file to use and how to use it.

Next we add a paragraph in the "filter" section. Here we define all columns of the geonames file. The separator is a tab character. We also add a field "feature" by concatenating two fields - "feature_class" and "feature_code". The other field we add is "position" it will have a "latitude" and "longitude" field in the JSON structure. This is important to later use the geomapping features of Kibana.

So now we have the columns of the CSV input file defined and have added/constructed new fields. One of the fields was the "feature" field. Another file from the geonames website contains the mapping from this "feature" to real names. E.g. the feature: "L.SNOW" is mapped to the description "snowfield". We use these mappings to do a lookup of the "feature" field to this description for each row that we read.

Same procedure in the next part: we map the "feature_class" from the CSV row to a description. E.g. feature calss "R" is mapped to: "road-railroad".

In the last part of the filter section - "mutate" - we will do some conversions and remove fields, that we don't need. The latitude and longitude of the position field are both float values. The others are integers. We rename one field to give it a more appropriate name and finally we remove the other fields.
The conversions to the appropriate data types are necessary, so that we can properly use these in Elasticsearch.

This is already all that we do in terms of transformation of the data. The final step is to output the data to Elasticsearch. This is the part in the "output" section of the Logstash config file. It defines the Elasticsearch host and port and the name of the index we want to store the data in.
Each row of the CSV file has a unique id (geonameid). Because it is unique, we use it as the document_id of the index in Elasticsearch. This way, the next time we load a new geonames file, when an id already exists in the index, then the value will be updated in the Elasticsearch index instead of inserting a new record.

This is all. Here the full Logstash config file.

The user can now run this config file with Logstash. The CSV file will be read (input), transformed and then each row is transformed into JSON and sent to Elasticsearch.

So we have transformed the data from this input:

2986043 Pic de Font Blanca Pic de Font Blanca Pic de Font Blanca,Pic du Port 42.64991 1.53335 T PK AD 00 02860 Europe/Andorra 2014-11-05

To this output. As described above, some fields have been removed, others renamed or transformed, such as the position and latitude, longitude fields.

{
    "@timestamp": "2018-10-02T19:56:12.263Z",
    "feature_name": "peak",
    "country_code": "AD",
    "feature_code": "PK",
    "name": "Pic de Font Blanca",
    "geonameid": 2986043,
    "position": {
      "lon": 1.53335,
      "lat": 42.64991
    },
    "feature": "T.PK",
    "timezone": "Europe/Andorra",
    "feature_class": "T",
    "modification date": "2014-11-05",
    "elevation": null,
    "country_code_alternate": null,
    "population": 0,
    "asciiname": "Pic de Font Blanca",
    "modification date": [
      "2014-11-05T00:00:00.000Z"
    ]
}

This was the first part. The next one will show how the data arrives in Elasticsearch and how we can visualize it.

Have a look at the documentation for Logstash at: https://www.elastic.co/guide/en/logstash/current/index.html

Here are the other parts:

Carpe Diem

0 Comments

Elasticsearch - a practical Example. Part 1

Leave a Reply.

Author

Categories

Archives