Category: Logstash

Elasticsearch - more details

16/10/2018

I spent more time on understanding the dataset and also on the geonames.org website. It is convenient to search for points and locations and if you sign up for a user account you can add or update the data. A good opportunity to help and enlarge the dataset for the benefit of all.

So meanwhile I have added the following data to my processing chain with Logstash, Elasticsearch and Kibana:

Administrative Devision 1st to 4th Order
Continent
Continent Code

Here is an example taken from Kibana:

And I wanted to have an automated solution, so I wrote a shell script that only expects the path and name of the geonames input file and the rest is done automatically:

remove double quotes from the file. It looks as if there are some unbalanced quotes in the file, so I remove them all together (usind sed)
create two lookup CSV files for the Administrative Devision 3rd and 4th order. The first two are already availabe as files but these ones have to be derived from the data file itself (using awk)
Lookup the continent code and name from the country code of each row of data
finally the data is sent to Elasticsearch

With this being automated, I can quickly reload the data into Elasticsearch and make it available for searching and visualization in Kibana. And as I am using the geonameid which uniquely identifies each row in the dataset, I use Elasticsearchs capability to upsert (update or insert) the data.

The script uses environment variables that are passed to logstash, so that no paths or file names are hardcoded in the logstash pipeline file.

The file are all available on Github.

Carpe Diem

0 Comments

Elasticsearch - a practical Example. Part 2

7/10/2018

0 Comments

In Part 1 we saw how to use Logstash to read a CSV file and prepare the data to send it to Elasticsearch. The data is in JSON fomrat and that's what Elasticsearch expects.

Now we will send JSON formatted data and see how we deal with the schema. If you send data to an Elasticsearch index, the first record that arrives is used to detrmine the schema. So Elasticsearch does a dynamic mapping. But this might not always work 100%. The alternative is to define a schema/mapping manually. We will have a look at both.

The geonames database I use has about 17 million rows. That takes a while to ingest, so I created a separate file, which only contains 10000 records. I will use this with logstash for this example. Also, I am running this on Linux, so if you use Windows it might work a little bit different in terms of commands and paths, etc.

Before we continue, make sure you installed Elasticsearch and also Kibana. Kibana is the frontend for creating visualizations and dashboards. Download the relevant packages for your operating system from the Elastic website, install them and then run both.

I named my Logstash file: geonames_1.yml and it is located in the Logstash config folder. The complete code is listed at the bottom of part 1 of this post. Adjust the path as appropriate for your system. I also changed the file to use the index "geonames_01" in Elasticsearch. To run it with Logstash do this:

Logstash takes a while to startup, and if everything runs well, the data is sent to Elasticsearch. Start your webbrowser and go to http://localhost:5601, if you have a local installation of Kibana. Once started, on the left lower side, click on the "Management" link, then under "Elasticsearch" click on "Index Management". You will get a list of available indexes. The index we just create is also there:

You can see that 10000 documents have been created in the index. Click on the name of the index and on the right side you get a popup with a summary. Click on the tab labeled "Mapping". It will show you a JSON representation of the dynamic mapping that Elasticsearch created when we imported the data.

If you read the first part, then you'll remember, that we made some type conversions. The CSV data comes all as strings and we converted e.g. "elevation" and "population" to number fields. All fields have been properly sent to Elasticsearch in their appropriate types. Even the "modification_date" was detected as a date field.

Now in Logstash we have converted the latitude and longitude position to a float value. And they are like this in the mappings file shown above. To make use of the geo-indexing features of Elasticsearch, the position field needs to defined as type "geo_point" in Elasticsearch.

Once the index mapping has been defined for a field, it can not be changed. So I will go ahead and delete the "geonames_01" index and we will manually add a different schema which properly maps the position field.

The popup where you saw the definition of the mapping for our index has a button labeled "Manage". Click on it and then select "Delete Index" to delete it.

Click on the "Dev Tools" link on the left side of the browser window. Per default you will be on the "Console" tab. In the left part of the console paste the code from below:

The code tells Elasticsearch to create a new index with the given mapping. I have only changed the "position" field as discussed above to the following value:

Elasticsearch will understand that the geo_point type field will have a latitude and a longitude value attached and index it appropriately.

Still in the "Console" tab, click on the green icon to execute the code.

Note: If you go back to the management page you won't find the index there - not yet. As there are no documents in the index, it is not shown.

So let's go back to Logstash and run the config file once again.

After Logstash started and after a short time, you will find the data again in Kibana as discussed above. If you go to the management view and look at the mapping defined for the index, you will see that now the position with latitude and longitude is properly mapped to the "geo_point" type.

That's all for the second part. The third part will concentrate on creating visualizations and a dashboard in Kibana on top of the data we imported.

Here is part 3:

https://datamelt.weebly.com/blog/elasticsearch-a-practical-example-part-3

Carpe Diem

0 Comments

Elasticsearch - a practical Example. Part 1

3/10/2018

0 Comments

I want to show you a practical example here, using the geonames geographical database which covers all countries and contains over eleven million placenames that are available for download free of charge. Have a look here: geonames.org. The data can be freely dowloaded and used. Download the allCountries.zip file and unzip it.

I will use Logstash to translate the data into JSON format and to model the data, so that it can be sent to Elasticsearch. On top of that, I will show how to create some simple visualizations using Kibana.

Let's start. Here is a single row from the geonames dataset - in CSV, tab delimited - format:

2986043 Pic de Font Blanca Pic de Font Blanca Pic de Font Blanca,Pic du Port 42.64991 1.53335 T PK AD 00 02860 Europe/Andorra 2014-11-05

Below is description for each field:

geonameid : integer id of record in geonames database
name : name of geographical point (utf8) varchar(200)
asciiname : name of geographical point in plain ascii characters, varchar(200)
alternatenames: alternatenames, comma separated, ascii names
latitude: latitude in decimal degrees (wgs84)
longitude: longitude in decimal degrees (wgs84)
feature class: see http://www.geonames.org/export/codes.html, char(1)
feature code: see http://www.geonames.org/export/codes.html, varchar(10)
country code: ISO-3166 2-letter country code, 2 characters
cc2: alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code: fipscode. see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code: code for the second administrative division varchar(80)
admin3 code: code for third level administrative division, varchar(20)
admin4 code: code for fourth level administrative division, varchar(20)
population: bigint (8 byte int)
elevation: in meters, integer
dem: digital elevation model, srtm3 or gtopo30. integer
timezone: the iana timezone id (see file timeZone.txt) varchar(40)
modification date: date of last modification in yyyy-MM-dd format

We will now use Logstash to read the file line by line. In Logstash a pipeline is made out of "Input", "Filter" and "Output". The filter part allows basically to transform the data in many different ways. Here is the first part. It defines which file to use and how to use it.

Next we add a paragraph in the "filter" section. Here we define all columns of the geonames file. The separator is a tab character. We also add a field "feature" by concatenating two fields - "feature_class" and "feature_code". The other field we add is "position" it will have a "latitude" and "longitude" field in the JSON structure. This is important to later use the geomapping features of Kibana.

So now we have the columns of the CSV input file defined and have added/constructed new fields. One of the fields was the "feature" field. Another file from the geonames website contains the mapping from this "feature" to real names. E.g. the feature: "L.SNOW" is mapped to the description "snowfield". We use these mappings to do a lookup of the "feature" field to this description for each row that we read.

Same procedure in the next part: we map the "feature_class" from the CSV row to a description. E.g. feature calss "R" is mapped to: "road-railroad".

In the last part of the filter section - "mutate" - we will do some conversions and remove fields, that we don't need. The latitude and longitude of the position field are both float values. The others are integers. We rename one field to give it a more appropriate name and finally we remove the other fields.
The conversions to the appropriate data types are necessary, so that we can properly use these in Elasticsearch.

This is already all that we do in terms of transformation of the data. The final step is to output the data to Elasticsearch. This is the part in the "output" section of the Logstash config file. It defines the Elasticsearch host and port and the name of the index we want to store the data in.
Each row of the CSV file has a unique id (geonameid). Because it is unique, we use it as the document_id of the index in Elasticsearch. This way, the next time we load a new geonames file, when an id already exists in the index, then the value will be updated in the Elasticsearch index instead of inserting a new record.

This is all. Here the full Logstash config file.

The user can now run this config file with Logstash. The CSV file will be read (input), transformed and then each row is transformed into JSON and sent to Elasticsearch.

So we have transformed the data from this input:

2986043 Pic de Font Blanca Pic de Font Blanca Pic de Font Blanca,Pic du Port 42.64991 1.53335 T PK AD 00 02860 Europe/Andorra 2014-11-05

To this output. As described above, some fields have been removed, others renamed or transformed, such as the position and latitude, longitude fields.

{
    "@timestamp": "2018-10-02T19:56:12.263Z",
    "feature_name": "peak",
    "country_code": "AD",
    "feature_code": "PK",
    "name": "Pic de Font Blanca",
    "geonameid": 2986043,
    "position": {
      "lon": 1.53335,
      "lat": 42.64991
    },
    "feature": "T.PK",
    "timezone": "Europe/Andorra",
    "feature_class": "T",
    "modification date": "2014-11-05",
    "elevation": null,
    "country_code_alternate": null,
    "population": 0,
    "asciiname": "Pic de Font Blanca",
    "modification date": [
      "2014-11-05T00:00:00.000Z"
    ]
}

This was the first part. The next one will show how the data arrives in Elasticsearch and how we can visualize it.

Have a look at the documentation for Logstash at: https://www.elastic.co/guide/en/logstash/current/index.html

Here are the other parts:

Carpe Diem

0 Comments

Elasticsearch - more details

Elasticsearch - a practical Example. Part 2

Elasticsearch - a practical Example. Part 1

Author

Categories

Archives