Blog Archives - datamelt blog

Test Data: Airports

30/1/2016

Just a quick note here: I have put the sample data that use in my blogs on github at:

https://github.com/uwegeercken?tab=repositories

Have a look and download the data for re-doing the samples or for your own purposes.

Apache Drill - Querying Hadoop (HDFS) - Part 3

23/1/2016

This is the last part of three parts. The last time we have queried a CSV file in Hadoop HDFS from Drill.

This time we will use the Drill "create table as" statement to create parquet files in HDFS and use those for querying.

First we do a basic query against the CSV file in HDFS:

select
columns[1] as airport_code,
columns[2] as airport_type,
columns[3] as name,
columns[4] as latitude,
columns[5] as longitude,
columns[6] as elevation,
columns[7] as continent,
columns[8] as country,
columns[9] as state,
columns[10] as city
from hdfs.data.`/airports.csv` limit 10

We select the fields and run the query to see if it executes correctly.

Bild

Now we will slightly change the query to create parquet files. Simply add the first line as shown below:

create table hdfs.data.`/airports_parquet` partition by(continent) as
select
columns[1] as airport_code,
columns[2] as airport_type,
columns[3] as name,
columns[4] as latitude,
columns[5] as longitude,
columns[6] as elevation,
columns[7] as continent,
columns[8] as country,
columns[9] as state,
columns[10] as city
from hdfs.data.`/airports.csv`

When you run the query it will create multiple files in the Hadoop filesystem because we have specified the "partition by" clause. For each continent in the CSV data a seperate file is created.

Bild

Go to the console window and check what has happened in the Hadoop filesystem:

/opt/hadoop/bin/hdfs dfs -ls /

Bild

The folder "airports_parquet" was created as we used it as the table name in the "create table as" statement. Inside the folder there are the parquet files - one per continent:

Bild

Now that we have the parquet files created, we can use them to query the airport data. Here is an example:

select *
from hdfs.data.airports_parquet
where continent='NA' and country='US' and airport_type='large_airport'

This queries for large-type airports in continent North America in the US:

Bild

This is it. We have created parquet files from the CSV file. Parquet offers a better performance than CSV files and can easily be created from Drill.

It shows how easily data in Hadoop can be queried with Drill and you are now free to do more complex stuff, like e.g. to combine this data with data from MongoDB, MySql, Json files or other sources.

Apache Drill - Querying Hadoop (HDFS) - Part 2

23/1/2016

The first part of this post was about the basic setup and running Zookeeper, Drill and Hadoop HDFS. In this second part we will add a CSV file to HDFS and query it from Drill.

Before we can query HDFS, we have to define a storage plugin for HDFS. Go to the local Drill web site at http://localhost:8047. Once there click on "Storage" at the top. You will see the page below.

Bild

At the bottom under "New Storage Plugin" enter hdfs in the textbox and click on "create". You will see a page titeled "Configuration". I filled in the information as shown below. What you can also do is to copy the configuration of an existing storage plugin (e.g. "dfs") and modify it for the HDFS configuration. Basically you only need to change the values for "connection" and (workspace) "location".
Look at the value for "connection" below. That is the hostname and port as defined in the hadoop hdfs-site.xml file for the property "fs.default.name". Please also note that my workspace below is called "data" and as "location" I have set the root folder of hdfs.
When you'r done click on "Create" and make sure you get a message saying "success" at the bottom of the page. Next click on "Back" to return to the "Storage" page.

Bild

We are now ready to query the Hadoop filesystem - everything is configured. But we have no files in HDFS yet. So lets copy a CSV file into hdfs. We will use the hdfs command -a script in /opt/hadoop/bin - to do this.
My CSV file - a file containing information about airports - is named "airports.csv" and is located in /tmp. Copy it into the hdfs root folder using the following command:

/opt/hadoop/bin/hdfs dfs -copyFromLocal /tmp/airports.csv /

To see if the file was really copied, use this command:

/opt/hadoop/bin/hdfs dfs -ls /

Bild

Go back to the drill web ui and click at the top on "Query". Enter the following query (note the backticks at the beginning and end of the filename) and submit it.

select * from hdfs.data.`airports.csv` limit 10

In this query "hdfs" is the name of the storage plugin and "data" is the name of the workspace both from the hdfs configuration of the storage plugin we did before.

Have a look at the result:

Bild

Because we have used "select * ..." Drill returns the rows as an array of fields. Let's do a slightly different query which displays nicer and only some of the fields:

select columns[1] as code,columns[2] as airport_type,columns[3] as name, columns[4] as latitude, columns[5] as longitude,columns[6] as elevation from hdfs.data.`/airports.csv` where columns[2]='large_airport' limit 10

Bild

So we successfully queried the Hadoop filesystem using Drill.

The next and last part will have a look at how to create parquet files from the CSV file directly from Drill in the HDFS filesystem by using the "Create table as" statement.

Apache Drill - Querying Hadoop (HDFS) - Part 1

23/1/2016

I am new to Apache Hadoop and playing around with it to understand how it works. And I like Apache Drill. So I wanted to make some tests how the both work together.

Mainly I am interested in querying HDFS - the Hadoop Distributed Filesystem - from Drill. Furthermore I wanted to understand if I can query CSV and Parquet files in Hadoop.

Here is an overview of what my setup and configuration is:

Zookeeper:
I am running Zookeeper together with Drill. Zookeeper 3.4.7 is installed in /opt/zookeeper

Drill:
Drill 1.4.0 is installed in /opt/drill and I have one drillbit running together with zookeeper

Hadoop:
Hadoop 2.6.2 is installed in /opt/hadoop. I only run HDFS and none of the other Hadoop components. So I simply run the start-dfs.sh script to run HDFS.
Below two screenshots of the Hadoop core-site.xml file and hdfs-site.xml file. In the first one you see the fs.default.name property. We use this value later to configure access from Drill to HDFS. The second screenshot shows my HDFS configuration - mainly the location of the namenode and datanode folders.

Bild

Bild

Now we run the three components: Zookeeper, Drill and Hadoop:

/opt/zookeeper/bin/zkServer.sh start
/opt/drill/bin/drillbit.sh start
/opt/hadoop/sbin/start-dfs.sh

Below are the commands and the output from the console.

Bild

You should have the three components running now. Check the relevant log files as appropriate. Now start your webbrowser and go to: http://localhost:8047 to start the Drill web site.

Bild

We have Zookeeper, Drill and Hadoop HDFS running and are ready to start querying Hadoop. This will be done in the second part of this post.

uwe geercken, 2014-2020