Here is a simple example of how to query a CSV file. We will go through the complete process of defining the storage, placing the CSV file in the filesystem and querying the file. Here is an excerpt from the file which is named "airports.csv". It is a list of airports and contains around 46000 records.. The individual fields are delimited by comma. The first row is a header row defining the names of the fields. Further down - under "formats" - there is the definition for the CSV format. Note that "extension" is set to "csv" - the file extension. Then there is an entry for "skipFirstLine". It is set to "true" because we have a header line in the CSV file, which we want to skip - otherwise drill treats it as a regular line of data. And then there is an entry for "delimiter" which defines how the individual fields are seperated - in this case by a comma. The query runs and quickly returns the result shown on the screenprint below. We have 46505 rows of data in the file. One line was skipped - the header row, because we defined it like this in the configuration. Let's retrieve some records now the easy way by running another query. We select "*" to indicate that we want all fields of each row. And we specify a limit of 20, so that only 20 rows are returned - we do not want all 46505 rows being returned. Click on "Submit" to run the query. The query will return 20 rows. Drill pages the rows into chunks of 10 records. In the lower right you can go to the next page. In the upper right you can search for text in the result of the query. In the upper left you may define how many records per page are displayed. Now let's do a more sophisticated query. We will name the individual columns correctly and we will cast the columns to the correct type where appropriate. The first column "id" is a number so we cast it to an integer. Columns 4 and 5 are latitude and longitude position of the airport - we cast those to double. Column 6 (elevation) is tricky, because it contains an empty string for some of the records in the CSV file but for the other records it contains a number. I have used a case statement to differentiate between these and to always have an integer as the result using the cast function. Finally I limit the query to again return only 20 rows. Here is the result from the query above. It nicely shows the fields with the correct types and allows us to do further queries on the data. Hope you find this useful. The next blog entry here will use the same example, to convert the CSV file data into to parquet format using drill. This will allow to make more efficient queries on the data.
0 Comments
|
AuthorUwe Geercken Categories
All
Archives
September 2020
|