人人IT網

人人IT網

當前位置: 主頁 > 編程語言 > C >

Apache Drill簡介

時間:2016-12-03 00:59來源:Internet 作者:Internet 點擊:
簡介 Apache Drill是一個低延遲的分布式海量數據(涵蓋結構化、半結構化以及嵌套數據)交互式查詢引擎,使用ANSI SQL兼容語法,支持本地文件、HDFS、HBase、MongoDB等後端存儲

簡介

Apache Drill是一個低延遲的分布式海量數據(涵蓋結構化、半結構化以及嵌套數據)交互式查詢引擎,使用ANSI SQL兼容語法,支持本地文件、HDFS、HBase、MongoDB等後端存儲,支持Parquet、JSON、CSV、TSV、PSV等數據格式。受Google的Dremel启發,Drill滿足上千節點的PB級別數據的交互式商業智能分析場景。

安裝

Drill可以安裝在單機或者集群環境上,支持Linux、Windows、Mac OS X系統。簡單起見,我們在Linux單機環境(CentOS 6.3)搭建以供試用。

准備安裝包:

在$WORK(/path/to/work)目錄中安裝,將jdk和drill分別解壓到java和drill目錄中,並打軟連以便升級:

.
├── drill
│   ├── apache-drill -> apache-drill-0.8.0
│   └── apache-drill-0.8.0
├── init.sh
└── java
    ├── jdk -> jdk1.7.0_75
    └── jdk1.7.0_75

並添加一init.sh腳本初始化java相關環境變量:

export WORK="/path/to/work"
export JAVA="$WORK/java/jdk/bin/java"
export JAVA_HOME="$WORK/java/jdk"

启動

在單機環境運行只需要启動bin/sqlline便可:

$ cd $WORK
$ . ./init.sh
$ ./drill/apache-drill/bin/sqlline -u jdbc:drill:zk=local
Drill log directory /var/log/drill does not exist or is not writable, defaulting to ...
Apr 06, 2015 12:47:30 AM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
sqlline version 1.1.6
0: jdbc:drill:zk=local> 

-u jdbc:drill:zk=local表示使用本機的Drill,無需启動ZooKeeper,如果是集群環境則需要配置和启動ZooKeeper並填寫地址。启動後便可以在0: jdbc:drill:zk=local>後敲入命令使用了。

試用

Drill的sample-data目錄有Parquet格式的演示數據可供查詢:

0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/apache-drill/sample-data/nation.parquet` limit 5;
+-------------+------------+-------------+------------+
| N_NATIONKEY |   N_NAME   | N_REGIONKEY | N_COMMENT  |
+-------------+------------+-------------+------------+
| 0           | ALGERIA    | 0           |  haggle. carefully f |
| 1           | ARGENTINA  | 1           | al foxes promise sly |
| 2           | BRAZIL     | 1           | y alongside of the p |
| 3           | CANADA     | 1           | eas hang ironic, sil |
| 4           | EGYPT      | 4           | y above the carefull |
+-------------+------------+-------------+------------+
5 rows selected (0.741 seconds)

這裏用的庫名格式为dfs.`本地文件(Parquet、JSON、CSV等文件)絕對路徑`。可以看出只要熟悉SQL語法幾乎沒有學習成本。但Parquet格式文件需要專用工具查看、編輯,不是很方便,後續再專門介紹,下文先使用更通用的CSV和JSON文件進行演示。

$WORK/data中創建如下test.csv文件:

1101,SteveEurich,Steve,Eurich,16,StoreT
1102,MaryPierson,Mary,Pierson,16,StoreT
1103,LeoJones,Leo,Jones,16,StoreTem
1104,NancyBeatty,Nancy,Beatty,16,StoreT
1105,ClaraMcNight,Clara,McNight,16,Store

然後查詢:

0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/data/test.csv`;
+------------+
|  columns   |
+------------+
| ["1101","SteveEurich","Steve","Eurich","16","StoreT"] |
| ["1102","MaryPierson","Mary","Pierson","16","StoreT"] |
| ["1103","LeoJones","Leo","Jones","16","StoreTem"] |
| ["1104","NancyBeatty","Nancy","Beatty","16","StoreT"] |
| ["1105","ClaraMcNight","Clara","McNight","16","Store"] |
+------------+
5 rows selected (0.082 seconds)

可以看到結果和之前的稍有不同,因为CSV文件沒有地方存放列列名,所以統一用columns代替,如果需要具體制定列則需要用columns[n],如:

0: jdbc:drill:zk=local> select columns[0], columns[3] from dfs.`/path/to/work/drill/data/test.csv`;
+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
| 1101       | Eurich     |
| 1102       | Pierson    |
| 1103       | Jones      |
| 1104       | Beatty     |
| 1105       | McNight    |
+------------+------------+

CSV文件格式比較簡單,發揮不出Drill的強大優勢,下邊更复雜的功能使用和Parquet更接近的JSON文件進行演示。

$WORK/data中創建如下test.json文件:

{
  "ka1": 1,
  "kb1": 1.1,
  "kc1": "vc11",
  "kd1": [
    {
      "ka2": 10,
      "kb2": 10.1,
      "kc2": "vc1010"
    }
  ]
}
{
  "ka1": 2,
  "kb1": 2.2,
  "kc1": "vc22",
  "kd1": [
    {
      "ka2": 20,
      "kb2": 20.2,
      "kc2": "vc2020"
    }
  ]
}
{
  "ka1": 3,
  "kb1": 3.3,
  "kc1": "vc33",
  "kd1": [
    {
      "ka2": 30,
      "kb2": 30.3,
      "kc2": "vc3030"
    }
  ]
}

可以看到這個JSON文件內容是有多層嵌套的,結構比之前那個CSV文件要复雜不少,而查詢嵌套數據正是Drill的優勢所在。

0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+------------+------------+
|    ka1     |    kb1     |    kc1     |    kd1     |
+------------+------------+------------+------------+
| 1          | 1.1        | vc11       | [{"ka2":10,"kb2":10.1,"kc2":"vc1010"}] |
| 2          | 2.2        | vc22       | [{"ka2":20,"kb2":20.2,"kc2":"vc2020"}] |
| 3          | 3.3        | vc33       | [{"ka2":30,"kb2":30.3,"kc2":"vc3030"}] |
+------------+------------+------------+------------+
3 rows selected (0.098 seconds)

select *只查出第一層的數據,更深層的數據只以原本的JSON數據呈現出來,我們顯然不應該只關心第一層的數據,具體怎麼查完全隨心所欲:

0: jdbc:drill:zk=local> select sum(ka1), avg(kd1[0].kb2) from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
| 6          | 20.2       |
+------------+------------+
1 row selected (0.136 seconds)

可以通過kd1[0]來訪問嵌套到第二層的這個表。

0: jdbc:drill:zk=local> select kc1, kd1[0].kc2 from dfs.`/path/to/work/drill/data/test.json` where kd1[0].kb2 = 10.1 and ka1 = 1;
+------------+------------+
|    kc1     |   EXPR$1   |
+------------+------------+
| vc11       | vc1010     |
+------------+------------+
1 row selected (0.181 seconds)

創建view:

0: jdbc:drill:zk=local> create view dfs.tmp.tmpview as select kd1[0].kb2 from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+
|     ok     |  summary   |
+------------+------------+
| true       | View 'tmpview' created successfully in 'dfs.tmp' schema |
+------------+------------+
1 row selected (0.055 seconds)

0: jdbc:drill:zk=local> select * from dfs.tmp.tmpview;
+------------+
|   EXPR$0   |
+------------+
| 10.1       |
| 20.2       |
| 30.3       |
+------------+
3 rows selected (0.193 seconds)

可以把嵌套的第二層表打平(整合kd1[0]..kd1[n]):

0: jdbc:drill:zk=local> select kddb.kdtable.kc2 from (select flatten(kd1) kdtable from dfs.`/path/to/work/drill/data/test.json`) kddb;
+------------+
|   EXPR$0   |
+------------+
| vc1010     |
| vc2020     |
| vc3030     |
+------------+
3 rows selected (0.083 seconds)

使用細節上和mysql還是有所不同的,另外涉及到多層表的复雜邏輯,要想用得得心應手還需要仔細閱讀官方文檔並多多練習。這次先走馬觀花了,之後會深入了解語法層面的特性。

 

https://segmentfault.com/a/1190000002652348


From:ITEYE
頂一下
(0)
0%
踩一下
(0)
0%
------分隔線----------------------------
發表評論
請自覺遵守互聯網相關的政策法規,嚴禁發布色情、暴力、反動的言論。
評價:
表情:
驗證碼:點擊我更換圖片
欄目列表
推薦內容