查询Druid中的数据

佚名 · 0905

资料

文件大小33.69 KB

文件格式docx

分享时间2024-06-19

更多此类文档

立即下载

还剩14页未读，继续阅读

本资源只提供10页预览，全部文档请下载后查看！喜欢就下载吧，查找使用更方便

立即下载

文本内容:

查询中的数据DruidDruid查询是通过HTTP REST方式发送查询请求，查询的描述写在一个JSON文件中，可以处理查询请求的服务包括Broker、Historical和Realtime,这几个服务节点都提供了相同的查询接口，但一般是将查询请求发送至Broker节点，由Broker节点根据查询的数据源来转发至Historical或者RealTime节点另外，目前已有很多开源的使用其他语言查询Druid数据的包具体可参考http://druid.io/docs/latest/development/libraries.html本文介绍Druid自带的JSON+HTTP的查询方式，使用的数据源为Ixw1234,之前文章中介绍批量数据加载时候加载到Druid中，可参考《使用HadoopDruidlndexer向Druid集群中加载批量数据-Batch DataIngestion》基本的查询有三类聚合查询AggregationQueries＞元数据查询MetadataQueries和搜索查询Search QueriesDruid关于Query的官方文档地址在http://druid.io/docs/latest/querying/querying.html执行查询这里指定的是Broker Node的地址

1.curl-X POSThttp://node2:8092/druid/v2/pretty-H content-type:application/json-d@query.json聚合查询

1.Aggregation Queries聚合查询就是指标数据根据一定的规则，在一个或多个维度上进行聚合基于时间序列的聚合查询

1.1Timeseries queriesTimeseries查询根据指定的时间区间及时间间隔进行聚合查询，在查询中还可以指定过滤条件，需要聚合的指标列、等一个简单的Timesehes查询配置文件如下

1.{

2.queryType:timeseries

3.datasource:lxwl234,

4.intervals:[2015-11-15/2015-11-

185.granularity:day,

6.aggregations:[

7.{type:longSum,fieldName:count,name:total_count8・]

9.}选项

3.1queryquery选项用来定义搜索的规则，目前有两种，一种是上面用到的insensitive_contains（包含）另外一种是fragment,即，当维度列的值包含给定数组中所有值的时候，才算匹配，比如

1.query:{

2.type:fragment,

3.values:[

219.146/

132.239]

4.意思是当ip同时包含”

219.146”和”

132.239时候才算匹配注意这里是values,insensitive_containsvalue；选项

3.2sortsort选项用来定义结果中该维度列的值按照什么顺序来排序，默认为lexicographic（字典序）,另外还有strlen（字符串长度排序）查询

4.SelectSelect查询很简单，就是根据配置的规则来获取列，支持分页一个简单的Select查询配置文件如下:

2.queryType:select

3.datasource:lxwl234\

4.dimensions:[],

5.“metrics口，

6.granularity:all,

7.intervals:[

8.2015-11-17/2015-11-18]，

10.pagingSpec:{pagingidentifiers:{},threshold:10}

11.}该查询从数据源Ixw1234中查询时间段内所有的字段，每10条一个分页，结果为（只截取部分）“timestamp2015-11-16116:00:

00.000Z；,result123:{pagingldentifiers:{1XW12342015・11・17T00:

0000.000+08:00_2015・U•⑻00:00:

00.000+08:00_2015・11・⑻16:53:

02.158+08:009L---•events11:[{segmentld:lxwl234_2015・U・l7T00:00:

00.000+08:00_2015・11•⑻00:00:

00.000+08:00_2015・11・18口6:53:

02.158+08:offset11:0--;event:{“timesta叩2015-1147700:00:

00.000+08:00,“cookieid”,103CA0670027E855639F90D,；ip:Hlll.

15.

94.87\•count1:1segmentld:lxwl234_2015・U・17T00:00:

00.000+08:00_2015・11・⑻00:00:

00.000+08:00_2015・11・18口6:53:02158+08:offset:1,--event:{timestamp:n2015-ll-17T00:00:

00.000+08:00,;cookieid:086E5B2A0A322F557ED1BC,“ip42,

91.

107.32,count11:1L{segmentld:lxwl234_2015・U・17T00:00:

00.000+08:00_2矶5・11•⑻00:00:

00.000+08:002015-11-18716:53:

02.158+08:offset11:2--;1pagingSpec:{2pagingldentifiers:{lxwl234_2015-11-17T00:00:

00.000+08:00_2015-11-18T00:00:

00.000+0800_2015-11-18T

165302.158+0800H10,3“threshold”],event:{其中，threshold指定了每页的记录数，offset显示了该条记录在segment中的索引号,

1.pagingldentifiers:{

2.1XW1234_2015-11-17T

000000.000+0800_2015-11-18T

000000.000+0800_2015-11-18T1653:

02.158+08:

0093.},记录了分页标记，前面是segment ID,后面是本页最大的offset；如果将查询配置文件中的pagingldentifiers改成:

4.}再看执行结果:“timesta叩””2015・U・16T16:00:

00.000Z“jresult:{pagingldentifiers:{Hlxwl234J0154147T00:00:

00.000+08:00J0154148T00:00:

00.000+08:00J015-11-18T16:53:

02.158+08:00H；:19L---events:[{segmentld:Hlxwl23420154147100:00:

00.000+08:00J015-ll-18T00:00:

00.000+08:00J015-1148T16:53:

02.158+08loffset:10---,:event1:{timestamp:,,2015-11-17T00:00:

00.000+08:00,,I^ookieid11:1,24910D7A0705F65649FD79;1；count:1b{segmentld:Hlxwl234201541-17700:00:

00.000+08:002015-11-18700:00:

00.000+08:00_2015-ll-18T16:53:

02.158+08ioffsetn:11/一一一event:“timestamp”2015・U・17T00:00:

00.000+08:00”jcookieid:,2639277D0617E15635D8B4\ip:nH,count:1}L{“segment产71XW12342015-11-17700:00:

00.000+08:0020154148700:00:

00.000+08:002015-11-18116:53:

02.158+08:1offset

12..---很明显，这样显示的已经是“第二页”的数据了另外，SelectQuery中也支持filter、context选项，后续将做介绍queryType查询类型，这里是timeseriesdataSource指定数据源intervals查询的时间区间granularity聚合的时间间隔aggregations聚合的类型、字段及结果显示的名称使用上面的配置文件执行查询后，结果如下[{timestamp:2015-11-14T00:00:

00.000Z11,result:{total-count:816711}卜{timestamp:2015-11-15T00:00:

00.000Z11,nresultn:{ntotal_countn:7650142}一},{timestamp:u2015-11-16T00:00:

00.000Z,result:{,,total_countK:8101597}},{timestamp:2015-11-17T00:00:

00.000Zresult1:{,total_counf:9126742}~}][liuxiaowen@getway druid-

8.1]$结果已按天汇总，但是存在时间格式的问题除了上面几个选项，Timeseries查询中还可以指定的选项有filter、postAggregations和context；后续将做详细介绍Zero-filling一般情况下，使用Timeseries查询按天汇总，而某一天没有数据（被过滤掉了），那么在结果中会显示该天的汇总结果为0比如上面的数据，假设2015-11-15这一天没有符合条件的数据，那么结果会变成

12..“timestamp:2015-11-15T00:00:

00.000Z,3result:.如果不希望这种数据出现在结果中，那么可以使用context选项来去掉它，

4.total count:0配置如下:

1.context:{

2.skipEmptyBuckets:true

3.}聚合查询

1.2TopN TopNqueriesTopN查询大家应该都比较熟悉，就是基于一个维度GroupBy,然后按照汇总后的指标排序,取TopN,在Druid中，TopN查询要比相同实现方式的GroupBy+Ordering效率快实现原理上，其实也就是分而治之，比如取Top10,由每个任务节点各自取Top10,然后统一发送至Broker,由Broker从各个节点的Top10中，再汇总出最终的Top

10.一个简单的TopN查询配置文件

1.{

2.queryType:topN,

3.datasource:1XW1234,

4.granularity:day,

5.dimension:cookieid^

6.metric:,total_count,\

7.threshold:3,

8.aggregations:[

9.{type:longSum,fieldName:count,name:total_count

10.]，

11.intervals:[2015-11-17/2015-11-18]

12.该杳询杳出每天pv最多的Top3cookieid,杳询结果:timestamp:2015-11-16T00:00:

00.000Z,result:[{cookieid:FB9F477802089A55C34FE1\total count:1064},{一cookieid:A85745780A75D55649F686,total count:930},{一cookieid:,,7794FA7707FAE3561CC4F2,\total_count:913}]一{timestamp:2015-11-17T00:00:

00.000Z,result:[{cookieid:n07D03F2403FEDB564A89CD,total count1:17333},{一，cookieid:AA511C1B0c7CC355F8DBD3”total count:15476},{一cookieid:BE9AEF700463AE562DF470,total__count:9933}]一}1[liuxiaowenfflgetway druid-

8.11$另外，TopN查询中还支持配置其他的选项filter、postAggregations、context等，后续将做详细介绍聚合查询

1.3GroupByGroupBy聚合查询就是在多个维度上，将指标聚合Druid中建议，能用TimeseriesQueries和TopN实现的查询尽量不要用GroupBy,因为GroupBy的性能要差一些一个简单的GroupBy查询配置文件如下:

2.queryType:groupBy

3.datasource:Hlxwl234,\

4.granularity:day,

5.dimensions:[cookieid,ip],

6.limitSpec:{type:default,limit:50,columns:[cookieid,ip]},

7.aggregations:[

8.{type:longSum\name:total_pv\fieldName:count}

9.intervals:[2015-11-17/2015-11-19]

10.}该查询按照天、cookieid、ip进行GroupBy,汇总pv,并且limit50条数据结果为

1.{

2.version:vl,

3.timestamp:2015-11-16T00:00:00,000Z,

4.event:{

5.total_pv:4,

6.cookieid:001714AF0549BB55405121,

7.ip:

175.

20.

11.

388.

9.},{

10.version:vl,

11.timestamp:2015-11-16T00:00:

00.000Z,

12.event:{

13.total_pv:1,

14.cookieid:00179C6E02F673564A656C,

15.ip:

110.

156.

23.

016.}

17.},{

18.version:vl,

19.timestamp:2015-11-16T00:00:

00.000Z,

20.event:{

21.total_pv:4,

22.cookieid:0019E379003F70539467A7,

23.ip

218.

75.

7024.

25.GroupBy查询还支持的选项有filter、postAggregations havinglimitSpec context等,后续将做详细介绍元数据查询

2.Metadata Queries时间范围查询

2.1Time BoundaryQueries时间范围查询用来查询一个数据源的最小和最大时间点

1.{

2.queryType:timeBoundary,

3.dataSource:lxwl

2344.}查询结果为:

1.[{

2.timestamp:2015-11-15T00:00:

00.000+08:00,

3.result:{

4.•minTime:2015-11-15700:00:

00.000+08:00,

5.maxTime:2015-11-18T23:59:

59.000+08:

006.

7.}]另外，还有个bound选项，用来指定返回最大时间点还是最小时间点，如果不指定，则两个都返回

1.{

2.queryType:timeBoundary,

3.dataSource:lxwl234,

4.bound:maxTime

5.此时只返回最大时间点:

1.[

2.timestamp:2015-11-18T23:59:

59.000+08:00,

3.result:{

4.maxTime:2015-11-18723:59:

59.000+08:

006.}]Segments元数据查询可以查询到每个Segment的以下信息

1.Segment中所有列的基数（Cardinality）,非STRING类型的列为null；

2.每个列的预计大小（Bytes）；

3.该Segment的时间跨度；

4.列的类型；

5.该Segment的预估总大小；；

6.Segment ID查询的配置文件

1.{

2.queryType:segmentMetadata,

3.datasource:lxwl234,

4.intervals:[2015-11-15/2015-11-19]

5.}查询结果（只取了一个Segment）

2.id1XW1234_2015-11-17T

000000.000+0800_2015-11-18T

000000.000+0800_2015-11-18T1653:

02.158+08:00」”，

3.intervals:[2015-11-17T00:00:

00.000+08:00/2015-ll-18T00:00:

00.000+08:

004.columns:{

5.:{

6.type:LONG,

7.size:46837800,

8.cardinality:null,

9.errorMessage:null}，

10.

11.cookieid:{

12.type:STRING,

13.size:106261532,

14.cardinality:1134359,

15.errorMessage:null上

16.

17.count:{

19.size:37470240,

20.cardinality:null,

21.errorMessage:null

22.},

23.ip{

24.type:STRING,

25.size:63478131,

26.cardinality:735562,

27.errorMessage:null

28.

29.打

30.size:

27278282331.另外，还有其他几个选项，tolnclude、merge analysisTypes,比较简单，详见:http://druid.io/docs/latest/querying/segmentmetadataquery.html数据源元数据查询

2.3Data SourceMetadata Queries这个查询只是返回该数据源的最后一次有数据进入的时间比如，查询配置文件

2.queryType:dataSourceMetadata,,

3.dataSource:lxwl

2344.结果为:

1.[{

2.timestamp:2015-11-18T23:59:

59.000+08:00,

3.result:{

4.maxIngestedEventTime:2015-11-18T23:59:

59.000+08:

005.}

6.}]搜索查询

3.Search Queries这里的搜索指的是对维度列的值的搜索，基本类似于过滤（Filter）下面的配置文件从时间区间在2015-11-17/2015-11-19的数据中，搜索出ip包含”

219.

146.

132.239〃的记录:

1.{

2.queryType:search,

3.datasource:lxwl234\

4.granularity:day”，

5.searchDimensions:[

6.ip

7.],

8.query:{

9.type:insensitive_contains\

10.value:

219.

146.

132.

23911.}，

12.sort:{

13.type:lexicographic卜

14.

15.intervals:[

16.”2015-11-17/2015-11-19”

17.]

18.}运行结果为:

1.[{

2.timestamp:2015-11-16T00:00:

00.000Z,

3.result:[{

4.dimension:ip,

5.value:

219.

146.

132.

2396.}]7・},{

8.timestamp:2015-11-17T00:00:

00.000Z,

9.result:[{

10.dimension:ip,

11.value:

219.

146.

132.

23912.}]

13.}]SearchQuery中可以配置的选项还有filter、context,后面将做详细介绍。

更多此类文档

关于文档

个人认证

优秀文档

获得点赞 0

文件大小33.69 KB

文件格式docx

分享时间2024-06-19

更多此类文档

立即下载