失眠网 > 时空数据库实践(含纽约TAXI数据透视分析) - PostGIS + TimescaleDB = PostgreSQL

时空数据库实践(含纽约TAXI数据透视分析) - PostGIS + TimescaleDB = PostgreSQL

时间：2021-06-29 05:32:28

背景

现实社会中，很多业务产生的数据具有时序数据属性（在时间维度上顺序写入，同时包括大量时间区间查询统计的需求）。

例如业务的FEED数据，物联网产生的时序数据（如气象传感器、车辆轨迹、等），金融行业的实时数据等等。

PostgreSQL的UDF和BRIN（块级索引）很适合时序数据的处理。具体有以下的两个例子

《PostgreSQL 按需切片的实现(TimescaleDB插件自动切片功能的plpgsql schemaless实现)》

《PostgreSQL 时序最佳实践 - 证券交易系统数据库设计 - 阿里云RDS PostgreSQL最佳实践》

实际上PostgreSQL生态中，衍生了一个时序插件：timescaleDB。专门用于处理时序数据。（timescale的改进，包括SQL优化器的改进(支持merge append，时间片聚合非常高效)，rotate接口，自动分片等）

同时timescaleDB也非常受投资者的关注，已获5000万美金的投资，也间接说明时序数据库在未来是非常受用户欢迎的。

timescaleDB的优势

首先，timescaleDB是自动切片的，对用户无感知，在数据量非常庞大的时候，写入性能不衰减。（主要指IOPS较低的磁盘，如果IOPS较好的磁盘PG在写入大量数据后性能也是OK的。）

其次，timescale改进了SQL优化器，增加了merge append的执行节点，同时在对小时间片进行group by时，可以不用HASH或GROUP整个数据范围，而是分片计算，使得效率非常高。

最后，timescale增加了一些API，使得用户在时序数据的写入、维护、查询都非常的高效、同时易于维护。

API如下

/v0.8/api

部署timescaleDB

以CentOS 7.x x64为例。

1、首先要安装好PostgreSQL

参考《PostgreSQL on Linux 最佳部署手册》

export USE_NAMED_POSIX_SEMAPHORES=1 LIBS=-lpthread CFLAGS="-O3" ./configure --prefix=/home/digoal/pgsql10 --with-segsize=8 --with-wal-segsize=256 LIBS=-lpthread CFLAGS="-O3" make world -j 64 LIBS=-lpthread CFLAGS="-O3" make install-world

2、其次需要安装cmake3

epel yum install -y cmake3 ln -s /usr/bin/cmake3 /usr/bin/cmake

3、编译timescaleDB

git clone /timescale/timescaledb/ cd timescaledb git checkout release-0.8.0 或 wget /timescale/timescaledb/archive/0.8.0.tar.gz export PATH=/home/digoal/pgsql10/bin:$PATH export LD_LIBRARY_PATH=/home/digoal/pgsql10/lib:$LD_LIBRARY_PATH # Bootstrap the build system ./bootstrap cd ./build && make make install [ 2%] Built target sqlupdatefile [ 4%] Built target sqlfile [100%] Built target timescaledb Install the project... -- Install configuration: "Release" -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb.control -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.8.0.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.1--0.8.0.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.1.0--0.2.0.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.2.0--0.3.0.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.3.0--0.4.0.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.0--0.4.1.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.1--0.4.2.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.2--0.5.0.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.5.0--0.6.0.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.0--0.6.1.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.0.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.1.sql -- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.0--0.7.1.sql -- Installing: /home/dege.zzz/pgsql10/lib/timescaledb.so

4、配置postgresql.conf，在数据库启动时自动加载timescale lib库。

vi $PGDATA/postgresql.conf shared_preload_libraries = 'timescaledb' pg_ctl restart -m fast

5、对需要使用timescaledb的数据库，创建插件.

psql psql (10.1) Type "help" for help. postgres=# create extension timescaledb ;

6、timescaledb的相关参数

timescaledb.constraint_aware_appendtimescaledb.disable_optimizations timescaledb.optimize_non_hypertables timescaledb.restoring postgres=# show timescaledb.constraint_aware_append ; timescaledb.constraint_aware_append ------------------------------------- on (1 row) postgres=# show timescaledb.disable_optimizations ; timescaledb.disable_optimizations ----------------------------------- off (1 row) postgres=# show timescaledb.optimize_non_hypertables ; timescaledb.optimize_non_hypertables -------------------------------------- off (1 row) postgres=# show timescaledb.restoring ; timescaledb.restoring ----------------------- off (1 row)

timescaleDB使用例子1 - 纽约TAXI数据透视分析

第一个例子是real-life New York City taxicab data ，

/v0.8/tutorials/tutorial-hello-nyc

数据为真实的数据，来自

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

1、下载样本数据

wget https://timescaledata.blob./datasets/nyc_data.tar.gz

2、解压

tar -zxvf nyc_data.tar.gz

3、建表，其中包括将普通表转换为时序存储表的API create_hypertable 的使用。

psql -f nyc_data.sql

截取一些nyc_data.sql的内容如下：

cat nyc_data.sql -- 打车数据: 包括时长、计费、路程、上车、下车经纬度、时间、人数等等。 CREATE TABLE "rides"( vendor_id TEXT, pickup_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL, dropoff_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL, passenger_count NUMERIC, trip_distance NUMERIC, pickup_longitude NUMERIC, pickup_latitude NUMERIC, rate_code INTEGER, dropoff_longitude NUMERIC, dropoff_latitude NUMERIC, payment_type INTEGER, fare_amount NUMERIC, extra NUMERIC, mta_tax NUMERIC, tip_amount NUMERIC, tolls_amount NUMERIC, improvement_surcharge NUMERIC, total_amount NUMERIC );

这句话，将rides转换为时序表存储

SELECT create_hypertable('rides', 'pickup_datetime', 'payment_type', 2, create_default_indexes=>FALSE);

创建索引

CREATE INDEX ON rides (vendor_id, pickup_datetime desc); CREATE INDEX ON rides (pickup_datetime desc, vendor_id); CREATE INDEX ON rides (rate_code, pickup_datetime DESC); CREATE INDEX ON rides (passenger_count, pickup_datetime desc);

4、导入测试数据

psql -c "\COPY rides FROM nyc_data_rides.csv CSV" COPY 10906858

5、对已转换为时序存储表的rides执行一些测试SQL，性能比PostgreSQL普通表要好。

每天同车超过2人的交易，平均计费多少？

-- Average fare amount of rides with 2+ passengers by day SELECT date_trunc('day', pickup_datetime) as day, avg(fare_amount) FROM rides WHERE passenger_count > 1 AND pickup_datetime < '-01-08' GROUP BY day ORDER BY day; day | avg --------------------+--------------------- -01-01 00:00:00 | 13.3990821679715529 -01-02 00:00:00 | 13.0224687415181399 -01-03 00:00:00 | 13.5382068607068607 -01-04 00:00:00 | 12.9618895561740149 -01-05 00:00:00 | 12.6614611935518309 -01-06 00:00:00 | 12.5775245695086098 -01-07 00:00:00 | 12.5868802584437019 (7 rows)

6、某些查询的性能甚至超过20倍

每天有多少笔交易。

-- Total number of rides by day for first 5 days SELECT date_trunc('day', pickup_datetime) as day, COUNT(*) FROM rides GROUP BY day ORDER BY day LIMIT 5; day | count --------------------+-------- -01-01 00:00:00 | 345037 -01-02 00:00:00 | 312831 -01-03 00:00:00 | 302878 -01-04 00:00:00 | 316171 -01-05 00:00:00 | 343251 (5 rows)

timescale增加了merge append的执行优化，因此在时间片上按小粒度聚合，效率非常高，数据量越大，性能提升的效果越明显。

For example, TimescaleDB introduces a time-based "merge append" optimization to minimize the number of

groups which must be processed to execute the following (given its knowledge that time is already ordered).

For our 100M row table, this results in query latency that is 396x faster than PostgreSQL (82ms vs. 32566ms).

SELECT date_trunc('minute', time) AS minute, max(usage_user) FROM cpu WHERE time < '-01-01' GROUP BY minute ORDER BY minute DESC LIMIT 5;

7、执行一些timescaleDB特有的功能函数，例如time_bucket，这里同样会用到timescaleDB内置的一些加速算法。

每5分钟间隔为一个BUCKET，输出每个间隔产生了多少笔订单。

-- Number of rides by 5 minute intervals -- (using the TimescaleDB "time_bucket" function) SELECT time_bucket('5 minute', pickup_datetime) as five_min, count(*) FROM rides WHERE pickup_datetime < '-01-01 02:00' GROUP BY five_min ORDER BY five_min; five_min | count ---------------------+------- -01-01 00:00:00 | 703 -01-01 00:05:00 | 1482 -01-01 00:10:00 | 1959 -01-01 00:15:00 | 2200 -01-01 00:20:00 | 2285 -01-01 00:25:00 | 2291 -01-01 00:30:00 | 2349 -01-01 00:35:00 | 2328 -01-01 00:40:00 | 2440 -01-01 00:45:00 | 2372 -01-01 00:50:00 | 2388 -01-01 00:55:00 | 2473 -01-01 01:00:00 | 2395 -01-01 01:05:00 | 2510 -01-01 01:10:00 | 2412 -01-01 01:15:00 | 2482 -01-01 01:20:00 | 2428 -01-01 01:25:00 | 2433 -01-01 01:30:00 | 2337 -01-01 01:35:00 | 2366 -01-01 01:40:00 | 2325 -01-01 01:45:00 | 2257 -01-01 01:50:00 | 2316 -01-01 01:55:00 | 2250 (24 rows)

8、执行一些统计分析SQL

每个城市的打车交易量。

-- Join rides with rates to get more information on rate_code SELECT rates.description, COUNT(vendor_id) as num_trips FROM rides JOIN rates on rides.rate_code = rates.rate_code WHERE pickup_datetime < '-01-08' GROUP BY rates.description ORDER BY rates.description; description| num_trips -----------------------+----------- JFK |54832 Nassau or Westchester | 967 Newark|4126 group ride | 17 negotiated fare |7193 standard rate | 2266401 (6 rows)

某些城市1月的打车统计（最长、短距离、平均人数、时长等）

-- Analysis of all JFK and EWR rides in Jan SELECT rates.description, COUNT(vendor_id) as num_trips, AVG(dropoff_datetime - pickup_datetime) as avg_trip_duration, AVG(total_amount) as avg_total, AVG(tip_amount) as avg_tip, MIN(trip_distance) as min_distance, AVG(trip_distance) as avg_distance, MAX(trip_distance) as max_distance, AVG(passenger_count) as avg_passengers FROM rides JOIN rates on rides.rate_code = rates.rate_code WHERE rides.rate_code in (2,3) AND pickup_datetime < '-02-01' GROUP BY rates.description ORDER BY rates.description; description | num_trips | avg_trip_duration |avg_total|avg_tip | min_distance | avg_distance| max_distance | avg_passengers -------------+-----------+-------------------+---------------------+--------------------+--------------+---------------------+--------------+-------------------- JFK | 225019 | 00:45:46.822517 | 64.3278115181384683 | 7.3334228220728027 | 0.00 | 17.2602816651038357 | 221.00 | 1.7333869584346211 Newark|16822 | 00:35:16.157472 | 86.4633688027582927 | 9.5461657353465700 | 0.00 | 16.2706122934252764 | 177.23 | 1.7435501129473309 (2 rows)

9、数据自动分片与执行计划

postgres=# \d+ rides Table "public.rides" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------------------+-----------------------------+-----------+----------+---------+----------+--------------+------------- vendor_id | text | || | extended | | pickup_datetime | timestamp without time zone | | not null | | plain | | dropoff_datetime| timestamp without time zone | | not null | | plain | | passenger_count | numeric | || | main| | trip_distance | numeric | || | main| | pickup_longitude| numeric | || | main| | pickup_latitude | numeric | || | main| | rate_code | integer | || | plain | | dropoff_longitude| numeric | || | main| | dropoff_latitude| numeric | || | main| | payment_type| integer | || | plain | | fare_amount | numeric | || | main| | extra | numeric | || | main| | mta_tax| numeric | || | main| | tip_amount | numeric | || | main| | tolls_amount| numeric | || | main| | improvement_surcharge | numeric | || | main| | total_amount| numeric | || | main| | Indexes: "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC) "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id) "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC) "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC) Child tables: _timescaledb_internal._hyper_1_1_chunk, _timescaledb_internal._hyper_1_2_chunk, _timescaledb_internal._hyper_1_3_chunk, _timescaledb_internal._hyper_1_4_chunk 其中一个分片的约束如下 Check constraints: "constraint_1" CHECK (pickup_datetime >= '-12-31 00:00:00'::timestamp without time zone AND pickup_datetime < '-01-30 00:00:00'::timestamp without time zone) "constraint_2" CHECK (_timescaledb_internal.get_partition_hash(payment_type) >= 1073741823) Inherits: rides

-- Peek behind the scenes postgres=# select count(*) from rides; count---------- 10906858 (1 row) Time: 376.247 ms postgres=# explain select count(*) from rides; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Finalize Aggregate (cost=254662.23..254662.24 rows=1 width=8) -> Gather (cost=254661.71..254662.22 rows=5 width=8) Workers Planned: 5 -> Partial Aggregate (cost=253661.71..253661.72 rows=1 width=8) -> Append (cost=0.00..247468.57 rows=2477258 width=0) -> Parallel Seq Scan on rides (cost=0.00..0.00 rows=1 width=0) -> Parallel Seq Scan on _hyper_1_1_chunk (cost=0.00..77989.57 rows=863657 width=0) -> Parallel Seq Scan on _hyper_1_2_chunk (cost=0.00..150399.01 rows=1331101 width=0) -> Parallel Seq Scan on _hyper_1_3_chunk (cost=0.00..6549.75 rows=112675 width=0) -> Parallel Seq Scan on _hyper_1_4_chunk (cost=0.00..12530.24 rows=169824 width=0) (10 rows)

10、也可以直接查分片

postgres=# select count(*) from _timescaledb_internal._hyper_1_1_chunk; count --------- 3454961 (1 row)

分片对用户完全透明

分片元数据：

timescaleDB + PostGIS 双剑合璧 - 时空数据库

结合时序数据库timescaleDB插件，空间数据库PostGIS插件。PostgreSQL可以很好的处理空间数据。

1、创建空间数据库PostGIS创建

create extension postgis;

2、添加空间类型字段

/docs/manual-2.4/AddGeometryColumn.html

postgres=# SELECT AddGeometryColumn ('public','rides','pickup_geom',2163,'POINT',2); addgeometrycolumn -------------------------------------------------------- public.rides.pickup_geom SRID:2163 TYPE:POINT DIMS:2 (1 row) postgres=# SELECT AddGeometryColumn ('public','rides','dropoff_geom',2163,'POINT',2); addgeometrycolumn --------------------------------------------------------- public.rides.dropoff_geom SRID:2163 TYPE:POINT DIMS:2 (1 row) postgres=# postgres=# \d+ rides Table "public.rides" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------------------+-----------------------------+-----------+----------+---------+----------+--------------+------------- vendor_id | text | || | extended | | pickup_datetime | timestamp without time zone | | not null | | plain | | dropoff_datetime| timestamp without time zone | | not null | | plain | | passenger_count | numeric | || | main| | trip_distance | numeric | || | main| | pickup_longitude| numeric | || | main| | pickup_latitude | numeric | || | main| | rate_code | integer | || | plain | | dropoff_longitude| numeric | || | main| | dropoff_latitude| numeric | || | main| | payment_type| integer | || | plain | | fare_amount | numeric | || | main| | extra | numeric | || | main| | mta_tax| numeric | || | main| | tip_amount | numeric | || | main| | tolls_amount| numeric | || | main| | improvement_surcharge | numeric | || | main| | total_amount| numeric | || | main| | pickup_geom | geometry(Point,2163) | || | main| | dropoff_geom| geometry(Point,2163) | || | main| | Indexes: "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC) "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id) "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC) "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC) Child tables: _timescaledb_internal._hyper_1_1_chunk, _timescaledb_internal._hyper_1_2_chunk, _timescaledb_internal._hyper_1_3_chunk, _timescaledb_internal._hyper_1_4_chunk

3、将数据更新到geometry字段（实际存储为两个自动，分别表示经度和纬度。实际上不更新也没关系，因为PG支持表达式索引，完全可以使用这两个字段，创建表达式空间索引）。

-- Generate the geometry points and write to table -- (Note: These calculations might take a few mins) UPDATE rides SET pickup_geom = ST_Transform(ST_SetSRID(ST_MakePoint(pickup_longitude,pickup_latitude),4326),2163); UPDATE rides SET dropoff_geom = ST_Transform(ST_SetSRID(ST_MakePoint(dropoff_longitude,dropoff_latitude),4326),2163); vacuum full rides;

4、时空分析举例。

在(lat, long) (40.7589,-73.9851)附近400米范围内，每30分钟有多少辆车被叫（以上车位置来计算）。

-- Number of rides on New Years Eve originating within -- 400m of Times Square, by 30 min buckets -- Note: Times Square is at (lat, long) (40.7589,-73.9851) SELECT time_bucket('30 minutes', pickup_datetime) AS thirty_min, COUNT(*) AS near_times_sq FROM rides WHERE ST_Distance(pickup_geom, ST_Transform(ST_SetSRID(ST_MakePoint(-73.9851,40.7589),4326),2163)) < 400 AND pickup_datetime < '-01-01 14:00' GROUP BY thirty_min ORDER BY thirty_min; thirty_min| near_times_sq ---------------------+-------------- -01-01 00:00:00 |74 -01-01 00:30:00 |102 -01-01 01:00:00 |120 -01-01 01:30:00 |98 -01-01 02:00:00 |112 -01-01 02:30:00 |109 -01-01 03:00:00 |163 -01-01 03:30:00 |181 -01-01 04:00:00 |214 -01-01 04:30:00 |185 -01-01 05:00:00 |158 -01-01 05:30:00 |113 -01-01 06:00:00 |102 -01-01 06:30:00 |91 -01-01 07:00:00 |88 -01-01 07:30:00 |58 -01-01 08:00:00 |72 -01-01 08:30:00 |94 -01-01 09:00:00 |115 -01-01 09:30:00 |118 -01-01 10:00:00 |135 -01-01 10:30:00 |160 -01-01 11:00:00 |212 -01-01 11:30:00 |229 -01-01 12:00:00 |244 -01-01 12:30:00 |230 -01-01 13:00:00 |235 -01-01 13:30:00 |238

实例2 - 传感器数据、天气数据

/v0.8/tutorials/other-sample-datasets

不再赘述。

timescaleDB 常用API

/v0.8/api

1、创建时序表

create_hypertable()

Required Arguments

Optional Arguments

2、添加多级分片字段

支持hash和interval分片

add_dimension()

Required Arguments

Optional Arguments

3、删除分片

删除指定时间点、多久之前的分片

drop_chunks()

Required Arguments

Optional Arguments

4、设置分片时间区间

set_chunk_time_interval()

Required Arguments

5、分析函数 - 第一条

first()

Required Arguments

例如，查找所有传感器的最早上传的温度值。

SELECT device_id, first(temp, time) FROM metrics GROUP BY device_id;

使用递归亦可实现：

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

6、分析函数 - 最后一条

last()

Required Arguments

例如，查找每5分钟时间区间内，每个传感器的最新温度值

SELECT device_id, time_bucket('5 minutes', time) as interval, last(temp, time) FROM metrics WHERE time > now () - interval '1 day' GROUP BY device_id, interval ORDER BY interval DESC;

使用递归亦可实现：

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

7、分析函数 - 柱状图

histogram()

Required Arguments

例如，

电池电量20到60，均分为5个BUCKET区间，返回5+2个值的数组（表示每个bucket区间的记录数），头尾分为别为边界外的记录数有多少。

SELECT device_id, histogram(battery_level, 20, 60, 5) FROM readings GROUP BY device_id LIMIT 10; device_id |histogram ------------+------------------------------ demo000000 | {0,0,0,7,215,206,572} demo000001 | {0,12,173,112,99,145,459} demo000002 | {0,0,187,167,68,229,349} demo000003 | {197,209,127,221,106,112,28} demo000004 | {0,0,0,0,0,39,961} demo000005 | {12,225,171,122,233,80,157} demo000006 | {0,78,176,170,8,40,528} demo000007 | {0,0,0,126,239,245,390} demo000008 | {0,0,311,345,116,228,0} demo000009 | {295,92,105,50,8,8,442}

8、分析函数 - 时间区间

类似date_trunc，但是更强大，可以用任意interval进行时间截断。方便用户使用。

time_bucket()

Required Arguments

Optional Arguments

9、数据概貌查看函数 - 时序表概貌

hypertable_relation_size_pretty()

SELECT * FROM hypertable_relation_size_pretty('conditions'); table_size | index_size | toast_size | total_size ------------+------------+------------+------------ 1171 MB | 1608 MB | 176 kB| 2779 MB

10、数据概貌查看函数 - 分片大小

chunk_relation_size_pretty()

SELECT * FROM chunk_relation_size_pretty('conditions'); chunk_table | table_size | index_size | total_size ---------------------------------------------+------------+------------+------------ "_timescaledb_internal"."_hyper_1_1_chunk" | 28 MB| 36 MB| 64 MB "_timescaledb_internal"."_hyper_1_2_chunk" | 57 MB| 78 MB| 134 MB ...

11、数据概貌查看函数 - 索引大小

indexes_relation_size_pretty()

SELECT * FROM indexes_relation_size_pretty('conditions'); index_name_ | total_size --------------------------------------+------------ public.conditions_device_id_time_idx | 1143 MB public.conditions_time_idx | 465 MB