拉链表的创建、查询和回滚
概述
使⽤这种⽅式即可以记录历史,⽽且最⼤程度的节省存储。这⾥简单介绍⼀下这种历史拉链表的更新⽅法。此⽂参考本⽂中假设:
1. 数据仓库中订单历史表的刷新频率为⼀天,当天更新前⼀天的增量数据;2. 如果⼀个订单在⼀天内有多次状态变化,则只会记录最后⼀个状态的历史;3. 订单状态包括三个:创建、⽀付、完成;
4. 创建时间和修改时间只取到天,如果源订单表中没有状态修改时间,那么抽取增量就⽐较⿇烦,需要有个机制来确保能抽取到每天的增量数据;
5. 本⽂中的表和SQL都使⽤Hive的HQL语法;
初始化
假设我们有三天数据,【2015-08-20,2015-08-21,2015-08-22】。
数据流向:(原系统表)orders ==> (增量表)t_ods_orders_inc ==> (拉链表)t_dw_orders_his 建表脚本
-- 源系统中订单表
CREATE TABLE orders (    orderid INT,
createtime STRING,    modifiedtime STRING,    status STRING) stored AS textfile;
-- 订单的增量数据表,按天分区,存放每天的增量数据,保留半年左右CREATE TABLE t_ods_orders_inc (    orderid INT,
createtime STRING,    modifiedtime STRING,    status STRING
) PARTITIONED BY (day STRING)stored AS textfile;
-- 订单的历史数据拉链表
CREATE TABLE t_dw_orders_his (    orderid INT,
createtime STRING,    modifiedtime STRING,    status STRING,
dw_start_date STRING,    dw_end_date STRING) stored AS textfile;
数据初始化
-- 1. 源表orders,假设此表为21⽇状态insert into orders values
(1,'2015-08-18','2015-08-18','创建'),(2,'2015-08-18','2015-08-18','创建'),(3,'2015-08-19','2015-08-21','⽀付'),(4,'2015-08-19','2015-08-21','完成'),(5,'2015-08-19','2015-08-20','⽀付'),(6,'2015-08-20','2015-08-20','创建'),(7,'2015-08-20','2015-08-21','⽀付'),(8,'2015-08-21','2015-08-21','创建');
-- 2. 初始化ODS增量表-21⽇数据(全量初始化,将21号前的累加到此分区)INSERT overwrite TABLE t_ods_orders_inc PARTITION (day = '2015-08-21')SELECT     orderid,    createtime,    modifiedtime,    status
FROM orders
WHERE createtime <= '2015-08-21';
-- 3. 初始化ODS增量表-22⽇数据
insert overwrite table t_ods_orders_inc partition(day='2015-08-22')values
(1,'2015-08-18','2015-08-22','⽀付'),(2,'2015-08-18','2015-08-22','完成'),(6,'2015-08-20','2015-08-22','⽀付'),(8,'2015-08-21','2015-08-22','⽀付'),(9,'2015-08-22','2015-08-22','创建'),(10,'2015-08-22','2015-08-22','⽀付');
-- 4. 初始化ODS增量表-23⽇数据
insert overwrite table t_ods_orders_inc partition(day='2015-08-23')values
(1,'2015-08-18','2015-08-23','完成'),(3,'2015-08-19','2015-08-23','完成'),(5,'2015-08-19','2015-08-23','完成'),(8,'2015-08-21','2015-08-23','完成'),(11,'2015-08-23','2015-08-23','创建'),(12,'2015-08-23','2015-08-23','创建'),(13,'2015-08-23','2015-08-23','⽀付');
拉链表创建
假设
跑数时间 T= '${dt1}'
拉链表有开始⽇期(⽣效⽇期)和结束⽇期(失效⽇期,最新记录此列 = '9999-12-31')
初始化
当21号跑数时,需要全量初始化拉链表,此时,拉链表就是21⽇的切⽚数据
-- 初始化拉链表,假设21号的就是原始数据INSERT overwrite TABLE t_dw_orders_hisSELECT     orderid,    createtime,    modifiedtime,    status,
createtime   AS dw_start_date,    '9999-12-31' AS dw_end_dateFROM t_ods_orders_incWHERE day = '2015-08-21';
增量更新
当22号跑数时,需要把要处理的所有数据分成两部分,处理思路为:
新增,22号增量数据,结束⽇期= '9999-12-31'更新,历史拉链表与增量表进⾏⽐对
当增量表中存在记录,开始⽇期=历史拉链表开始⽇期,结束⽇期= date_add('${dt1}',-1)当增量表中不存在此记录,代表不需要更新
具体操作
创建⼀张临时表保存⽐对的结果数据。
-- 22号增量数据进来后,与21号的状态数据(t_dw_orders_his)⽐对,更新拉链表-- ${dt1} = '2015-08-22'
DROP TABLE IF EXISTS t_dw_orders_his_tmp;CREATE TABLE t_dw_orders_his_tmp AS SELECT  orderid,        createtime,        modifiedtime,        status,
dw_start_date,        dw_end_date FROM (
-- 22号前需更新状态的数据
SELECT  a.orderid,            a.createtime,            a.modifiedtime,            a.status,
a.dw_start_date,
CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date = '9999-12-31'                 THEN date_add('${dt1}',-1) -- 把22号前有效的数据失效⽇期置为dt1的上⼀⽇            ELSE a.dw_end_date END AS dw_end_date     FROM t_dw_orders_his a     left join t_ods_orders_inc b       ON a.orderid = b.orderid     and b.day = '${dt1}'    UNION ALL
-- 22号最新状态数据    SELECT  orderid,            createtime,            modifiedtime,            status,
modifiedtime AS dw_start_date,            '9999-12-31' AS dw_end_date     FROM t_ods_orders_inc     WHERE day = '${dt1}' ) x
ORDER BY orderid,dw_start_date;-- 临时数据正式⼊库
INSERT overwrite TABLE t_dw_orders_hisSELECT * FROM t_dw_orders_his_tmp;
-- 23号数据⼊t_dw_orders_his表请重复上述22号的刷新步骤-- ${dt1} = '2015-08-23'-- ... ...
 
查询和使⽤场景
1. 查询拉链表最新状态数据
select * from t_dw_orders_his where dw_end_date='9999-12-31';
2.查询某⽇所有订单快照
--假设查询22号数据状态 --'${dt1}' = '2021-08-22'select *
from t_dw_orders_his
where dw_start_date<='${dt1}'and dw_end_date>='${dt1}';
回滚⽅法
⽅法
先把拉链表的数据分为三份,分别为 T-N、T、T+N 的数据,T是回滚⽇期。
那么我们假设,在23号发现数据有问题,需要回滚22号的数据,此时拉链表数据有三块。
其中:
对于结束⽇期为21号及之前的数据,【保留】,下图绿⾊对于22⽇有效的数据,【更新】,其数据⼜分两种
⼀种是结束⽇期是22⽇的,把结束⽇期 = '9999-12-31'即可,下图黄⾊
⼀种是22⽇前创建,22⽇后还有效的数据,把结束⽇期 = '9999-12-31',下图蓝⾊对于22⽇后产⽣的数据,【删除】,下图红⾊
所以,拉链表的回滚过程的增删改就如下图所⽰:
具体操作
创建三个临时表,分别保存 T-N、T⽇的数据,最后合到⼀张结果表中。
-- 1. 绿⾊,保留
DROP TABLE t_dw_orders_his_tmp1;CREATE TABLE t_dw_orders_his_tmp1AS
SELECT   orderid,  createtime,  modifiedtime,  status,
  dw_start_date,  dw_end_dateFROM
  t_dw_orders_hisWHERE
  dw_end_date < '2015-08-22';
-- 2. 黄⾊,更新-当⽇⽣效的数据
DROP TABLE t_dw_orders_his_tmp2;CREATE TABLE t_dw_orders_his_tmp2 AS
SELECT     orderid,
  createtime,     modifiedtime,     status,   
  dw_start_date,   
  '9999-12-31' AS dw_end_date FROM
  t_dw_orders_hisWHERE
  dw_end_date = '2015-08-22';
-- 2. 蓝⾊,更新-22号前到22号后还⽣效的数据DROP TABLE t_dw_orders_his_tmp3;CREATE TABLE t_dw_orders_his_tmp3AS
SELECT   orderid,  createtime,  modifiedtime,  status,
  dw_start_date,
  '9999-12-31' dw_end_dateFROM
  t_dw_orders_hisWHERE
  dw_start_date <= '2015-08-22' AND dw_end_date > '2015-08-22';
-- 4. 数据插⼊到新表
CREATE TABLE t_dw_orders_his_newAS
select * from (
SELECT a.* ,'绿⾊,保留,号前的数据' FROM t_dw_orders_his_tmp1 a    UNION ALL
SELECT b.*,'黄⾊,更新-当⽇⽣效的数据' FROM t_dw_orders_his_tmp2 b    UNION ALL
SELECT c.*,'蓝⾊,更新-22号前到22号后还⽣效的数据' FROM t_dw_orders_his_tmp3 c ) a
ORDER BY a.orderid,a.dw_start_date;
总结
拉链表的出现是为了压缩存储和记录数,针对频繁更新的数据会很有效。但是回滚操作不⽅便,查询的时候也必须要指定时间才能正确取数,操作成本⾼,得权衡利弊后再确认是否合适⾃⼰使⽤。