文章导航
大三下思特奇菁英班大数据作业实验报告
思特奇大数据作业实验报告
一、 题目
用HIVE/SPARK 2种SQL方式实现以下2个需求:
1、表拼接
ods_hour 表和 dwd_bd_coordinate 表拼接(拼接条件:两张表的 lng、lat 列相等),生成的数据插入 cdm.dwd_hour 表。
-------------------------------------------------------------------------------
ods_hour + dwd_bd_coordinate => cdm.dwd_hour (拼接条件:lng、lat 列相等)
-------------------------------------------------------------------------------
2、条件查询
查询cdm.dwd_hour每个省的最新日期对应的字段,结果数据插入dwd_hour_latest 表。
例如:select * from dwd_hour_latest
province provincecode daytime
云南省 530000 20190531 <<< 注意这里的值:20190531
海省 630000 20190522
-----------------------------------------------------------------
云南省的最新时间是 2019-05-31,如果新增数据中,云南省的最新时间是 2019-06-03,那么就需要更新云南省的时间为 20190603
province provincecode daytime
云南省 530000 20190603 <<< 注意这里的值:20190603
海省 630000 20190522
二、 思路
1.表拼接
data:image/s3,"s3://crabby-images/51c03/51c03676cd746efc14596cc8f326043bc98208bb" alt="img"
2.条件查询
data:image/s3,"s3://crabby-images/348fa/348fac86ca5cac2fc7d0c25fd728324ac505103f" alt="img"
三、 操作步骤与输出结果
1. Hive
表拼接
(1) 进入hive
data:image/s3,"s3://crabby-images/08ac1/08ac14839a7a17b8aadabc32eed2e1e2340d347a" alt="img"
(2) 创建名为rzh的数据库(之前已创建没有截图,用show databases显示)
data:image/s3,"s3://crabby-images/e70da/e70daf0aba7f8a657de24b78a4ef93a264a3568f" alt="img"
(3) 建立四个表
data:image/s3,"s3://crabby-images/7052e/7052e8b331f2c43d3bdb0da7737133216a056406" alt="img"
(4) 导入数据
之前总是无法导入数据,在群里与老师交流后学到了要在hdfs中先导入文件
报错图:
data:image/s3,"s3://crabby-images/3f296/3f296b81b32dcd18f3f4459a1456863a291f281e" alt="img"
成功导入数据
data:image/s3,"s3://crabby-images/67d1f/67d1f6e1e06c70bbb583cae62ca0185d1be1ba9a" alt="img"
选上10列数据显示
data:image/s3,"s3://crabby-images/da196/da1965ba9eecf7a391bbc237aeab39b627eb826b" alt="img"
另一个表
data:image/s3,"s3://crabby-images/9cdb9/9cdb9fe0ea9662120ad2e7d9dd7c563f04b8b478" alt="img"
选上5列数据显示
data:image/s3,"s3://crabby-images/b3ab1/b3ab1319bfc4628e8c3a508429d6fde9d7760088" alt="img"
(5) 实现表拼接
data:image/s3,"s3://crabby-images/38793/387937b44e249c9bcea260a8bd6c9ee2aa898d2b" alt="img"
查看合并后的表
(6) 去除空数据
data:image/s3,"s3://crabby-images/dbb33/dbb331639a94ecd3c5c48deeff904440981a0842" alt="img"
已无空数据
data:image/s3,"s3://crabby-images/0ea50/0ea50cda0d3750a02bfc0a63825b6e782fe4736c" alt="img"
条件查询
(1) 按省份筛选最新数据
data:image/s3,"s3://crabby-images/12747/1274713f956daa32d58fe05bb35db8feb0fc9986" alt="img"
data:image/s3,"s3://crabby-images/a5c40/a5c40144689a7bd73636a98ca7db021844ede2bf" alt="img"
(2) 在日期表中找到最新数据并覆盖
data:image/s3,"s3://crabby-images/1ec49/1ec49a8aa62fe07a9cade7c028b6d6ec78d13345" alt="img"
data:image/s3,"s3://crabby-images/3941e/3941e63f5ea1b68fbd739dc222b3dae2808693c5" alt="img"
2. Spark
表拼接
(1) 进入spark
data:image/s3,"s3://crabby-images/951ff/951ffc8530fbceacea8787232778da8c1a854c2b" alt="img"
发现原来的表格还在,不用重新创建
data:image/s3,"s3://crabby-images/937e7/937e7dcf156b9d87c7b0d0e2da965a000fa2ad4b" alt="img"
(2) 拼接操作
data:image/s3,"s3://crabby-images/61ba2/61ba27503c4fc3314d63c6522568f3753859e890" alt="img"
data:image/s3,"s3://crabby-images/3409f/3409f693a90e123126ede04e18467bca76ea9a8b" alt="img"
条件查询
(1) 按省份筛选最新数据
data:image/s3,"s3://crabby-images/5639a/5639a01949d2c64149fb5c3b7f37fc49e5cd7416" alt="img"
data:image/s3,"s3://crabby-images/76d6c/76d6cdcdf9a264e7d8c2a97e059e9409df5a12c6" alt="img"
(2) 在日期表中找到最新数据并覆盖
data:image/s3,"s3://crabby-images/d955d/d955d3f03f91a7eeaf8bfac6f104d976bce93d6b" alt="img"
data:image/s3,"s3://crabby-images/64d05/64d05c1a65bb4f313b0cd3bc4b4036a9b5e0c5f6" alt="img"