点击上方
蓝字关注我们
如果你有将源系统 MySQL 表数据全量抽取到 Hive 中作为 ODS 层,不保留历史变化这样的需求,不妨看看这篇文章。
create table T_YYBZB_TGH_BANKINFO(id int(8),bank_id int(8),bank_name varchar(200));insert into T_YYBZB_TGH_BANKINFO (ID, BANK_ID, BANK_NAME)values (11, 11, '工商银行(广州)');
01
创建目标表
create table ods.ods_t_yybzb_tgh_bankinfo_di(id int,bank_id int,bank_name string);
02
编写相关脚本
mysql_username=rootmysql_password=123456mysql_ip=192.168.6.102mysql_port=3306mysql_sid=sourcehadoop102_ip=192.168.6.102hadoop102_port=8020
{"job": {"content": [{"reader": {"name": "mysqlreader","parameter": {"connection": [{"jdbcUrl": ["jdbc:mysql://$ip:$port/$sid"],"querySql": ["select id,bank_id,bank_name from T_YYBZB_TGH_BANKINFO"],}],"password": "$password","username": "$username"}},"writer": {"name": "hdfswriter","parameter": {"defaultFS": "hdfs://$hdfs_ip:$hdfs_port","fileType": "text","path": "/user/hive/warehouse/ods.db/ods_t_yybzb_tgh_bankinfo_di","fileName": "ods_t_yybzb_tgh_bankinfo_di","column": [{"name":"id","type":"int"},{"name":"bank_id","type":"int"},{"name":"bank_name","type":"string"}],"writeMode": "append","fieldDelimiter": "\t","encoding": "utf-8"}}}],"setting": {"speed": {"channel": "1"}}}}
#配置文件路径:config_file=../db_conf/db.conf#源系统名称src_system=mysql#目标系统名称tag_system=hadoop102export in_username=`grep -w ${src_system}_username ${config_file} |awk -F '=' '{print $2}'`export in_password=`grep -w ${src_system}_password ${config_file} |awk -F '=' '{print $2}'`export in_ip=`grep -w ${src_system}_ip ${config_file} |awk -F '=' '{print $2}'`export in_port=`grep -w ${src_system}_port ${config_file} |awk -F '=' '{print $2}'`export in_sid=`grep -w ${src_system}_sid ${config_file} |awk -F '=' '{print $2}'`export in_hdfs_ip=`grep -w ${tag_system}_ip ${config_file} |awk -F '=' '{print $2}'`export in_hdfs_port=`grep -w ${tag_system}_port ${config_file} |awk -F '=' '{print $2}'`pre_day=`date -d -1day +%Y%m%d`pre_day_mon=`date -d -1day +%Y%m`echo ${in_username}echo ${in_password}echo ${in_ip}echo ${in_port}echo ${in_sid}echo ${in_hdfs_ip}echo ${in_hdfs_port}echo ${pre_day}echo ${pre_day_mon}# 全量导入:hive -e "truncate table test_hive.stu;"# nsrun_workgroup.jsonpython ../../datax.py -p"-Dusername=$in_username -Dpassword=$in_password -Dip=$in_ip -Dport=$in_port -Dsid=$in_sid -Dhdfs_ip=$in_hdfs_ip -Dhdfs_port=$in_hdfs_port" ../json_conf/bank_name.json
03
DolphinScheduler配置
sh_startjson_confdb_conf
根据业务主题进行划分
工作流定义-创建工作流
配置数据源
配置sql控件(清空表)
truncate table ods.ods_t_yybzb_tgh_bankinfo_di
配置shell控件(导入数据)
第一步:上线部署
第二步:手动运行
通过工作流进入画布查看子任务运行状态(非常推荐)
通过工作流实例查看运行状态
通过甘特图查看运行状态
{"job": {"content": [{"reader": {"name": "mysqlreader","parameter": {"connection": [{"jdbcUrl": ["jdbc:mysql://${ip}:${port}/${sid}?useSSL=false"],"querySql": ["select id,bank_id,bank_name from T_YYBZB_TGH_BANKINFO"],}],"password": "${password}","username": "${username}"}},"writer": {"name": "hdfswriter","parameter": {"defaultFS": "hdfs://${hdfs_ip}:${hdfs_port}","fileType": "text","path": "/user/hive/warehouse/ods.db/ods_t_yybzb_tgh_bankinfo_di","fileName": "ods_t_yybzb_tgh_bankinfo_di","column": [{"name":"id","type":"int"},{"name":"bank_id","type":"int"},{"name":"bank_name","type":"string"}],"writeMode": "append","fieldDelimiter": "\t","encoding": "utf-8"}}}],"setting": {"speed": {"channel": "1"}}}}
SQL组件清空表
truncate table ods.ods_t_yybzb_tgh_bankinfo_di
DataX组件全量导入数据
第一步:上线部署
第二步:手动运行
SQL组件日志
[INFO] 2021-12-07 16:04:20.697 - [taskAppId=TASK-10-23-85]:[115] - create dir success /tmp/dolphinscheduler/exec/process/4/10/23/85[INFO] 2021-12-07 16:04:20.750 - [taskAppId=TASK-10-23-85]:[112] - sql task params {"postStatements":[],"connParams":"","receiversCc":"","udfs":"","type":"HIVE","title":"","sql":"truncate table ods.ods_t_yybzb_tgh_bankinfo_di","preStatements":[],"sqlType":"1","sendEmail":false,"receivers":"","datasource":7,"displayRows":10,"limit":10000,"showType":"TABLE","localParams":[]}[INFO] 2021-12-07 16:04:20.755 - [taskAppId=TASK-10-23-85]:[128] - Full sql parameters: SqlParameters{type='HIVE', datasource=7, sql='truncate table ods.ods_t_yybzb_tgh_bankinfo_di', sqlType=1, sendEmail=false, displayRows=10, limit=10000, udfs='', showType='TABLE', connParams='', title='', receivers='', receiversCc='', preStatements=[], postStatements=[]}[INFO] 2021-12-07 16:04:20.755 - [taskAppId=TASK-10-23-85]:[129] - sql type : HIVE, datasource : 7, sql : truncate table ods.ods_t_yybzb_tgh_bankinfo_di , localParams : [],udfs : ,showType : TABLE,connParams : , query max result limit : 10000[INFO] 2021-12-07 16:04:20.765 - [taskAppId=TASK-10-23-85]:[549] - after replace sql , preparing : truncate table ods.ods_t_yybzb_tgh_bankinfo_di[INFO] 2021-12-07 16:04:20.765 - [taskAppId=TASK-10-23-85]:[558] - Sql Params are replaced sql , parameters:[INFO] 2021-12-07 16:04:20.767 - [taskAppId=TASK-10-23-85]:[52] - can't find udf function resource[INFO] 2021-12-07 16:04:20.974 - [taskAppId=TASK-10-23-85]:[458] - prepare statement replace sql : org.apache.hive.jdbc.HivePreparedStatement@43ecab0e
DataX组件日志
2021-12-07 16:04:36.123 [job-0] INFO JobContainer - PerfTrace not enable!2021-12-07 16:04:36.123 [job-0] INFO StandAloneJobContainerCommunicator - Total 15 records, 134 bytes | Speed 13B/s, 1 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%2021-12-07 16:04:36.124 [job-0] INFO JobContainer -任务启动时刻 : 2021-12-07 16:04:25任务结束时刻 : 2021-12-07 16:04:36任务总计耗时 : 11s任务平均流量 : 13B/s记录写入速度 : 1rec/s读出记录总数 : 15读写失败总数 : 0
任务实例-操作查看运行日志
具体日志如下
工作流定义-》操作-》下线
04
常见问题
在作业流调试运行报错:无可用master节点
masterserver服务挂调
./stop-all.sh./start-all.sh
Hive本身不创建用户,用户就是linux的用户
./hiveserver2
beeline!connect jdbc:hive2://192.168.6.102:10000Enter username for jdbc:hive2://192.168.6.102:10000: atguiguEnter password for jdbc:hive2://192.168.6.102:10000: ******
# 1、检查hiveserver是否启动netstat -anp |grep 10000
启动hiveserver2
查看海豚日志
current cpu load average 0.35 is too high or available memory 0.14G is too low, under max.cpuload.avg=16.0 and reserved.memory=0.3G[WARN] 2021-12-06 17:17:45.031 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[159] - worker 192.168.6.102:1234 current cpu load average 0.42 is too high or available memory 0.14G is too low[WARN] 2021-12-06 17:17:45.032 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[159] - worker 192.168.6.102:1234 current cpu load average 0.42 is too high or available memory 0.14G is too low[WARN] 2021-12-06 17:17:50.032 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[159] - worker 192.168.6.102:1234 current cpu load average 0.42 is too high or available memory 0.14G is too low[WARN] 2021-12-06 17:17:50.032 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[159] - worker 192.168.6.102:1234 current cpu load average 0.42 is too high or available memory 0.14G is too low[WARN] 2021-12-06 17:17:51.317 org.apache.dolphinscheduler.server.registry.HeartBeatTask:[80] - current cpu load average 0.3 is too high or available memory 0.14G is too low, under max.cpuload.avg=16.0 and reserved.memory=0.3G
解决方式
在jvm里配置启动参数
日志报错没有这样的文件路径,but集群上的datax启动目录是这样的
海豚调度器的datax路径配置错误,找不到该文件
dataX在Hadoop02上部署,没有在103,104节点部署
dolphinscheduler_env.sh
添加如下内容
#JAVA_HOMEexport JAVA_HOME=/opt/module/jdk1.8.0_212export PATH=$PATH:$JAVA_HOME/bin
dolphinscheduler_env.sh
添加如下内容
#JAVA_HOMEexport JAVA_HOME=/opt/module/jdk1.8.0_212export PATH=$PATH:$JAVA_HOME/bin#HADOOP_HOMEexport HADOOP_HOME=/opt/module/hadoop-3.1.3export PATH=$PATH:$HADOOP_HOME/binexport PATH=$PATH:$HADOOP_HOME/sbin#HIVE_HOMEexport HIVE_HOME=/opt/module/hiveexport PATH=$PATH:$HIVE_HOME/bin#KAFKA_HOMEexport KAFKA_HOME=/opt/module/kafkaexport PATH=$PATH:$KAFKA_HOME/bin#HBASE_HOMEexport HBASE_HOME=/opt/module/hbaseexport PATH=$PATH:$HBASE_HOME/bin#FLINK_HOMEexport FLINK_HOME=/opt/module/flink-1.10.1export PATH=$PATH:$FLINK_HOME/bin#SPARK_HOMEexport SPARK_HOME=/opt/module/spark-3.1.2export PATH=$PATH:$SPARK_HOME/binexport PATN=$PATH:$SPARK_HOME/sbin#DATAX_HOMEexport DATAX_HOME=/opt/module/dataxexport PATH=$PATH:$DATAX_HOME/bin
xsync dolphinscheduler_env.sh
java.lang.Exception: DataX无法连接对应的数据库,可能原因是:1) 配置的ip/port/database/jdbc错误,无法连接。2) 配置的username/password错误,鉴权失败。请和DBA确认该数据库的连接信息是否正确。
"jdbcUrl": ["jdbc:mysql://$ip:$port/$sid?useSSL=false"],
05
加入我们
参与 DolphinScheduler 社区有非常多的参与贡献的方式,包括:
贡献第一个PR(文档、代码) 我们也希望是简单的,第一个PR用于熟悉提交的流程和社区协作以及感受社区的友好度。
社区汇总了以下适合新手的问题列表:https://github.com/apache/dolphinscheduler/issues/5689
非新手问题列表:https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A%22volunteer+wanted%22
如何参与贡献链接:https://dolphinscheduler.apache.org/zh-cn/docs/development/contribute.html
来吧,DolphinScheduler开源社区需要您的参与,为中国开源崛起添砖加瓦吧,哪怕只是小小的一块瓦,汇聚起来的力量也是巨大的。
参与开源可以近距离与各路高手切磋,迅速提升自己的技能,如果您想参与贡献,我们有个贡献者种子孵化群,可以添加社区小助手
微信(Leonard-ds) 手把手教会您( 贡献者不分水平高低,有问必答,关键是有一颗愿意贡献的心 )。添加小助手微信时请说明想参与贡献。
来吧,开源社区非常期待您的参与。
https://dolphinscheduler.apache.org/
您的 Star,是 Apache DolphinScheduler 为爱发电的动力❤️ ~
(Leonard-ds)
☞
WorkflowAsCode 来了,Apache DolphinScheduler 2.0.2 惊喜发布!
☞恭喜 Apache DolphinScheduler 入选可信开源社区共同体预备成员!
☞Apache DolphinScheduler 获评 2021 OSC 最受欢迎项目,白鲸开源获优秀中国开源原生创企奖项!
☞看看又是谁在悄悄做贡献?
☞感谢有你!所有贡献者来领礼物了
☞一文给你整明白多租户在 Apache DolphinScheduler 中的作用
☞开源并不是大牛的专属,普通人也能有属于自己的一亩三分地
☞在 Apache DolphinScheduler 上调试 LDAP 登录,亲测有效!
☞4 亿用户,7W+ 作业调度难题,Bigo 基于 Apache DolphinScheduler 巧化解
点击阅读原文,加入开源!
点个在看你最好看


