|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-
离线数据同步:像货运卡车。它定期(比如每天一次)将整批货物从A仓库运到B仓库。单次运量大,但频率低,有延迟。 -
在线数据同步:像传送带。货物(数据)在A仓库一生产出来,就立刻被放到传送带上,源源不断地、以很低的延迟运往B仓库。
DataX本身作为离线数据同步框架,采用Framework + plugin架构构建。将数据源读取和写入抽象成为Reader/Writer插件,纳入到整个同步框架中。
-
Reader:Reader为数据采集模块,负责采集数据源的数据,将数据发送给Framework。 -
Writer: Writer为数据写入模块,负责不断向Framework取数据,并将数据写入到目的端。 -
Framework:Framework用于连接reader和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等核心技术问题。
|
|
|
|
|
|
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
https://github.com/alibaba/DataX
https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202308/datax.tar.gz
https://www.oracle.com/cn/java/technologies/downloads/
https://www.python.org/downloads/
plugin是dataX支持的插件目录,里面包含读插件和写插件:
其他目录不过多介绍。
python /path/datax.py /path/job.json
python bin/datax.py job/job.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.2025-10-16 10:12:46.279 [main] INFO MessageSource - JVM TimeZone: GMT+08:00, Locale: zh_CN2025-10-16 10:12:46.280 [main] INFO MessageSource - use Locale: zh_CN timeZone: sun.util.calendar.ZoneInfo[id="GMT+08:00",offset=28800000,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]2025-10-16 10:12:46.296 [main] INFO VMInfo - VMInfo# operatingSystem class => com.sun.management.internal.OperatingSystemImpl2025-10-16 10:12:46.298 [main] INFO Engine - the machine info =>osInfo: Windows 10 amd64 10.0jvmInfo: Oracle Corporation 21 21.0.8+12-LTS-250cpu num: 20totalPhysicalMemory: -0.00GfreePhysicalMemory: -0.00GmaxFileDescriptorCount: -1currentOpenFileDescriptorCount: -1GC Names [G1 Young Generation, G1 Old Generation, G1 Concurrent GC]MEMORY_NAME | allocation_size | init_sizeCodeHeap 'profiled nmethods' | 116.31MB | 2.44MBG1 Old Gen | 1,024.00MB | 973.00MBG1 Survivor Space | -0.00MB | 0.00MBCodeHeap 'non-profiled nmethods' | 116.38MB | 2.44MBCompressed Class Space | 1,024.00MB | 0.00MBMetaspace | -0.00MB | 0.00MBG1 Eden Space | -0.00MB | 51.00MBCodeHeap 'non-nmethods' | 7.31MB | 2.44MB2025-10-16 10:12:46.331 [main] INFO Engine -{"setting":{"speed":{"channel":1},"errorLimit":{"record":0,"percentage":0.02}},"content":[{"reader":{"name":"streamreader","parameter":{"column":[{"value":"DataX","type":"string"},{"value":20250101,"type":"long"},{"value":"2025-01-01 00:00:00","type":"date"},{"value":true,"type":"bool"},{"value":"test","type":"bytes"}],"sliceRecordCount":10}},"writer":{"name":"streamwriter","parameter":{"print":true,"encoding":"UTF-8"}}}]}2025-10-16 10:12:46.340 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false2025-10-16 10:12:46.340 [main] INFO JobContainer - DataX jobContainer starts job.2025-10-16 10:12:46.341 [main] INFO JobContainer - Set jobId = 02025-10-16 10:12:46.348 [job-0] INFO JobContainer - jobContainer starts to do prepare ...2025-10-16 10:12:46.349 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work .2025-10-16 10:12:46.349 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .2025-10-16 10:12:46.349 [job-0] INFO JobContainer - jobContainer starts to do split ...2025-10-16 10:12:46.349 [job-0] INFO JobContainer - Job set Channel-Number to 1 channels.2025-10-16 10:12:46.350 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [1] tasks.2025-10-16 10:12:46.350 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [1] tasks.2025-10-16 10:12:46.363 [job-0] INFO JobContainer - jobContainer starts to do schedule ...2025-10-16 10:12:46.365 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.2025-10-16 10:12:46.366 [job-0] INFO JobContainer - Running by standalone Mode.2025-10-16 10:12:46.369 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.2025-10-16 10:12:46.372 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.2025-10-16 10:12:46.372 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.2025-10-16 10:12:46.379 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is startedDataX 20250101 2025-01-01 00:00:00 true testDataX 20250101 2025-01-01 00:00:00 true testDataX 20250101 2025-01-01 00:00:00 true testDataX 20250101 2025-01-01 00:00:00 true testDataX 20250101 2025-01-01 00:00:00 true testDataX 20250101 2025-01-01 00:00:00 true testDataX 20250101 2025-01-01 00:00:00 true testDataX 20250101 2025-01-01 00:00:00 true testDataX 20250101 2025-01-01 00:00:00 true testDataX 20250101 2025-01-01 00:00:00 true test2025-10-16 10:12:46.482 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[104]ms2025-10-16 10:12:46.482 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.2025-10-16 10:12:56.384 [job-0] INFO StandAloneJobContainerCommunicator - Total 10 records, 260 bytes | Speed 26B/s, 1 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%2025-10-16 10:12:56.385 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.2025-10-16 10:12:56.387 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work.2025-10-16 10:12:56.388 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work.2025-10-16 10:12:56.389 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.2025-10-16 10:12:56.392 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: E:\develop-env\datax\hook2025-10-16 10:12:56.395 [job-0] INFO JobContainer -[total cpu info] =>averageCpu | maxDeltaCpu | minDeltaCpu-1.00% | -1.00% | -1.00%[total gc info] =>NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTimeG1 Young Generation | 0 | 0 | 0 | 0.000s | 0.000s | 0.000sG1 Old Generation | 0 | 0 | 0 | 0.000s | 0.000s | 0.000sG1 Concurrent GC | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s2025-10-16 10:12:56.396 [job-0] INFO JobContainer - PerfTrace not enable!2025-10-16 10:12:56.397 [job-0] INFO StandAloneJobContainerCommunicator - Total 10 records, 260 bytes | Speed 26B/s, 1 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%2025-10-16 10:12:56.398 [job-0] INFO JobContainer -任务启动时刻 : 2025-10-16 10:12:46任务结束时刻 : 2025-10-16 10:12:56任务总计耗时 : 10s任务平均流量 : 26B/s记录写入速度 : 1rec/s读出记录总数 : 10读写失败总数 : 0
https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md
https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md
{"job": {"setting": {"speed": {"channel": 1 #设置传输并发}},"content": [{"reader": { # 读插件配置"name": "mysqlreader", # 插件名,完整名字可在plugin/reader下找到"parameter": { # mysqlreader的配置参数"username": "root", # 连接mysql的用户名"password": "root", # 连接mysql的密码"column": [ #需要查询的列 此处代表查询[id,name,age,gender]列"id","name","age","gender"],"splitPk": "id", # 分片字段, DataX如果并发执行# 需要根据这个字段对任务进行分片推荐使用主键"connection": [ #连接信息{"table": [ #表名"users"],"jdbcUrl": [ #url"jdbc:mysql://127.0.0.1:3306/lyc_test?useSSL=false"]}]}},"writer": { #写插件配置"name": "mysqlwriter",# 插件名,完整名字可在plugin/writer下找到"parameter": { #写插件的参数配置"writeMode": "insert", #写入模式insert代表直接插入数据但是遇到id冲突会报错"username": "root", #写入数据库用户名"password": "root", #写入数据库密码"column": [ #写入哪些列"id","name","age","gender"],"connection": [ #连接信息{"jdbcUrl": "jdbc:mysql://127.0.0.1:3306/lyc_test_copy?useSSL=false","table": ["users_copy"]}]}}}]}}
python bin/datax.py ./mysql2mysql.json
https://github.com/alibaba/DataX

