DataX -- configuration resolution

Keywords: Java

DataX is Alibaba's heterogeneous data source offline synchronization tool. For detailed introduction and use, please refer to Introduction to the official website and Quick Start . DataX series mainly introduces the principle of the whole operation in more detail.

Configuration

The parsing of DataX configuration includes three files: job.json, core.json and plugin.json. These three JSON files are all multi-level JSON configurations, such as a JSON with a.b.c=d. if we obtain it through JSON, we usually obtain the JSON of b.c through the key a, then obtain the josn of c through the key b, and finally obtain d through the key c, which is very cumbersome to write the code.

DataX provides a Configuration class that can flatten json directly. Let's take a look at the following example.

public static String JSON = "{'a': {'b': {'c': 'd'}}}";

public static void main(String[] args) {
    Configuration configuration = Configuration.from(JSON);
    System.out.println(configuration.get("a.b"));
    System.out.println(configuration.get("a.b.c"));
    System.out.println(configuration.get("a.b.d"));
}

The running results are as follows. It can be seen that the multi-level data of json can be easily obtained through Configuration. In addition to get, there are methods such as merge, getString based on String type, and getNecessaryValue, which will not be introduced here.

{"c":"d"}
d
null

job.json

job.json is the job configuration file. Before the task runs, the full path of the configuration file is passed in through parameters, so the name can be customized.

The main configuration contents include job.content.reader, job.content.writer and job.setting.speed. The reader and writer can refer to the resources / plugin in each corresponding module_ job_ The template.json file can also be obtained directly through the command. There are examples in Quick Start in this way. It mainly specifies which reader is used to read data, which writer is used to write data, and the related configuration information of reader and writer.

setting.speed mainly controls the flow rate, which will be explained in detail later.

{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
      }
    }
  }
}

core.json

The full path is in DATAX_HOME/conf/core.json. Configure some global information, such as the number of channel s in taskGroup. Type conversion is configured here.

{
    "entry": {
        "jvm": "-Xms1G -Xmx1G",
        "environment": {}
    },
    "common": {
        "column": {
            "datetimeFormat": "yyyy-MM-dd HH:mm:ss",
            "timeFormat": "HH:mm:ss",
            "dateFormat": "yyyy-MM-dd",
            "extraFormats":["yyyyMMdd"],
            "timeZone": "GMT+8",
            "encoding": "utf-8"
        }
    },
    "core": {
        "dataXServer": {
            "address": "http://localhost:7001/api",
            "timeout": 10000,
            "reportDataxLog": false,
            "reportPerfLog": false
        },
        "transport": {
            "channel": {
                "class": "com.alibaba.datax.core.transport.channel.memory.MemoryChannel",
                "speed": {
                    "byte": -1,
                    "record": -1
                },
                "flowControlInterval": 20,
                "capacity": 512,
                "byteCapacity": 67108864
            },
            "exchanger": {
                "class": "com.alibaba.datax.core.plugin.BufferedRecordExchanger",
                "bufferSize": 32
            }
        },
        "container": {
            "job": {
                "reportInterval": 10000
            },
            "taskGroup": {
                "channel": 5
            },
            "trace": {
                "enable": "false"
            }

        },
        "statistics": {
            "collector": {
                "plugin": {
                    "taskClass": "com.alibaba.datax.core.statistics.plugin.task.StdoutPluginCollector",
                    "maxDirtyNumber": 10
                }
            }
        }
    }
}

plugin.json

The full path of plugin.json is DATAX_HOME/plugin/reader/streamreader/plugin.json. The streamreader corresponds to the job.json above.

The main contents of this file are name and class. Class is the plug-in class to be used at runtime. Since there will be reader s and writer s, two plugin.json will be loaded here.

{
    "name": "streamreader",
    "class": "com.alibaba.datax.plugin.reader.streamreader.StreamReader",
    "description": {
        "useScene": "only for developer test.",
        "mechanism": "use datax framework to transport data from stream.",
        "warn": "Never use it in your real job."
    },
    "developer": "alibaba"
}

The above three files, job.json, core.json and plugin.json, will be merged through the merge method after loading, so the final Configuration is the merged information of these files, followed by starting the plug-in through Configuration.

Posted by eurozaf on Fri, 26 Nov 2021 02:30:07 -0800