Analysis of DSL&CONF configuration parameters of Federated learning fat and job

Keywords: Python Flask IT

1. Job concept

When using fat to start a training model task (Job), two parameter files must be: dsl and conf(Task Submit Runtime Conf);
Task startup flowchart:

2. About DSL language

DSL & task submit runtime conf setting V2
In order to make the construction of task model more flexible, fat uses a set of self-defined domain specific language (DSL) to describe tasks. In DSL, various modules (such as data read / write, data_io, feature engineering, regression region, classification) can be organized into a directed acyclic graph (DAG). Through various ways, users can flexibly combine various algorithm modules according to their own needs.

In addition, each module has different parameters to be configured, and different parties may have different parameters for the same module. In order to simplify this situation, for each module, fat will save different parameters of all parties to the same Submit Runtime Conf, and all parties will share this configuration file. This guide will show you how to create a DSL configuration file. V2 configuration official website reference

3. DSL configuration description

3.1. Summary

The configuration file of DSL adopts json format. In fact, the whole configuration file is a json object (dict). It is usually used to define the model training plan. It is also used to arrange the implemented components of fat, and the training plan is executed in this order;

3.2. Components

The first level of this dict is "components", which is used to represent each module that will be used in this task. Each independent module is defined under "components". All data needs to be taken from the data store through the Reader module. Note that the Reader module only output s

3.3. module

Used to specify the module to use. Refer to fat ML algorithm list for module name , and / fat / Python / federatedml / conf / setting_ The file names of all modules under conf are consistent (excluding the. json suffix).

3.4. input

There are two input types: Data and Model.

Data input, divided into three input types:

 1. data: Generally used for data_io modular, feature_engineering Module or evaluation modular;
 2. train_data: Generally used for homo_lr, hetero_lr and secure_boost modular. If it does train_data 
    Field, then the task will be recognized as a fit task;
 4. validate_data:  If present train_data Field, this field is optional. If you choose to keep this field, the
     Data sent to will be used as validation set
 5. test_data: Used as forecast data, if provided, it shall be provided at the same time model Input.

Model input is divided into two types:

 1. model: Model input for components of the same type.
 2. isometric_model: Specifies the model input that inherits upstream components

3.5. output

Data output is divided into four output types:

1. data: General module data output;
2. train_data: Only for Data Split;
3. validate_data: Only for Data Split;
4. test_data:  Only for Data Split;

Model output

1. Use only model;

3.6. DSL configuration example

In the training mode, users can use other algorithm modules to replace heterosecurebost. Note the module name hetero_secureboost_0 should also be changed together;

 {
    "components": {
        "reader_0": {
            "module": "Reader",
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "dataio_0": {
            "module": "DataIO",
            "input": {
                "data": {
                    "data": [
                        "reader_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        },
        "intersection_0": {
            "module": "Intersection",
            "input": {
                "data": {
                    "data": [
                        "dataio_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "hetero_secureboost_0": {
            "module": "HeteroSecureBoost",
            "input": {
                "data": {
                    "train_data": [
                        "intersection_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        },
        "evaluation_0": {
            "module": "Evaluation",
            "input": {
                "data": {
                    "data": [
                        "hetero_secureboost_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ]
            }
        }
    }
}

4. Submit Runtime Conf

dslV2 version this file is an operation configuration file, which is mainly composed of the following four main parts:

  • dsl version:{}
  • initiator:{}
  • role:{}
  • job parameters:{}
  • component_parameters:{}

4.1. dsl version:

The configuration version is 1 by default and 2 is recommended;

"dsl_version": "2"

4.2. initiator:

The user needs to define the initiator.

1. Initiator, including the role and Party of the task initiator_ ID, for example:

"initiator": {
    "role": "guest",
    "party_id": 9999
}

4.3 role:

All participants: contains the information of each participant. In the role field, each element represents a role and the party undertaking the role_ id. Party for each role_ ID exists as a list, because a task may involve multiple parties playing the same role. For example:

"role": {
    "guest": [9999], 
    "host": [10000],
    "arbiter": [10000]
}

4.4. job parameters:

Configure the main system parameters during job operation; Parameter application range policy settings:

1. Apply to all participants, using common scope identifier ;

2. Apply only to a party, using role Range identifier, using role:party_index Locate the specified party, and the directly specified parameter takes precedence over common parameter

Example:

"common": {
}

"role": {
  "guest": {
    "0": {
    }
  }
}

The parameters under common are applied to all participants, and the parameters under role-guest-0 configuration are applied to participants with subscript 0 of guest role;
Note that the system operation parameters in the current version are not strictly tested and are only applied to a participant, so it is recommended to choose common first;
Detailed description of job parameters:

Configuration itemDefault valueSupport valueexplain
job_typetraintrain, predictTask type
work_mode00, 10 represents single party stand-alone version and 1 represents multi-party distributed version
backend00, 1, 20 represents EGGROLL, 1 represents SPARK plus rabbit MQ, and 2 represents SPARK plus Pulsar
model_id--Model id, which needs to be filled in for prediction task
model_version--Model version, the forecast task needs to be filled in
task_cores4positive integerTotal cpu cores for job requests
task_parallelism1positive integertask parallelism
computing_partitionsNumber of cpu cores allocated to taskpositive integerThe number of partitions in the data table at the time of calculation
eggroll_runnothingprocessors_per_node et alThe configuration parameters related to the eggroll computing engine generally do not need to be configured and are controlled by task_ The cores are calculated automatically. If configured, the task_ The cores parameter does not take effect
spark_runnothingNum executors, executor cores, etcThe related configuration parameters of spark computing engine generally do not need to be configured and are determined by task_ The cores are calculated automatically. If configured, the task_ The cores parameter does not take effect
rabbitmq_runnothingqueue, exchange, etcrabbitmq creates the relevant configuration parameters of queue and exchange. Generally, it does not need to be configured, and the system default value is adopted
pulsar_runnothingproducer, consumer, etcpulsar is configured when creating producer and consumer. Generally, it does not need to be configured.
federated_status_collect_typePUSHPUSH, PULLMulti party operation status collection mode, PUSH means that each participant actively reports to the initiator, and PULL means that the initiator regularly pulls from each participant
timeout259200 (3 days)positive integerTask timeout in seconds

The following common parameters are explained in detail:

4.4.1. backend parameter:

1. The three types of engines have certain support dependencies, such as Spark The computing engine currently only supports HDFS As an intermediate data storage engine;
2. work_mode + backend It will automatically generate the corresponding three engine configurations according to the support dependencies computing,storage,federation;
3. Developers can implement the adaptive engine and runtime config Configuration engine;

There are four reference configurations:

1. use eggroll As backend,Take default cpu Configuration when assigning calculation policies;
2. use eggroll As backend,Take direct designation cpu Configuration with equal parameters
3. use spark plus rabbitMQ As backend,Take direct designation cpu Configuration with equal parameters
4. use spark plus pulsar As backend;

Example:

"job_parameters": {
  "common": {
    "job_type": "train",
    "work_mode": 1,
    "backend": 1,
    "spark_run": {
      "num-executors": 1,
      "executor-cores": 2
    },
    "task_parallelism": 2,
    "computing_partitions": 8,
    "timeout": 36000,
    "rabbitmq_run": {
      "queue": {
        "durable": true
      },
      "connection": {
        "heartbeat": 10000
      }
    }
  }
}

Refer to for details

4.4.2. Detailed description of resource management

Since version 1.5.0, in order to further manage resources, fateflow enables a finer grained cpu cores management policy, removing the policy of limiting the number of jobs running at the same time in the previous version.

Including: total resource allocation, operation resource calculation, resource scheduling, Refer to for details

4.5. component_parameters: component running parameters

Parameter application range policy settings:

1.  Apply to all participants, using common scope identifier ;
2.  Apply only to a party, using role Range identifier, using role:party_index Locate the designated party,
    Directly specified parameters take precedence over common parameter;
  • Example 1:
"commom": {
}

"role": {
  "guest": {
    "0": {}
  }
  "host":{
    "0": {}
  }
}

The parameters under the common configuration are applied to all participants, and the parameters under the role-guest-0 configuration are applied to the participants under the subscript 0 of the guest role. Note that the component operation parameters of the current version support two application range policies;

  • Example 2:
"component_parameters": {
  "common": {
    "intersection_0": {
      "intersect_method": "raw",
      "sync_intersect_ids": true,
      "only_output_key": false
    },
    "hetero_lr_0": {
      "penalty": "L2",
      "optimizer": "rmsprop",
      "alpha": 0.01,
      "max_iter": 3,
      "batch_size": 320,
      "learning_rate": 0.15,
      "init_param": {
        "init_method": "random_uniform"
      }
    }
  },
  "role": {
    "guest": {
      "0": {
        "reader_0": {
          "table": {"name": "breast_hetero_guest", "namespace": "experiment"}
        },
        "dataio_0":{
          "with_label": true,
          "label_name": "y",
          "label_type": "int",
          "output_format": "dense"
        }
      }
    },
    "host": {
      "0": {
        "reader_0": {
          "table": {"name": "breast_hetero_host", "namespace": "experiment"}
        },
        "dataio_0":{
          "with_label": false,
          "output_format": "dense"
        }
      }
    }
  }
}

Example Parameter Description:

  • The above example component name is defined in the DSL configuration file, which corresponds to it;
  • intersection_0 and hetero_lr_0 the operating parameters of the two components are placed in the common range and applied to all participants;
  • For reader_0 and dataio_0 the operating parameters of the two components are configured according to different participants, because the input parameters of different participants are usually inconsistent, and all the two components are generally set according to the participants;

4.5.1 configuration of multiple host s

Including multi host tasks, all host information shall be listed under the role, and different configurations of each host shall be listed under their corresponding modules; todo=??

===================================================================================
The above is the composition of a complete Submit Runtime Conf configuration file. The following is an example of Submit Runtime Conf configuration:

{
    "dsl_version": "2",
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "role": {
        "host": [
            10000
        ],
        "guest": [
            9999
        ]
    },
    "job_parameters": {
        "job_type": "train",
        "work_mode": 0,
        "backend": 0,
        "computing_engine": "STANDALONE",
        "federation_engine": "STANDALONE",
        "storage_engine": "STANDALONE",
        "engines_address": {
            "computing": {
                "nodes": 1,
                "cores_per_node": 20
            },
            "federation": {
                "nodes": 1,
                "cores_per_node": 20
            },
            "storage": {
                "nodes": 1,
                "cores_per_node": 20
            }
        },
        "federated_mode": "SINGLE",
        "task_parallelism": 1,
        "computing_partitions": 4,
        "federated_status_collect_type": "PULL",
        "model_id": "guest-9999#host-10000#model",
        "model_version": "202108310831349550536",
        "eggroll_run": {
            "eggroll.session.processors.per.node": 4
        },
        "spark_run": {},
        "rabbitmq_run": {},
        "pulsar_run": {},
        "adaptation_parameters": {
            "task_nodes": 1,
            "task_cores_per_node": 4,
            "task_memory_per_node": 0,
            "request_task_cores": 4,
            "if_initiator_baseline": false
        }
    },
    "component_parameters": {
        "role": {
            "guest": {
                "0": {
                    "reader_0": {
                        "table": {
                            "name": "breast_hetero_guest",
                            "namespace": "experiment"
                        }
                    }
                }
            },
            "host": {
                "0": {
                    "reader_0": {
                        "table": {
                            "name": "breast_hetero_host",
                            "namespace": "experiment"
                        }
                    },
                    "dataio_0": {
                        "with_label": false
                    }
                }
            }
        },
        "common": {
            "dataio_0": {
                "with_label": true
            },
            "hetero_secureboost_0": {
                "task_type": "classification",
                "objective_param": {
                    "objective": "cross_entropy"
                },
                "num_trees": 5,
                "bin_num": 16,
                "encrypt_param": {
                    "method": "iterativeAffine"
                },
                "tree_param": {
                    "max_depth": 3
                }
            },
            "evaluation_0": {
                "eval_type": "binary"
            }
        }
    }
}

5. Principle of fat-flow operation job process

  1. After submitting the job, fast flow obtains the job dsl configuration file and job config configuration file (Submit Runtime Conf) and stores them in the database t_ Corresponding fields of the job table and / fat / jobs / $jobid / directory;
  2. Parse the job dsl and job config, generate fine-grained parameters according to the merged parameters (three engine parameters will be generated corresponding to the backend & work_mode mentioned above), and process the default values of the parameters;
  3. Distribute and store the common configuration to all participants, and generate a job according to the actual information of the participants_ runtime_ on_ party_ conf;
  4. When each participant receives a task, it is based on the job_runtime_on_party_conf execution;

The $job id directory includes the following files:

	1. job_dsl.json  
	2. job_runtime_conf.json  
	3. local  pipeline_dsl.json 
	4.  train_runtime_conf.json

Posted by Lumio on Sat, 20 Nov 2021 14:06:23 -0800