Inference General Vendor Intergration Guide

In ByteMLPerf Inference General Perf system architecture design, the framework and Backend are isolated, and vendors can implement Backend by themselves and participate in the evaluation test as the ByteMLPerf backend.

Backend Deployment

Create Backend

  • Create a new folder named after the backend name under the backends/ folder. All the required dependencies need to be stored in this directory, such as GRAPHCORE backend, and the directory name is GRAPHCORE (for specific naming rules, refer to the naming rules below;

  • Add compile_backend_xxx.py/runtime_backend_xxx.py, where xxx is the backend name, such as GRAPHCORE backend. You need to create an entry file named compile_backend_graphcore.py, which needs to inherit the class CompileBackend internally;

  • Add xxx.json, which is used to interact with the user. If there is no need to interact with the user, you can provide an empty file and return None directly at get_interact_profile in backend_xxx.py;

  • Add requirements.txt, for the required environment dependencies, the framework will create a new venv for each backend, and install the pkg declared in the requirements for it;

We use Graphcore as an example. The backend should contain the following files:

byte_mlperf/backends/GRAPHCORE/ ├── compile_backend_graphcore.py ├── runtime_backend_graphcore.py ├── GRAPHCORE.json └── requirements.txt

Implement CompileBackend API

For the CompileBackend base class, refer to the Compile Backend section above.

In the current version, the APIs that need to be implemented are as follows:

  • pre_optimize() Model pre-optimization interface. Pre-optimize the model before compiling, such as model sorting, shape fix, etc. It is allowed to change the model structure, but the Backend needs to cache the model after the format change to ensure that the original model can still be loaded and run. If not required, this interface may not be implemented.

  • compile() Model compilation interface. For Vendor that needs to be compiled, model conversion and compilation can be performed here. The model format can be changed here, and the compiled product can be loaded and run by the runtime backend, or it can be loaded and run by QS Runtime. In addition, in addition to returning the compiled product, compile also needs to return compile also needs to return the compilation accuracy, sub-graph segmentation information, and model IO information after compilation (if not full-image compilation and runtime supports heterogeneous operation);

result = { "model": "ResNet50", "framework": "Tensorflow", "compile_precision": "int16_mix", "optimizations": {}, "instance_count": 1, // The number of total machines was used "device_count": 128, // The number of total cards was used "input_type": ["INT16"], //List of String, Upper case only "max_batch_size": 64, //Max Batch Size allowed to use "compile_status": "success", //Only if all subgraph was compiled successfully "sg_percent": 100, "segments": [{ "sg_idx": 0, "is_fallback" : false, "input_tensor_map" : {"input:0":[-1,3,255,255]}, "output_tensor_map" : {"pred:0":[-1,1024]}, "compiled_model" : [{ "compiled_bs" : 1, "compiled_obj" : "xxx.obj", },], },] }

As shown in the above example, if multiple subgraphs are generated by compilation, multiple segments need to be returned; if multiple batch sizes are compiled, all of them need to be listed in compiled_model.

It should be noted that the is_fallback field indicates whether the current subgraph will fallback to run on the CPU. If it is true, it usually means that the current subgraph is not placed on the accelerator card, but fallsback to the CPU for execution. Note: If you need to use dataloader during compile(), you can refer to the ModelZoo&Dataset section above.

  • get_interact_profile() Load the interactive configuration interface. If the vendor needs the user to provide some additional information, such as compilation configuration, you can load the json file you added here and return a list of dict. Framework will display the content of the profile to the user and is responsible for collecting feedback about the profile. If the user does not need to provide additional information, return None here.
CPU.json
[ { "name": "omp", "note": "Using OMP?", "dialog_type": "Yes/No Dialog", "type": "bool", "default": false, "depends": null }, { "name": "precision", "note": "Precision to compile the model (Example Only)", "dialog_type": "Radiolist Dialog", "options": ["FP8", "FP16"], "type": "str", "default": "FP16", "depends": null }, { "name": "batch", "note": "Batch Size", "dialog_type": "Input Dialog", "type": "str", "default": "4", "depends": null } ]

get_interact_profile can get some workload info and model info, and the vendor can also generate some options other than json under this API.

  • get_best_batch_size() Select the interface with the best batch size configuration. For some accelerator cards, there may be optimal batch size usage. This interface can be used to conduct a preliminary analysis of the model and return an optimal bs list. The framework will evaluate the list returned by this interface.

Implement RuntimeBackend API

In the current version, the APIs that need to be implemented are as follows:

runtime_backend.py
class RuntimeBackend(object): def __init__(self): self.hardware_type = 'UnKnown' self.need_reload = False self.need_quant = False def version(self) -> str: """ Return runtime backend version details """ raise NotImplementedError("RuntimeBackend:version") def load(self, batch_size) -> str: """ Return runtime backend version details """ raise NotImplementedError("RuntimeBackend:load") def get_loaded_batch_size(self) -> int: """ Get Currect batch size """ raise NotImplementedError("RuntimeBackend:get_loaded_batch_size") def predict(self, data): """ Run the compiled model and return the model output corresponding to the data. """ raise NotImplementedError("RuntimeBackend:predict") def is_qs_mode_supported(self) -> bool: """ Used to check whether QSv2 Runtime is enabled """ return False def generate_qs_config(self) -> Dict[str, Any]: """ Used only when is_qs_ported return True. Generate QS Config File for QSv2 Runtime """ return None def benchmark(self, dataloader): """ Performance Testing when qs mode is not enabled. """ raise NotImplementedError("RuntimeBackend:benchmark")
  • load() Load the model corresponding to the Batch Size. The compiled result framework will be passed to the runtime backend, ensuring the input bs matches the backend returned by compile.

  • predict() Call the compiled product for a single prediction.

  • is_qs_mode_supported() Whether it has been connected to qs, and if it has been connected, the performance test can be performed through qs.

  • generate_qs_config() If qs has been supported, the framework will call this interface to generate the corresponding qs configuration.

  • benchmark() It is used to call this interface by the framework before QuickSilver is ready, and transfer the benchmark to the Runtime backend, and the runtime backend loads the compiled product in the interface for performance testing. A dictionary needs to be returned, which can contain information such as BS, QPS, AVG Latency, P99 Latency, as shown below.

"Performance": [ { "BS": 1, "QPS": 2, "AVG_Latency": 1.2, "P99_Latency": 15, }, { "BS": 1, "QPS": 2, "AVG_Latency": 1.2, "P99_Latency": 15, }, ],

Config Info

The config passed to compile contains three parts, and the layout in configs is as follows:

configs = { "workload" : {...}, "model_info" : {...}, "interact_info" : {...}, }

Workload:

Basic workload we got from model_framework_precision.json

bert-torch-fp32.json
{ "model": "bert-torch-fp32", //待评估的模型名称,需要与model_zoo名称对齐 "test_perf": true, //是否评估模型性能 "test_accuracy": true, //是否评估模型精度 "test_numeric": true, //精度:是否评估数字误差 "clients": 3, //性能:提交数据的客户端线程数 "iterations": 100, //性能:每个线程提交的迭代次数 "batch_sizes":[1,4,16], //性能:每个线程提交数据时的批处理大小 "fake_data": false, //性能:使用虚假数据提交数据 "data_percent": 50, //精度:用于评估精度的数据集的百分比,[1-100] }

Model Info

Information about the model itself. Examples are as follows:

bert-tf-fp16.json
{ "name": "bert-tf-fp16", //Model name, generally consistent with the json file "model_path": "model_zoo/bert/bert_fp16", //Model relative model_zoo path "framework": "Tensorflow", //training framework "framework_version": "2.4.0", //Training framework version "model_format": "saved_model", "model_precision": "FP32", "inputs": "input_ids:0,input_mask:0,segment_ids:0", "outputs": "logits:0", "input_shape": {"input_ids:0": [1, 384], "input_mask:0": [1, 384], "segment_ids:0": [1, 384]}, "input_type": "FLOAT32", "dataset_name": "squad", "max_batch_size": 64, }

Interact Info

The information that the vendor wants to collect. Examples are as follows:

{ "max_seq_len": 512, "try_dfs": true, "internal_quant_bits_int16": 12 }

When you use this file as interact file

CPU.json
{ "name": "max_seq_len", "note" : "模型Padding后的最长序列长度?", "dialog_type": "Input Dialog", "type": "int", "depends": "is_bert", }, { "name": "try_dfs", "note": "是否存在多分支共享输入的Layer?", "dialog_type": "Yes/No Dialog", "type": "bool", "depends": null, }, { "name" : "internal_quant_bits_int16", "note" : "int16时内部量化的位数,可以是12,13或14", "dialog_type": "Radiolist Dialog", "type" : "int", "default": "12", "options":["12", "13", "14"], "depends" : "accuracy_mode" }

Naming Conventions

Starting from the user's usage method, explain the naming convention.

python3 lanuch.py --task xxx --hardware_type xxx

Workload naming Convention

--task The workload description file is the parameter specified by --task, and the parameter is the prefix of the workload description file. For example, to evaluate bert-tf-fp32.json, the parameter is task bert-tf-fp32.

Backend naming convention

--hardware_type

  1. The new folder is named --hardware_type parameter, naming convention: uppercase, such as GRAPHCORE;
  2. Under the new folder, add a backend entry file, runtime_backend_xxx.py/compile_backend_xxx.py, which is lowercase at this time: for example, GRAPHCORE, the name is: runtime_backend_graphcore.py/compile_backend_graphcore.py;
  3. In the backend entry file, the backend main class name must comply with: CompileBackendXXX(), such as GRAPHCORE, the name is: RuntimeBackendGRAPHCORE/CompileBackendGRAPHCORE;

Model naming convention

workload:model The newly added model description file needs to be consistent with the model field in the workload description file. For example, for the model defined by the bert-torch-fp32.json workload, the corresponding model description file is: model_zoo/bert-torch-fp32.json. The specific details can be adjusted according to the needs of the field.

bert-torch-fp32.json
{ "model": "bert-torch-fp32", // Name of the model to be evaluated, needs to align with the model_zoo name "test_perf": true, // Whether to evaluate model performance "test_accuracy": true, // Whether to evaluate model accuracy "test_numeric": true, // Accuracy: Whether to evaluate numerical errors "clients": 3, // Performance: Number of client threads submitting data "iterations": 100, // Performance: How many iterations each thread submits "batchsizes":[1,4,8,16,32,64],// Performance: Batch size when each thread submits data "data_percent": 50, // Accuracy: Percentage of dataset used to evaluate accuracy, [1-100] "compile_only":false // Whether to only compile the model }

Dataset naming convention

model_info: dataset_name As shown in the following model_info content, "dataset_name": "squad", squad is the name of the folder under datasets. Therefore, the name of the dataset needs to be aligned with the description of the model info

bert-tf-fp16.json
{ "name": "bert-tf-fp16", // Model name, usually consistent with the json file "model_path": "model_zoo/bert/bert_fp16", // Model path relative to model_zoo "framework": "Tensorflow", // Training framework "framework_version": "2.4.0", // Training framework version "model_format": "saved_model", // Model format "model_precision": "FP32", // Model precision "inputs": "input_ids:0,input_mask:0,segment_ids:0", // Model Inputs Name "outputs": "logits:0", // Model Outputs Name "input_shape": { // Input shape "input_ids:0": [1, 384], "input_mask:0": [1, 384], "segment_ids:0": [1, 384] }, "input_type": "FLOAT32", // Data data type "dataset_name": "squad", // Designated dataset, should align with dataset naming "max_batch_size": 64, // Maximum batch size for the model }

Vendor Integration Test

Vendors can run the following code to test their own backend after completing Backend access, where xxx is the newly added backend name. For details, refer to Naming Specification.

pip install -r requirements.txt ./run.sh --task resnet50-torch-fp32 --hardware_type xxx