Example of mlflow server building and mlflow api call

1 Build mlflow tracking server

1.1 Build MinIO

The purpose of building MinIO is to provide mlflow with the storage backend of model data. In this case, the metadata storage of mflow uses mysql.

step 1 Install and start the Docker service

If the docker service is already installed, please ignore it, if not installed:

# 设置yum源
yum install -y yum-utils device-mapper-persistent-data lvm2 
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
# 安装docker 稳定版
yum install docker-ce
# 启动docker
systemctl start docker && systemctl enable docker
# 查看docker是否安装成功
docker version 

step 2 Pull and start the MinIO container

nohup docker run -p 9000:9000 \
  -e "MINIO_ACCESS_KEY=AKIAIOSFODNN7EXAMPLE" \
  -e "MINIO_SECRET_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
  minio/minio server /data &

step 3 Log in to the UI and create a bucket

(1) Visit http://localhost:9000 (localhost is the server IP) and enter the UI interface:

Insert picture description here

(2) Click the + symbol in the lower right corner to create a bucket and name it mlflow for subsequent mlflow configuration:

Insert picture description here


Insert picture description here

1.2 Start mlflow server

Step 1 premise: conda is installed on the server

step 2 install mysql service

# 安装mysql
yum install mysql
yum install mysql-devel
wget http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
rpm -ivh mysql-community-release-el7-5.noarch.rpm
yum install mysql-server

# 启动mysql服务
systemctl start mysql 

# 创建数据库
mysql -u root -p
# 初始没有密码,直接回车
mysql> CREATE DATABASE mlflow_test;
mysql> commit;
# Crtl+C 退出mysql交互界面

step 3 Create a virtual environment and install dependencies

# 创建虚拟环境
conda create -n mlflow-1.11.0 python==3.6
conda activate mlflow-1.11.0

# 安装mlflow server和依赖
pip install mlflow==1.11.0 
pip install mysqlclient
pip install boto3

Step 4 Add environment variables

(1) vi ~/.bashrc After adding the following content to the environment variable file, save and exit:

# mlflow server configuration
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE 
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export MLFLOW_S3_ENDPOINT_URL=http://localhost:9000

(2) Operation:

source ~/.bashrc

step 5 start mlflow server

nohup mlflow server --backend-store-uri mysql://root:@localhost/mlflow_test --host 0.0.0.0 -p 5002 --default-artifact-root s3://mlflow &
[Note]: --backend-store-uri corresponds to mysql service on the server. If your mysql has a set password, you need to fill in the corresponding password; in the example, there is no case that the password is not set.
# 查看nohup日志,确认启动正确
tailf nohup.out
[2021-06-08 10:33:50 +0800] [51952] [INFO] Worker exiting (pid: 51952)
[2021-06-08 10:33:50 +0800] [51951] [INFO] Parent changed, shutting down: <Worker 51951>
[2021-06-08 10:33:50 +0800] [51951] [INFO] Worker exiting (pid: 51951)
[2021-06-08 11:02:07 +0800] [53500] [INFO] Starting gunicorn 20.1.0
[2021-06-08 11:02:07 +0800] [53500] [INFO] Listening at: http://0.0.0.0:5002 (53500)
[2021-06-08 11:02:07 +0800] [53500] [INFO] Using worker: sync
[2021-06-08 11:02:07 +0800] [53503] [INFO] Booting worker with pid: 53503
[2021-06-08 11:02:07 +0800] [53567] [INFO] Booting worker with pid: 53567
[2021-06-08 11:02:07 +0800] [53568] [INFO] Booting worker with pid: 53568
[2021-06-08 11:02:07 +0800] [53569] [INFO] Booting worker with pid: 53569

Enter http://serverip:5002 to view the UI of mlflow server :

Insert picture description here

2 Use mlflow API to track training

2.1 Bind address, register for experiment

[Note] If the training server and mlflow server are not the same machine, you need to install pip install mlflow and pip install boto3 in the training environment. And install the environment variable in step 4 (1) of step 4 in 1.2. Note that localhost needs to be changed to the address of the MinIO s3 database.
import mlflow # 引入头文件
mlflow.set_tracking_uri("http://xxx.xxx.xxx.xxx:5002")  # mlflow server的部署地址,本机可填环回地址或默认地址
mlflow.set_experiment("beijing-foreign-0608")  # 注册或绑定实验名称

2.2 Record initial parameters

The API of mlflow.log_param can be used to record the parameters of this training, and you can log whatever you want to record, including hyperparameters, tricks, and paths. . .

mlflow.log_param("key", value)

For example, my example here records the following parameters:

if __name__ == "__main__":
  
    with mlflow.start_run():     # 首先要调用mlflow.start_run(),对应实验的一次运行
        # 记录参数
        mlflow.log_param("project_root", "192.168.64.22:/data1/yolov4-train-foreign/")
        mlflow.log_param("dataset_path", "192.168.64.22:/data5/0_yibiaozhutuxiang/\
            beijinggongfutuxiangshibie/yiwuxuangua-yibiaozhu/")
        mlflow.log_param("ckpt_path", "./data_gen_and_train/ckpt_0608/")
        mlflow.log_param("freeze_batch_size", "8")
        mlflow.log_param("freeze_epoch", "50")
        mlflow.log_param("freeze_learning_rate", "1e-3")
        mlflow.log_param("batch_size", "4")
        mlflow.log_param("total_epoch","250")
        mlflow.log_param("learning_rate", "2e-4")
        mlflow.log_param("tricks", "mosaic, no cosine_lr, no smooth_label")
        mlflow.log_param("input_size", "416")
        
        # 开始训练
        model = Trainer()
        model.train()

Correspondingly, you can see the registered experiment in the ui page of mlflow server. Each entry below the experiment corresponds to a run, which will save all the parameters of your log, run start time and other information.

2.2 Record dynamic information

In your function of each epoch or step, use mlflow.log_metric() to record the dynamic information you want to record, such as train loss, val loss, acc and other indicators.

 mlflow.log_metric("key", val, step)

For example, my example here records the loss value (the example is some pseudo-code, only need to pay attention to where to use mlflow api):

def fit_one_epoch(model, generator, epoch, ...):
    train_loss = 0
    val_loss = 0
    
    # train
    for step_train, batch in generator.train:
        ...
        train_loss += model.loss(batch)
        
    # value
    for step_val, batch in generator.value:
        model.eval()
        ...
        val_loss += model.loss(batch)
    
    # save model
    train_loss = float(train_loss) / (setp_train + 1)
    val_loss = float(val_loss) / (step_val + 1)
    torch.save(model.stat_dict(), str("model_%d-tloss%.4f-vloss%.4f.pth" \
                                      % (train_loss, val_loss, epoch)))
    
    # log metrics to mlflow
    mlflow.log_metric("train_loss", train_loss, epoch)
    mlflow.log_metric("val_loss", val_loss, epoch)
        

Correspondingly, you can see the recorded metrics in the UI page of mlflow server :

Insert picture description here

You can click the run item to enter the detailed page, and then click the metric parameter marked in the red box in the figure below to monitor this dynamic indicator during the training process :

Insert picture description here
Insert picture description here

2.3 Record output

Use mlflow.log_artifacts("output") to record the output and transfer the output to the s3 model storage server. The formal parameter of this function is a path, which can be the saving path of model weights, logs, and experimental pictures.

mlflow.log_artifacts(self.ckpt_path)

Correspondingly, the output file under the path can be viewed on the mflow server UI, if it is in a supported format (picture, txt, html, pdf document, etc.), it can be previewed on the page:

Insert picture description here

At the same time, you can view the files in the database associated with mlflow. This case is associated with the s3 database built on MinIO:

Insert picture description here