08-MLOps与工程落地——01. 实验追踪:MLflow
01. 实验追踪MLflow一、MLflow概述1.1 什么是MLflowMLflow是一个开源平台用于管理机器学习生命周期包括实验跟踪、项目打包、模型管理和模型注册。它由Databricks开发旨在解决ML开发中的可重复性和可追溯性问题。核心问题记录模型参数、指标、代码版本困难实验结果难以比较和复现模型管理混乱部署流程复杂1.2 四大组件组件功能核心概念Tracking记录参数、指标、工件模型、图表runs, experiments, artifactsProjects打包和复现代码MLproject, conda.yamlModels模型格式标准化model flavor, model loaderModel Registry模型版本、阶段管理registered model, version, stage二、MLflow Tracking2.1 实验和运行管理importmlflow# 设置 Tracking URI可选mlflow.set_tracking_uri(http://localhost:5000)# 创建实验mlflow.create_experiment(xgboost-experiment,artifact_location/mlflow/artifacts/xgboost)# 设置实验切换上下文mlflow.set_experiment(xgboost-experiment)2.2 参数、指标、工件记录fromsklearn.model_selectionimporttrain_test_splitfromxgboostimportXGBClassifierfromsklearn.metricsimportaccuracy_score,f1_scorewithmlflow.start_run(run_namexgboost-v1)asrun:# 记录参数mlflow.log_param(max_depth,6)mlflow.log_param(learning_rate,0.1)mlflow.log_param(n_estimators,100)# 训练模型modelXGBClassifier(max_depth6,learning_rate0.1,n_estimators100)model.fit(X_train,y_train)# 记录指标y_predmodel.predict(X_test)mlflow.log_metric(accuracy,accuracy_score(y_test,y_pred))mlflow.log_metric(f1_score,f1_score(y_test,y_pred,averageweighted))# 记录工件mlflow.sklearn.log_model(model,xgboost_model)mlflow.log_artifact(feature_importance.png)mlflow.log_artifacts(./reports)# 记录整个目录2.3 嵌套运行# 嵌套运行适用于交叉验证或超参数搜索withmlflow.start_run(run_nameparent-run):mlflow.log_param(model_type,xgboost)forfoldinrange(5):withmlflow.start_run(run_nameffold-{fold},nestedTrue):mlflow.log_metric(fold_score,fold_accuracy[fold])2.4 自动日志记录# TensorFlow / Kerasmlflow.tensorflow.autolog()# PyTorchmlflow.pytorch.autolog()# Scikit-learnmlflow.sklearn.autolog()# XGBoostmlflow.xgboost.autolog()三、MLflow Projects3.1 项目规范定义# MLprojectname:housing-price-predictionconda_env:conda.yamlentry_points:main:parameters:max_depth:{type:int,default:6}learning_rate:{type:float,default:0.1}n_estimators:{type:int,default:100}command:python train.py --max_depth {max_depth} --learning_rate {learning_rate} --n_estimators {n_estimators}evaluate:parameters:model_uri:{type:str}command:python evaluate.py --model_uri {model_uri}3.2 依赖管理# conda.yamlname:mlflow-envchannels:-defaults-conda-forgedependencies:-python3.9-scikit-learn1.0.2-xgboost1.5.0-pip-pip:-mlflow-pandas-numpy3.3 远程执行# 本地执行项目mlflow run.-Pmax_depth8-Plearning_rate0.05# 远程执行Databricksmlflow run.--backenddatabricks --cluster-spec cluster.json# Kubernetes执行mlflow run.--backendkubernetes四、MLflow Models4.1 模型格式标准# 保存模型mlflow.sklearn.log_model(sk_modelmodel,artifact_pathmodel,registered_model_namexgboost-regressor)# 保存自定义模型classCustomModel(mlflow.pyfunc.PythonModel):defload_context(self,context):importjoblib self.modeljoblib.load(context.artifacts[model_path])defpredict(self,context,model_input):returnself.model.predict(model_input)mlflow.pyfunc.log_model(artifact_pathcustom_model,python_modelCustomModel(),artifacts{model_path:model.pkl})4.2 模型风味FlavorsFlavor框架使用场景sklearnScikit-learn传统机器学习pytorchPyTorch深度学习tensorflowTensorFlow/Keras深度学习xgboostXGBoost树模型lightgbmLightGBM树模型pyfunc任意Python函数自定义模型4.3 模型加载# 加载 Scikit-learn 模型modelmlflow.sklearn.load_model(fruns:/{run_id}/model)# 加载 PyTorch 模型modelmlflow.pytorch.load_model(fmodels:/xgboost-regressor/1)# 加载自定义模型modelmlflow.pyfunc.load_model(fruns:/{run_id}/custom_model)五、MLflow Model Registry5.1 模型版本管理# 注册模型mlflow.register_model(fruns:/{run_id}/model,xgboost-classifier)# 获取模型版本详情frommlflow.trackingimportMlflowClient clientMlflowClient()versionsclient.search_model_versions(namexgboost-classifier)5.2 阶段转换# 模型阶段None默认、Staging、Production、Archived# 转换模型阶段client.transition_model_version_stage(namexgboost-classifier,version1,stageProduction)# 阶段特定加载modelmlflow.sklearn.load_model(models:/xgboost-classifier/Production)5.3 模型审批流程# 注册审批 Webhookclient.create_webhook(nameslack-notify,urlhttps://hooks.slack.com/...,events[MODEL_VERSION_TRANSITIONED_STAGE])# 添加模型描述client.update_registered_model(namexgboost-classifier,descriptionXGBoost model for churn prediction)六、MLflow部署6.1 本地部署# 启动 Tracking Servermlflow ui--host0.0.0.0--port5000# 使用 SQLite 后端mlflow server\--backend-store-uri sqlite:///mlflow.db\--default-artifact-root ./mlflow-artifacts\--host0.0.0.0\--port50006.2 生产部署# docker-compose.ymlversion:3.8services:postgres:image:postgres:13environment:POSTGRES_DB:mlflowPOSTGRES_USER:mlflowPOSTGRES_PASSWORD:mlflowvolumes:-postgres_data:/var/lib/postgresql/dataminio:image:minio/miniocommand:server /data--console-address :9001environment:MINIO_ROOT_USER:minioadminMINIO_ROOT_PASSWORD:minioadminvolumes:-minio_data:/datamlflow-server:image:mlflow:latestcommand:|mlflow server --backend-store-uri postgresql://mlflow:mlflowpostgres/mlflow --default-artifact-root s3://mlflow/artifacts --host 0.0.0.0 --port 5000ports:-5000:5000depends_on:-postgres-miniovolumes:postgres_data:minio_data:七、实战完整MLflow工作流# full_pipeline.pyimportmlflowfromsklearn.datasetsimportload_irisfromsklearn.model_selectionimporttrain_test_split,cross_val_scorefromxgboostimportXGBClassifierfromsklearn.metricsimportaccuracy_score,classification_reportimportpandasaspd# 配置mlflow.set_tracking_uri(http://localhost:5000)mlflow.set_experiment(iris-classification)# 数据加载irisload_iris()Xpd.DataFrame(iris.data,columnsiris.feature_names)ypd.Series(iris.target)X_train,X_test,y_train,y_testtrain_test_split(X,y,test_size0.2,random_state42)withmlflow.start_run(run_namexgboost-iris)asrun:# 1. 记录参数params{max_depth:4,learning_rate:0.1,n_estimators:100,eval_metric:mlogloss,use_label_encoder:False}forkey,valueinparams.items():mlflow.log_param(key,value)# 2. 记录数据集信息mlflow.log_metric(train_samples,len(X_train))mlflow.log_metric(test_samples,len(X_test))# 3. 模型训练modelXGBClassifier(**params)model.fit(X_train,y_train)# 4. 验证y_predmodel.predict(X_test)accuracyaccuracy_score(y_test,y_pred)mlflow.log_metric(accuracy,accuracy)# 5. 交叉验证cv_scorescross_val_score(model,X,y,cv5)mlflow.log_metric(cv_mean,cv_scores.mean())mlflow.log_metric(cv_std,cv_scores.std())# 6. 记录分类报告reportclassification_report(y_test,y_pred,output_dictTrue)forclass_name,metricsinreport.items():ifisinstance(metrics,dict):mlflow.log_metrics({fprecision_{class_name}:metrics[precision],frecall_{class_name}:metrics[recall],ff1_{class_name}:metrics[f1-score]})# 7. 记录模型mlflow.xgboost.log_model(model,xgboost_model)# 8. 记录额外工件mlflow.log_dict(iris.feature_names,feature_names.json)# 9. 注册模型mlflow.register_model(fruns:/{run.info.run_id}/xgboost_model,iris-xgboost-classifier)print(fRun ID:{run.info.run_id})print(fAccuracy:{accuracy:.4f})# 模型加载和推理defpredict_new_data(model_uri,data):modelmlflow.xgboost.load_model(model_uri)returnmodel.predict(data)# 使用生产模型production_modelmodels:/iris-xgboost-classifier/Productionpredictionspredict_new_data(production_model,new_samples) 参考资源MLflow 官方文档MLflow GitHubMLflow 示例集