提问者:小点点

ModelUploadOp步骤失败与自定义预测容器


我目前正在尝试部署Vertex管道以实现以下目标:

>

  • 训练自定义模型(来自自定义训练python包)和转储模型工件(训练模型和数据预处理器,将在预测时被sed)。这一步工作正常,因为我可以看到存储桶中正在创建新资源。

    通过ModelUploadOp创建模型资源。当在下面的错误部分指定serving_container_environment_variablesserving_container_ports时,由于某种原因,此步骤失败。这有点令人惊讶,因为它们都是预测容器所需要的,并且环境变量作为留档中指定的字典传递。
    使用gcloud命令,此步骤效果很好:

    gcloud ai models upload \
        --region us-west1 \
        --display-name session_model_latest \
        --container-image-uri gcr.io/and-reporting/pred:latest \
        --container-env-vars="MODEL_BUCKET=ml_session_model" \
        --container-health-route=//health \
        --container-predict-route=//predict \
        --container-ports=5000
    

    很明显,我在顶点上犯了一些错误,在这种情况下,组件留档没有多大帮助。

    from datetime import datetime
    
    import kfp
    from google.cloud import aiplatform
    from google_cloud_pipeline_components import aiplatform as gcc_aip
    from kfp.v2 import compiler
    
    PIPELINE_ROOT = "gs://ml_model_bucket/pipeline_root"
    
    
    @kfp.dsl.pipeline(name="session-train-deploy", pipeline_root=PIPELINE_ROOT)
    def pipeline():
        training_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(
            project="my-project",
            location="us-west1",
            display_name="train_session_model",
            model_display_name="session_model",
            service_account="name@my-project.iam.gserviceaccount.com",
            environment_variables={"MODEL_BUCKET": "ml_session_model"},
            python_module_name="trainer.train",
            staging_bucket="gs://ml_model_bucket/",
            base_output_dir="gs://ml_model_bucket/",
            args=[
                "--gcs-data-path",
                "gs://ml_model_data/2019-Oct_short.csv",
                "--gcs-model-path",
                "gs://ml_model_bucket/model/model.joblib",
                "--gcs-preproc-path",
                "gs://ml_model_bucket/model/preproc.pkl",
            ],
            container_uri="us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.0-23:latest",
            python_package_gcs_uri="gs://ml_model_bucket/trainer-0.0.1.tar.gz",
            model_serving_container_image_uri="gcr.io/my-project/pred",
            model_serving_container_predict_route="/predict",
            model_serving_container_health_route="/health",
            model_serving_container_ports=[5000],
            model_serving_container_environment_variables={
                "MODEL_BUCKET": "ml_model_bucket/model"
            },
        )
    
        model_upload_op = gcc_aip.ModelUploadOp(
            project="and-reporting",
            location="us-west1",
            display_name="session_model",
            serving_container_image_uri="gcr.io/my-project/pred:latest",
            # When passing the following 2 arguments this step fails...
            serving_container_environment_variables={"MODEL_BUCKET": "ml_model_bucket/model"},
            serving_container_ports=[5000],
            serving_container_predict_route="/predict",
            serving_container_health_route="/health",
        )
        model_upload_op.after(training_op)
    
        endpoint_create_op = gcc_aip.EndpointCreateOp(
            project="my-project",
            location="us-west1",
            display_name="pipeline_endpoint",
        )
    
        model_deploy_op = gcc_aip.ModelDeployOp(
            model=model_upload_op.outputs["model"],
            endpoint=endpoint_create_op.outputs["endpoint"],
            deployed_model_display_name="session_model",
            traffic_split={"0": 100},
            service_account="name@my-project.iam.gserviceaccount.com",
        )
        model_deploy_op.after(endpoint_create_op)
    
    
    if __name__ == "__main__":
        ts = datetime.now().strftime("%Y%m%d%H%M%S")
        compiler.Compiler().compile(pipeline, "custom_train_pipeline.json")
        pipeline_job = aiplatform.PipelineJob(
            display_name="session_train_and_deploy",
            template_path="custom_train_pipeline.json",
            job_id=f"session-custom-pipeline-{ts}",
            enable_caching=True,
        )
        pipeline_job.submit()
    
    
    1. 当指定serving_container_environment_variablesserving_container_ports时,该步骤失败并出现以下错误:
    {'code': 400, 'message': 'Invalid JSON payload received. Unknown name "MODEL_BUCKET" at \'model.container_spec.env[0]\': Cannot find field.\nInvalid value at \'model.container_spec.ports[0]\' (type.googleapis.com/google.cloud.aiplatform.v1.Port), 5000', 'status': 'INVALID_ARGUMENT', 'details': [{'@type': 'type.googleapis.com/google.rpc.BadRequest', 'fieldViolations': [{'field': 'model.container_spec.env[0]', 'description': 'Invalid JSON payload received. Unknown name "MODEL_BUCKET" at \'model.container_spec.env[0]\': Cannot find field.'}, {'field': 'model.container_spec.ports[0]', 'description': "Invalid value at 'model.container_spec.ports[0]' (type.googleapis.com/google.cloud.aiplatform.v1.Port), 5000"}]}]}
    

    注释掉serving_container_environment_variablesserving_container_ports时,模型资源会被创建,但手动将其部署到endpoint会导致部署失败,没有输出日志。


  • 共1个答案

    匿名用户

    经过一段时间的研究,我偶然发现了这个Github问题。这个问题是由google_cloud_pipeline_componentskubernetes_api文档之间的不匹配引起的。在这种情况下,serving_container_environment_variables被键入为可选的[cript[str, str]],而它应该被键入为可选的[list[cript[str,str]]serving_container_ports参数也可以找到类似的不匹配。在kubernetes留档之后传递参数就解决了这个问题:

    model_upload_op = gcc_aip.ModelUploadOp(
        project="my-project",
        location="us-west1",
        display_name="session_model",
        serving_container_image_uri="gcr.io/my-project/pred:latest",
        serving_container_environment_variables=[
            {"name": "MODEL_BUCKET", "value": "ml_session_model"}
        ],
        serving_container_ports=[{"containerPort": 5000}],
        serving_container_predict_route="/predict",
        serving_container_health_route="/health",
    )