我正在尝试启动Dataflow flex模板。作为构建和部署过程的一部分,我正在预构建自定义SDK容器映像,以减少工作人员启动时间。
我尝试过以下方式:
sdk_container_image
并且提供了一个需求. txt文件时,Dataflow flex模板成功启动并构建了一个图形,但是工作人员无法启动,因为他们缺乏安装私有包的权限。这是我的Dockerfiles和gcloud
命令:
Flex模板Dockerfile:
FROM gcr.io/dataflow-templates-base/python39-template-launcher-base
# Create working directory
ARG WORKDIR=/flex
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
# https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
RUN apt-get update && apt-get install -y libffi-dev && rm -rf /var/lib/apt/lists/*
COPY ./ ./
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/launch_pipeline.py"
# Install the pipeline dependencies
RUN pip install --no-cache-dir --upgrade pip setuptools wheel
RUN pip install --no-cache-dir apache-beam[gcp]==2.41.0
RUN pip install --no-cache-dir -r requirements.txt
ENTRYPOINT [ "/opt/google/dataflow/python_template_launcher" ]
工人Dockerfile:
# Set up image for worker.
FROM apache/beam_python3.9_sdk:2.41.0
WORKDIR /worker
COPY ./requirements.txt ./
RUN pip install --no-cache-dir --upgrade pip setuptools wheel
RUN pip install --no-cache-dir -r requirements.txt
建筑模板:
gcloud dataflow flex-template build $TEMPLATE_LOCATION \
--image "$IMAGE_LOCATION" \
--sdk-language "PYTHON" \
--metadata-file "metadata.json"
启动模板:
gcloud dataflow flex-template run ddjanke-local-flex \
--template-file-gcs-location=$TEMPLATE_LOCATION \
--project=$PROJECT \
--service-account-email=$EMAIL \
--parameters=[OTHER_ARGS...],sdk_container_image=$WORKER_IMAGE \
--additional-experiments=use_runner_v2
我昨天解决了这个问题。问题是我通过flex模板将sdk_container_image
传递给Dataflow,然后将其传递给代码中的PipelineOptions。在我从选项中删除sdk_container_image
后,它在同一个作业中启动了管道。