提问者:小点点

从Dataflow上传到MSSQL使用DirectRunner运行良好,但使用DataflowRunner无法运行(使用pyodbc/msodbcsql18)


我们正在尝试使用数据流将数据从BigQuery上传到MicrosoftSQL服务器。我们使用Dataflow flex环境,我已经安装了pyodbc和msodbcsql18。

在交互式会话中使用docker容器内的DirectRunner运行我们的Dataflow脚本时,我们没有问题:读取BigQuery,转换数据并最终上传到SQL服务器。

但是,当在交互式会话中使用DataflowRunner运行相同的脚本时,我们会收到以下错误:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 267, in _execute
    response = task()
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 340, in <lambda> 
    lambda: self.create_worker().do_instruction(request), request)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 581, in do_instruction
    getattr(request, request_type), request.instruction_id)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 618, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 996, in process_bundle
    element.data)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 221, in process_encoded
    self.output(decoded_value)
  File "apache_beam/runners/worker/operations.py", line 346, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 348, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 215, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 707, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 708, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1200, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1281, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1198, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 718, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 843, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/usr/local/lib/python3.7/site-packages/mssql.py", line 30, in process
    connection = mssql_client.connect(host, database, userid, password)
  File "/usr/local/lib/python3.7/site-packages/mssql_client.py", line 5, in connect
    connection = pyodbc.connect(connection_string)
pyodbc.Error: ("01000 [while running 'Upload to SQL server in 10 threads-ptransform-22']", "[01000] [unixODBC][Driver Manager]Can't open lib '/opt/microsoft/msodbcsql18/lib64/libmsodbcsql-18.1.so.1.1' : file not found (0) (SQLDriverConnect)")

当我们将图像/模板部署到我们的GCP项目时,我们看到了同样的错误。在错误消息上的谷歌搜索显示这是一个相当常见的问题,并且有不同的解决方案。但是它在DirectRunner中工作,但在DataflowRunner中不工作,所以是因为代码在不同的worker上运行(由于某种原因没有安装libmsodbcsql-18.1.so.1.1),而不是在我们使用下面的Dockerfile创建的映像上吗?是执行用户没有正确的文件权限吗?

我们的Dockerfile:

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base:20220418_RC00

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY transfer.py .
COPY mssql_client.py .

ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/transfer.py"

RUN apt-get update -y
RUN apt-get install -y git

RUN pip install -U apache-beam==2.39.0
RUN pip install -U pyodbc==4.0.34

# Install dependency for apt-key
RUN apt-get install -y gnupg2 
RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
# Install Microsoft SQL ODBC (dependency for pyodbc)
RUN curl https://packages.microsoft.com/config/debian/11/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update -y
RUN ACCEPT_EULA=Y apt-get install -y msodbcsql18

mssql_client.py:

# File mssql_client.py
import pyodbc

def connect(host, database, userid, password):
    connection_string = 'DRIVER={/opt/microsoft/msodbcsql18/lib64/libmsodbcsql-18.1.so.1.1};SERVER='+host+';DATABASE='+database+';UID='+userid+';PWD='+password
    connection = pyodbc.connect(connection_string)
    return connection

我们还尝试将DRIVER参数设置为{ODBC Driver 18 forSQLServer}

我们没有要求. txt,而是使用setuptools:

# File setup.py 

import setuptools

setuptools.setup (
   name='setup',
   version='1.0.0',
   install_requires=[ 'pyodbc' ],
   packages=setuptools.find_packages(),
)

我还尝试在install_requires中添加'msodbcsql18',但是这个工作永远挂着。


共1个答案

匿名用户

我的问题差点就解决了

但是它在DirectRunner中工作,但在DataflowRunner中不工作,所以是因为代码在不同的worker上运行(由于某种原因没有安装libmsodbcsql-18.1.so.1.1),而不是在我们使用下面的Dockerfile创建的映像上吗?

看起来,留档明确表示Beam容器只可用需求. txt依赖项:

作为Dockerfile的一部分,您需要安装apache-Beam来生成作业图。请注意,使用Dockerfile构建的Flex模板容器仅用于创建作业图和启动Dataflow作业。安装在Flex模板容器中的包在Beam容器中不可用。

要使包成为Beam容器的一部分,您必须将它们指定为要求. txt文件的一部分。确保您没有将apache-Beam指定为要求.txt文件的一部分。Beam容器已经有apache-Beam。

由于我需要msodbcsql18,它只能通过apt存储库获得,因此我更多地研究了自定义Beam容器。

我终于通过自定义基本容器得到了它的工作:

FROM apache/beam_python3.7_sdk:2.39.0

RUN apt-get update -y

# Install various dependencies for pyodbc
RUN apt-get install -y gnupg2
RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
RUN curl https://packages.microsoft.com/config/debian/11/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update -y
RUN ACCEPT_EULA=Y apt-get install -y msodbcsql18
RUN apt-get install -y unixodbc-dev

RUN pip install -U pyodbc==4.0.34

确保您使用的基本映像包含相同的Python和数据流版本,在我的例子中,我分别使用了v3.7和2.39(查看原始帖子中Dockerfile中的pip install apache-Beb)。

然后,一旦您构建并将映像推送到您的docker存储库,请添加一个额外的参数sdk_container_image以使用DataflowRunner启动您的管道:

--runner=DataflowRunner --sdk_container_image=[your-image-name]