如何在Apache Beam/Google Cloud DataFlow中通过多个ParDo转换处理本地文件的操作

提问者：小点点

如何在Apache Beam/Google Cloud DataFlow中通过多个ParDo转换处理本地文件的操作

我正在为Google Cloud Dataflow开发一个ETL管道，其中我有几个分支ParDo转换，每个转换都需要一个本地音频文件。然后将分支结果组合并导出为文本。

这最初是一个在单机上运行的Python脚本，我试图使用GC数据流适应VM工作器并行化。

提取过程从单个GCS存储桶位置下载文件，然后在转换完成后删除它们以保持存储在控制之下。这是由于预处理模块需要本地访问文件。这可以通过重写一些预处理库来重新设计以处理字节流而不是文件-然而，这方面的一些尝试并不顺利，我想首先探索如何在Apache Beam/GCDataflow中处理并行化的本地文件操作，以便更好地理解框架。

在这个粗略的实现中，每个分支都下载和删除文件，有很多双重处理。在我的实现中，我有8个分支，所以每个文件被下载和删除8次。是否可以在每个worker上安装一个GCS桶，而不是从远程下载文件？

或者是否有其他方法可以确保工作人员被传递对文件的正确引用，以便：

单个DownloadFilesDoFn（）可以下载一批
然后将PCollection中的本地文件引用扇出到所有分支
然后最后的CleanUpFilesDoFn（）可以删除它们
如何并行化本地文件引用？

如果不能避免本地文件操作，Apache Beam/GC数据流的最佳分支ParDo策略是什么？

为简单起见，我现有实现的一些示例代码带有两个分支。

# singleton decorator
def singleton(cls):
  instances = {}
  def getinstance():
      if cls not in instances:
          instances[cls] = cls()
      return instances[cls]
  return getinstance

@singleton
class Predict():
  def __init__(self, model):
    '''
    Process audio, reads in filename 
    Returns Prediction
    '''
    self.model = model

  def process(self, filename):
      #simplified pseudocode
      audio = preprocess.load(filename=filename)
      prediction = inference(self.model, audio)
      return prediction

class PredictDoFn(beam.DoFn):
  def __init__(self, model):
    self.localfile, self.model = "", model
    
  def process(self, element):
    # Construct Predict() object singleton per worker
    predict = Predict(self.model)

    subprocess.run(['gsutil','cp',element['GCSPath'],'./'], cwd=cwd, shell=False)
    self.localfile = cwd + "/" + element['GCSPath'].split('/')[-1]

    res = predict.process(self.localfile)
    return [{
        'Index': element['Index'], 
        'Title': element['Title'],
        'File' : element['GCSPath'],
        self.model + 'Prediction': res
        }]    
  def finish_bundle(self):
    subprocess.run(['rm',self.localfile], cwd=cwd, shell=False)


# DoFn to split csv into elements (GSC bucket could be read as a PCollection instead maybe)
class Split(beam.DoFn):
    def process(self, element):
        Index,Title,GCSPath = element.split(",")
        GCSPath = 'gs://mybucket/'+ GCSPath
        return [{
            'Index': int(Index),
            'Title': Title,
            'GCSPath': GCSPath
        }]

管道的简化版本：

with beam.Pipeline(argv=pipeline_args) as p:
    files = 
        ( 
        p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
          | 'Parse CSV into Dict' >> beam.ParDo(Split())
        )
    # prediction 1 branch
    preds1 = 
        (
          files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
        )
    # prediction 2 branch
    preds2 = 
        (
          files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
        )
    
    # join branches
    joined = { preds1, preds2 }

    # output to file
    output = 
        ( 
      joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
        )

共1个答案

匿名用户

为了避免重复下载文件，可以将文件内容放入pCollection。

class DownloadFilesDoFn(beam.DoFn):
  def __init__(self):
     import re
     self.gcs_path_regex = re.compile(r'gs:\/\/([^\/]+)\/(.*)')

  def start_bundle(self):
     import google.cloud.storage
     self.gcs = google.cloud.storage.Client()

  def process(self, element):
     file_match = self.gcs_path_regex.match(element['GCSPath'])
     bucket = self.gcs.get_bucket(file_match.group(1))
     blob = bucket.get_blob(file_match.group(2))
     element['file_contents'] = blob.download_as_bytes()
     yield element

然后PredicDoFn变成：

class PredictDoFn(beam.DoFn):
  def __init__(self, model):
    self.model = model

  def start_bundle(self):
    self.predict = Predict(self.model)
    
  def process(self, element):
    res = self.predict.process(element['file_contents'])
    return [{
        'Index': element['Index'], 
        'Title': element['Title'],
        'File' : element['GCSPath'],
        self.model + 'Prediction': res
        }]

和管道：

with beam.Pipeline(argv=pipeline_args) as p:
    files = 
        ( 
        p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
          | 'Parse CSV into Dict' >> beam.ParDo(Split())
          | 'Read files' >> beam.ParDo(DownloadFilesDoFn())
        )
    # prediction 1 branch
    preds1 = 
        (
          files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
        )
    # prediction 2 branch
    preds2 = 
        (
          files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
        )
    
    # join branches
    joined = { preds1, preds2 }

    # output to file
    output = 
        ( 
      joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
        )