我正在为Google Cloud Dataflow开发一个ETL管道,其中我有几个分支ParDo
转换,每个转换都需要一个本地音频文件。然后将分支结果组合并导出为文本。
这最初是一个在单机上运行的Python脚本,我试图使用GC数据流适应VM工作器并行化。
提取过程从单个GCS存储桶位置下载文件,然后在转换完成后删除它们以保持存储在控制之下。这是由于预处理模块需要本地访问文件。这可以通过重写一些预处理库来重新设计以处理字节流而不是文件-然而,这方面的一些尝试并不顺利,我想首先探索如何在Apache Beam/GCDataflow中处理并行化的本地文件操作,以便更好地理解框架。
在这个粗略的实现中,每个分支都下载和删除文件,有很多双重处理。在我的实现中,我有8个分支,所以每个文件被下载和删除8次。是否可以在每个worker上安装一个GCS桶,而不是从远程下载文件?
或者是否有其他方法可以确保工作人员被传递对文件的正确引用,以便:
DownloadFilesDoFn()
可以下载一批PCollection
中的本地文件引用扇出到所有分支CleanUpFilesDoFn()
可以删除它们如果不能避免本地文件操作,Apache Beam/GC数据流的最佳分支ParDo
策略是什么?
为简单起见,我现有实现的一些示例代码带有两个分支。
# singleton decorator
def singleton(cls):
instances = {}
def getinstance():
if cls not in instances:
instances[cls] = cls()
return instances[cls]
return getinstance
@singleton
class Predict():
def __init__(self, model):
'''
Process audio, reads in filename
Returns Prediction
'''
self.model = model
def process(self, filename):
#simplified pseudocode
audio = preprocess.load(filename=filename)
prediction = inference(self.model, audio)
return prediction
class PredictDoFn(beam.DoFn):
def __init__(self, model):
self.localfile, self.model = "", model
def process(self, element):
# Construct Predict() object singleton per worker
predict = Predict(self.model)
subprocess.run(['gsutil','cp',element['GCSPath'],'./'], cwd=cwd, shell=False)
self.localfile = cwd + "/" + element['GCSPath'].split('/')[-1]
res = predict.process(self.localfile)
return [{
'Index': element['Index'],
'Title': element['Title'],
'File' : element['GCSPath'],
self.model + 'Prediction': res
}]
def finish_bundle(self):
subprocess.run(['rm',self.localfile], cwd=cwd, shell=False)
# DoFn to split csv into elements (GSC bucket could be read as a PCollection instead maybe)
class Split(beam.DoFn):
def process(self, element):
Index,Title,GCSPath = element.split(",")
GCSPath = 'gs://mybucket/'+ GCSPath
return [{
'Index': int(Index),
'Title': Title,
'GCSPath': GCSPath
}]
管道的简化版本:
with beam.Pipeline(argv=pipeline_args) as p:
files =
(
p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
| 'Parse CSV into Dict' >> beam.ParDo(Split())
)
# prediction 1 branch
preds1 =
(
files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
)
# prediction 2 branch
preds2 =
(
files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
)
# join branches
joined = { preds1, preds2 }
# output to file
output =
(
joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
)
为了避免重复下载文件,可以将文件内容放入pCollection。
class DownloadFilesDoFn(beam.DoFn):
def __init__(self):
import re
self.gcs_path_regex = re.compile(r'gs:\/\/([^\/]+)\/(.*)')
def start_bundle(self):
import google.cloud.storage
self.gcs = google.cloud.storage.Client()
def process(self, element):
file_match = self.gcs_path_regex.match(element['GCSPath'])
bucket = self.gcs.get_bucket(file_match.group(1))
blob = bucket.get_blob(file_match.group(2))
element['file_contents'] = blob.download_as_bytes()
yield element
然后PredicDoFn变成:
class PredictDoFn(beam.DoFn):
def __init__(self, model):
self.model = model
def start_bundle(self):
self.predict = Predict(self.model)
def process(self, element):
res = self.predict.process(element['file_contents'])
return [{
'Index': element['Index'],
'Title': element['Title'],
'File' : element['GCSPath'],
self.model + 'Prediction': res
}]
和管道:
with beam.Pipeline(argv=pipeline_args) as p:
files =
(
p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
| 'Parse CSV into Dict' >> beam.ParDo(Split())
| 'Read files' >> beam.ParDo(DownloadFilesDoFn())
)
# prediction 1 branch
preds1 =
(
files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
)
# prediction 2 branch
preds2 =
(
files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
)
# join branches
joined = { preds1, preds2 }
# output to file
output =
(
joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
)