我面临Multipart Upload
到SSE-KMS
加密桶的访问被拒绝
代码正在运行Glue
(可能来自其他服务,无法验证)。我尝试了一组不同的权限,甚至完全访问,但没有效果。
kms:Decrypt
, kms:Encrypt
and kms:GenerateDataKey*
) and it worked previously!Glue security configuration
is set (the key is same as granted for the job)output_sink = glueContext.getSink(...)
output_sink.writeFrame(dynamic_frame)
df = pandas.read_excel(...)
df.to_parquet(output_file_path, compression="snappy", index=False)
我已经尝试过但没有成功:
s3:*
权限kms:*
权限添加到作业中使用的kms密钥的策略s3:*
资源:“*”
glue.amazonaws.com
添加到KMS密钥的主要服务fs.s3.enableServerSideEncryption
和fs.s3.serverSideEncryption.kms.keyId
以及相应的密钥ARNawscli
、botocore
和boto3
版本(pands=1.1.5
以及s3fs=0.4.2
由于PyShell python版本的3.6.13
,版本无法升级到更高版本)PySpark作业的stacktrace的一部分指示多部分上载问题:
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 172.36.10.82, executor 1): com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: ..; S3 Extended Request ID: ..), S3 Extended Request ID: ....
..<cropped entries>..
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:110)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:189)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.putObject(AmazonS3LiteClient.java:107)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:174)
at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:208)
at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.close(MultipartUploadOutputStream.java:423)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:74)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:108)
at org.apache.parquet.nimble.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:579)
..<cropped entries>..
at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter.writeParquetPartitioned(GlueParquetHadoopWriter.scala:163)
at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter$$anonfun$doParquetWrite$2.apply(GlueParquetHadoopWriter.scala:188)
at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter$$anonfun$doParquetWrite$2.apply(GlueParquetHadoopWriter.scala:181)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
..<cropped entries>..
PyShell的错误消息:
Sending http request: <AWSPreparedRequest stream_output=False, method=PUT,
url=https://my-bucket-name.s3.ca-central-1.amazonaws.com/folder/folder/folder/file-name.snappy.parquet?partNumber=1&uploadId=~uploadId~,
headers={
'User-Agent': b'Botocore/1.12.232 Python/3.6.13 Linux/4.14.238-125.422.amzn1.x86_64',
'Content-MD5': b'Ic4VG7BgETssQJOhSK+E/Q==',
'Expect': b'100-continue',
'X-Amz-Date': b'20220518T163248Z',
'X-Amz-Security-Token': b'~token-data~',
'X-Amz-Content-SHA256': b'UNSIGNED-PAYLOAD',
'Authorization': b'AWS4-HMAC-SHA256 Credential=~credential~, SignedHeaders=content-md5;host;x-amz-content-sha256;x-amz-date;x-amz-security-token, Signature=~signature~',
'Content-Length': '5421349'
}>
...
Traceback (most recent call last):
File "/tmp/glue-python-scripts-2tscdixy/script.py", line 44, in main
df.to_parquet(output_file_path, compression="snappy", index=False)
File "/glue/lib/installation/pandas/util/_decorators.py", line 199, in wrapper
return func(*args, **kwargs)
File "/glue/lib/installation/pandas/core/frame.py", line 2372, in to_parquet
**kwargs,
File "/glue/lib/installation/pandas/io/parquet.py", line 276, in to_parquet
**kwargs,
File "/glue/lib/installation/pandas/io/parquet.py", line 123, in write
self.api.parquet.write_table(table, path, compression=compression, **kwargs)
File "/glue/lib/installation/pyarrow/parquet.py", line 2034, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "/glue/lib/installation/pyarrow/parquet.py", line 686, in __exit__
self.close()
File "/glue/lib/installation/pyarrow/parquet.py", line 710, in close
self.file_handle.close()
File "pyarrow/io.pxi", line 173, in pyarrow.lib.NativeFile.close
File "/glue/lib/installation/fsspec/spec.py", line 1630, in close
self.flush(force=True)
File "/glue/lib/installation/fsspec/spec.py", line 1501, in flush
if self._upload_chunk(final=force) is not False:
File "/glue/lib/installation/s3fs/core.py", line 1245, in _upload_chunk
raise IOError('Write failed: %r' % exc)
OSError: Write failed: ClientError('An error occurred (AccessDenied) when calling the UploadPart operation: Access Denied',)
这个问题持续了3天,今天突然消失了。我唯一的怀疑是,这个问题是由AWS内部错误引起的,昨天刚刚修复,所有非工作策略都开始给予适当的权限。
对于寻找任何线索或解决方案的人来说,我发现唯一有用的可能选项(但尚未确认这是否有效)是添加具有bucket列表权限的单独策略语句,以及为目标文件夹语句添加s3:ListMultipartUploadParts
和s3:AbortMultipartUpload
操作:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::my-bucket-name"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetEncryptionConfiguration",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload"
],
"Resource": [
"arn:aws:s3:::my-bucket-name/my-output-folder/*"
]
},
{
"Effect": "Allow",
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey"
],
"Resource": "arn:aws:kms:region:11111:key/kkkkkk"
}
]
}