Pyspark read files from s3 and parallelize the file list

I am a newbie to Apache Spark and Pyspark. I have a use case where I have to read multiple files from different folders in s3 and then process the file contents for processing parallely. I have tried various ways and one of which is this way. I did not understand how to initialize s3 client inside the lambda body. I have been experiencing the same issue TypeError: can't pickle thread.lock objects. How could I process the s3 files parallely and read the body of the object.

Here is the doe snippet after editing.

def f(key): s3_client = boto3.client('s3') body = s3_client.get_object(Bucket='bucket', Key=key)['Body'].read() return body
data_rdd = sc.parallelize(keys_list).map(lambda key: f(key))

1 answer

  • answered 2017-06-17 18:33 Jordan P

    I did not understand how to initialize s3 client inside the lambda body.

    Instead of

    s3_client = boto3.client('s3')
    ..map(lambda key: s3_client.get_object(Bucket="my_bucket", Key=key))
    

    You could try

    ..map(lambda key: boto3.client('s3').get_object(Bucket="my_bucket", Key=key))
    

    If you need to configure the boto3 client, it might be best to use a named function:

    def f(key):
      import boto3
      s3_client = boto3.client('s3')
      # ... more s3_client config ...
      return s3_client.get_objects(Bucket="my_bucket", Key=key)
    
    ..map(f)