Querying Parquet file nested column scan whole column even when there's predicate pushdown applied to records

I'm playing with parquet format. I have parquet file of events, each consist of timestamp, topic, and tags. the file is sorted by topic and then by timestamp. I run query which can be described like:

select topic from T where topic = 404;

it runs much fast, and returns very few rows. It runs much faster than:

select topic from T;

When I change it to be something like:

select tags from T where topic = 404;

it runs as slow as running

select tags from T;

Analyzing the plan, it seems (when using spark), that the predicate push down is applied, but from the performance I can assume it doesn't apply to the column of tags.

I tested with hive, spark and presto. Is there anything to do about it or any other technology that handles parquet nested arrays better?

  • execution plan in spark:

== Physical Plan ==

*Project [tags#4]

+- *Filter (isnotnull(topic#3L) && (topic#3L = 404))

+- *FileScan parquet [topic#3L,tags#4] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/example-path], PartitionFilters: [], PushedFilters: [IsNotNull(topic), EqualTo(topic,404)], ReadSchema: struct>

Thanks, Roee

1 answer