Filtering

A simple example to demonstrate how to filter a parquet file using a pandas-like expression.

This example uses the parq_tools library to filter a Parquet file based on a specified condition. Pyarrow filtering is not structured like the filtering in pandas, but parq-tools uses custom parser allowing pandas-like expressions to be used.

import tempfile

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path

Create a Parquet file

Create a temporary parquet file for demonstration

def create_parquet_file(file_path: Path):
    # Define the dataset
    data = {
        "x": range(1, 11),  # Index column
        "y": range(11, 21),  # Index column
        "z": range(21, 31),  # Index column
        "a": [f"val{i}" for i in range(1, 11)],  # Supplementary column
        "b": [i * 2 for i in range(1, 11)],  # Supplementary column
        "c": [i % 3 for i in range(1, 11)],  # Supplementary column
    }

    # Create a DataFrame
    df = pa.Table.from_pydict(data)

    # Write the DataFrame to a Parquet file
    pq.write_table(df, file_path)


parquet_file_path = Path(tempfile.gettempdir()) / "example_data.parquet"
create_parquet_file(parquet_file_path)

View the file as a DataFrame

df = pd.read_parquet(parquet_file_path)
df
x y z a b c
0 1 11 21 val1 2 1
1 2 12 22 val2 4 2
2 3 13 23 val3 6 0
3 4 14 24 val4 8 1
4 5 15 25 val5 10 2
5 6 16 26 val6 12 0
6 7 17 27 val7 14 1
7 8 18 28 val8 16 2
8 9 19 29 val9 18 0
9 10 20 30 val10 20 1


Filter with Pandas

We can use pandas directly to load the Parquet file and filter it using a pandas-like expression. First we filter early with read_parquet for efficiency. Additionally, we have manually set the index in this example.

index_cols = ["x", "y", "z"]
df_from_pandas_1: pd.DataFrame = pd.read_parquet(parquet_file_path,
                                                 columns=["x", "y", "z", "a", "c"],
                                                 filters=[("x", ">", 3), ("y", "<=", 15)]).set_index(index_cols)
df_from_pandas_1
a c
x y z
4 14 24 val4 1
5 15 25 val5 2


An alternative but less efficient way is to load all records and then apply a filter

df_from_pandas_2 = pd.read_parquet(parquet_file_path,
                                   columns=["x", "y", "z", "a", "c"]).query("x > 3 and y <= 15").set_index(index_cols)

df_from_pandas_2
a c
x y z
4 14 24 val4 1
5 15 25 val5 2


Compare the two DataFrames to ensure they are equal

pd.testing.assert_frame_equal(df_from_pandas_1, df_from_pandas_2)

Filter with Parq Tools

The parq_tools library provides a way to filter Parquet files that do not fit into memory, using a pandas-like expression. The output is a new Parquet file containing only the filtered records and selected columns. This can be useful in pipelines with large datasets.

from parq_tools import filter_parquet_file

filter_parquet_file(parquet_file_path,
                    output_path=parquet_file_path.with_suffix('.filtered.parquet'),
                    columns=["x", "y", "z", "a", "c"], filter_expression='x > 3 and y <= 15',
                    show_progress=True)
Filtering:   0%|          | 0/10 [00:00<?, ?rows/s]
Filtering:  20%|██        | 2/10 [00:00<00:00, 1908.67rows/s]

Read the filtered Parquet file

df_filtered = pd.read_parquet(parquet_file_path.with_suffix('.filtered.parquet')).set_index(index_cols)
df_filtered
a c
x y z
4 14 24 val4 1
5 15 25 val5 2


Compare the filtered DataFrame with the one from pandas

pd.testing.assert_frame_equal(df_filtered, df_from_pandas_1)

Total running time of the script: (0 minutes 0.060 seconds)

Gallery generated by Sphinx-Gallery