Filtering

A simple example to demonstrate how to filter a parquet file using a pandas-like expression.

This example uses the parq_tools library to filter a Parquet file based on a specified condition. Pyarrow filtering is not structured like the filtering in pandas, but parq-tools uses custom parser allowing pandas-like expressions to be used.

import tempfile

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path

Create a Parquet file

Create a temporary parquet file for demonstration

def create_parquet_file(file_path: Path):
    # Define the dataset
    data = {
        "x": range(1, 11),  # Index column
        "y": range(11, 21),  # Index column
        "z": range(21, 31),  # Index column
        "a": [f"val{i}" for i in range(1, 11)],  # Supplementary column
        "b": [i * 2 for i in range(1, 11)],  # Supplementary column
        "c": [i % 3 for i in range(1, 11)],  # Supplementary column
    }

    # Create a DataFrame
    df = pa.Table.from_pydict(data)

    # Write the DataFrame to a Parquet file
    pq.write_table(df, file_path)


parquet_file_path = Path(tempfile.gettempdir()) / "example_data.parquet"
create_parquet_file(parquet_file_path)

View the file as a DataFrame

df = pd.read_parquet(parquet_file_path)
df

	x	y	z	a	b	c
0	1	11	21	val1	2	1
1	2	12	22	val2	4	2
2	3	13	23	val3	6	0
3	4	14	24	val4	8	1
4	5	15	25	val5	10	2
5	6	16	26	val6	12	0
6	7	17	27	val7	14	1
7	8	18	28	val8	16	2
8	9	19	29	val9	18	0
9	10	20	30	val10	20	1

Filter with Pandas

We can use pandas directly to load the Parquet file and filter it using a pandas-like expression. First we filter early with read_parquet for efficiency. Additionally, we have manually set the index in this example.

index_cols = ["x", "y", "z"]
df_from_pandas_1: pd.DataFrame = pd.read_parquet(parquet_file_path,
                                                 columns=["x", "y", "z", "a", "c"],
                                                 filters=[("x", ">", 3), ("y", "<=", 15)]).set_index(index_cols)
df_from_pandas_1

			a	c
x	y	z
4	14	24	val4	1
5	15	25	val5	2

An alternative but less efficient way is to load all records and then apply a filter

df_from_pandas_2 = pd.read_parquet(parquet_file_path,
                                   columns=["x", "y", "z", "a", "c"]).query("x > 3 and y <= 15").set_index(index_cols)

df_from_pandas_2

			a	c
x	y	z
4	14	24	val4	1
5	15	25	val5	2

Compare the two DataFrames to ensure they are equal

pd.testing.assert_frame_equal(df_from_pandas_1, df_from_pandas_2)

Filter with Parq Tools

The parq_tools library provides a way to filter Parquet files that do not fit into memory, using a pandas-like expression. The output is a new Parquet file containing only the filtered records and selected columns. This can be useful in pipelines with large datasets.

from parq_tools import filter_parquet_file

filter_parquet_file(parquet_file_path,
                    output_path=parquet_file_path.with_suffix('.filtered.parquet'),
                    columns=["x", "y", "z", "a", "c"], filter_expression='x > 3 and y <= 15',
                    show_progress=True)

Filtering:   0%|          | 0/10 [00:00<?, ?rows/s]
Filtering:  20%|██        | 2/10 [00:00<00:00, 1908.67rows/s]

Read the filtered Parquet file

df_filtered = pd.read_parquet(parquet_file_path.with_suffix('.filtered.parquet')).set_index(index_cols)
df_filtered

			a	c
x	y	z
4	14	24	val4	1
5	15	25	val5	2

Compare the filtered DataFrame with the one from pandas

pd.testing.assert_frame_equal(df_filtered, df_from_pandas_1)

Total running time of the script: (0 minutes 0.060 seconds)

Gallery generated by Sphinx-Gallery