.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/01_filtering.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_01_filtering.py: Filtering ========= A simple example to demonstrate how to filter a parquet file using a pandas-like expression. This example uses the `parq_tools` library to filter a Parquet file based on a specified condition. Pyarrow filtering is not structured like the filtering in pandas, but parq-tools uses custom parser allowing pandas-like expressions to be used. .. GENERATED FROM PYTHON SOURCE LINES 12-20 .. code-block:: Python import tempfile import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from pathlib import Path .. GENERATED FROM PYTHON SOURCE LINES 21-25 Create a Parquet file --------------------- Create a temporary parquet file for demonstration .. GENERATED FROM PYTHON SOURCE LINES 26-49 .. code-block:: Python def create_parquet_file(file_path: Path): # Define the dataset data = { "x": range(1, 11), # Index column "y": range(11, 21), # Index column "z": range(21, 31), # Index column "a": [f"val{i}" for i in range(1, 11)], # Supplementary column "b": [i * 2 for i in range(1, 11)], # Supplementary column "c": [i % 3 for i in range(1, 11)], # Supplementary column } # Create a DataFrame df = pa.Table.from_pydict(data) # Write the DataFrame to a Parquet file pq.write_table(df, file_path) parquet_file_path = Path(tempfile.gettempdir()) / "example_data.parquet" create_parquet_file(parquet_file_path) .. GENERATED FROM PYTHON SOURCE LINES 50-51 View the file as a DataFrame .. GENERATED FROM PYTHON SOURCE LINES 51-54 .. code-block:: Python df = pd.read_parquet(parquet_file_path) df .. raw:: html
x y z a b c
0 1 11 21 val1 2 1
1 2 12 22 val2 4 2
2 3 13 23 val3 6 0
3 4 14 24 val4 8 1
4 5 15 25 val5 10 2
5 6 16 26 val6 12 0
6 7 17 27 val7 14 1
7 8 18 28 val8 16 2
8 9 19 29 val9 18 0
9 10 20 30 val10 20 1


.. GENERATED FROM PYTHON SOURCE LINES 55-60 Filter with Pandas ------------------ We can use pandas directly to load the Parquet file and filter it using a pandas-like expression. First we filter early with read_parquet for efficiency. Additionally, we have manually set the index in this example. .. GENERATED FROM PYTHON SOURCE LINES 60-67 .. code-block:: Python index_cols = ["x", "y", "z"] df_from_pandas_1: pd.DataFrame = pd.read_parquet(parquet_file_path, columns=["x", "y", "z", "a", "c"], filters=[("x", ">", 3), ("y", "<=", 15)]).set_index(index_cols) df_from_pandas_1 .. raw:: html
a c
x y z
4 14 24 val4 1
5 15 25 val5 2


.. GENERATED FROM PYTHON SOURCE LINES 68-69 An alternative but less efficient way is to load all records and then apply a filter .. GENERATED FROM PYTHON SOURCE LINES 69-74 .. code-block:: Python df_from_pandas_2 = pd.read_parquet(parquet_file_path, columns=["x", "y", "z", "a", "c"]).query("x > 3 and y <= 15").set_index(index_cols) df_from_pandas_2 .. raw:: html
a c
x y z
4 14 24 val4 1
5 15 25 val5 2


.. GENERATED FROM PYTHON SOURCE LINES 75-76 Compare the two DataFrames to ensure they are equal .. GENERATED FROM PYTHON SOURCE LINES 76-78 .. code-block:: Python pd.testing.assert_frame_equal(df_from_pandas_1, df_from_pandas_2) .. GENERATED FROM PYTHON SOURCE LINES 79-85 Filter with Parq Tools ---------------------- The `parq_tools` library provides a way to filter Parquet files that do not fit into memory, using a pandas-like expression. The output is a new Parquet file containing only the filtered records and selected columns. This can be useful in pipelines with large datasets. .. GENERATED FROM PYTHON SOURCE LINES 85-93 .. code-block:: Python from parq_tools import filter_parquet_file filter_parquet_file(parquet_file_path, output_path=parquet_file_path.with_suffix('.filtered.parquet'), columns=["x", "y", "z", "a", "c"], filter_expression='x > 3 and y <= 15', show_progress=True) .. rst-class:: sphx-glr-script-out .. code-block:: none Filtering: 0%| | 0/10 [00:00
a c
x y z
4 14 24 val4 1
5 15 25 val5 2


.. GENERATED FROM PYTHON SOURCE LINES 99-100 Compare the filtered DataFrame with the one from pandas .. GENERATED FROM PYTHON SOURCE LINES 100-101 .. code-block:: Python pd.testing.assert_frame_equal(df_filtered, df_from_pandas_1) .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.060 seconds) .. _sphx_glr_download_auto_examples_01_filtering.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 01_filtering.ipynb <01_filtering.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 01_filtering.py <01_filtering.py>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_