.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/03_tall_concatenation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_03_tall_concatenation.py: Tall Concatenation ================== This example demonstrates how to concatenate multiple Parquet files along rows (tall concatenation) using the `parq_tools` library. It also shows how to achieve the same result with `pandas.concat` and verifies that the outputs are equivalent. Tall concatenation combines all rows from each file, aligning columns by name. Columns missing in some files are filled with nulls. .. GENERATED FROM PYTHON SOURCE LINES 12-19 .. code-block:: Python import pandas as pd from parq_tools import ParquetConcat from pathlib import Path import tempfile .. GENERATED FROM PYTHON SOURCE LINES 20-24 Create Example Parquet Files ---------------------------- We create three example Parquet files with overlapping and unique columns. .. GENERATED FROM PYTHON SOURCE LINES 24-66 .. code-block:: Python def create_example_parquet(file_path: Path, data: dict): df = pd.DataFrame(data) df.to_parquet(file_path, index=False) temp_dir = Path(tempfile.gettempdir()) / "parquet_tall_concat_example" temp_dir.mkdir(parents=True, exist_ok=True) data1 = { "x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9], "a": ["A", "B", "C"], "b": [1.0, 2.0, 3.0] } data2 = { "x": [4, 5, 6], "y": [7, 8, 9], "z": [10, 11, 12], "a": ["D", "E", "F"], "b": [3.0, 4.0, 5.0] } data3 = { "x": [7, 8, 9], "y": [10, 11, 12], "z": [13, 14, 15], "a": ["G", "H", "I"], "b": [6.0, 7.0, 8.0], "c": ["J", "K", "L"] } input_files = [ temp_dir / "example_data1.parquet", temp_dir / "example_data2.parquet", temp_dir / "example_data3.parquet" ] create_example_parquet(input_files[0], data1) create_example_parquet(input_files[1], data2) create_example_parquet(input_files[2], data3) .. GENERATED FROM PYTHON SOURCE LINES 67-71 Tall Concatenation with Pandas ------------------------------ This approach loads all data into memory and concatenates along rows. .. GENERATED FROM PYTHON SOURCE LINES 71-78 .. code-block:: Python index_cols = ["x", "y", "z"] dfs = [pd.read_parquet(f) for f in input_files] tall_result_pandas = pd.concat(dfs, axis=0, ignore_index=True).set_index(index_cols) tall_result_pandas .. raw:: html
a b c
x y z
1 4 7 A 1.0 NaN
2 5 8 B 2.0 NaN
3 6 9 C 3.0 NaN
4 7 10 D 3.0 NaN
5 8 11 E 4.0 NaN
6 9 12 F 5.0 NaN
7 10 13 G 6.0 J
8 11 14 H 7.0 K
9 12 15 I 8.0 L


.. GENERATED FROM PYTHON SOURCE LINES 79-83 Tall Concatenation with Parq Tools ---------------------------------- This approach is more efficient for large datasets, as it processes data in chunks. .. GENERATED FROM PYTHON SOURCE LINES 83-91 .. code-block:: Python output_tall = temp_dir / "tall_concatenated.parquet" tall_concat = ParquetConcat(files=input_files, axis=0) tall_concat.concat_to_file(output_path=output_tall) tall_result = pd.read_parquet(output_tall).set_index(index_cols) tall_result .. raw:: html
a b c
x y z
1 4 7 A 1.0 None
2 5 8 B 2.0 None
3 6 9 C 3.0 None
4 7 10 D 3.0 None
5 8 11 E 4.0 None
6 9 12 F 5.0 None
7 10 13 G 6.0 J
8 11 14 H 7.0 K
9 12 15 I 8.0 L


.. GENERATED FROM PYTHON SOURCE LINES 92-96 Compare the Results ------------------- Ensure the outputs from both methods are equivalent. .. GENERATED FROM PYTHON SOURCE LINES 96-99 .. code-block:: Python pd.testing.assert_frame_equal(tall_result, tall_result_pandas) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/runner/work/parq-tools/parq-tools/examples/03_tall_concatenation.py:97: FutureWarning: Mismatched null-like values None and nan found. In a future version, pandas equality-testing functions (e.g. assert_frame_equal) will consider these not-matching and raise. pd.testing.assert_frame_equal(tall_result, tall_result_pandas) .. GENERATED FROM PYTHON SOURCE LINES 100-104 Tall Concatenation with Filters ------------------------------- You can also apply a filter expression during concatenation. .. GENERATED FROM PYTHON SOURCE LINES 104-115 .. code-block:: Python filter_query = "x > 4 and b >= 5.0" output_filtered_tall = temp_dir / "filtered_tall_concatenated.parquet" tall_concat.concat_to_file( output_path=output_filtered_tall, filter_query=filter_query, columns=["x", "y", "z", "a", "b"] ) filtered_tall_result = pd.read_parquet(output_filtered_tall).set_index(index_cols) filtered_tall_result .. raw:: html
a b c
x y z
6 9 12 F 5.0 None
7 10 13 G 6.0 None
8 11 14 H 7.0 None
9 12 15 I 8.0 None


.. GENERATED FROM PYTHON SOURCE LINES 116-120 Concatenate by function ----------------------- You can also concatenate with a function, rather than using the class directly. The same filtering options are available, and the function will handle the concatenation in a memory-efficient way. .. GENERATED FROM PYTHON SOURCE LINES 120-130 .. code-block:: Python # concatenate with the function from parq_tools import concat_parquet_files concat_parquet_files(files=input_files, output_path=output_tall.with_suffix('.by_function.parquet'), axis=0, index_columns=index_cols) # Read the filtered concatenated file filtered_wide_function_result = pd.read_parquet(output_tall.with_suffix('.by_function.parquet')).set_index( index_cols) filtered_wide_function_result .. raw:: html
a b c
x y z
1 4 7 A 1.0 None
2 5 8 B 2.0 None
3 6 9 C 3.0 None
4 7 10 D 3.0 None
5 8 11 E 4.0 None
6 9 12 F 5.0 None
7 10 13 G 6.0 J
8 11 14 H 7.0 K
9 12 15 I 8.0 L


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.094 seconds) .. _sphx_glr_download_auto_examples_03_tall_concatenation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 03_tall_concatenation.ipynb <03_tall_concatenation.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 03_tall_concatenation.py <03_tall_concatenation.py>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_