.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/02_wide_concatenation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_02_wide_concatenation.py: Wide Concatenation ================== This example demonstrates how to wide concatenate multiple parquet files into new parquet files using the `parq_tools` library and compares the results with `pandas.concat`. .. GENERATED FROM PYTHON SOURCE LINES 8-60 .. code-block:: Python import pandas as pd from parq_tools import ParquetConcat from pathlib import Path import tempfile # Create a temporary directory for the output files temp_dir = Path(tempfile.gettempdir()) / "parquet_concat_example" temp_dir.mkdir(parents=True, exist_ok=True) # Define the input Parquet files input_files = [ temp_dir / "example_data1.parquet", temp_dir / "example_data2.parquet", temp_dir / "example_data3.parquet" ] # Create example Parquet files def create_example_parquet(file_path: Path, data: dict): df = pd.DataFrame(data) df.to_parquet(file_path, index=False) # Example data for the Parquet files data1 = { "x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9], "a": ["A", "B", "C"] } data2 = { "x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9], "b": [6.0, 7.0, 8.0], "c": ["G", "H", "I"] } data3 = { "x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9], "d": ["J", "K", "L"] } # Create the Parquet files create_example_parquet(input_files[0], data1) create_example_parquet(input_files[1], data2) create_example_parquet(input_files[2], data3) .. GENERATED FROM PYTHON SOURCE LINES 61-64 Perform Wide Concatenation with Pandas -------------------------------------- This approach is fine subject to memory constraints, as it loads all data into memory. .. GENERATED FROM PYTHON SOURCE LINES 64-71 .. code-block:: Python index_cols = ["x", "y", "z"] dfs = [pd.read_parquet(file).set_index(index_cols) for file in input_files] wide_result_pandas = pd.concat(dfs, axis=1) wide_result_pandas .. raw:: html
a b c d
x y z
1 4 7 A 6.0 G J
2 5 8 B 7.0 H K
3 6 9 C 8.0 I L


.. GENERATED FROM PYTHON SOURCE LINES 72-75 Perform Wide Concatenation with Parq Tools ------------------------------------------ This approach is more efficient for large datasets, as it processes data in chunks. .. GENERATED FROM PYTHON SOURCE LINES 75-88 .. code-block:: Python output_wide = temp_dir / "wide_concatenated.parquet" # Initialize the ParquetConcat class for wide concatenation wide_concat = ParquetConcat(files=input_files, axis=1, index_columns=index_cols) # Perform the concatenation wide_concat.concat_to_file(output_path=output_wide) # Read the concatenated file wide_result = pd.read_parquet(output_wide).set_index(index_cols) wide_result .. rst-class:: sphx-glr-script-out .. code-block:: none Validating index alignment: 0%| | 0/1 [00:00
a b c d
x y z
1 4 7 A 6.0 G J
2 5 8 B 7.0 H K
3 6 9 C 8.0 I L


.. GENERATED FROM PYTHON SOURCE LINES 89-90 Compare the results .. GENERATED FROM PYTHON SOURCE LINES 90-92 .. code-block:: Python pd.testing.assert_frame_equal(wide_result, wide_result_pandas) .. GENERATED FROM PYTHON SOURCE LINES 93-98 Wide concatenation with filters ------------------------------- You can also apply filters during the concatenation process. The filter expression is pandas-like and can include index and non-index columns. This re-uses the `ParquetConcat` object created earlier to filter and concatenate the data. .. GENERATED FROM PYTHON SOURCE LINES 98-111 .. code-block:: Python filter_query = "x > 2 and b > 6" output_filtered_wide = temp_dir / "filtered_wide_concatenated.parquet" # Perform the concatenation with a filter wide_concat.concat_to_file( output_path=output_filtered_wide, filter_query=filter_query, columns=["a", "b", "d"] ) # Read the filtered concatenated file filtered_wide_result = pd.read_parquet(output_filtered_wide).set_index(index_cols) filtered_wide_result .. rst-class:: sphx-glr-script-out .. code-block:: none Validating index alignment: 0%| | 0/1 [00:00
a b d
x y z
3 6 9 C 8.0 L


.. GENERATED FROM PYTHON SOURCE LINES 112-116 Concatenate by function ----------------------- You can also concatenate with a function, rather than using the class directly. The same filtering options are available, and the function will handle the concatenation in a memory-efficient way. .. GENERATED FROM PYTHON SOURCE LINES 116-126 .. code-block:: Python # concatenate with the function from parq_tools import concat_parquet_files concat_parquet_files(files=input_files, output_path=output_wide.with_suffix('.by_function.parquet'), axis=1, index_columns=index_cols) # Read the filtered concatenated file filtered_wide_function_result = pd.read_parquet(output_wide.with_suffix('.by_function.parquet')).set_index( index_cols) filtered_wide_function_result .. rst-class:: sphx-glr-script-out .. code-block:: none Validating index alignment: 0%| | 0/1 [00:00
a b c d
x y z
1 4 7 A 6.0 G J
2 5 8 B 7.0 H K
3 6 9 C 8.0 I L


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.070 seconds) .. _sphx_glr_download_auto_examples_02_wide_concatenation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 02_wide_concatenation.ipynb <02_wide_concatenation.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 02_wide_concatenation.py <02_wide_concatenation.py>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_