Tall Concatenation

This example demonstrates how to concatenate multiple Parquet files along rows (tall concatenation) using the parq_tools library. It also shows how to achieve the same result with pandas.concat and verifies that the outputs are equivalent.

Tall concatenation combines all rows from each file, aligning columns by name. Columns missing in some files are filled with nulls.

import pandas as pd
from parq_tools import ParquetConcat
from pathlib import Path
import tempfile

Create Example Parquet Files

We create three example Parquet files with overlapping and unique columns.

def create_example_parquet(file_path: Path, data: dict):
    df = pd.DataFrame(data)
    df.to_parquet(file_path, index=False)


temp_dir = Path(tempfile.gettempdir()) / "parquet_tall_concat_example"
temp_dir.mkdir(parents=True, exist_ok=True)

data1 = {
    "x": [1, 2, 3],
    "y": [4, 5, 6],
    "z": [7, 8, 9],
    "a": ["A", "B", "C"],
    "b": [1.0, 2.0, 3.0]
}
data2 = {
    "x": [4, 5, 6],
    "y": [7, 8, 9],
    "z": [10, 11, 12],
    "a": ["D", "E", "F"],
    "b": [3.0, 4.0, 5.0]
}
data3 = {
    "x": [7, 8, 9],
    "y": [10, 11, 12],
    "z": [13, 14, 15],
    "a": ["G", "H", "I"],
    "b": [6.0, 7.0, 8.0],
    "c": ["J", "K", "L"]
}

input_files = [
    temp_dir / "example_data1.parquet",
    temp_dir / "example_data2.parquet",
    temp_dir / "example_data3.parquet"
]

create_example_parquet(input_files[0], data1)
create_example_parquet(input_files[1], data2)
create_example_parquet(input_files[2], data3)

Tall Concatenation with Pandas

This approach loads all data into memory and concatenates along rows.

index_cols = ["x", "y", "z"]

dfs = [pd.read_parquet(f) for f in input_files]
tall_result_pandas = pd.concat(dfs, axis=0, ignore_index=True).set_index(index_cols)
tall_result_pandas
a b c
x y z
1 4 7 A 1.0 NaN
2 5 8 B 2.0 NaN
3 6 9 C 3.0 NaN
4 7 10 D 3.0 NaN
5 8 11 E 4.0 NaN
6 9 12 F 5.0 NaN
7 10 13 G 6.0 J
8 11 14 H 7.0 K
9 12 15 I 8.0 L


Tall Concatenation with Parq Tools

This approach is more efficient for large datasets, as it processes data in chunks.

output_tall = temp_dir / "tall_concatenated.parquet"
tall_concat = ParquetConcat(files=input_files, axis=0)
tall_concat.concat_to_file(output_path=output_tall)

tall_result = pd.read_parquet(output_tall).set_index(index_cols)
tall_result
a b c
x y z
1 4 7 A 1.0 None
2 5 8 B 2.0 None
3 6 9 C 3.0 None
4 7 10 D 3.0 None
5 8 11 E 4.0 None
6 9 12 F 5.0 None
7 10 13 G 6.0 J
8 11 14 H 7.0 K
9 12 15 I 8.0 L


Compare the Results

Ensure the outputs from both methods are equivalent.

pd.testing.assert_frame_equal(tall_result, tall_result_pandas)
/home/runner/work/parq-tools/parq-tools/examples/03_tall_concatenation.py:97: FutureWarning: Mismatched null-like values None and nan found. In a future version, pandas equality-testing functions (e.g. assert_frame_equal) will consider these not-matching and raise.
  pd.testing.assert_frame_equal(tall_result, tall_result_pandas)

Tall Concatenation with Filters

You can also apply a filter expression during concatenation.

filter_query = "x > 4 and b >= 5.0"
output_filtered_tall = temp_dir / "filtered_tall_concatenated.parquet"
tall_concat.concat_to_file(
    output_path=output_filtered_tall,
    filter_query=filter_query,
    columns=["x", "y", "z", "a", "b"]
)
filtered_tall_result = pd.read_parquet(output_filtered_tall).set_index(index_cols)
filtered_tall_result
a b c
x y z
6 9 12 F 5.0 None
7 10 13 G 6.0 None
8 11 14 H 7.0 None
9 12 15 I 8.0 None


Concatenate by function

You can also concatenate with a function, rather than using the class directly. The same filtering options are available, and the function will handle the concatenation in a memory-efficient way.

# concatenate with the function
from parq_tools import concat_parquet_files

concat_parquet_files(files=input_files, output_path=output_tall.with_suffix('.by_function.parquet'),
                     axis=0, index_columns=index_cols)
# Read the filtered concatenated file
filtered_wide_function_result = pd.read_parquet(output_tall.with_suffix('.by_function.parquet')).set_index(
    index_cols)
filtered_wide_function_result
a b c
x y z
1 4 7 A 1.0 None
2 5 8 B 2.0 None
3 6 9 C 3.0 None
4 7 10 D 3.0 None
5 8 11 E 4.0 None
6 9 12 F 5.0 None
7 10 13 G 6.0 J
8 11 14 H 7.0 K
9 12 15 I 8.0 L


Total running time of the script: (0 minutes 0.094 seconds)

Gallery generated by Sphinx-Gallery