Index Utilities

A simple example to demonstrate utilities related to index columns.

Note

Index columns as we know them in Pandas do not exist in a native Parquet file. However, if the Parquet file has been created using Pandas then metadata is preserved to restore the indexes when a round-trip back to Pandas is completed.

The utilities demonstrated here are tools mimic index operations that one may use in Pandas.

import tempfile

import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
from pathlib import Path

from parq_tools import sort_parquet_file, reindex_parquet, validate_index_alignment
from parq_tools.utils.demo_block_model import create_demo_blockmodel

Create a dataset

Create a temporary parquet file for demonstration. This example represents a 3D block model.

parquet_file_path = Path(tempfile.gettempdir()) / "example_data.parquet"

df: pd.DataFrame = create_demo_blockmodel(shape=(3, 3, 3), block_size=(1, 1, 1),
                                          corner=(-0.5, -0.5, -0.5))
df.to_parquet(parquet_file_path)
df
c_order_xyz f_order_zyx depth
x y z
0.0 0.0 0.0 0 0 2.5
1.0 1 9 1.5
2.0 2 18 0.5
1.0 0.0 3 3 2.5
1.0 4 12 1.5
2.0 5 21 0.5
2.0 0.0 6 6 2.5
1.0 7 15 1.5
2.0 8 24 0.5
1.0 0.0 0.0 9 1 2.5
1.0 10 10 1.5
2.0 11 19 0.5
1.0 0.0 12 4 2.5
1.0 13 13 1.5
2.0 14 22 0.5
2.0 0.0 15 7 2.5
1.0 16 16 1.5
2.0 17 25 0.5
2.0 0.0 0.0 18 2 2.5
1.0 19 11 1.5
2.0 20 20 0.5
1.0 0.0 21 5 2.5
1.0 22 14 1.5
2.0 23 23 0.5
2.0 0.0 24 8 2.5
1.0 25 17 1.5
2.0 26 26 0.5


Randomise the order of the DataFrame and persist to Parquet

df = df.sample(frac=1)
df.to_parquet(parquet_file_path)
df_randomised = pd.read_parquet(parquet_file_path)
df_randomised
c_order_xyz f_order_zyx depth
x y z
2.0 0.0 1.0 19 11 1.5
0.0 1.0 1.0 4 12 1.5
2.0 1.0 1.0 22 14 1.5
2.0 23 23 0.5
1.0 2.0 2.0 17 25 0.5
0.0 2.0 11 19 0.5
1.0 1.0 13 13 1.5
0.0 0.0 1.0 1 9 1.5
2.0 2.0 0.0 24 8 2.5
1.0 0.0 1.0 10 10 1.5
1.0 0.0 12 4 2.5
0.0 0.0 2.0 2 18 0.5
2.0 2.0 8 24 0.5
1.0 7 15 1.5
2.0 2.0 1.0 25 17 1.5
0.0 2.0 0.0 6 6 2.5
1.0 0.0 3 3 2.5
2.0 1.0 0.0 21 5 2.5
2.0 2.0 26 26 0.5
0.0 1.0 2.0 5 21 0.5
0.0 0.0 0 0 2.5
1.0 1.0 2.0 14 22 0.5
0.0 0.0 9 1 2.5
2.0 0.0 0.0 18 2 2.5
1.0 2.0 0.0 15 7 2.5
2.0 0.0 2.0 20 20 0.5
1.0 2.0 1.0 16 16 1.5


Sort by the index

We can sort the DataFrame by the index columns to mimic the behavior of Pandas.

index_cols = ["x", "y", "z"]
sorted_file_path: Path = parquet_file_path.parent / "sorted_example_data.parquet"
sort_parquet_file(parquet_file_path, output_path=sorted_file_path,
                  columns=index_cols, chunk_size=100_000)

# Read the sorted Parquet file
sorted_df = pd.read_parquet(sorted_file_path)
sorted_df
Sorting parquet file:   0%|          | 0/1 [00:00<?, ?it/s]
Sorting parquet file: 100%|██████████| 1/1 [00:00<00:00, 1448.31it/s]
c_order_xyz f_order_zyx depth
x y z
0.0 0.0 0.0 0 0 2.5
1.0 1 9 1.5
2.0 2 18 0.5
1.0 0.0 3 3 2.5
1.0 4 12 1.5
2.0 5 21 0.5
2.0 0.0 6 6 2.5
1.0 7 15 1.5
2.0 8 24 0.5
1.0 0.0 0.0 9 1 2.5
1.0 10 10 1.5
2.0 11 19 0.5
1.0 0.0 12 4 2.5
1.0 13 13 1.5
2.0 14 22 0.5
2.0 0.0 15 7 2.5
1.0 16 16 1.5
2.0 17 25 0.5
2.0 0.0 0.0 18 2 2.5
1.0 19 11 1.5
2.0 20 20 0.5
1.0 0.0 21 5 2.5
1.0 22 14 1.5
2.0 23 23 0.5
2.0 0.0 24 8 2.5
1.0 25 17 1.5
2.0 26 26 0.5


Reindexing

We can reindex the DataFrame to change the order of the index columns. This is useful if we want to change the order of the index columns to align with another dataset prior to concatenation. Reindexing will reorder existing records, and will add empty records if the new index has more records than the original index.

To demonstrate this, we will create another Parquet file with a subset of the original records that are unordered.

unsorted_subset_file_path: Path = parquet_file_path.parent / "unsorted_subset.parquet"
df_randomised.sample(frac=0.5).to_parquet(unsorted_subset_file_path)
df_unsorted_subset: pd.DataFrame = pd.read_parquet(unsorted_subset_file_path)
df_unsorted_subset
c_order_xyz f_order_zyx depth
x y z
0.0 2.0 1.0 7 15 1.5
0.0 0.0 0 0 2.5
1.0 0.0 2.0 11 19 0.5
0.0 2.0 0.0 6 6 2.5
2.0 1.0 1.0 22 14 1.5
0.0 1.0 1.0 4 12 1.5
2.0 5 21 0.5
1.0 0.0 0.0 9 1 2.5
2.0 0.0 2.0 20 20 0.5
0.0 0.0 2.0 2 18 0.5
2.0 1.0 2.0 23 23 0.5
2.0 0.0 24 8 2.5
1.0 1.0 2.0 14 22 0.5
2.0 0.0 1.0 19 11 1.5


Reindex the unsorted subset to match the original index order

reindexed_file_path: Path = parquet_file_path.parent / "reindexed_subset.parquet"
reindex_parquet(unsorted_subset_file_path, output_path=reindexed_file_path,
                new_index=pa.Table.from_pandas(sorted_df.reset_index()[index_cols]))
df_reindexed: pd.DataFrame = pd.read_parquet(reindexed_file_path).set_index(index_cols)
df_reindexed
Reindexing parquet file:   0%|          | 0/1 [00:00<?, ?it/s]
Reindexing parquet file: 100%|██████████| 1/1 [00:00<00:00, 723.53it/s]

Sorting parquet file:   0%|          | 0/1 [00:00<?, ?it/s]
Sorting parquet file: 100%|██████████| 1/1 [00:00<00:00, 1818.87it/s]
c_order_xyz f_order_zyx depth
x y z
0.0 0.0 0.0 0.0 0.0 2.5
1.0 NaN NaN NaN
2.0 2.0 18.0 0.5
1.0 0.0 NaN NaN NaN
1.0 4.0 12.0 1.5
2.0 5.0 21.0 0.5
2.0 0.0 6.0 6.0 2.5
1.0 7.0 15.0 1.5
2.0 NaN NaN NaN
1.0 0.0 0.0 9.0 1.0 2.5
1.0 NaN NaN NaN
2.0 11.0 19.0 0.5
1.0 0.0 NaN NaN NaN
1.0 NaN NaN NaN
2.0 14.0 22.0 0.5
2.0 0.0 NaN NaN NaN
1.0 NaN NaN NaN
2.0 NaN NaN NaN
2.0 0.0 0.0 NaN NaN NaN
1.0 19.0 11.0 1.5
2.0 20.0 20.0 0.5
1.0 0.0 NaN NaN NaN
1.0 22.0 14.0 1.5
2.0 23.0 23.0 0.5
2.0 0.0 24.0 8.0 2.5
1.0 NaN NaN NaN
2.0 NaN NaN NaN


Validate index alignment

We can demonstrate the validate_index_alignment function.

datasets: list[ds.Dataset] = [ds.dataset(pf) for pf in [sorted_file_path, reindexed_file_path]]
validate_index_alignment(datasets=datasets, index_columns=index_cols)
Validating index alignment:   0%|          | 0/1 [00:00<?, ?it/s]
Validating index alignment: 100%|██████████| 1/1 [00:00<00:00, 7557.30it/s]

Total running time of the script: (0 minutes 0.057 seconds)

Gallery generated by Sphinx-Gallery