Index Utilities

A simple example to demonstrate utilities related to index columns.

Note

Index columns as we know them in Pandas do not exist in a native Parquet file. However, if the Parquet file has been created using Pandas then metadata is preserved to restore the indexes when a round-trip back to Pandas is completed.

The utilities demonstrated here are tools mimic index operations that one may use in Pandas.

import tempfile

import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
from pathlib import Path

from parq_tools import sort_parquet_file, reindex_parquet, validate_index_alignment
from parq_tools.utils.demo_block_model import create_demo_blockmodel

Create a dataset

Create a temporary parquet file for demonstration. This example represents a 3D block model.

parquet_file_path = Path(tempfile.gettempdir()) / "example_data.parquet"

df: pd.DataFrame = create_demo_blockmodel(shape=(3, 3, 3), block_size=(1, 1, 1),
                                          corner=(-0.5, -0.5, -0.5))
df.to_parquet(parquet_file_path)
df

			c_order_xyz	f_order_zyx	depth
x	y	z
0.0	0.0	0.0	0	0	2.5
		1.0	1	9	1.5
		2.0	2	18	0.5
	1.0	0.0	3	3	2.5
		1.0	4	12	1.5
		2.0	5	21	0.5
	2.0	0.0	6	6	2.5
		1.0	7	15	1.5
		2.0	8	24	0.5
1.0	0.0	0.0	9	1	2.5
		1.0	10	10	1.5
		2.0	11	19	0.5
	1.0	0.0	12	4	2.5
		1.0	13	13	1.5
		2.0	14	22	0.5
	2.0	0.0	15	7	2.5
		1.0	16	16	1.5
		2.0	17	25	0.5
2.0	0.0	0.0	18	2	2.5
		1.0	19	11	1.5
		2.0	20	20	0.5
	1.0	0.0	21	5	2.5
		1.0	22	14	1.5
		2.0	23	23	0.5
	2.0	0.0	24	8	2.5
		1.0	25	17	1.5
		2.0	26	26	0.5

Randomise the order of the DataFrame and persist to Parquet

df = df.sample(frac=1)
df.to_parquet(parquet_file_path)
df_randomised = pd.read_parquet(parquet_file_path)
df_randomised

			c_order_xyz	f_order_zyx	depth
x	y	z
2.0	0.0	1.0	19	11	1.5
0.0	1.0	1.0	4	12	1.5
2.0	1.0	1.0	22	14	1.5
2.0	1.0	2.0	23	23	0.5
1.0	2.0	2.0	17	25	0.5
	0.0	2.0	11	19	0.5
	1.0	1.0	13	13	1.5
0.0	0.0	1.0	1	9	1.5
2.0	2.0	0.0	24	8	2.5
1.0	0.0	1.0	10	10	1.5
1.0	1.0	0.0	12	4	2.5
0.0	0.0	2.0	2	18	0.5
	2.0	2.0	8	24	0.5
	2.0	1.0	7	15	1.5
2.0	2.0	1.0	25	17	1.5
0.0	2.0	0.0	6	6	2.5
0.0	1.0	0.0	3	3	2.5
2.0	1.0	0.0	21	5	2.5
2.0	2.0	2.0	26	26	0.5
0.0	1.0	2.0	5	21	0.5
0.0	0.0	0.0	0	0	2.5
1.0	1.0	2.0	14	22	0.5
1.0	0.0	0.0	9	1	2.5
2.0	0.0	0.0	18	2	2.5
1.0	2.0	0.0	15	7	2.5
2.0	0.0	2.0	20	20	0.5
1.0	2.0	1.0	16	16	1.5

Sort by the index

We can sort the DataFrame by the index columns to mimic the behavior of Pandas.

index_cols = ["x", "y", "z"]
sorted_file_path: Path = parquet_file_path.parent / "sorted_example_data.parquet"
sort_parquet_file(parquet_file_path, output_path=sorted_file_path,
                  columns=index_cols, chunk_size=100_000)

# Read the sorted Parquet file
sorted_df = pd.read_parquet(sorted_file_path)
sorted_df

Sorting parquet file:   0%|          | 0/1 [00:00<?, ?it/s]
Sorting parquet file: 100%|██████████| 1/1 [00:00<00:00, 1448.31it/s]

			c_order_xyz	f_order_zyx	depth
x	y	z
0.0	0.0	0.0	0	0	2.5
		1.0	1	9	1.5
		2.0	2	18	0.5
	1.0	0.0	3	3	2.5
		1.0	4	12	1.5
		2.0	5	21	0.5
	2.0	0.0	6	6	2.5
		1.0	7	15	1.5
		2.0	8	24	0.5
1.0	0.0	0.0	9	1	2.5
		1.0	10	10	1.5
		2.0	11	19	0.5
	1.0	0.0	12	4	2.5
		1.0	13	13	1.5
		2.0	14	22	0.5
	2.0	0.0	15	7	2.5
		1.0	16	16	1.5
		2.0	17	25	0.5
2.0	0.0	0.0	18	2	2.5
		1.0	19	11	1.5
		2.0	20	20	0.5
	1.0	0.0	21	5	2.5
		1.0	22	14	1.5
		2.0	23	23	0.5
	2.0	0.0	24	8	2.5
		1.0	25	17	1.5
		2.0	26	26	0.5

Reindexing

We can reindex the DataFrame to change the order of the index columns. This is useful if we want to change the order of the index columns to align with another dataset prior to concatenation. Reindexing will reorder existing records, and will add empty records if the new index has more records than the original index.

To demonstrate this, we will create another Parquet file with a subset of the original records that are unordered.

unsorted_subset_file_path: Path = parquet_file_path.parent / "unsorted_subset.parquet"
df_randomised.sample(frac=0.5).to_parquet(unsorted_subset_file_path)
df_unsorted_subset: pd.DataFrame = pd.read_parquet(unsorted_subset_file_path)
df_unsorted_subset

			c_order_xyz	f_order_zyx	depth
x	y	z
0.0	2.0	1.0	7	15	1.5
0.0	0.0	0.0	0	0	2.5
1.0	0.0	2.0	11	19	0.5
0.0	2.0	0.0	6	6	2.5
2.0	1.0	1.0	22	14	1.5
0.0	1.0	1.0	4	12	1.5
0.0	1.0	2.0	5	21	0.5
1.0	0.0	0.0	9	1	2.5
2.0	0.0	2.0	20	20	0.5
0.0	0.0	2.0	2	18	0.5
2.0	1.0	2.0	23	23	0.5
2.0	2.0	0.0	24	8	2.5
1.0	1.0	2.0	14	22	0.5
2.0	0.0	1.0	19	11	1.5

Reindex the unsorted subset to match the original index order

reindexed_file_path: Path = parquet_file_path.parent / "reindexed_subset.parquet"
reindex_parquet(unsorted_subset_file_path, output_path=reindexed_file_path,
                new_index=pa.Table.from_pandas(sorted_df.reset_index()[index_cols]))
df_reindexed: pd.DataFrame = pd.read_parquet(reindexed_file_path).set_index(index_cols)
df_reindexed

Reindexing parquet file:   0%|          | 0/1 [00:00<?, ?it/s]
Reindexing parquet file: 100%|██████████| 1/1 [00:00<00:00, 723.53it/s]

Sorting parquet file:   0%|          | 0/1 [00:00<?, ?it/s]
Sorting parquet file: 100%|██████████| 1/1 [00:00<00:00, 1818.87it/s]

			c_order_xyz	f_order_zyx	depth
x	y	z
0.0	0.0	0.0	0.0	0.0	2.5
		1.0	NaN	NaN	NaN
		2.0	2.0	18.0	0.5
	1.0	0.0	NaN	NaN	NaN
		1.0	4.0	12.0	1.5
		2.0	5.0	21.0	0.5
	2.0	0.0	6.0	6.0	2.5
		1.0	7.0	15.0	1.5
		2.0	NaN	NaN	NaN
1.0	0.0	0.0	9.0	1.0	2.5
		1.0	NaN	NaN	NaN
		2.0	11.0	19.0	0.5
	1.0	0.0	NaN	NaN	NaN
		1.0	NaN	NaN	NaN
		2.0	14.0	22.0	0.5
	2.0	0.0	NaN	NaN	NaN
		1.0	NaN	NaN	NaN
		2.0	NaN	NaN	NaN
2.0	0.0	0.0	NaN	NaN	NaN
		1.0	19.0	11.0	1.5
		2.0	20.0	20.0	0.5
	1.0	0.0	NaN	NaN	NaN
		1.0	22.0	14.0	1.5
		2.0	23.0	23.0	0.5
	2.0	0.0	24.0	8.0	2.5
		1.0	NaN	NaN	NaN
		2.0	NaN	NaN	NaN

Validate index alignment

We can demonstrate the validate_index_alignment function.

datasets: list[ds.Dataset] = [ds.dataset(pf) for pf in [sorted_file_path, reindexed_file_path]]
validate_index_alignment(datasets=datasets, index_columns=index_cols)

Validating index alignment:   0%|          | 0/1 [00:00<?, ?it/s]
Validating index alignment: 100%|██████████| 1/1 [00:00<00:00, 7557.30it/s]

Total running time of the script: (0 minutes 0.057 seconds)

Gallery generated by Sphinx-Gallery