.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/05_index_utilities.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_05_index_utilities.py: Index Utilities =============== A simple example to demonstrate utilities related to `index` columns. .. note:: Index columns as we know them in Pandas do not exist in a native Parquet file. However, if the Parquet file has been created using Pandas then metadata is preserved to restore the indexes when a round-trip back to Pandas is completed. The utilities demonstrated here are tools mimic index operations that one may use in Pandas. .. GENERATED FROM PYTHON SOURCE LINES 16-26 .. code-block:: Python import tempfile import pandas as pd import pyarrow as pa import pyarrow.dataset as ds from pathlib import Path from parq_tools import sort_parquet_file, reindex_parquet, validate_index_alignment from parq_tools.utils.demo_block_model import create_demo_blockmodel .. GENERATED FROM PYTHON SOURCE LINES 27-31 Create a dataset ---------------- Create a temporary parquet file for demonstration. This example represents a 3D block model. .. GENERATED FROM PYTHON SOURCE LINES 32-40 .. code-block:: Python parquet_file_path = Path(tempfile.gettempdir()) / "example_data.parquet" df: pd.DataFrame = create_demo_blockmodel(shape=(3, 3, 3), block_size=(1, 1, 1), corner=(-0.5, -0.5, -0.5)) df.to_parquet(parquet_file_path) df .. raw:: html
c_order_xyz f_order_zyx depth
x y z
0.0 0.0 0.0 0 0 2.5
1.0 1 9 1.5
2.0 2 18 0.5
1.0 0.0 3 3 2.5
1.0 4 12 1.5
2.0 5 21 0.5
2.0 0.0 6 6 2.5
1.0 7 15 1.5
2.0 8 24 0.5
1.0 0.0 0.0 9 1 2.5
1.0 10 10 1.5
2.0 11 19 0.5
1.0 0.0 12 4 2.5
1.0 13 13 1.5
2.0 14 22 0.5
2.0 0.0 15 7 2.5
1.0 16 16 1.5
2.0 17 25 0.5
2.0 0.0 0.0 18 2 2.5
1.0 19 11 1.5
2.0 20 20 0.5
1.0 0.0 21 5 2.5
1.0 22 14 1.5
2.0 23 23 0.5
2.0 0.0 24 8 2.5
1.0 25 17 1.5
2.0 26 26 0.5


.. GENERATED FROM PYTHON SOURCE LINES 41-42 Randomise the order of the DataFrame and persist to Parquet .. GENERATED FROM PYTHON SOURCE LINES 42-48 .. code-block:: Python df = df.sample(frac=1) df.to_parquet(parquet_file_path) df_randomised = pd.read_parquet(parquet_file_path) df_randomised .. raw:: html
c_order_xyz f_order_zyx depth
x y z
2.0 0.0 1.0 19 11 1.5
0.0 1.0 1.0 4 12 1.5
2.0 1.0 1.0 22 14 1.5
2.0 23 23 0.5
1.0 2.0 2.0 17 25 0.5
0.0 2.0 11 19 0.5
1.0 1.0 13 13 1.5
0.0 0.0 1.0 1 9 1.5
2.0 2.0 0.0 24 8 2.5
1.0 0.0 1.0 10 10 1.5
1.0 0.0 12 4 2.5
0.0 0.0 2.0 2 18 0.5
2.0 2.0 8 24 0.5
1.0 7 15 1.5
2.0 2.0 1.0 25 17 1.5
0.0 2.0 0.0 6 6 2.5
1.0 0.0 3 3 2.5
2.0 1.0 0.0 21 5 2.5
2.0 2.0 26 26 0.5
0.0 1.0 2.0 5 21 0.5
0.0 0.0 0 0 2.5
1.0 1.0 2.0 14 22 0.5
0.0 0.0 9 1 2.5
2.0 0.0 0.0 18 2 2.5
1.0 2.0 0.0 15 7 2.5
2.0 0.0 2.0 20 20 0.5
1.0 2.0 1.0 16 16 1.5


.. GENERATED FROM PYTHON SOURCE LINES 49-53 Sort by the index ----------------- We can sort the DataFrame by the index columns to mimic the behavior of Pandas. .. GENERATED FROM PYTHON SOURCE LINES 53-63 .. code-block:: Python index_cols = ["x", "y", "z"] sorted_file_path: Path = parquet_file_path.parent / "sorted_example_data.parquet" sort_parquet_file(parquet_file_path, output_path=sorted_file_path, columns=index_cols, chunk_size=100_000) # Read the sorted Parquet file sorted_df = pd.read_parquet(sorted_file_path) sorted_df .. rst-class:: sphx-glr-script-out .. code-block:: none Sorting parquet file: 0%| | 0/1 [00:00
c_order_xyz f_order_zyx depth
x y z
0.0 0.0 0.0 0 0 2.5
1.0 1 9 1.5
2.0 2 18 0.5
1.0 0.0 3 3 2.5
1.0 4 12 1.5
2.0 5 21 0.5
2.0 0.0 6 6 2.5
1.0 7 15 1.5
2.0 8 24 0.5
1.0 0.0 0.0 9 1 2.5
1.0 10 10 1.5
2.0 11 19 0.5
1.0 0.0 12 4 2.5
1.0 13 13 1.5
2.0 14 22 0.5
2.0 0.0 15 7 2.5
1.0 16 16 1.5
2.0 17 25 0.5
2.0 0.0 0.0 18 2 2.5
1.0 19 11 1.5
2.0 20 20 0.5
1.0 0.0 21 5 2.5
1.0 22 14 1.5
2.0 23 23 0.5
2.0 0.0 24 8 2.5
1.0 25 17 1.5
2.0 26 26 0.5


.. GENERATED FROM PYTHON SOURCE LINES 64-71 Reindexing ---------- We can reindex the DataFrame to change the order of the index columns. This is useful if we want to change the order of the index columns to align with another dataset prior to concatenation. Reindexing will reorder existing records, and will add empty records if the new index has more records than the original index. To demonstrate this, we will create another Parquet file with a subset of the original records that are unordered. .. GENERATED FROM PYTHON SOURCE LINES 71-77 .. code-block:: Python unsorted_subset_file_path: Path = parquet_file_path.parent / "unsorted_subset.parquet" df_randomised.sample(frac=0.5).to_parquet(unsorted_subset_file_path) df_unsorted_subset: pd.DataFrame = pd.read_parquet(unsorted_subset_file_path) df_unsorted_subset .. raw:: html
c_order_xyz f_order_zyx depth
x y z
0.0 2.0 1.0 7 15 1.5
0.0 0.0 0 0 2.5
1.0 0.0 2.0 11 19 0.5
0.0 2.0 0.0 6 6 2.5
2.0 1.0 1.0 22 14 1.5
0.0 1.0 1.0 4 12 1.5
2.0 5 21 0.5
1.0 0.0 0.0 9 1 2.5
2.0 0.0 2.0 20 20 0.5
0.0 0.0 2.0 2 18 0.5
2.0 1.0 2.0 23 23 0.5
2.0 0.0 24 8 2.5
1.0 1.0 2.0 14 22 0.5
2.0 0.0 1.0 19 11 1.5


.. GENERATED FROM PYTHON SOURCE LINES 78-79 Reindex the unsorted subset to match the original index order .. GENERATED FROM PYTHON SOURCE LINES 79-86 .. code-block:: Python reindexed_file_path: Path = parquet_file_path.parent / "reindexed_subset.parquet" reindex_parquet(unsorted_subset_file_path, output_path=reindexed_file_path, new_index=pa.Table.from_pandas(sorted_df.reset_index()[index_cols])) df_reindexed: pd.DataFrame = pd.read_parquet(reindexed_file_path).set_index(index_cols) df_reindexed .. rst-class:: sphx-glr-script-out .. code-block:: none Reindexing parquet file: 0%| | 0/1 [00:00
c_order_xyz f_order_zyx depth
x y z
0.0 0.0 0.0 0.0 0.0 2.5
1.0 NaN NaN NaN
2.0 2.0 18.0 0.5
1.0 0.0 NaN NaN NaN
1.0 4.0 12.0 1.5
2.0 5.0 21.0 0.5
2.0 0.0 6.0 6.0 2.5
1.0 7.0 15.0 1.5
2.0 NaN NaN NaN
1.0 0.0 0.0 9.0 1.0 2.5
1.0 NaN NaN NaN
2.0 11.0 19.0 0.5
1.0 0.0 NaN NaN NaN
1.0 NaN NaN NaN
2.0 14.0 22.0 0.5
2.0 0.0 NaN NaN NaN
1.0 NaN NaN NaN
2.0 NaN NaN NaN
2.0 0.0 0.0 NaN NaN NaN
1.0 19.0 11.0 1.5
2.0 20.0 20.0 0.5
1.0 0.0 NaN NaN NaN
1.0 22.0 14.0 1.5
2.0 23.0 23.0 0.5
2.0 0.0 24.0 8.0 2.5
1.0 NaN NaN NaN
2.0 NaN NaN NaN


.. GENERATED FROM PYTHON SOURCE LINES 87-90 Validate index alignment ------------------------ We can demonstrate the `validate_index_alignment` function. .. GENERATED FROM PYTHON SOURCE LINES 90-91 .. code-block:: Python datasets: list[ds.Dataset] = [ds.dataset(pf) for pf in [sorted_file_path, reindexed_file_path]] validate_index_alignment(datasets=datasets, index_columns=index_cols) .. rst-class:: sphx-glr-script-out .. code-block:: none Validating index alignment: 0%| | 0/1 [00:00` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 05_index_utilities.py <05_index_utilities.py>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_