parq_tools.utils.index_utils.dedup_index_parquet
- parq_tools.utils.index_utils.dedup_index_parquet(input_path, output_path, index_columns, chunk_size=100000)[source]
Remove duplicate rows based on index columns from a Parquet file.
- Parameters:
input_path (Path) – Path to the input Parquet file.
output_path (Path) – Path to save the deduplicated Parquet file.
index_columns (List[str]) – Columns to use as the index for deduplication.
chunk_size (int) – Number of rows to process per chunk.
- Return type:
None