parq_tools.utils.index_utils.dedup_index_parquet

parq_tools.utils.index_utils.dedup_index_parquet(input_path, output_path, index_columns, chunk_size=100000)[source]

Remove duplicate rows based on index columns from a Parquet file.

Parameters:
  • input_path (Path) – Path to the input Parquet file.

  • output_path (Path) – Path to save the deduplicated Parquet file.

  • index_columns (List[str]) – Columns to use as the index for deduplication.

  • chunk_size (int) – Number of rows to process per chunk.

Return type:

None