Welcome to parq-tools’ documentation!

parq-tools

Run Tests PyPI Coverage Python Versions License Publish Docs Open Issues Open PRs

Overview

parq-tools is a collection of utilities for efficiently working with large-scale Parquet datasets. A typical use case is asset-based workflows with large scientific datasets.

If your datasets are not large, you might find the pandas library more convenient.

Features

  • Filtering → Efficiently filter large parquet files.

  • Concatenation → Combines multiple Parquet files efficiently along rows (axis=0) or columns (axis=1).

  • Tokenized Filtering → Converts pandas-style expressions into efficient PyArrow queries.

  • Profiling Enhancements → Improves ydata-profiling by profiling specific columns incrementally, merging results for large files.

  • DataFrame Enhancements → Provides a LazyParquetDataFrame class that extends pandas.DataFrame with lazy loading from Parquet files.