Note
Go to the end to download the full example code.
OMF Profile Block Model
Profiling a dataset is a common task in data analysis. This example demonstrates how to profile an OMF block model. The profile report is persisted inside the omf file.
import shutil
import tempfile
from pathlib import Path
import pandas as pd
from omfpandas import OMFPandasReader, OMFPandasWriter
Instantiate
Create the object OMFPandas with the path to the OMF file.
test_omf_path: Path = Path('./../assets/v2/test_file.omf')
# create a temporary copy to preserve the original file
temp_omf_path: Path = Path(tempfile.gettempdir()) / 'test_file_copy.omf'
shutil.copy(test_omf_path, temp_omf_path)
# Display the head of the original block model
blocks: pd.DataFrame = OMFPandasReader(filepath=temp_omf_path).read_blockmodel(blockmodel_name='vol')
blocks.head()
Profile
Create the writer, write the pandera schema and the profile report into the file. The use of a pandera schema is optional, but it provides a way to describe the attributes in the dataset.
omfpw: OMFPandasWriter = OMFPandasWriter(filepath=temp_omf_path)
omfpw.write_block_model_schema(blockmodel_name='vol', pd_schema_filepath=test_omf_path.with_suffix('.schema.yaml'))
omfpw.profile_blockmodel(blockmodel_name='vol')
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Summarize dataset: 0%| | 0/6 [00:00<?, ?it/s, Describe variable:random attr]
Summarize dataset: 17%|█▋ | 1/6 [00:00<00:00, 74.35it/s, Get variable types]
Summarize dataset: 29%|██▊ | 2/7 [00:00<00:00, 147.17it/s, Get dataframe statistics]
Summarize dataset: 38%|███▊ | 3/8 [00:00<00:00, 208.25it/s, Calculate auto correlation]
Summarize dataset: 50%|█████ | 4/8 [00:00<00:00, 275.07it/s, Get scatter matrix]
Summarize dataset: 44%|████▍ | 4/9 [00:00<00:00, 273.11it/s, scatter random attr, random attr]
Summarize dataset: 56%|█████▌ | 5/9 [00:00<00:00, 17.61it/s, scatter random attr, random attr]
Summarize dataset: 45%|████▌ | 5/11 [00:00<00:00, 17.61it/s, Missing diagram bar]
Summarize dataset: 55%|█████▍ | 6/11 [00:00<00:00, 17.61it/s, Missing diagram matrix]
Summarize dataset: 64%|██████▎ | 7/11 [00:00<00:00, 15.73it/s, Missing diagram matrix]
Summarize dataset: 64%|██████▎ | 7/11 [00:00<00:00, 15.73it/s, Take sample]
Summarize dataset: 73%|███████▎ | 8/11 [00:00<00:00, 15.73it/s, Detecting duplicates]
Summarize dataset: 82%|████████▏ | 9/11 [00:00<00:00, 15.73it/s, Get alerts]
Summarize dataset: 91%|█████████ | 10/11 [00:00<00:00, 15.73it/s, Get reproduction details]
Summarize dataset: 100%|██████████| 11/11 [00:00<00:00, 15.73it/s, Completed]
Summarize dataset: 100%|██████████| 11/11 [00:00<00:00, 25.02it/s, Completed]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Generate report structure: 100%|██████████| 1/1 [00:00<00:00, 2.43it/s]
Generate report structure: 100%|██████████| 1/1 [00:00<00:00, 2.43it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 100%|██████████| 1/1 [00:00<00:00, 5.57it/s]
Render HTML: 100%|██████████| 1/1 [00:00<00:00, 5.56it/s]
View the profile report, which benefits from the attribute descriptions from the schema.
omfpw.view_block_model_profile(blockmodel_name='vol')
Profile a subset with a query filter string
omfpw.profile_blockmodel(blockmodel_name='vol', query='`random attr`>0.5')
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Summarize dataset: 0%| | 0/6 [00:00<?, ?it/s, Describe variable:random attr]
Summarize dataset: 17%|█▋ | 1/6 [00:00<00:00, 121.41it/s, Get variable types]
Summarize dataset: 29%|██▊ | 2/7 [00:00<00:00, 238.73it/s, Get dataframe statistics]
Summarize dataset: 38%|███▊ | 3/8 [00:00<00:00, 335.34it/s, Calculate auto correlation]
Summarize dataset: 50%|█████ | 4/8 [00:00<00:00, 440.91it/s, Get scatter matrix]
Summarize dataset: 44%|████▍ | 4/9 [00:00<00:00, 436.24it/s, scatter random attr, random attr]
Summarize dataset: 56%|█████▌ | 5/9 [00:00<00:00, 38.96it/s, scatter random attr, random attr]
Summarize dataset: 45%|████▌ | 5/11 [00:00<00:00, 38.96it/s, Missing diagram bar]
Summarize dataset: 55%|█████▍ | 6/11 [00:00<00:00, 38.96it/s, Missing diagram matrix]
Summarize dataset: 64%|██████▎ | 7/11 [00:00<00:00, 38.96it/s, Take sample]
Summarize dataset: 73%|███████▎ | 8/11 [00:00<00:00, 38.96it/s, Detecting duplicates]
Summarize dataset: 82%|████████▏ | 9/11 [00:00<00:00, 30.84it/s, Detecting duplicates]
Summarize dataset: 82%|████████▏ | 9/11 [00:00<00:00, 30.84it/s, Get alerts]
Summarize dataset: 91%|█████████ | 10/11 [00:00<00:00, 30.84it/s, Get reproduction details]
Summarize dataset: 100%|██████████| 11/11 [00:00<00:00, 30.84it/s, Completed]
Summarize dataset: 100%|██████████| 11/11 [00:00<00:00, 38.88it/s, Completed]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Generate report structure: 100%|██████████| 1/1 [00:00<00:00, 2.60it/s]
Generate report structure: 100%|██████████| 1/1 [00:00<00:00, 2.59it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 100%|██████████| 1/1 [00:00<00:00, 11.83it/s]
View the profile report of the subset. The dataset tab in the profile report describes the filter applied to the dataset.
omfpw.view_block_model_profile(blockmodel_name='vol', query='`random attr`>0.5')
Total running time of the script: (0 minutes 3.551 seconds)