DAG with Estimator

Flowsheet can be used to apply an estimator in a process flowsheet. This example demonstrates how to use a DAG to define a flowsheet that applies a lump estimator to a feed stream.

The focus will not be on the model development, but rather on the simulation. The model is a simple RandomForest regressor that predicts the lump mass and composition from the feed stream.

Note

This example uses the estimator extras. ensure you have installed like poetry install -E estimator.

import logging

# This import at the top to guard against the estimator extras not being installed
from elphick.mass_composition.utils.sklearn import PandasPipeline

import pandas as pd
import plotly
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from elphick.mass_composition import MassComposition, Stream
from elphick.mass_composition.dag import DAG
from elphick.mass_composition.datasets.sample_data import iron_ore_met_sample_data
from elphick.mass_composition.flowsheet import Flowsheet
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

Load Data

We load some metallurgical data from a drill program, REF: A072391 Since we are not concerned about the model performance in this example, we’ll convert the categorical feature bulk_hole_no to an integer

df: pd.DataFrame = iron_ore_met_sample_data()

base_components = ['fe', 'p', 'sio2', 'al2o3', 'loi']
cols_x = ['dry_weight_lump_kg'] + [f'head_{comp}' for comp in base_components] + ['bulk_hole_no']
cols_y = ['lump_pct'] + [f'lump_{comp}' for comp in base_components]

df = df.loc[:, cols_x + cols_y].query('lump_pct>0').dropna(how='any')
df = df.rename(columns={'dry_weight_lump_kg': 'head_mass_dry'})
df['bulk_hole_no'] = df['bulk_hole_no'].astype('category').cat.codes

logger.info(df.shape)
df.head()
head_mass_dry head_fe head_p head_sio2 head_al2o3 head_loi bulk_hole_no lump_pct lump_fe lump_p lump_sio2 lump_al2o3 lump_loi
sample_number
30129 0.31 62.94 0.02 3.71 1.51 4.12 0 21.3 64.95 0.014 2.29 0.87 3.83
30131 0.52 64.79 0.02 2.88 1.29 2.80 0 22.4 66.48 0.009 1.67 0.65 2.46
30132 0.41 65.22 0.02 2.64 1.15 2.61 0 16.6 66.78 0.009 1.34 0.47 2.67
30133 0.32 64.67 0.02 2.85 1.17 3.00 0 19.6 66.23 0.011 1.56 0.62 2.74
30134 0.31 65.29 0.02 2.25 0.94 2.93 0 13.9 66.53 0.011 1.41 0.54 2.91


Build a model

X: pd.DataFrame = df[[col for col in df.columns if col not in cols_y]]
y: pd.DataFrame = df[cols_y]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The model needs to be wrapped in a PandasPipeline object to ensure that the column names are preserved.

pipe: PandasPipeline = PandasPipeline.from_pipeline(
    make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42)))

pipe
PandasPipeline(steps=[('standardscaler', StandardScaler()),
                      ('randomforestregressor',
                       RandomForestRegressor(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Test the model

The model can be called directly to predict the lump percentage and composition from the feed stream. We will pass in a dataframe with the same columns as the training data.

y_pred = pipe.fit(X_train.drop(columns=['head_mass_dry']), y_train).predict(X_test)
logger.info(f'Test score: {pipe.score(X_test, y_test)}')
y_pred.head()
lump_pct lump_fe lump_p lump_sio2 lump_al2o3 lump_loi
sample_number
30838 21.150 62.1259 0.03646 1.5286 0.9012 8.3704
30836 17.511 61.8068 0.04462 1.6825 0.9777 8.5672
30143 20.425 64.9562 0.01953 2.0668 0.6913 4.1378
30158 18.170 58.1892 0.03722 8.7937 0.9085 6.5543
30915 22.386 62.5843 0.03022 1.4852 0.9723 7.6845


Create a Head MassComposition object

Now we will create a MassComposition object and use it to apply the model to the feed stream.

head: MassComposition = MassComposition(data=X_test.drop(columns=['bulk_hole_no']), name='head',
                                        mass_dry_var='head_mass_dry')
lump, fines = head.split_by_estimator(estimator=pipe, name_2='fines',
                                      mass_recovery_column='lump_pct', mass_recovery_max=100,
                                      extra_features=X_test['bulk_hole_no'])
lump.data.to_dataframe().head()
mass_wet mass_dry H2O Fe P SiO2 Al2O3 LOI
sample_number
30838 0.300330 0.300330 0.0 62.1259 0.03646 1.5286 0.9012 8.3704
30836 0.227643 0.227643 0.0 61.8068 0.04462 1.6825 0.9777 8.5672
30143 0.085785 0.085785 0.0 64.9562 0.01953 2.0668 0.6913 4.1378
30158 0.101752 0.101752 0.0 58.1892 0.03722 8.7937 0.9085 6.5543
30915 0.385039 0.385039 0.0 62.5843 0.03022 1.4852 0.9723 7.6845


fines.data.to_dataframe().head()
mass_wet mass_dry H2O Fe P SiO2 Al2O3 LOI
sample_number
30838 1.119670 1.119670 0.0 60.672634 0.053632 1.974256 1.166007 9.092784
30836 1.072357 1.072357 0.0 60.416554 0.051142 2.309856 1.501770 8.752437
30143 0.334215 0.334215 0.0 63.892801 0.032687 2.975628 1.644740 3.424259
30158 0.458248 0.458248 0.0 58.788980 0.040617 8.092612 1.643560 5.669172
30915 1.334961 1.334961 0.0 60.800730 0.042821 2.135598 1.948541 8.155420


Define the DAG

First we define a simple DAG, where the feed stream is split into two streams, lump and fines. The lump estimator requires the usual mass-composition variables plus an addition feature/variable called bulk_hole_no. Since the bulk_hole_no is available in the feed stream, it is immediately accessible to the estimator.

head: MassComposition = MassComposition(data=X_test, name='head',
                                        mass_dry_var='head_mass_dry')

dag = DAG(name='A072391', n_jobs=1)
dag.add_input(name='head')
dag.add_step(name='screen', operation=Stream.split_by_estimator, streams=['head'],
             kwargs={'estimator': pipe, 'name_1': 'lump', 'name_2': 'fines',
                     'mass_recovery_column': 'lump_pct', 'mass_recovery_max': 100})
dag.add_output(name='lump', stream='lump')
dag.add_output(name='fines', stream='fines')
dag.run(input_streams={'head': head}, progress_bar=True)

fig = Flowsheet.from_dag(dag).plot_network()
fig
Executing nodes:   0%|          | 0/4 [00:00<?, ?node/s]
Executing nodes:   0%|          | 0/4 [00:00<?, ?node/s, Processed node: head]
Executing nodes:  25%|##5       | 1/4 [00:00<00:00,  3.99node/s, Processed node: screen]
Executing nodes:  50%|#####     | 2/4 [00:00<00:00,  7.98node/s, Processed node: screen]
Executing nodes:  50%|#####     | 2/4 [00:00<00:00,  7.98node/s, Processed node: lump]
Executing nodes:  75%|#######5  | 3/4 [00:00<00:00,  7.98node/s, Processed node: fines]
Executing nodes: 100%|##########| 4/4 [00:00<00:00, 15.93node/s, Processed node: fines]


More Complex DAG

This DAG is to test a more complex flowsheet where the estimator may have all the features immediately available in the parent stream.

Note

This example works, but it does so since all attribute (extra) variables are passed all the way around the network in the current design. This is to be changed in the future to allow for more efficient processing. Once attributes are no longer passed, changes will be needed to the DAG to marshall features from other streams in the network (most often the input stream).

dag = DAG(name='A072391', n_jobs=1)
dag.add_input(name='head')
dag.add_step(name='screen', operation=Stream.split_by_estimator, streams=['head'],
             kwargs={'estimator': pipe, 'name_1': 'lump', 'name_2': 'fines',
                     'mass_recovery_column': 'lump_pct', 'mass_recovery_max': 100})
dag.add_step(name='screen_2', operation=Stream.split_by_estimator, streams=['fines'],
             kwargs={'estimator': pipe, 'name_1': 'lump_2', 'name_2': 'fines_2',
                     'mass_recovery_column': 'lump_pct', 'mass_recovery_max': 100,
                     'allow_prefix_mismatch': True})
dag.add_output(name='lump', stream='lump_2')
dag.add_output(name='fines', stream='fines_2')
dag.add_output(name='stockpile', stream='lump')
dag.run(input_streams={'head': head}, progress_bar=True)

fs: Flowsheet = Flowsheet.from_dag(dag)

fig = fs.plot_network()
fig
Executing nodes:   0%|          | 0/6 [00:00<?, ?node/s]
Executing nodes:   0%|          | 0/6 [00:00<?, ?node/s, Processed node: head]
Executing nodes:  17%|#6        | 1/6 [00:00<00:01,  3.92node/s, Processed node: screen]
Executing nodes:  33%|###3      | 2/6 [00:00<00:00,  7.83node/s, Processed node: screen]
Executing nodes:  33%|###3      | 2/6 [00:00<00:00,  7.83node/s, Processed node: screen_2]
Executing nodes:  50%|#####     | 3/6 [00:00<00:00,  6.11node/s, Processed node: screen_2]
Executing nodes:  50%|#####     | 3/6 [00:00<00:00,  6.11node/s, Processed node: stockpile]
Executing nodes:  67%|######6   | 4/6 [00:00<00:00,  6.11node/s, Processed node: lump]
Executing nodes:  83%|########3 | 5/6 [00:00<00:00,  6.11node/s, Processed node: fines]
Executing nodes: 100%|##########| 6/6 [00:00<00:00, 12.77node/s, Processed node: fines]


fig = fs.table_plot(plot_type='sankey', sankey_color_var='Fe', sankey_edge_colormap='copper_r', sankey_vmin=52,
                    sankey_vmax=70)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 2.498 seconds)

Gallery generated by Sphinx-Gallery