DAG with Estimator

Flowsheet can be used to apply an estimator in a process flowsheet. This example demonstrates how to use a DAG to define a flowsheet that applies a lump estimator to a feed stream.

The focus will not be on the model development, but rather on the simulation. The model is a simple RandomForest regressor that predicts the lump mass and composition from the feed stream.

Note

This example uses the estimator extras. ensure you have installed like poetry install -E estimator.

import logging

# This import at the top to guard against the estimator extras not being installed
from elphick.mass_composition.utils.sklearn import PandasPipeline

import pandas as pd
import plotly
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from elphick.mass_composition import MassComposition, Stream
from elphick.mass_composition.dag import DAG
from elphick.mass_composition.datasets.sample_data import iron_ore_met_sample_data
from elphick.mass_composition.flowsheet import Flowsheet

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

Load Data

We load some metallurgical data from a drill program, REF: A072391 Since we are not concerned about the model performance in this example, we’ll convert the categorical feature bulk_hole_no to an integer

df: pd.DataFrame = iron_ore_met_sample_data()

base_components = ['fe', 'p', 'sio2', 'al2o3', 'loi']
cols_x = ['dry_weight_lump_kg'] + [f'head_{comp}' for comp in base_components] + ['bulk_hole_no']
cols_y = ['lump_pct'] + [f'lump_{comp}' for comp in base_components]

df = df.loc[:, cols_x + cols_y].query('lump_pct>0').dropna(how='any')
df = df.rename(columns={'dry_weight_lump_kg': 'head_mass_dry'})
df['bulk_hole_no'] = df['bulk_hole_no'].astype('category').cat.codes

logger.info(df.shape)
df.head()

	head_mass_dry	head_fe	head_p	head_sio2	head_al2o3	head_loi	bulk_hole_no	lump_pct	lump_fe	lump_p	lump_sio2	lump_al2o3	lump_loi
sample_number
30129	0.31	62.94	0.02	3.71	1.51	4.12	0	21.3	64.95	0.014	2.29	0.87	3.83
30131	0.52	64.79	0.02	2.88	1.29	2.80	0	22.4	66.48	0.009	1.67	0.65	2.46
30132	0.41	65.22	0.02	2.64	1.15	2.61	0	16.6	66.78	0.009	1.34	0.47	2.67
30133	0.32	64.67	0.02	2.85	1.17	3.00	0	19.6	66.23	0.011	1.56	0.62	2.74
30134	0.31	65.29	0.02	2.25	0.94	2.93	0	13.9	66.53	0.011	1.41	0.54	2.91

Build a model

X: pd.DataFrame = df[[col for col in df.columns if col not in cols_y]]
y: pd.DataFrame = df[cols_y]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The model needs to be wrapped in a PandasPipeline object to ensure that the column names are preserved.

pipe: PandasPipeline = PandasPipeline.from_pipeline(
    make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42)))

pipe

PandasPipeline(steps=[('standardscaler', StandardScaler()),
                      ('randomforestregressor',
                       RandomForestRegressor(random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Test the model

The model can be called directly to predict the lump percentage and composition from the feed stream. We will pass in a dataframe with the same columns as the training data.

y_pred = pipe.fit(X_train.drop(columns=['head_mass_dry']), y_train).predict(X_test)
logger.info(f'Test score: {pipe.score(X_test, y_test)}')
y_pred.head()

	lump_pct	lump_fe	lump_p	lump_sio2	lump_al2o3	lump_loi
sample_number
30838	21.150	62.1259	0.03646	1.5286	0.9012	8.3704
30836	17.511	61.8068	0.04462	1.6825	0.9777	8.5672
30143	20.425	64.9562	0.01953	2.0668	0.6913	4.1378
30158	18.170	58.1892	0.03722	8.7937	0.9085	6.5543
30915	22.386	62.5843	0.03022	1.4852	0.9723	7.6845

Create a Head MassComposition object

Now we will create a MassComposition object and use it to apply the model to the feed stream.

head: MassComposition = MassComposition(data=X_test.drop(columns=['bulk_hole_no']), name='head',
                                        mass_dry_var='head_mass_dry')
lump, fines = head.split_by_estimator(estimator=pipe, name_2='fines',
                                      mass_recovery_column='lump_pct', mass_recovery_max=100,
                                      extra_features=X_test['bulk_hole_no'])
lump.data.to_dataframe().head()

	mass_wet	mass_dry	H2O	Fe	P	SiO2	Al2O3	LOI
sample_number
30838	0.300330	0.300330	0.0	62.1259	0.03646	1.5286	0.9012	8.3704
30836	0.227643	0.227643	0.0	61.8068	0.04462	1.6825	0.9777	8.5672
30143	0.085785	0.085785	0.0	64.9562	0.01953	2.0668	0.6913	4.1378
30158	0.101752	0.101752	0.0	58.1892	0.03722	8.7937	0.9085	6.5543
30915	0.385039	0.385039	0.0	62.5843	0.03022	1.4852	0.9723	7.6845

fines.data.to_dataframe().head()

	mass_wet	mass_dry	H2O	Fe	P	SiO2	Al2O3	LOI
sample_number
30838	1.119670	1.119670	0.0	60.672634	0.053632	1.974256	1.166007	9.092784
30836	1.072357	1.072357	0.0	60.416554	0.051142	2.309856	1.501770	8.752437
30143	0.334215	0.334215	0.0	63.892801	0.032687	2.975628	1.644740	3.424259
30158	0.458248	0.458248	0.0	58.788980	0.040617	8.092612	1.643560	5.669172
30915	1.334961	1.334961	0.0	60.800730	0.042821	2.135598	1.948541	8.155420

Define the DAG

First we define a simple DAG, where the feed stream is split into two streams, lump and fines. The lump estimator requires the usual mass-composition variables plus an addition feature/variable called bulk_hole_no. Since the bulk_hole_no is available in the feed stream, it is immediately accessible to the estimator.

head: MassComposition = MassComposition(data=X_test, name='head',
                                        mass_dry_var='head_mass_dry')

dag = DAG(name='A072391', n_jobs=1)
dag.add_input(name='head')
dag.add_step(name='screen', operation=Stream.split_by_estimator, streams=['head'],
             kwargs={'estimator': pipe, 'name_1': 'lump', 'name_2': 'fines',
                     'mass_recovery_column': 'lump_pct', 'mass_recovery_max': 100})
dag.add_output(name='lump', stream='lump')
dag.add_output(name='fines', stream='fines')
dag.run(input_streams={'head': head}, progress_bar=True)

fig = Flowsheet.from_dag(dag).plot_network()
fig

Executing nodes:   0%|          | 0/4 [00:00<?, ?node/s]
Executing nodes:   0%|          | 0/4 [00:00<?, ?node/s, Processed node: head]
Executing nodes:  25%|##5       | 1/4 [00:00<00:00,  3.99node/s, Processed node: screen]
Executing nodes:  50%|#####     | 2/4 [00:00<00:00,  7.98node/s, Processed node: screen]
Executing nodes:  50%|#####     | 2/4 [00:00<00:00,  7.98node/s, Processed node: lump]
Executing nodes:  75%|#######5  | 3/4 [00:00<00:00,  7.98node/s, Processed node: fines]
Executing nodes: 100%|##########| 4/4 [00:00<00:00, 15.93node/s, Processed node: fines]

More Complex DAG

This DAG is to test a more complex flowsheet where the estimator may have all the features immediately available in the parent stream.

Note

This example works, but it does so since all attribute (extra) variables are passed all the way around the network in the current design. This is to be changed in the future to allow for more efficient processing. Once attributes are no longer passed, changes will be needed to the DAG to marshall features from other streams in the network (most often the input stream).

dag = DAG(name='A072391', n_jobs=1)
dag.add_input(name='head')
dag.add_step(name='screen', operation=Stream.split_by_estimator, streams=['head'],
             kwargs={'estimator': pipe, 'name_1': 'lump', 'name_2': 'fines',
                     'mass_recovery_column': 'lump_pct', 'mass_recovery_max': 100})
dag.add_step(name='screen_2', operation=Stream.split_by_estimator, streams=['fines'],
             kwargs={'estimator': pipe, 'name_1': 'lump_2', 'name_2': 'fines_2',
                     'mass_recovery_column': 'lump_pct', 'mass_recovery_max': 100,
                     'allow_prefix_mismatch': True})
dag.add_output(name='lump', stream='lump_2')
dag.add_output(name='fines', stream='fines_2')
dag.add_output(name='stockpile', stream='lump')
dag.run(input_streams={'head': head}, progress_bar=True)

fs: Flowsheet = Flowsheet.from_dag(dag)

fig = fs.plot_network()
fig

Executing nodes:   0%|          | 0/6 [00:00<?, ?node/s]
Executing nodes:   0%|          | 0/6 [00:00<?, ?node/s, Processed node: head]
Executing nodes:  17%|#6        | 1/6 [00:00<00:01,  3.92node/s, Processed node: screen]
Executing nodes:  33%|###3      | 2/6 [00:00<00:00,  7.83node/s, Processed node: screen]
Executing nodes:  33%|###3      | 2/6 [00:00<00:00,  7.83node/s, Processed node: screen_2]
Executing nodes:  50%|#####     | 3/6 [00:00<00:00,  6.11node/s, Processed node: screen_2]
Executing nodes:  50%|#####     | 3/6 [00:00<00:00,  6.11node/s, Processed node: stockpile]
Executing nodes:  67%|######6   | 4/6 [00:00<00:00,  6.11node/s, Processed node: lump]
Executing nodes:  83%|########3 | 5/6 [00:00<00:00,  6.11node/s, Processed node: fines]
Executing nodes: 100%|##########| 6/6 [00:00<00:00, 12.77node/s, Processed node: fines]

fig = fs.table_plot(plot_type='sankey', sankey_color_var='Fe', sankey_edge_colormap='copper_r', sankey_vmin=52,
                    sankey_vmax=70)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 2.498 seconds)

Gallery generated by Sphinx-Gallery