Predict¶
Uses the output from preprocessing and feature selection to build, train, and evaluate the model.
- predict.compute_building_weather_errors(df, actual_label, prediction_label)¶
Calculate the absolute difference and squared difference between the predicted and actual energy/costing use, and group by the building type and epw file to compute the means for BOTH the actual and predicted energy/costing
- Parameters:
df – The dataframe being manipulated.
actual_label – The string used to describe the class being predicted (i.e. electricity, gas, …).
prediction_label – The string used to describe the prediction which a model outputs (i.e. predicted_electricity, …).
- Returns:
The updated dataframe. building_errors: The errors for each building type. climate_errors: The errors for each climate zone.
- Return type:
df
- predict.convert_dataframe_to_annual(df)¶
Converts a dataframe of daily predictions into one with annual predictions (energy only).
- Parameters:
df – The dataframe being transformed.
- Returns:
The updated dataframe.
- Return type:
updated_df
- predict.create_model_mlp(dense_layers, activation, optimizer, dropout_rate, length, learning_rate, epochs, batch_size, X_train, y_train, X_test, y_test, y_test_complete, scalery, X_validate, y_validate, y_validate_complete, output_path, path_elec, path_gas, val_building_path, process_type, output_nodes)¶
Creates a MLP model with defaulted values without need to perform an hyperparameter search at all times. Its initutive to have run the hyperparameter search beforehand to know the hyperparameter value to set.
- Parameters:
dense_layers – number of layers for the model architecture e.g for a model with 3 layers, values will be passed as [8,20,30]
activation – activation function to be used e.g relu, tanh
optimizer – optimizer to be used in compiling the model e.g relu, rmsprop, adam
dropout_rate – used to make the model avoid overfitting, value should be less than 1 e.g 0.3
length – length of the trainset
learning_rate – learning rate determines how fast or how slow the model will converge to an optimal loss value. Value should be less or equal 0.1 e.g 0.001
epochs – number of iterations the model should perform
batch_size – batch size to be used
X_train – X trainset
y_train – y trainset
X_test – X testset
y_test – y testset
y_test_complete – dataframe containing the target variable with corresponding datapointid for the test set
scalery – y scaler used to transform the y values to the original scale
X_validate – X validation set
y_validate – y validation set
y_validate_complete – dataframe containing the target variable with corresponding datapointid for the validation set
output_path – Where the outputs will be placed
path_elec – Filepath of the electricity building file which has been used
path_gas – Filepath of the gas building file, if it has been used (pass nothing otherwise)
val_building_path – Filepath of the validation building file, if it has been used (pass nothing otherwise).
process_type – Either ‘energy’ or ‘costing’ to specify the operations to be performed.
output_nodes – The number of outputs which the model needs to predict.
- Returns:
evaluation results containing the loss value from the testset prediction, annual_metric: predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the testset prediction, output_df: merge of y_pred, y_test, datapoint_id, the final dataframe showing the model output using the testset val_metric:evaluation results containing the loss value from the validationset prediction, val_annual_metric:predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the validationset prediction,, output_val_df: merge of y_pred, y_validate, datapoint_id, the final dataframe showing the model output using the validation set
- Return type:
metric
- predict.create_model_rf(n_estimators, max_depth, min_samples_split, min_samples_leaf, X_train, y_train, X_test, y_test, y_test_complete, scalery, X_validate, y_validate, y_validate_complete, output_path, path_elec, path_gas, val_building_path, process_type, output_nodes)¶
Creates a model with defaulted values without need to perform an hyperparameter search at all times. Its initutive to have run the hyperparameter search beforehand to know the hyperparameter value to set.
- Parameters:
n_estimators – the number of trees in the random forest
max_depth – the maximum depth of the tree.
min_samples_split – The minimum number of samples required to split an internal node:
min_samples_leaf – The minimum number of samples required to be at a leaf node.
X_train – X trainset
y_train – y trainset
X_test – X testset
y_test – y testset
y_test_complete – dataframe containing the target variable with corresponding datapointid for the test set
scalery – y scaler used to transform the y values to the original scale
X_validate – X validation set
y_validate – y validation set
y_validate_complete – dataframe containing the target variable with corresponding datapointid for the validation set
output_path – Where the outputs will be placed
path_elec – Filepath of the electricity building file which has been used
path_gas – Filepath of the gas building file, if it has been used (pass nothing otherwise)
val_building_path – Filepath of the validation building file, if it has been used (pass nothing otherwise).
process_type – Either ‘energy’ or ‘costing’ to specify the operations to be performed.
output_nodes – The number of outputs which the model needs to predict.
- Returns:
evaluation results containing the loss value from the testset prediction, annual_metric: predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the testset prediction, output_df: merge of y_pred, y_test, datapoint_id, the final dataframe showing the model output using the testset val_metric:evaluation results containing the loss value from the validationset prediction, val_annual_metric:predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the validationset prediction,, output_val_df: merge of y_pred, y_validate, datapoint_id, the final dataframe showing the model output using the validation set
- Return type:
metric
- predict.evaluate(model, X_test, y_test, scalery, X_validate, y_validate, y_test_complete, y_validate_complete, path_elec, path_gas, val_building_path, process_type)¶
The model selected with the best hyperparameter is used to make predictions.
- Parameters:
model – model built from training
X_test – X testset
y_test – y testset
scalery – y scaler used to transform the y values to the original scale
X_validate – X validationset
y_validate – y validationset
y_test_complete – test dataset
y_validate_complete – validation dataset
path_elec – Filepath of the electricity building file which has been used
path_gas – Filepath of the gas building file, if it has been used (pass nothing otherwise)
val_building_path – Filepath of the validation building file, if it has been used (pass nothing otherwise).
process_type – Either ‘energy’ or ‘costing’ to specify the operations to be performed.
- Returns:
evaluation results containing the loss value from the testset prediction, annual_metric: predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the test set prediction, output_df: merge of y_pred, y_test, datapoint_id, the final dataframe showing the model output using the test set val_metric:evaluation results containing the loss value from the validationset prediction, val_annual_metric:predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the validationset prediction,, output_val_df: merge of y_pred, y_validate, datapoint_id, the final dataframe showing the model output using the validation set output_df_average_predictions_buildings: The mean energy predictions and actual energy values per building type in the test set output_df_average_predictions_climates: The mean energy predictions and actual energy values per climate zone in the test set output_val_df_average_predictions_buildings: The mean energy predictions and actual energy values per building type in the validation set output_val_df_average_predictions_climates: The mean energy predictions and actual energy values per climate zone in the validation set
- Return type:
metric
- predict.fit_evaluate(preprocessed_data_file, selected_features_file, selected_model_type, param_search, output_path, random_seed, path_elec, path_gas, val_building_path, process_type, use_updated_model, use_dropout)¶
Loads the output from preprocessing and feature selection, builds the model, then evaluates the model.
- Parameters:
preprocessed_data_file – Location and name of a .json preprocessing file to be used.
selected_features_file – Location and name of a .json feature selection file to be used.
selected_model_type – the type of model to be used. Can either be ‘mlp’ or ‘rf’
param_search – ‘yes’ if hyperparameter tuning should be performed (increases runtime), ‘no’ if the default hyperparameters should be used.
output_path – Where output data should be placed. Note that this value should be empty unless this file is called from a pipeline.
random_seed – Random seed to be used when training. Should not be -1 when used through the CLI.
path_elec – Filepath of the electricity building file which has been used.
path_gas – Filepath of the gas building file, if it has been used (pass nothing otherwise).
val_building_path – Filepath of the validation building file, if it has been used (pass nothing otherwise).
process_type – Either ‘energy’ or ‘costing’ to specify the operations to be performed.
use_updated_model – True if the larger model architecture should be used for training. Should be False if a costing model is being trained.
use_dropout – True if the regularization technique should be used (on by default). False if tests are desired without dropout. Note that not using dropout may cause bias to learned when training.
- Returns:
the results from the model prediction is uploaded to minio
- predict.main(config_file=<typer.models.ArgumentInfo object>, process_type=<typer.models.ArgumentInfo object>, preprocessed_data_file=<typer.models.ArgumentInfo object>, selected_features_file=<typer.models.ArgumentInfo object>, selected_model_type=<typer.models.OptionInfo object>, perform_param_search=<typer.models.OptionInfo object>, output_path=<typer.models.OptionInfo object>, random_seed=<typer.models.OptionInfo object>, path_elec=<typer.models.ArgumentInfo object>, path_gas=<typer.models.OptionInfo object>, val_building_path=<typer.models.OptionInfo object>, use_updated_model=<typer.models.OptionInfo object>, use_dropout=<typer.models.OptionInfo object>)¶
Using all preprocessed data, build and train a Machine Learning model to predict the total energy or costing values. All steps of this process are saved, and the model is evaluated to determine its effectiveness overall and on specific building types and climate zones.
- Parameters:
config_file (
str
) – Location of the .yml config file (default name is input_config.yml).process_type (
str
) – Either ‘energy’ or ‘costing’ to specify the operations to be performed.preprocessed_data_file (
str
) – Location and name of a .json preprocessing file to be used.selected_features_file (
str
) – Location and name of a .json feature selection file to be used.selected_model_type (
str
) – Type of model selected. can either be ‘mlp’ for Multilayer Perceptron or ‘rf’ for Random Forestperform_param_search (
str
) – ‘yes’ if hyperparameter tuning should be performed (increases runtime), ‘no’ if the default hyperparameters should be used.output_path (
str
) – Where output data should be placed. Note that this value should be empty unless this file is called from a pipeline.random_seed (
int
) – Random seed to be used when training. Should not be -1 when used through the CLI.path_elec (
str
) – Filepath of the electricity building file which has been used.path_gas (
str
) – Filepath of the gas building file, if it has been used (pass nothing otherwise).val_building_path (
str
) – Filepath of the validation building file, if it has been used (pass nothing otherwise).use_updated_model (
bool
) – True if the larger model architecture should be used for training. Should be False if a costing model is being trained.use_dropout (
bool
) – True if the regularization technique should be used (on by default). False if tests are desired without dropout. Note that not using dropout may cause bias to learned when training.
- predict.model_builder(hp)¶
Builds the model that would be used to search for hyperparameter. The hyperparameters search inclues activation, regularizers, dropout_rate, learning_rate, and optimizer
- Parameters:
hp – hyperband object with different hyperparameters to be checked.
- Returns:
model will be built based on the different hyperparameter combinations.
- predict.predicts_hp(X_train, y_train, X_test, y_test, selected_feature, output_path, random_seed)¶
DECOMMISIONED: May need updates for multi-outputs Using the set of hyperparameter combined,the model built is used to make predictions.
- Parameters:
X_train – X train set
y_train – y train set
X_test – X test set
y_test – y test set
selected_feature – selected features that would be used to build the model
output_path – Where the output files should be placed.
random_seed – The random seed to be used
- Returns:
Model built from the set of hyperparameters combined.
- predict.rmse_loss(y_true, y_pred)¶
A customized rmse score that takes a sum of y_pred and y_test before computing the rmse score
- Parameters:
y_true – y testset
y_pred – y predicted value from the model
- Returns:
rmse loss from comparing the y_test and y_pred values
- predict.score(y_test, y_pred)¶
Used to compute the mse, rmse and mae scores
- Parameters:
y_test – y testset
y_pred – y predicted value from the model
- Returns:
mse, rmse, mae and mape scores from comparing the y_test and y_pred values