Predict

Uses the output from preprocessing and feature selection to build, train, and evaluate the model.

predict.compute_building_weather_errors(df, actual_label, prediction_label)

Calculate the absolute difference and squared difference between the predicted and actual energy/costing use, and group by the building type and epw file to compute the means for BOTH the actual and predicted energy/costing

Parameters:
  • df – The dataframe being manipulated.

  • actual_label – The string used to describe the class being predicted (i.e. electricity, gas, …).

  • prediction_label – The string used to describe the prediction which a model outputs (i.e. predicted_electricity, …).

Returns:

The updated dataframe. building_errors: The errors for each building type. climate_errors: The errors for each climate zone.

Return type:

df

predict.convert_dataframe_to_annual(df)

Converts a dataframe of daily predictions into one with annual predictions (energy only).

Parameters:

df – The dataframe being transformed.

Returns:

The updated dataframe.

Return type:

updated_df

predict.create_model_mlp(dense_layers, activation, optimizer, dropout_rate, length, learning_rate, epochs, batch_size, X_train, y_train, X_test, y_test, y_test_complete, scalery, X_validate, y_validate, y_validate_complete, output_path, path_elec, path_gas, val_building_path, process_type, output_nodes)

Creates a MLP model with defaulted values without need to perform an hyperparameter search at all times. Its initutive to have run the hyperparameter search beforehand to know the hyperparameter value to set.

Parameters:
  • dense_layers – number of layers for the model architecture e.g for a model with 3 layers, values will be passed as [8,20,30]

  • activation – activation function to be used e.g relu, tanh

  • optimizer – optimizer to be used in compiling the model e.g relu, rmsprop, adam

  • dropout_rate – used to make the model avoid overfitting, value should be less than 1 e.g 0.3

  • length – length of the trainset

  • learning_rate – learning rate determines how fast or how slow the model will converge to an optimal loss value. Value should be less or equal 0.1 e.g 0.001

  • epochs – number of iterations the model should perform

  • batch_size – batch size to be used

  • X_train – X trainset

  • y_train – y trainset

  • X_test – X testset

  • y_test – y testset

  • y_test_complete – dataframe containing the target variable with corresponding datapointid for the test set

  • scalery – y scaler used to transform the y values to the original scale

  • X_validate – X validation set

  • y_validate – y validation set

  • y_validate_complete – dataframe containing the target variable with corresponding datapointid for the validation set

  • output_path – Where the outputs will be placed

  • path_elec – Filepath of the electricity building file which has been used

  • path_gas – Filepath of the gas building file, if it has been used (pass nothing otherwise)

  • val_building_path – Filepath of the validation building file, if it has been used (pass nothing otherwise).

  • process_type – Either ‘energy’ or ‘costing’ to specify the operations to be performed.

  • output_nodes – The number of outputs which the model needs to predict.

Returns:

evaluation results containing the loss value from the testset prediction, annual_metric: predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the testset prediction, output_df: merge of y_pred, y_test, datapoint_id, the final dataframe showing the model output using the testset val_metric:evaluation results containing the loss value from the validationset prediction, val_annual_metric:predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the validationset prediction,, output_val_df: merge of y_pred, y_validate, datapoint_id, the final dataframe showing the model output using the validation set

Return type:

metric

predict.create_model_rf(n_estimators, max_depth, min_samples_split, min_samples_leaf, X_train, y_train, X_test, y_test, y_test_complete, scalery, X_validate, y_validate, y_validate_complete, output_path, path_elec, path_gas, val_building_path, process_type, output_nodes)

Creates a model with defaulted values without need to perform an hyperparameter search at all times. Its initutive to have run the hyperparameter search beforehand to know the hyperparameter value to set.

Parameters:
  • n_estimators – the number of trees in the random forest

  • max_depth – the maximum depth of the tree.

  • min_samples_split – The minimum number of samples required to split an internal node:

  • min_samples_leaf – The minimum number of samples required to be at a leaf node.

  • X_train – X trainset

  • y_train – y trainset

  • X_test – X testset

  • y_test – y testset

  • y_test_complete – dataframe containing the target variable with corresponding datapointid for the test set

  • scalery – y scaler used to transform the y values to the original scale

  • X_validate – X validation set

  • y_validate – y validation set

  • y_validate_complete – dataframe containing the target variable with corresponding datapointid for the validation set

  • output_path – Where the outputs will be placed

  • path_elec – Filepath of the electricity building file which has been used

  • path_gas – Filepath of the gas building file, if it has been used (pass nothing otherwise)

  • val_building_path – Filepath of the validation building file, if it has been used (pass nothing otherwise).

  • process_type – Either ‘energy’ or ‘costing’ to specify the operations to be performed.

  • output_nodes – The number of outputs which the model needs to predict.

Returns:

evaluation results containing the loss value from the testset prediction, annual_metric: predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the testset prediction, output_df: merge of y_pred, y_test, datapoint_id, the final dataframe showing the model output using the testset val_metric:evaluation results containing the loss value from the validationset prediction, val_annual_metric:predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the validationset prediction,, output_val_df: merge of y_pred, y_validate, datapoint_id, the final dataframe showing the model output using the validation set

Return type:

metric

predict.evaluate(model, X_test, y_test, scalery, X_validate, y_validate, y_test_complete, y_validate_complete, path_elec, path_gas, val_building_path, process_type)

The model selected with the best hyperparameter is used to make predictions.

Parameters:
  • model – model built from training

  • X_test – X testset

  • y_test – y testset

  • scalery – y scaler used to transform the y values to the original scale

  • X_validate – X validationset

  • y_validate – y validationset

  • y_test_complete – test dataset

  • y_validate_complete – validation dataset

  • path_elec – Filepath of the electricity building file which has been used

  • path_gas – Filepath of the gas building file, if it has been used (pass nothing otherwise)

  • val_building_path – Filepath of the validation building file, if it has been used (pass nothing otherwise).

  • process_type – Either ‘energy’ or ‘costing’ to specify the operations to be performed.

Returns:

evaluation results containing the loss value from the testset prediction, annual_metric: predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the test set prediction, output_df: merge of y_pred, y_test, datapoint_id, the final dataframe showing the model output using the test set val_metric:evaluation results containing the loss value from the validationset prediction, val_annual_metric:predicted value for each datapooint_id is summed to calculate the annual energy consumed and the loss value from the validationset prediction,, output_val_df: merge of y_pred, y_validate, datapoint_id, the final dataframe showing the model output using the validation set output_df_average_predictions_buildings: The mean energy predictions and actual energy values per building type in the test set output_df_average_predictions_climates: The mean energy predictions and actual energy values per climate zone in the test set output_val_df_average_predictions_buildings: The mean energy predictions and actual energy values per building type in the validation set output_val_df_average_predictions_climates: The mean energy predictions and actual energy values per climate zone in the validation set

Return type:

metric

predict.fit_evaluate(preprocessed_data_file, selected_features_file, selected_model_type, param_search, output_path, random_seed, path_elec, path_gas, val_building_path, process_type, use_updated_model, use_dropout)

Loads the output from preprocessing and feature selection, builds the model, then evaluates the model.

Parameters:
  • preprocessed_data_file – Location and name of a .json preprocessing file to be used.

  • selected_features_file – Location and name of a .json feature selection file to be used.

  • selected_model_type – the type of model to be used. Can either be ‘mlp’ or ‘rf’

  • param_search – ‘yes’ if hyperparameter tuning should be performed (increases runtime), ‘no’ if the default hyperparameters should be used.

  • output_path – Where output data should be placed. Note that this value should be empty unless this file is called from a pipeline.

  • random_seed – Random seed to be used when training. Should not be -1 when used through the CLI.

  • path_elec – Filepath of the electricity building file which has been used.

  • path_gas – Filepath of the gas building file, if it has been used (pass nothing otherwise).

  • val_building_path – Filepath of the validation building file, if it has been used (pass nothing otherwise).

  • process_type – Either ‘energy’ or ‘costing’ to specify the operations to be performed.

  • use_updated_model – True if the larger model architecture should be used for training. Should be False if a costing model is being trained.

  • use_dropout – True if the regularization technique should be used (on by default). False if tests are desired without dropout. Note that not using dropout may cause bias to learned when training.

Returns:

the results from the model prediction is uploaded to minio

predict.main(config_file=<typer.models.ArgumentInfo object>, process_type=<typer.models.ArgumentInfo object>, preprocessed_data_file=<typer.models.ArgumentInfo object>, selected_features_file=<typer.models.ArgumentInfo object>, selected_model_type=<typer.models.OptionInfo object>, perform_param_search=<typer.models.OptionInfo object>, output_path=<typer.models.OptionInfo object>, random_seed=<typer.models.OptionInfo object>, path_elec=<typer.models.ArgumentInfo object>, path_gas=<typer.models.OptionInfo object>, val_building_path=<typer.models.OptionInfo object>, use_updated_model=<typer.models.OptionInfo object>, use_dropout=<typer.models.OptionInfo object>)

Using all preprocessed data, build and train a Machine Learning model to predict the total energy or costing values. All steps of this process are saved, and the model is evaluated to determine its effectiveness overall and on specific building types and climate zones.

Parameters:
  • config_file (str) – Location of the .yml config file (default name is input_config.yml).

  • process_type (str) – Either ‘energy’ or ‘costing’ to specify the operations to be performed.

  • preprocessed_data_file (str) – Location and name of a .json preprocessing file to be used.

  • selected_features_file (str) – Location and name of a .json feature selection file to be used.

  • selected_model_type (str) – Type of model selected. can either be ‘mlp’ for Multilayer Perceptron or ‘rf’ for Random Forest

  • perform_param_search (str) – ‘yes’ if hyperparameter tuning should be performed (increases runtime), ‘no’ if the default hyperparameters should be used.

  • output_path (str) – Where output data should be placed. Note that this value should be empty unless this file is called from a pipeline.

  • random_seed (int) – Random seed to be used when training. Should not be -1 when used through the CLI.

  • path_elec (str) – Filepath of the electricity building file which has been used.

  • path_gas (str) – Filepath of the gas building file, if it has been used (pass nothing otherwise).

  • val_building_path (str) – Filepath of the validation building file, if it has been used (pass nothing otherwise).

  • use_updated_model (bool) – True if the larger model architecture should be used for training. Should be False if a costing model is being trained.

  • use_dropout (bool) – True if the regularization technique should be used (on by default). False if tests are desired without dropout. Note that not using dropout may cause bias to learned when training.

predict.model_builder(hp)

Builds the model that would be used to search for hyperparameter. The hyperparameters search inclues activation, regularizers, dropout_rate, learning_rate, and optimizer

Parameters:

hp – hyperband object with different hyperparameters to be checked.

Returns:

model will be built based on the different hyperparameter combinations.

predict.predicts_hp(X_train, y_train, X_test, y_test, selected_feature, output_path, random_seed)

DECOMMISIONED: May need updates for multi-outputs Using the set of hyperparameter combined,the model built is used to make predictions.

Parameters:
  • X_train – X train set

  • y_train – y train set

  • X_test – X test set

  • y_test – y test set

  • selected_feature – selected features that would be used to build the model

  • output_path – Where the output files should be placed.

  • random_seed – The random seed to be used

Returns:

Model built from the set of hyperparameters combined.

predict.rmse_loss(y_true, y_pred)

A customized rmse score that takes a sum of y_pred and y_test before computing the rmse score

Parameters:
  • y_true – y testset

  • y_pred – y predicted value from the model

Returns:

rmse loss from comparing the y_test and y_pred values

predict.score(y_test, y_pred)

Used to compute the mse, rmse and mae scores

Parameters:
  • y_test – y testset

  • y_pred – y predicted value from the model

Returns:

mse, rmse, mae and mape scores from comparing the y_test and y_pred values