Train the model

If there are new data files that significantly change the types of information the model knows how to predict you need to retrain the model. In the case that you want to retrain the model but are not dealing with significant changes in the data cleaning and splitting steps, as needed. The steps of training a model are called inside of train_model_pipeline.py, but each of the individual steps performed are also discussed below.

Training a model

Training a model can be done by calling the train_model_pipeline.py file. This file combines each preprocessing step to allow the entire training process to be invoked from one call. The process begins by creating a unique output directory based on the date and time which the process begins. All outputs will be located within the new directory.

Using the input_config.yml file or the passed command line arguements (which can be viewed by passing the --help arguement), each required input will be validated before being used within the program.

The data will first be loaded by preprocessing.py, which load the building data and the energy data, and merges the two with the weather data (for energy training). For costing, no weather or energy data is loaded. Following additional data cleaning, the data is split into train/test/validation sets, where the validation set is either split from the regular input data or is loaded from an explicitly provided file to be used for validation.

Note

All input data will be used ‘as is’, where transformations using specific calculations should be done before the data is used for processing.

Before training the Machine Learning model, the optimal features will be selected by calling feature_selection.py. This process will use one of several avilable tools for feature selection to derive a list of features which will be used for training. This omits any feature which will not be useful in the training process. The availability of each feature selection tool will vary depending on whether the total energy/costing is predicted or whether the energy/costing breakdowns are predicted.

There are options from selecting between two models: Either a Multilayer Perceptron (MLP) or a Random Forest (RF) model. User can specify the type of being model that needs to be trained by sepcifying ‘mlp’ for MultiLayer Perceptron or ‘rf’ for Random Forest Model in the selected_model_type parameter in input_config.yml.

Using the selected features and preprocessed datasets, predict.py will be called to train the model. Hyperparameter tuning can be performed at the cost of additional time complexity or the default model architecture can be used. The trained model and test results will be output for future use. In the output .json file, there will be breakdowns on the model’s performance for each building type and climate zone

Preparation of weather data

EnergyPlus uses weather data in the EPW file format, as defined in the Auxiliary Programs documentation. In order to be consumed by the model, the data needs to undergo several steps before it can be used.

The weather data is received from a GitHub repo.

Weather data can be automatically preprared from the information found in the input building .xlsx files. The prepare_weather.py script can be used to process and save weather data.

Although prepare_weather.py can be used to process the weather files directly, within the training and running pipelines, the weather is processed as part of the data preprocessing step by loading all weather files attached to buildings in the input file(s). The contents of this subsection outline how the weather data is processed.

Note

The weather data is only loaded for energy preprocessing and if there are connection issues, the code may return an error when retrieving the weather from the repo. You can try running the program again if this arises.

python prepare_weather.py input_config.yml

Note

The input to prepare_weather.py is the same configuration file that is given to train_model_pipeline.py and to run_model.py. The CLI can also be used to invoke the process without a complete configuration file. The weather files to process must now be passed through the CLI since it is no longer an input in the configuration file.

Outputs are placed into a /weather folder (by default) in storage as parquet files. The name of the weather files will match the input name found in the YAML file, but with a .parquet file extension.

Note

When called within the data preprocessing step, no .parquet file is output. The output is directly used without needing to be loaded again.

What this does

EPW files contain several lines of data that is irrelevant to the model. This data is all in the first several lines of the file. For example, everything before the line starting with 1966 can be ignored:

LOCATION,Montreal Int'l,PQ,CAN,WYEC2-B-94792,716270,45.47,-73.75,-5.0,36.0
DESIGN CONDITIONS,1,Climate Design Data 2009 ASHRAE Handbook,,Heating,1,-23.7,-21.1,-30.5,0.2,-23,-27.9,0.3,-20.6,12.9,-5.3,11.5,-7.9,3.9,260,Cooling,7,9.3,30,22.1,28.5,21.1,27.1,20.2,23.2,28.1,22.2,26.6,21.4,25.6,4.9,220,21.6,16.3,26,20.7,15.5,25.2,19.8,14.5,24.2,69.3,28.1,65.5,26.7,62.3,25.6,703,Extremes,11.1,9.7,8.6,27.4,-26.5,32.3,2.9,1.5,-28.6,33.4,-30.4,34.3,-32,35.2,-34.2,36.3
TYPICAL/EXTREME PERIODS,6,Summer - Week Nearest Max Temperature For Period,Extreme,7/13,7/19,Summer - Week Nearest Average Temperature For Period,Typical,6/ 8,6/14,Winter - Week Nearest Min Temperature For Period,Extreme,1/ 6,1/12,Winter - Week Nearest Average Temperature For Period,Typical,2/17,2/23,Autumn - Week Nearest Average Temperature For Period,Typical,10/13,10/19,Spring - Week Nearest Average Temperature For Period,Typical,4/12,4/18
GROUND TEMPERATURES,3,.5,,,,-1.50,-6.19,-7.46,-6.35,-0.03,7.05,13.71,18.53,19.94,17.67,12.21,5.33,2,,,,2.71,-1.68,-3.77,-3.85,-0.51,4.33,9.54,14.01,16.32,15.89,12.81,8.08,4,,,,5.45,2.05,-0.04,-0.69,0.54,3.36,6.87,10.31,12.62,13.17,11.85,9.08
HOLIDAYS/DAYLIGHT SAVINGS,No,0,0,0
COMMENTS 1,WYEC2-Canadian Weather year for Energy Calculations (CWEC) -- WMO#716270
COMMENTS 2, -- Ground temps produced with a standard soil diffusivity of 2.3225760E-03 {m**2/day}
DATA PERIODS,1,1,Data,Sunday, 1/ 1,12/31
1966,1,1,1,60,A_A__*A_9*M_Q_M_Q_Q_Q_9_____________*_*___*,6.8,4.9,88,100550,0,9999,322,0,0,0,0,0,0,999900,225,7.2,10,10,16.1,3600,0,999999999,0,0.0000,0,88,0.000,0.0,0.0
1966,1,1,2,60,A_A__*A_9*M_Q_M_Q_Q_Q_9_____________*_*___*,8.3,6.2,87,100310,0,9999,330,0,0,0,0,0,0,999900,248,6.7,10,10,16.1,3600,0,999999999,0,0.0000,0,88,0.000,0.0,0.0
1966,1,1,3,60,A_A__*A_9*M_Q_M_Q_Q_Q_9_____________*_*___*,9.2,7.1,87,100170,0,9999,335,0,0,0,0,0,0,999900,248,8.1,10,10,16.1,3600,0,999999999,0,0.0000,0,88,0.000,0.0,0.0

Column names for the tabular data can be found in the Auxiliary Programs documentation.

Remove unused columns

In the Auxiliary Programs documentation for the EPW file format it lists which columns in the file are actually used for predicting building energy use. Columns marked as not used by EnergyPlus are removed from the dataset.

Note

The year column is marked as not being used by EnergyPlus, but is kept for use in creating a datetime index of the data. Only the month and day values are used when merging with building and energy data.

Summarize by day

Weather data in the file is provided for every hour for all 365 days of the year. Due to the way the data is combined in later steps, this becomes millions of records that need to be processed despite the fact that there is no intent to predict energy use to the hourly level. To make working with the data more manageable, weather values are summarized to daily intervals.

Clean and split the data

With the inputs specified through the command line or through input_config.py, run proprocessing.py to clean and split the data into train, test, and validation sets:

python preprocessing.py input_config.py

Note

Update the input parameters to point to the appropriate input files. Check preprocessing for full detail description of each parameter.

Aside from supporting optional files for electricity and gas, if you have multiple input files to combine, they need to be manually combined before passing them as input.

Note

The system will not work with columns containing strings and floats together. A mix of default values which require conversion and regular numbers must be updated to be only numbers. All data should be fully prepared before being passed into the program.

What this does

During preprocessing, the hourly energy consumption file is transposed such that each datapoint_id has 8760 rows (365 * 24). Hence, for a simulation run containing 5000 datapoint_id values, there would be 5000 * 8760 rows which would be 43.8 million rows. In order to avoid the preprocessing stage to become computationally expensive due to large datapoints created, the transposed hourly energy file is aggregated to daily energy for each datapoint_id. Similarly, the weather information is loaded based on which climate zones are used within the building input data, which is then aggregated from hourly to daily, so that it can be merged with the hourly energy file. In essence, the simulation I/O file(s), weather data from the specified climate zones and the hourly energy consumption file(s) are all merged to one dataframe which is then split for training and testing purposes (training set, testing set, validation set).

Note

There are two versions of the program, one which preprocesses to only include the total energy and costing values, and one which keeps all types of energy and costing to work with later.

For costing, the only preprocessing performed is on the building input files, with the weather and energy preprocessing being skipped.

Note

The total daily energy computed for each datapoint_id is converted to Megajoules per meter square (mj_per_m_sq), which is derived by converting the total energy used provided in the simulation I/O file to Megajoules and then dividing the result by the building floor area in meters.

In the case where the validation set is not provided, the dataset provided is split into a 70% training set, 20% test set and 10% validation set. However, when a validation set is provided, the dataset is split only into a 80% training set and 20% test set. It should be noted that the data splitting is performed using GroupShuffleSplit function which ensures that all data instances for a datapoint_id are included in the respective data group they are split into which implies that a datapoint_id used for training would not have any instance in the test set.

Note

The json file created at the end of preprocessing has the following keys:

  • features: This is all the features after data cleaning and preprocessing. Note that these are not the final features used for modelling.

  • y_train: This contains the y_train dataset.

  • X_train: This contains the X_train dataset.

  • X_test: This contains the X_test dataset.

  • y_test: This contains the y_test dataset.

  • y_test_complete: This is similar to the y_test dataset, but has the additional column datapoint_id which would be used needed in creating the final output file after predict.py is run.

  • X_validate: This contains the X_validate dataset.

  • y_validate: This contains the y_validate dataset.

  • y_validate_complete: This is similar to the y_validate dataset, but has the additional column datapoint_id which would be used needed in creating the final output file after predict.py is run.

Feature selection

After preprocessing the data, the features to be used in building the surrogate model can be selected (with all inputs coming from the command line or from the input_config.yml file):

python feature_selection.py input_config.yml

The parameters to the above script are documented at feature_selection.

To ensure that the most relevant features are used in building the surrogage model, feature selection is performed to search for the most optimal features from the preprocessed data.

Note

Feature selection is performed using the X_train and the y_train set.

To determine which features are useful for predicting the total energy and total costing we use:

  • MultiTaskLassoCV

Feature selection can be perfomed using any of the following estimator types when predicting the total energy or total costing:

  • Linear Regression

  • XGBRegressor

  • ElasticnetCV

  • LassoCV

Some details on the options are:

  • XGBRegressor and ElasticnetCV are often slow in selecting the features.

  • ElasticnetCV and LassoCV frequently selected same features.

  • Although, LassoCV is used as the default estimator for feature selection, any of the other estimator type can be used by specifying the respective esimator type to the estimator_type parameter when performing feature selection.

Note

The json file that is created contains the key value “features” which represents the final features selected for modelling.

Build the surrogate model and predict energy use

The outputs from the preprocessing and feature selection steps are used as input to predict.py to derive a trained model for predicting the energy and costing:

python predict.py input_config.yml

This script will output the trained model as a .h5 file for a Multilayer Perceptron model and .joblib for a Random Forest Model alongisde information on the training and testing performed for analysis. The outputs from this step will be used when obtaining predictions with the model.

The parameters to the above script are documented at: predict.

The Machine Learning model which is used for training will vary depending on whether parameter tuning is or is not performed (total outputs only) and depending on whether the user selects the more complex or simplistic model. When hyperparameter tuning is used (total outputs only), the Keras Hyperband Tuner is configured to optimize the parameters based on the loss. The tuner also uses early stopping by monitoring the loss with a patience of 5. Once the tuner finishes, the optimized model is built and used for training.

If hyperparameter tuning is not performed, a default model which has been evaluated as a baseline for performance is used. There are two options for the default model, one with more hidden nodes and a lower learning rate and another with less hidden nodes and a larger learning rate.

  • Larger model: This model uses a single layer of 10000 nodes with the ReLU activation function. An optional 10% dropout layer is then applied. The adam optimizer is used with a learning rate of 0.0001.

  • Smaller model: This model uses a single layer of 56 nodes with the ReLU activation function. An optional 10% dropout layer is then applied. The adam optimizer is used with a learning rate of 0.001.

A customized RMSE loss function is defined and used for training which takes a sum of the labels and predictions before computing the RMSE loss.

The outputs will include breakdowns on the model’s overall performance and the model’s performance for each building type and climate zone. When working with breakdowns of the energy and costing values, these breakdowns will be done for all types of costing and energy.