Train the model¶
If there are new data files that significantly change the types of information the model knows how to predict you need to retrain the model.
In the case that you want to retrain the model but are not dealing with significant changes in the data cleaning and splitting steps, as needed.
The steps of training a model are called inside of train_model_pipeline.py
, but each of the individual steps performed
are also discussed below.
Training a model¶
Training a model can be done by calling the train_model_pipeline.py
file. This file combines each
preprocessing step to allow the entire training process to be invoked from one call. The process
begins by creating a unique output directory based on the date and time which the process begins.
All outputs will be located within the new directory.
Using the input_config.yml
file or the passed command line arguements (which can be viewed by
passing the --help
arguement), each required input will be validated before being used within the
program.
The data will first be loaded by preprocessing.py
, which load the building data and the energy data,
and merges the two with the weather data (for energy training). For costing, no weather or energy data is loaded.
Following additional data cleaning, the data is split into train/test/validation sets,
where the validation set is either split from the regular input data or
is loaded from an explicitly provided file to be used for validation.
Note
All input data will be used ‘as is’, where transformations using specific calculations should be done before the data is used for processing.
Before training the Machine Learning model, the optimal features will be selected by calling
feature_selection.py
. This process will use one of several avilable tools for feature selection
to derive a list of features which will be used for training. This omits any feature which will not be
useful in the training process. The availability of each feature selection tool will vary depending on
whether the total energy/costing is predicted or whether the energy/costing breakdowns are predicted.
There are options from selecting between two models: Either a Multilayer Perceptron (MLP) or a Random Forest (RF) model. User can specify the type of being model that needs to be trained by sepcifying ‘mlp’ for MultiLayer Perceptron or ‘rf’ for Random Forest Model in the selected_model_type parameter in input_config.yml.
Using the selected features and preprocessed datasets, predict.py
will be called to train the model.
Hyperparameter tuning can be performed at the cost of additional time complexity or the default
model architecture can be used. The trained model and test results will be output for future use.
In the output .json file
, there will be breakdowns on the model’s performance for each building type
and climate zone
Preparation of weather data¶
EnergyPlus uses weather data in the EPW file format, as defined in the Auxiliary Programs documentation. In order to be consumed by the model, the data needs to undergo several steps before it can be used.
The weather data is received from a GitHub repo.
Weather data can be automatically preprared from the information found in the input building .xlsx files. The
prepare_weather.py
script can be used to process and save weather data.
Although prepare_weather.py
can be used to process the weather files directly, within the training and
running pipelines, the weather is processed as part of the data preprocessing step by loading all weather
files attached to buildings in the input file(s). The contents of this subsection outline how the weather data
is processed.
Note
The weather data is only loaded for energy preprocessing and if there are connection issues, the code may return an error when retrieving the weather from the repo. You can try running the program again if this arises.
python prepare_weather.py input_config.yml
Note
The input to prepare_weather.py
is the same configuration file that is given to train_model_pipeline.py
and to run_model.py
. The CLI can also be used to invoke the process without a complete configuration file.
The weather files to process must now be passed through the CLI since it is no longer an input in the
configuration file.
Outputs are placed into a /weather
folder (by default) in storage as parquet files. The name of the weather files
will match the input name found in the YAML file, but with a .parquet
file extension.
Note
When called within the data preprocessing step, no .parquet
file is output. The output is directly used without
needing to be loaded again.
What this does¶
EPW files contain several lines of data that is irrelevant to the model. This data is all in the first several lines
of the file. For example, everything before the line starting with 1966
can be ignored:
LOCATION,Montreal Int'l,PQ,CAN,WYEC2-B-94792,716270,45.47,-73.75,-5.0,36.0
DESIGN CONDITIONS,1,Climate Design Data 2009 ASHRAE Handbook,,Heating,1,-23.7,-21.1,-30.5,0.2,-23,-27.9,0.3,-20.6,12.9,-5.3,11.5,-7.9,3.9,260,Cooling,7,9.3,30,22.1,28.5,21.1,27.1,20.2,23.2,28.1,22.2,26.6,21.4,25.6,4.9,220,21.6,16.3,26,20.7,15.5,25.2,19.8,14.5,24.2,69.3,28.1,65.5,26.7,62.3,25.6,703,Extremes,11.1,9.7,8.6,27.4,-26.5,32.3,2.9,1.5,-28.6,33.4,-30.4,34.3,-32,35.2,-34.2,36.3
TYPICAL/EXTREME PERIODS,6,Summer - Week Nearest Max Temperature For Period,Extreme,7/13,7/19,Summer - Week Nearest Average Temperature For Period,Typical,6/ 8,6/14,Winter - Week Nearest Min Temperature For Period,Extreme,1/ 6,1/12,Winter - Week Nearest Average Temperature For Period,Typical,2/17,2/23,Autumn - Week Nearest Average Temperature For Period,Typical,10/13,10/19,Spring - Week Nearest Average Temperature For Period,Typical,4/12,4/18
GROUND TEMPERATURES,3,.5,,,,-1.50,-6.19,-7.46,-6.35,-0.03,7.05,13.71,18.53,19.94,17.67,12.21,5.33,2,,,,2.71,-1.68,-3.77,-3.85,-0.51,4.33,9.54,14.01,16.32,15.89,12.81,8.08,4,,,,5.45,2.05,-0.04,-0.69,0.54,3.36,6.87,10.31,12.62,13.17,11.85,9.08
HOLIDAYS/DAYLIGHT SAVINGS,No,0,0,0
COMMENTS 1,WYEC2-Canadian Weather year for Energy Calculations (CWEC) -- WMO#716270
COMMENTS 2, -- Ground temps produced with a standard soil diffusivity of 2.3225760E-03 {m**2/day}
DATA PERIODS,1,1,Data,Sunday, 1/ 1,12/31
1966,1,1,1,60,A_A__*A_9*M_Q_M_Q_Q_Q_9_____________*_*___*,6.8,4.9,88,100550,0,9999,322,0,0,0,0,0,0,999900,225,7.2,10,10,16.1,3600,0,999999999,0,0.0000,0,88,0.000,0.0,0.0
1966,1,1,2,60,A_A__*A_9*M_Q_M_Q_Q_Q_9_____________*_*___*,8.3,6.2,87,100310,0,9999,330,0,0,0,0,0,0,999900,248,6.7,10,10,16.1,3600,0,999999999,0,0.0000,0,88,0.000,0.0,0.0
1966,1,1,3,60,A_A__*A_9*M_Q_M_Q_Q_Q_9_____________*_*___*,9.2,7.1,87,100170,0,9999,335,0,0,0,0,0,0,999900,248,8.1,10,10,16.1,3600,0,999999999,0,0.0000,0,88,0.000,0.0,0.0
Column names for the tabular data can be found in the Auxiliary Programs documentation.
Remove unused columns¶
In the Auxiliary Programs documentation for the EPW file format it lists which columns in the file are actually used for predicting building energy use. Columns marked as not used by EnergyPlus are removed from the dataset.
Note
The year
column is marked as not being used by EnergyPlus, but is kept for use in creating a datetime index
of the data. Only the month and day values are used when merging with building and energy data.
Summarize by day¶
Weather data in the file is provided for every hour for all 365 days of the year. Due to the way the data is combined in later steps, this becomes millions of records that need to be processed despite the fact that there is no intent to predict energy use to the hourly level. To make working with the data more manageable, weather values are summarized to daily intervals.
Clean and split the data¶
With the inputs specified through the command line or through input_config.py
,
run proprocessing.py
to clean and split the data into train, test, and validation sets:
python preprocessing.py input_config.py
Note
Update the input parameters to point to the appropriate input files.
Check preprocessing
for full detail description of each parameter.
Aside from supporting optional files for electricity and gas, if you have multiple input files to combine, they need to be manually combined before passing them as input.
Note
The system will not work with columns containing strings and floats together. A mix of default values which require conversion and regular numbers must be updated to be only numbers. All data should be fully prepared before being passed into the program.
What this does¶
During preprocessing, the hourly energy consumption file is transposed such that each datapoint_id
has 8760 rows
(365 * 24). Hence, for a simulation run containing 5000 datapoint_id
values, there would be 5000 * 8760 rows which would
be 43.8 million rows. In order to avoid the preprocessing stage to become computationally expensive due to large
datapoints created, the transposed hourly energy file is aggregated to daily energy for each datapoint_id
. Similarly,
the weather information is loaded based on which climate zones are used within the building input data, which is then
aggregated from hourly to daily, so that it can be merged with the hourly energy file.
In essence, the simulation I/O file(s), weather data from the specified climate zones and the hourly energy consumption file(s) are all merged to one
dataframe which is then split for training and testing purposes (training set, testing set, validation set).
Note
There are two versions of the program, one which preprocesses to only include the total energy and costing values, and one which keeps all types of energy and costing to work with later.
For costing, the only preprocessing performed is on the building input files, with the weather and energy preprocessing being skipped.
Note
The total daily energy computed for each datapoint_id
is converted to Megajoules per meter square (mj_per_m_sq),
which is derived by converting the total energy used provided in the simulation I/O file to Megajoules and then
dividing the result by the building floor area in meters.
In the case where the validation set is not provided, the dataset provided is split into a 70% training set, 20% test
set and 10% validation set. However, when a validation set is provided, the dataset is split only into a 80% training
set and 20% test set. It should be noted that the data splitting is performed using GroupShuffleSplit function which
ensures that all data instances for a datapoint_id
are included in the respective data group they are split into which
implies that a datapoint_id
used for training would not have any instance in the test set.
Note
The json file created at the end of preprocessing has the following keys:
features
: This is all the features after data cleaning and preprocessing. Note that these are not the final features used for modelling.y_train
: This contains the y_train dataset.X_train
: This contains the X_train dataset.X_test
: This contains the X_test dataset.y_test
: This contains the y_test dataset.y_test_complete
: This is similar to the y_test dataset, but has the additional column datapoint_id which would be used needed in creating the final output file after predict.py is run.X_validate
: This contains the X_validate dataset.y_validate
: This contains the y_validate dataset.y_validate_complete
: This is similar to the y_validate dataset, but has the additional column datapoint_id which would be used needed in creating the final output file after predict.py is run.
Feature selection¶
After preprocessing the data, the features to be used in building the surrogate model can be selected
(with all inputs coming from the command line or from the input_config.yml
file):
python feature_selection.py input_config.yml
The parameters to the above script are documented at feature_selection
.
To ensure that the most relevant features are used in building the surrogage model, feature selection is performed to search for the most optimal features from the preprocessed data.
Note
Feature selection is performed using the X_train and the y_train set.
To determine which features are useful for predicting the total energy and total costing we use:
MultiTaskLassoCV
Feature selection can be perfomed using any of the following estimator types when predicting the total energy or total costing:
Linear Regression
XGBRegressor
ElasticnetCV
LassoCV
Some details on the options are:
XGBRegressor and ElasticnetCV are often slow in selecting the features.
ElasticnetCV and LassoCV frequently selected same features.
Although, LassoCV is used as the default estimator for feature selection, any of the other estimator type can be used by specifying the respective esimator type to the estimator_type parameter when performing feature selection.
Note
The json file that is created contains the key value “features” which represents the final features selected for modelling.
Build the surrogate model and predict energy use¶
The outputs from the preprocessing and feature selection steps are used as input to predict.py
to derive a trained model for predicting the energy and costing:
python predict.py input_config.yml
This script will output the trained model as a .h5 file for a Multilayer Perceptron model and .joblib for a Random Forest Model alongisde information on the training and testing performed for analysis. The outputs from this step will be used when obtaining predictions with the model.
The parameters to the above script are documented at: predict
.
The Machine Learning model which is used for training will vary depending on whether parameter tuning is or is not performed (total outputs only) and depending on whether the user selects the more complex or simplistic model. When hyperparameter tuning is used (total outputs only), the Keras Hyperband Tuner is configured to optimize the parameters based on the loss. The tuner also uses early stopping by monitoring the loss with a patience of 5. Once the tuner finishes, the optimized model is built and used for training.
If hyperparameter tuning is not performed, a default model which has been evaluated as a baseline for performance is used. There are two options for the default model, one with more hidden nodes and a lower learning rate and another with less hidden nodes and a larger learning rate.
Larger model: This model uses a single layer of 10000 nodes with the ReLU activation function. An optional 10% dropout layer is then applied. The adam optimizer is used with a learning rate of 0.0001.
Smaller model: This model uses a single layer of 56 nodes with the ReLU activation function. An optional 10% dropout layer is then applied. The adam optimizer is used with a learning rate of 0.001.
A customized RMSE loss function is defined and used for training which takes a sum of the labels and predictions before computing the RMSE loss.
The outputs will include breakdowns on the model’s overall performance and the model’s performance for each building type and climate zone. When working with breakdowns of the energy and costing values, these breakdowns will be done for all types of costing and energy.