Train Model Pipeline

Given all data files, preprocess the data and train an energy model and a costing model with the preprocessed data

train_model_pipeline.main(config_file=<typer.models.ArgumentInfo object>, random_seed=<typer.models.OptionInfo object>, hourly_energy_electric_file=<typer.models.OptionInfo object>, building_params_electric_file=<typer.models.OptionInfo object>, val_hourly_energy_file=<typer.models.OptionInfo object>, val_building_params_file=<typer.models.OptionInfo object>, hourly_energy_gas_file=<typer.models.OptionInfo object>, building_params_gas_file=<typer.models.OptionInfo object>, skip_file_preprocessing=<typer.models.OptionInfo object>, delete_preprocessing_file=<typer.models.OptionInfo object>, preprocessed_data_file=<typer.models.OptionInfo object>, estimator_type=<typer.models.OptionInfo object>, skip_feature_selection=<typer.models.OptionInfo object>, selected_features_file=<typer.models.OptionInfo object>, skip_model_training=<typer.models.OptionInfo object>, use_updated_model=<typer.models.OptionInfo object>, use_dropout=<typer.models.OptionInfo object>, selected_model_type=<typer.models.OptionInfo object>)

Run through the entire training pipeline to train two surrogate Machine Learning models, one to predict energy and one to predict costing. The process below is repeated for both the energy and costing training processes. First, the energy model will be trained, then the costing model will be trained. All outputs will be placed within a folder created in the specified output path which uniquely uses the datetime for naming the folder. First, the provided input building files will be loaded. Second, for energy training, the energy files will be preprocessed. The weather data will also be collected and processed for energy training. Third, the dataset is split into train/test/validation sets. Fourth, the best features from the input data will be selected to be used for training. Finally, a Machine Learning model will be instantiated and trained with the preprocessed data and selected features. Note that all inputs except for the config file are optional, however the arguements can be set from the command line if an empty string is passed as input for the config .yml file.

Parameters:
  • config_file (str) – Location of the .yml config file (default name is input_config.yml).

  • random_seed (int) – Random seed to be used when training.

  • hourly_energy_electric_file (str) – Location and name of a electricity energy file to be used if the config file is not used.

  • building_params_electric_file (str) – Location and name of a electricity building parameters file to be used if the config file is not used.

  • val_hourly_energy_file (str) – Location and name of an energy validation file to be used if the config file is not used.

  • val_building_params_file (str) – Location and name of a building parameters validation file to be used if the config file is not used.

  • hourly_energy_gas_file (str) – Location and name of a gas energy file to be used if the config file is not used.

  • building_params_gas_file (str) – Location and name of a gas building parameters file to be used if the config file is not used.

  • skip_file_preprocessing (bool) – True if the .json preprocessing file generation should be skipped, where the preprocessed_data_file input is used, False if the preprocessing file generation should be performed.

  • delete_preprocessing_file (bool) – True if the preprocessing output file should be removed after use, False if the preprocessing file should be kept (for analysis or to use in later training runs).

  • preprocessed_data_file (str) – Location and name of a .json preprocessing file to be used if the preprocessing is skipped.

  • estimator_type (str) – The type of feature selection to be performed. The default is lasso, which will be used if nothing is passed. The other options are ‘linear’, ‘elasticnet’, and ‘xgb’.

  • skip_feature_selection (bool) – True if the .json feature selection file generation should be skipped, where the selected_features_file input is used, False if the feature selection file generation should be performed.

  • selected_features_file (str) – Location and name of a .json feature selection file to be used if the feature selection is skipped.

  • perform_param_search – ‘yes’ if hyperparameter tuning should be performed (increases runtime), ‘no’ if the default hyperparameters should be used.

  • skip_model_training (bool) – True if the model training should be skipped. Useful if only the preprocessing steps should be performed.

  • use_updated_model (bool) – True if the larger model architecture should be used for training for energy training.

  • use_dropout (bool) – True if the regularization technique should be used (on by default). False if tests are desired without dropout. Note that not using dropout may cause bias to learned when training.

  • selected_model_type (str) – Type of model selected. can either be ‘mlp’ for Multilayer Perceptron or ‘rf’ for Random Forest

Return type:

None