Preprocessing¶
Preprocesses each dataset and splits the data into train, test and validation set. There are separate preprocessing steps which are applied depending on whether energy or costing predictions are being done.
- preprocessing.categorical_encode(x_train, x_test, x_validate, output_path, ohe_file='')¶
Used to encode the categorical variables contained in the x_train, x_test and x_validate Note that the encoded data return creates additional columns equivalent to the unique categorical values in the each categorical column.
- Parameters:
x_train – X trainset
x_test – X testset
x_validate – X validation set
output_path – Where the output files should be placed.
ohe_file – Location of an OHE file which has already been saved from training.
- Returns:
encoded X trainset X_test_oh: encoded X testset x_val_oh: encoded X validation set all_features: all features after encoding.
- Return type:
X_train_oh
- preprocessing.clean_data(df, additional_exemptions=[])¶
Basic cleaning of the data using the following criterion:
dropping any column with more than 10% missing values We cant use N/A as it will elimnate the entire row /datapoint_id. Giving the number of features we have to work it its better we eliminate columns with features that have too much missing values than to eliminate by rows, which is what N/A will do .
dropping columns with 1 unique value For columns with 1 unique values are dropped during data cleaning as they have low variance and hence have little or no significant contribution to the accuracy of the model.
- Parameters:
df – dataset to be cleaned
additional_exemptions – list of columns to not be dropped in addition to any in this function
- Returns:
cleaned dataframe
- Return type:
df
- preprocessing.create_costing_dataset(energy_daily_df, val_df, costing_df, costing_df_val, valsplit, random_seed)¶
Used to split the dataset by datapoint_id into train, test and validation sets for the costing training process.
- Parameters:
energy_daily_df – the merged dataframe for simulation I/O, weather, and hourly energy file.
val_df – the merged dataframe for simulation I/O, weather, and hourly energy file validation set. Where there is no validation set, its value is null
costing_df – The dataframe containing costing values.
costing_df_val – The dataframe containing costing values for the validation set.
valsplit – flag to indicate if there is a dataframe for the validation set. Accepeted values are “yes” or “no”
random_seed – random seed to be passed for when splitting the data
- Returns:
X trainset y_train: y trainset X_test: X testset y_test_complete: Dataframe containing the target variable with corresponding datapointid X_validate: X validate set y_validate: y validate set y_validate_complete: Dataframe containing the target variable with corresponding datapointid for the validation set
- Return type:
X_train
- preprocessing.create_dataset(energy_daily_df, val_df, valsplit, random_seed)¶
Used to split the dataset by datapoint_id into train, test and validation sets for the energy training process.
- Parameters:
energy_daily_df – the merged dataframe for simulation I/O, weather, and hourly energy file.
val_df – the merged dataframe for simulation I/O, weather, and hourly energy file validation set. Where there is no validation set, its value is null
valsplit – flag to indicate if there is a dataframe for the validation set. Accepeted values are “yes” or “no”
random_seed – random seed to be passed for when splitting the data
- Returns:
X trainset y_train: y trainset X_test: X testset y_test_complete: Dataframe containing the target variable with corresponding datapointid X_validate: X validate set y_validate: y validate set y_validate_complete: Dataframe containing the target variable with corresponding datapointid for the validation set
- Return type:
X_train
- preprocessing.groupsplit(X, y, valsplit, random_seed=42)¶
Used to split the dataset by datapoint_id into train and test sets. The data is split to ensure all datapoints for each datapoint_id occurs completely in the respective dataset split. Note that where there is validation set, data is split with 80% for training and 20% for test set. Otherwise, the test set is split further with 60% as test set and 40% as validation set.
- Parameters:
X – data excluding the target_variable
y – target variable with datapoint_id
valsplit – flag to indicate if there is a dataframe for the validation set. Accepeted values are “yes” or “no”
random_seed – random seed to be passed for when splitting the data
- Returns:
X trainset y_train: y trainset X_test: X testset y_test_complete: Dataframe containing the target variable with corresponding datapointid
- Return type:
X_train
- preprocessing.is_df_col_mixed_type(col)¶
Returns True if the col has multiple dtypes and False otherwise.
- Parameters:
col – dataframe column to check
- Return type:
bool
- Returns:
True if the col has multiple dtypes and False otherwise
- preprocessing.main(config_file=<typer.models.ArgumentInfo object>, process_type=<typer.models.ArgumentInfo object>, hourly_energy_electric_file=<typer.models.OptionInfo object>, building_params_electric_file=<typer.models.OptionInfo object>, val_hourly_energy_file=<typer.models.OptionInfo object>, val_building_params_file=<typer.models.OptionInfo object>, hourly_energy_gas_file=<typer.models.OptionInfo object>, building_params_gas_file=<typer.models.OptionInfo object>, output_path=<typer.models.OptionInfo object>, preprocess_only_for_predictions=<typer.models.OptionInfo object>, random_seed=<typer.models.OptionInfo object>, building_params_folder=<typer.models.OptionInfo object>, start_date=<typer.models.OptionInfo object>, end_date=<typer.models.OptionInfo object>, ohe_file=<typer.models.OptionInfo object>, cleaned_columns_file=<typer.models.OptionInfo object>)¶
Used to encode the categorical variables contained in the x_train, x_test and x_validate Note that the encoded data return creates additional columns equivalent to the unique categorical values in the each categorical column.
- Parameters:
config_file (
str
) – Location of the .yml config file (default name is input_config.yml).process_type (
str
) – Either ‘energy’ or ‘costing’ to specify the operations to be performed.hourly_energy_electric_file (
str
) – Location and name of a electricity energy file to be used if the config file is not used.building_params_electric_file (
str
) – Location and name of a electricity building parameters file to be used if the config file is not used.val_hourly_energy_file (
str
) – Location and name of an energy validation file to be used if the config file is not used.val_building_params_file (
str
) – Location and name of a building parameters validation file to be used if the config file is not used.hourly_energy_gas_file (
str
) – Location and name of a gas energy file to be used if the config file is not used.building_params_gas_file (
str
) – Location and name of a gas building parameters file to be used if the config file is not used.output_path (
str
) – Where output data should be placed. Note that this value should be empty unless this file is called from a pipeline.preprocess_only_for_predictions (
bool
) – True if the data to be preprocessed is to be used for prediction, not for training.random_seed (
int
) – The random seed to be used when splitting the data.building_params_folder (
str
) – The folder location containing all building parameter files which will have predictions made on by the provided model. Only used preprocess_only_for_predictions is True.start_date (
str
) – The start date to specify the start of which weather data is attached to the building data. Expects the input to be in the form Month_number-Day_number. Only used preprocess_only_for_predictions is True.end_date (
str
) – The end date to specify the end of which weather data is attached to the building data. Expects the input to be in the form Month_number-Day_number. Only used preprocess_only_for_predictions is True.ohe_file (
str
) – Location and name of a ohe.pkl OneHotEncoder file which was generated in the root of a training output folder. To be used when generating a dataset to obtain predictions for.cleaned_columns_file (
str
) – Location and name of a cleaned_columns.json file which was generated in the root of a training output folder. To be used when generating a dataset to obtain predictions for.
- preprocessing.process_building_files(path_elec, path_gas, clean_dataframe=True)¶
Used to read the building simulation I/O file
- Parameters:
path_elec – file path where data is to be read. This is a mandatory parameter and in the case where only one simulation I/O file is provided, the path to this file should be indicated here.
path_gas – This would be path to the gas output file. This is optional, if there is no gas output file to the loaded, then a value of path_gas =’’ should be used
clean_dataframe – True if the loaded data should be cleaned, False if cleaning should be skipped
- Returns:
Dataframe containing the clean building parameters file. floor_sq: the square foot of the building epw_keys: Dictionary containing all unique weather keys.
- Return type:
btap_df
- preprocessing.process_building_files_batch(directory, start_date, end_date, for_energy=True)¶
Given a directory of .xlsx building files, process and clean each file, combining them into one dataframe with entries for every day within a provided timespan (only if the for_energy arguement is true, otherwise the files are loaded normally). Each row will be assigned a custom identifier following the format: name_of_file/index_number_in_file The square foot of the building will be added in a column ‘square_foot’ which can be removed if unused.
- Parameters:
directory (
str
) – Directory containing one or more .xlsx building files (with no other .xlsx files present)start_date (
str
) – Starting date of the specified timespan (in the form Month_number-Day_number)end_date (
str
) – Ending date of the specified timespan (in the form Month_number-Day_number)for_energy (
bool
) – If true, adjust the loaded buildings to have rows for each day in the provided date span, otherwise just load individual building rows
- Returns:
Dataframe containing the clean building parameters files. epw_keys: Dictionary containing all unique weather keys.
- Return type:
buildings_df
- preprocessing.process_costing_building_files(path_elec, path_gas, clean_dataframe=True, cleaned_columns_list=[])¶
Used to read the building simulation I/O file in preparation for costing predictions/training
- Parameters:
path_elec – file path where data is to be read. This is a mandatory parameter and in the case where only one simulation I/O file is provided, the path to this file should be indicated here.
path_gas – This would be path to the gas output file. This is optional, if there is no gas output file to the loaded, then a value of path_gas =’’ should be used
clean_dataframe – True if the loaded data should be cleaned, False if cleaning should be skipped
cleaned_columns_list – A list of columns to filter if skipping the cleaning (to handle nulls in different columns appropriately)
- Returns:
Dataframe containing the clean building parameters file. costing_df: Dataframe containing the costing output values.
- Return type:
btap_df
- preprocessing.process_hourly_energy(path_elec, path_gas, floor_sq)¶
Used to read the hourly energy file(s)
- Parameters:
path_elec – file path where the electric hourly energy consumed file is to be read. This is a mandatory parameter and in the case where only one hourly energy output file is provided, the path to this file should be indicated here.
path_gas – This would be path to the gas output file. This is optional, if there is no gas output file to the loaded, then a value of path_gas =’’ should be used
floor_sq – the square foot of the building
- Returns:
Dataframe containing the clean and transposed hourly energy file.
- Return type:
energy_df
- preprocessing.read_weather(epw_keys)¶
Used to read retrieve all weather data for the specified epw_keys.
- Parameters:
epw_keys (
list
) – Weather file keys to be loaded.- Returns:
Dataframe containing the clean weather data.
- Return type:
btap_df