1 Introduction
Energy efficient control of Heating, Ventilation and Air Conditioning (HVAC) systems is an important aspect of building operations because they account for the major share of energy consumed by buildings. Most large office buildings,which are significant energy consumers, are structures with complex, internal energy flow dynamics and complex interactions with their environment. Therefore, building energy management is a difficult problem. Traditional building energy control systems are based on heuristic rules to control the parameters of the building’s HVAC systems. However, analysis of historical data shows that such rulebased heuristic control is inefficient because the rules are based on simplified assumptions about weather and building operating conditions.
Recently, there has been a lot of research on smart buildings with smart controllers that sense the building state and environmental conditions to adjust the HVAC parameters to optimize building energy consumption [36]. Model Predictive Control (MPC) methods have been successfully deployed for smart control [22], but traditional MPC methods require accurate models to achieve good performance. Developing such models for large buildings may be an intractable problem [39]. Recently, Datadriven MPC
based on random forest methods have been used to solve demandresponse problems for moderate size buildings
[39], but is not clear how they may scale up for continuous control of large buildings.Reinforcement Learning (RL) methods have recently gained traction for controlling energy consumption and comfort in smart buildings because they provide several advantages. Unlike MPC methods for robust receding horizon control [43], they have the ability to learn a locally optimal control policy without simulating the system dynamics over long time horizons. Instead, RL methods use concepts from Dynamic Programming to select the optimal actions. A number of reinforcement learning controllers for buildings have been proposed, where the building behavior under different environmental conditions are learnt from historical data [28]
. These approaches are classified as data driven or Deep Reinforcement Learning approaches.
[21].However, current data driven approaches for RL do not take into account the nonstationary behaviors of the building and its environment. Building operations and the environments in which they operate are continually changing, often in unpredictable ways. In such situations, the Deep RL controller performance degrades because the data that was used to train the controller becomes ‘stale’. The solution to this problem is to detect changes in the building operations and its environment, and relearn the controller using data that is more relevant to the current situation. This paper proposes such an approach, where we relearn the controller at periodic intervals to maintain its relevance, and thus its performance.
The rest of the paper is organized as follows. Section 2 presents a brief review of some of the current approaches in model and datadriven reinforcement learning, and the concept of nonstationarity in MDPs. Section 4 formally introduces the RL problem for nonstationary systems that we tackle in this paper. Section 6 then develops our data driven modeling as well as the reinforcement learning schemes for ‘optimal’ building energy management. Section 7 discusses our experimental results, and Section Conclusions presents our conclusions and directions for future work.
2 Literature Review
Traditional methods for developing RL controllers of systems have relied on accurate dynamic models of the system (modelbased approaches) or datadriven approaches. We briefly review modelbased and datadriven approaches to RL control, and then introduce the notion of nonstationary systems, where traditional methods for RL policy learning are not effective,
2.1 Reinforcement Learning with Model Based Simulators
Typical physicsbased models of building energy consumption, use conservation of energy and mass to construct thermodynamic equations to describe system behavior. [43] applied Deep QLearning methods [24] to optimize the energy consumption and ensure temperature comfort in a building simulated using EnergyPlus[3], a whole building energy simulation program. [26] obtained cooling energy savings of on an EnergyPlus simulated model of a datacenter using a natural policy gradient based algorithm called TRPO [34]. Similarly, [19] used an off policy algorithm called DDPG [21] to obtain cooling energy savings in an EnergyPlus simulation of a datacenter. To deal with sample inefficiency in onpolicy learning, [9] developed an eventtriggered RL approach, where the control action changes when the system crosses a boundary function in the state space. They used a oneroom EnergyPlus thermal to demonstrate their approach.
2.2 Reinforcement Learning with Data Driven Approaches
The examples above describe RL approaches applied to simple building architectures. As discussed, creating a model based simulator for large, complex buildings can be quite difficult [31, 14]. Alternatively, more realistic approaches for RL applied to large buildings rely on historical data from the building to learn datadriven models or directly use the data as experiences from which a policy is learnt. [27] developed simulators from datadriven models and then used them for finite horizon control. [29]
used Support Vector Regression to develop a building energy consumption model, and then used stochastic gradient methods to optimize energy consumption.
[2]used valuebased neural networks to learn the thermodynamics model of a building. The energy models were then optimized using
Qlearning [40]. Subsequently, [28] used a DDPG [21] approach with a sampling buffer to develop a policy function that minimized energy consumption without sacrificing comfort. Anther recent approach that has successfully applied deep RL to datadriven building energy optimization includes [25].2.3 Non Stationary MDPs
The datadriven approaches presented in Section 2.2 do not address the nonstationarity of the large buildings. Nonstationary behaviors can be attributed to multiple sources. For example, weather patterns, though seasonal, can change abruptly in unexpected ways. Similarly, conditions in a building can change quickly, e.g., when a large number of people enter the building for an event, or components of the HVAC system, degrade of fail, e..g, stuck valves, or failed pumps. When such situations occur, a RL controller, trained on the past experiences, cannot adapt to the unexpected changes in the system and environment, and, therefore, performs suboptimally. Some work [23, 42, 37] has been proposed to address nonstationarity in the environments by improving the value function under the worst case conditions[10] of the nonstationarity.
Other approaches try to minimize a regret function instead of finding the optimal policy for nonstationary MDPs. The regret function measures the sum of missed rewards when we compare the state value from a start state between current best policy and the target policy in hindsight i.e., they tell us what actions would have been appropriate after the episode ends. This regret is then optimized to get better actions. [7] applied this approach to contextdriven MDPs (each context may represent a different nonstationary behavior) to find the piecewise stationary optimal policies for each context. They proposed a clustering algorithm to find a set of contexts. [11, 6] also minimize the regret based on an average reward formulation instead of a state value function. [30] proposed a nonstationary MDP control method under a modelfree setting by using a context detection method proposed in [38]. These approaches assume knowledge of a known set of possible environment models beforehand, which may not be possible in real systems. Moreover, they are modelbased, i.e., they assume the MDP models are available. Therefore, they cannot be applied in a model free setting.
To address nonstationarity issues in complex buildings we extend previous research in this domain to make the following contributions to datadriven modeling and RL based control of buildings:

We retrain the dynamic behavior models of the building and its environment at regular intervals to ensure that the models respond to the distributional shifts in the system behavior, and, therefore, provide an accurate representation of the behavior.

By not relearning the building and its environment model from scratch, we ensure the repeated training is not time consuming. This also has the benefit of the model not being susceptible to the catastrophic forgetting [15] of the past behavior which is common in neural networks used for online training and relearning.

We relearn the policy function; i.e., the HVAC controller every time the dynamic model of the system is re learnt, so that it adapts to the current conditions in the building.
In the rest of this paper, we develop the relearning algorithms, and demonstrate the benefits of this incremental relearning approach on the controller efficiency.
3 Optimal Control with Reinforcement Learning
Reinforcement learning (RL) represents a class of machine learning methods for solving optimal control problems, where an agent learns by continually interacting with an environment
[40]. In brief, the agent observes the state of the environment, and based on this state/observation takes an action, and notes the reward it receives for the pair. The agent’s ultimate goal is to compute a policy, i.e., a mapping from the environment states to the actions that maximizes the expected sum of reward. RL has been cast as a stochastic optimization method for solving Markov Decision Processes (MDPs), when the MDP is not known. We define RL problem more formally below.
Definition 3.1 (Markov Decision Process).
A Markov decision process is defined by a four tuple: where represents the set of possible states in the environment. The transition function
defines the probability of reaching state
at given that action was chosen in state atdecision epoch
, . The reward function estimates the immediate reward obtained from choosing action in state .The objective of the agent is to find a policy that maximizes the accumulated discounted rewards it receives over the future. The optimization criteria is the following:
(1) 
where is called value function and it is defined as
(2) 
where is called the discount factor, and it determines the weight assigned to future rewards. In other words, the weight associated with future rewards decays with time.
An optimal deterministic Markovian policy satisfying Equation 1 exists if the following conditions are satisfied


and do not change over time.
If a MDP satisfies the second condition, it is called a stationary MDP. However, most real world systems undergo changes that cause their dynamic model, represented by the transition function , to change over time [4]. In other words, these systems exhibit non stationary behaviors. Non stationary behaviors may happen because the components of a system degrade, and/or the environment in which a system operates changes, causing the models that govern the system behavior to change over time. In case of large buildings, the weather conditions can change abruptly, or changes in occupancy or faults in building components can cause unexpected and unanticipated changes in the system’s behavior model. In other words, is no longer invariant, but it may change over time. Therefore, a more realistic model of the interactions between an agent and its environment is defined by a non stationary MDP (NMDP) [32].
Definition 3.2 (NonStationary Markov Decision Process).
A nonstationary Markov decision process is defined by a 5tuple: . represents the set of possible states that the environment can reach at decision epoch . is the set of decision epochs with . is the action space. and represent the transition function and the reward function at decision epoch , respectively.
In the most general case, the optimal policy for a NMDP, is also non stationary. The value of state at decision epoch within an infinite horizon NMDP is defined for a stochastic policy as follows:
(3) 
Learning optimal policies from nonstationary MDPs is particularly difficult for nonepisodic tasks when the agent is unable to explore the time axis at will. However, real systems do not change arbitrarily fast over time. Hence, we can assume that changes occur slowly over time. This assumption is know as the regularity hypothesis and it can be formalized by using the notion of Lipschitz Continuity (LC) applied to the transition and reward functions of a nonstationary MDP [18]. This results in the definition of Lipschitz Continuous NMDP (LCNMDP)
Definition 3.3 (() LcNmdp).
An () LCNMDP is a NMDP whose transition and reward functions are respectively LC and LC w.r.t. time, i.e.
and
where represents the Wasserstein distance and it is used to quantify the distance between two distributions.
Although learning from the true NMDP is generally not possible because the agent does not have access to the true NSMDP model, it is possible to learn a quasioptimal policy from interacting with temporal slices of the NMDP assuming the LCproperty. This means that the agent can learn using a stationary MDP of the environment at time . Therefore, the trajectory generated by a LCNMDP is assumed to be generated by a sequence of stationary MDPs . In the next section, we present a continuous learning approach for optimal control of non stationary processes based on this idea.
4 Continual Learning Approach for Optimal Control of NonStationary Systems
The proposed approach has two main steps: an initial offline learning process followed by continual learning process. Figure 1 presents the proposed approach organized in the following steps which are annotateed as 1, 2 in the figure:

Step 1. Data collection. Typically this represents historical data that may be available about system operations. In our work, we start with a data set containing information on past weather conditions and the building’s energyrelated variables. This data set may be representative of one or more operating conditions of the non stationary system, in our case, the building,

Step 2. Deriving a dynamic model of the environment. In our case, this is the building energy consumption model, given relevant building and weather parameters.

A state transition model is defined in terms of state variables (inputs and outputs) and the dynamics of the system are learned from the data set.

The reward function used to train the agent is defined.


Step 3. Learning an initial policy. A policy is learned offline by interacting with the environment model derived in the previous step.

Step 4. Deployment. The policy learned is deployed online, i.e., in the real environment, and experiences from theses interaction are collected.

Step 5. Relearning. In general, the relearning module would be invoked based on some predefined performance parameters, for example, when average accumulated reward value over small intervals of time is monotonically decreasing. When this happens:

the transition model of the environment is updated based on the recent experiences collected from the interaction with the uptodate policy.

The current policy is retrained offline, much like Step 3, by interacting with the environment now using the updated transition model of the system.

We will demonstrate that this method works if the regularity hypothesis is satisfied, i.e., the environment changes occur after sufficiently long intervals, to allow for the offline relearning step (Step 5) to be effectively applied. In this work, we also assume that the reward function, , is stationary, and does not have to be rederived (or relearned) when episodic non stationary changes occur in the system.
Another point to note is that our algorithm uses a twostep off line process to learn a new policy: (1) learn the dynamic (transition) model of the system from recent experiences; and (2) relearn the policy function using the new transition model of the system. This approach addresses two important problems: (1) policy learning happens off line, therefore, additional safety check and verification methods can be applied to the learned policy before deployment this is an important consideration for safety critical systems; and (2) the relearning process can use an appropriate mix of past experiences and recent experiences to relearn the environment model and the corresponding policy. Thus, it addresses the catastrophic forgetting problem discussed earlier. This approach also provides a compromise between off policy and on policy learning in RL, by addressing to some extent the sample inefficiency problem.
We use Long ShortTerm Memory (
LSTM) Neural Network to model the dynamics of the system and the the Proximal Policy Optimization (PPO) algorithm to train the control policy. PPO is one of the best known reinforcement learning algorithm for learning optimal control law in short periods of time. Next, we describe our approach to modeling the dynamic environment using LSTMs, and the reinforcement learning algorithm for learning and relearning the building controllers (i.e., the policy functions).4.1 Long ShortTerm Memory Networks for Modeling Dynamic Systems
Despite their known success in machine learning tasks, such as image classification, deep learning approaches for energy consumption prediction have not been sufficiently explored
[1]. In recent work, Recurrent neural networks (RNN) have demonstrated their effectiveness for load forecasting when compared against standard Multi Layer Perceptron (MLP) architectures
[17, 33].Among the variety of RNN architectures, LongShort Term Memory (LSTM) networks have the flexibility for modeling complex dynamic relationships and the capability to overcome the socalled vanishing/exploding gradient problem associated with training the recurrent networks
[8]. Moreover, LSTMs can capture arbitrary longterm dependencies, which are likely in the context of energy forecasting tasks for large, complex buildings. The architecture of an LSTM model is represented in Figure 2. It captures nonlinear longterm dependencies among the variables based on the following equations:(4)  
(5)  
(6)  
(7)  
(8)  
(9) 
where , , and represent the input variables, hidden state and memory cell state vectors respectively; stands for elementwise multiplication; and and
are the sigmoid and tanh activation functions.
The adaptive update of values in the input and forget gates () provide LSTMs the ability to remember and forget patterns (Equation 8) over time. The information accumulated in the memory cell is transferred to the hidden state scaled by the output gate (
). Therefore, training this network consists of learning the inputoutput relationships for energy forecasting by adjusting the eight weight matrices and bias vectors.
4.2 Proximal Policy Optimization
The Proximal Policy Optimization(PPO) algorithm [35] has its roots in the Natural Policy Gradient method [13], whose goal was to improve the common issues encountered in the application of policy gradients. Policy gradient methods[41] represent better approaches to creating optimal policies, especially when compared to valuebased reinforcement learning techniques. Valuebased methods suffer from convergence issues when used with function approximators (Neural networks). Policy gradient methods also have issues with high variability, which have been addressed by ActorCritic methods [16]. However, choosing the best stepsize for policy updates was the single biggest issue that was addressed in [12]. PPO replaces the log of action probability in the policy gradient equation
with the probability ratio inspired by [12]. Here, the current parameterized control policy is denoted by . denotes the advantage of taking a particular action compared to the average of all other actions in state . According to the authors of PPO, this addresses the issue of the step size partially as they need to limit the values of this probability ratio. So they modify the objective function further to provide a Clipped Surrogate Objective function,
(10) 
The best policy is found by maximizing the above objective. The above objective has several interesting properties that makes PPO easily implementable and fast to reach convergence during each optimization step. The clipping ensures that the policy does not update too much in a given direction when the Advantages are positive. Also, when the Advantages are negative, the clipping makes sure that the probability of choosing those actions are not decreased too much. In other words, it strikes a balance between exploration and exploitation with monotonic policy improvement by using the probability ratio.
Experiments run on the Mujoco platform, show that the PPO algorithm outperforms many other state of the art reinforcement learning algorithms [5]. This motivates our use of this algorithm in our relearning approach.
The PPO algorithm implements a parameterized policy using a neural network whose input is the state vector and the output is the mean
of the best possible action in that state. The policy network is trained using the clipped objective function (see Equation 10) to obtain the best controller policy. A second neural network called the value network, , keeps track of the values associated with the states under this policy. This is subsequently used to estimate the advantage of action in state . Its input is also and its output is a scalar value indicating the average return from that state when policy is followed. This network is trained using the TD error [40].5 Problem Formulation for The Building Environment
We start with a description of our building environment and formulate the solution of the energy optimization problem by using our continuous RL approach. This section presents the dynamic datadriven model of building energy consumption and the reward function we employ to derive our control policy.
5.1 System Description
The system under consideration is a large threestoreyed building on our university campus. It has a collection of individual office spaces, classrooms, halls, a gymnasium, a student lounge, and a small cafeteria. The building climate is controlled by a combination of Air Handling Units(AHU) and Variable Refrigerant Flow (VRF) systems [28]. The configuration of the HVAC system is shown in Figure 3.
The AHU brings in fresh air from the outside and adjusts the air’s temperature and humidity before releasing it into the building. Typically, the desired humidity level in the building is set to %, and the desired temperature values are set by the occupants. Typically, the air is released into the building at a neutral temperature (usually or ). The VRF units in the different zones further heat or cool the air according to the respective temperature setpoint (defined by the occupants’ preferences).
The AHU has two operating modes depending on the outside wet bulb temperature. When the wet bulb temperature is above , only the cooling and the reheat coils operate. The AHU dehumidifies the air using the cooling coil to reduce the air temperature to , thus causing a condensation of the excess moisture, and then heats it back up to a specific value that was originally determined by a rulebased controller (either or ). When the wet bulb temperature is below (implying the humidity of the outside air is below %), only the preheat coil operates to heat the incoming cold air to a predefined setpoint. The discharge temperature (reheating and preheating setpoint depending on the operating mode) will be defined by our RL controller. The appropriate setting of this setpoint would allow to reduce the work that must be done by the VRF units, as well as to prevent the building from becoming too cold during cooler weather.
5.2 Problem Formulation
The goals of our RL controller is to determine the discharge air temperature setpoint of the AHU to minimize the total heating and cooling energy consumed by the building without sacrificing comfort. We will formulate the RL problem by specifying the statespace, the actionspace, the reward function, and the transition function for the our building environment.
5.2.1 State Space
The overall energy consumption of our building depends on how the AHU operates but also on exogenous factors such as the weather variability and the building occupancy. The evolution of the weather does not depend on the state of the building. Therefore, the control problem we are trying to solve must be framed as a nonstationary and Exogenous State MDP. The latter can be formalized as follows
Definition 5.1 (Exogenous State Markov Decision Process).
An Exogenous State Markov decision process is defined by a Markov Decision Process which transition function satisfies the following property
where the state space of the MDP is divided into two subspaces such that and .
The above definition can be easily extended to the nonstationary case by considering the time dependency of the transition functions. The condition described above can be interpreted as if there is a subset of state variables whose change is independent from the actions taken by the agent. For our building, the subset of exogenous variables of the subspace are: (1) Outside Air Temperature (oat), (2) Outside Air Relative Humidity (orh), (3) Wet Bulb Temperature (wbt), (4) Solar irradiance (sol), (5) Average Building Temperature Preference Set Point (avgstpt). The remaining variables corresponding to the subspace are (6) AHU Supply Air Temperature (sat), (7) Heating energy for the Entire Building() and (7) Cooling energy for the Entire Building (
). Since building occupancy is not measured at this moment, we cannot incorporate that variable to our state space.
5.2.2 Action Space
The action space of the MDP in each epoch is the change in the neutral discharge temperature setpoint. As discussed before, the wet bulb temperature determines the AHU operating mode. The valves and actuators that operate the HVAC system have a certain latency in their operation. This means that our controller must not arbitrarily change the discharge temperature setpoint. We therefore adopted a safer approach where the action space is defined as a continuous variable that represents the change with respect to the previous setpoint. This means that at every output instant (in the present problem we have set the output to every minutes), the controller can change at most the discharge temperature setpoint by this amount.
5.3 Transition Model
Taking into consideration that the state and action space of the building are continuous, the transition function will comprise 3 components.
First, the transition function of the exogenous state variables is not explicitly modeled (oat, orh, wbt, sol, and avgstpt). Their next state () is determined by looking up at weather database forecasting for the next time step. These variables are available at minute intervals through a Metasys portal of our building; solar irradiance, , is available from external data sources. There are no humidity or occupancy sensors inside the building, therefore, we did not consider them as part of the exogenous state variables.
The supply air temperature and the heating and cooling energies are the nonexogenous variables. The change in the supply air temperature sat is a function of the current temperature and the setpoint selected by the agent.
(11) 
Here, the controller action will determine what the new setpoint will be and subsequently the supply air temperature will approximate that value. We do not create a transition function for this variable since we obtain its value from a sensor installed in the AHU.
Lastly, the heating and cooling energy variables( and ) are determined by the transition functions
(12) 
(13) 
where . As discussed in the last section, we train stacked LSTMs to derive nonlinear approximators for these functions. LSTMs can help keep track of the state of the system since they allow modeling continuous systems with slow dynamics. The heating and cooling energy estimated by the LSTMs will be used as a part of the reward function as discussed next.
5.4 Reward Function
The reward function includes two components: (1) the total energy savings for the building expressed as heating and cooling energy savings, and (2) the comfort level achieved. The reward signal at time instant is by
(14) 
where defines the importance we give to each term. We considered in this work.
is defined in terms of the energy savings achieved with respect to the rulebased controller previously implemented in the building, i.e. we reward the RL controller when its actions result in energy savings calculated as the difference between the total heating and cooling energy under the RBC controller actions and the RL controller actions. is defined as follows
where the components of this equation are

: The total energy used to heat the air at the heating or preheating coil as well as the VRF system at timeinstant based on the heating set point at the AHU assigned by the RL controller.

: The total energy used to heat the air at the heating or preheating coil as well as the VRF system at timeinstant based on the heating set point at the AHU assigned by the Rule Based Controller(RBC).

: The onoff state of the heating valve at timeinstant based on the heating set point at the AHU assigned by the RL controller.

: The onoff state of the heating valve at timeinstant based on the heating set point at the AHU assigned by the Rule Based Controller(RBC).

: The total energy used to cool the air at the cooling coil as well as the VRF system at timeinstant based on the set point at the AHU assigned by the RL controller.

: The total energy used to cool the air at the cooling coil as well as the VRF system at timeinstant based on the set point at the AHU assigned by the Rule Based Controller(RBC).
Here by Rule Based Controller setpoint, we refer to the historical set point data that is obtained from the past data on which we shall do our comparison.
The heating and the cooling energy are calculated as a function of the exogenous state variables and , as discussed in the previous subsection. Additionally, we model the behavior of the valve that manipulates the steam flow in the coil of the heating system, This valve shuts off under certain conditions such that the heating energy consumption sharply drops to 0. This hybrid onoff behavior cannot be modeled with an LSTM thus we need to model the valve behavior independently as a onoff switch to decide when to consider the predictions made by the LSTM (only during on). Note that both and are predicted by using a binary classifier.
The reward for comfort is measured by how close the supply air temperature is to the Average Building Temperature Preference setpoint(avgstpt. Let
The comfort term allows the RL controller to explore in the vicinity of the average building temperature preference to optimize energy. The 1 added to the denominator in case 1 makes the reward bounded.
The individual reward components are formulated such that a preferred action would provide positive feedback while a negative feedback implies actions which are not preferred. The overall reward is nonsparse so the RL agent would have sufficient heuristic information for moving towards an optimal policy.
6 Implementation Details
In this section, we describe the implementation of the proposed approach for the optimal control of the system described in the previous section.
6.1 Data Collection and Processing
This process is part of Step 1 in Figure 1. The data was collected over a period of 20 months(July ’18 to Feb ’20) from the building we were simulating using the BACNET
system which is a collection of sensor data logging all the relevant variables related to our study. These include the weather variables, the building set points, energy values collected at 5 minute aggregations. We first cleaned the data where we removed the statistical outliers using a 2 standard deviations approach. Next we aggregated the variables at halfanhour intervals where variables like temperature, humidity were averaged and variables like energy were summed over that interval. Then we scaled the data to a
interval so that we can learn the different datadriven models and the controller policy. In order to perform the offline learning as well as the subsequent relearning, we sampled this above data in windows of 3 months(for training) and 1 week(for evaluating).6.2 Definition of the environment
The environment has to implement the the functions as described in Section 5.4 as they will be used to calculate the energy and valve state.
6.2.1 Heating Energy model
This process is part of Step 2 in Figure 1. The heating energy model is used to calculate the heating energy consumed in state which results from the action taken in state . The model for Heating energy is trained using the sequence of variables comprising the states over the last 3 hours i.e. 6 samples considering data samples at 30 minute intervals. The output for the heating energy model is the total historical heating energy over next 30 minute interval.
The heating coils for the building operate in a hybrid mode where the heating valve is shutoff at times. Thus the heating energy goes to zero for that instant. This abrupt change cannot be modeled by a smooth LSTM model. We therefore decided to train our model on contiguous sections where the heating coils were operating. During evaluation phase, the valve() model will predict the on/off state of the heating coils. We shall predict the energy consumption only for those instances when the valve model determines the heating coils to be switched on.
The model for is constructed by stacking 6 Fully Feed Forward Neural (FFN) Network Layers of 16 units each followed by 2 layers of LSTM with 4 units each. The activation for each layer is Relu. The FFN layers are used to generate the rich feature from the input data and the LSTM layers are used to learn the time based correlation. The learning rate is initially 0.001 and is changed according to a linear schedule to ensure faster improvement at the beginning followed by gradual improvements near the optimum so that we don’t oscillate around the optima. Mean Square Error on validation data is used to terminate training. The model parameters were found by hyperparameter tuning via Bayesian Optimization on a RayTune[20] cluster.
6.2.2 Valve State model
This process is also a part of Step 2 in Figure 1. The valve model is used to classify whether the system is switched on or off or equivalently whether the heating energy is positive or 0. The input to this model is the same as the Heating Energy model. The output is the valve (heating coil) onoff state at the next time instant.
The model for is constructed by stacking 4 Fully Feed Forward Layers of 16 units each followed by 2 layers of LSTM with 8 units each. The activation for each layer is Relu. The learning rate, validation data, and the model parameters are similarly chosen as before. The loss used in this case is the binary crossentropy loss since it is a twoclass prediction problem.
6.2.3 Cooling Energy model
This process is also part of Step 2 in Figure 1. The cooling energy model is used to calculate the cooling energy consumed in state when the action is taken in state . The input to this model is the same as the Heating Energy model. The output of the model is the total historical cooling energy over the next 30 minute interval.
The model for is constructed by stacking 6 Fully Feed Forward Layers of 16 units each followed by 2 layers of LSTM with 8 units each. The activation for each layer is Relu. The learning rate, validation data and the model parameters are chosen in a way similar to the Heating Energy Model.
Once the processes in step 2 are completed we construct the datadriven simulated environment . It receives the control action from the PPO controller and steps from its current state to the next state . To calculate , the weather values for the next state are obtained by simple timebased lookup from the ”Weather Data” database. The supply air temperature for the next state is obtained from the ”State Transition Model” using equation 11. The reward is calculated using equation 14. Every time the Environment is called with an action, it will perform this entire process and return the next state , the reward back to the RL Controller with some additional information on the current episode.
6.3 PPO Controller
This process is part of Step 3 in Figure 1. As discussed previously in section 4.2, the controller will learn two neural networks using the feedback it receives from the environment in response to its action . This action is generated by sampling from the distribution which are the outputs of the policy network as shown in the figure. After sampling responses from the environment for a number of times, the collected experiences under the current controller parameters, are used to update the controller network by optimizing in equation 10 and the value networks by TD Learning. We repeat this training process until the optimization has converged to a local optima.
The Policy Network model architecture consists of two layers of Fully Feed Forward Layers with 64 units each. The Value Network network structure is identical to the Policy network. The networks are trained onpolicy with a learning rate of . Each time the networks were trained over 1e6 steps through the environment. For the environment this corresponded to approximately 10 episodes for each iteration.
6.4 Evaluating the energy models, valve models, and the PPO controller
This corresponds to Step 4 in Figure 1. Once the energy model, the valve state models, and the controller training have converged we evaluate them on a held out test data for 1 week. The Energy models are evaluated using the Coefficient of variation Root Mean Square Error (CVRMSE)
where and represent the true and the predicted value of the energy, respectively.
The valve model is evaluated based on its ROCAUC as the onoff dataset was found to be imbalanced. The controller policy is evaluated by comparing the energy savings for the cooling energy and the heating energy as well as how close the controller setpoint for the AHU supply air temperature is to the building average setpoint avgstpt.
6.5 Relearning Schedule
Steps 4 and 5 in Figure 1 are repeated by moving the data collection window forward by 1 week. We observed that having a large overlap over training data between successive iterations helps the model retain previous information and gradually adapt to the changing data.
From the second iteration onward we do not train the data driven LSTM models (i.e. ) from scratch. Instead, we use the pretrained models from the previous iteration to start learning on the new data. For the energy models and valve models we no longer train the FFN layers and only retrain the head layers comprising the LSTMs. The FFN layers are used to learn the representation from the input data and this learning is likely to stay identical for different data. The LSTM layers, on the other hand, model the trend in the data which must be relearnt due to the distributional shift. Our results show that this training approach saves time with virtually no loss in model performance. We also adapt the pretrained controller policy according to the changes in the system. This continual learning approach save us time during repeated retraining and allows the datadriven models and the controller adapting to the nonstationarity of the environment.
7 Results
In this section we present the performance of our energy models, valve model, and the RL controller over multiple weeks.
7.1 Relearning Results for Heating Energy Model
Figure 4 shows the heating energy prediction on a subset of the data from October 7th to 23rd. We selected this time period because the effects of the nonstationarity in the data can be appreciated. We compare the prediction of a fixed model, which is not updated after October 7th, with a model which is retrained by including the new week’s data from 7th to the 13th. The figure demonstrates the necessity of relearning the heating energy model at regular intervals. After the October 12th, the AHU switches from using the reheating to the preheating coil due to colder weather as indicated by the wet bulb temperature. This causes the heating energy consumption to change abruptly. The model which is not updated after October 7th is not able to learn this behavior and keeps predicting similar behavior as before. On the other hand, the weekly relearning model behavior starts degrading but once it is relearned using the data from Oct 7th to the 13th, it can capture the changing behavior quickly using a small section of similar data in its training set. The overall CVRMSE for the relearning energy model is shown in Figure 5. For majority of the weeks, the CVRMSE is below which is accepted according to ASHRAE guidelines for energy prediction at half hour intervals
7.2 Relearning Results for Cooling Energy Model
Figure 6 shows the plots for predicting the Cooling energy Energy over a span of two weeks. We also include the the energy prediction from a fixed model. Starting from 25th April, both the Fixed and Relearning model for Cooling Energy predictions start degrading as they start following an increasing trend while the actual trend is downward and this behavior is expected while learning on nonstationary data. But the Relearning Cooling Energy model is retrained using the data from April 19th to April 26th at the end of the week corresponding to 26th April. Thus its predictions tend to be better than a fixed model for the next week whose predictions degrade as the week progresses.The overall CVRMSE for the relearning energy model is shown in Figure 7. For all the weeks, the CVRMSE is below which is accepted according to ASHRAE guidelines for energy prediction at half hour intervals
7.3 Prediction of the Heating Valve status
7.4 Training Episode Reward
We trained the PPO controller on the environment every week to adjust to the shift in the data. The cumulative reward metric from equation J is used to asses the improvement in controller performance over the number of week. We observed that even though the controller is able to achieve good results after training over a couple of weeks of data, it still keeps improving as weeks progresses. The cumulative reward metric is plotted in Figure 10. The occasional drops in the average reward are due to changing environment conditions as training progresses.
7.5 Cooling Energy Performance
We compared the cooling energy performance of both the adaptive reinforcement learning controller and a static reinforcement learning controller against a rule based controller. A plots comparing the cooling energy consumed over a certain part of the evaluation period is shown in figure 11. We are displaying this part of the timeline because it will be significant in understanding why relearning is important. When we calculate the energy savings for each RL controller, the static RL controller had slightly higher cooling energy savings because the last version of it was trained during warmer weather and it tends to keep the building cooler. But when the outside temperature drops, the static controller action does not heat the system too much resulting in the VRF systems starting to heat the building which consume higher energy. The cooling energy savings over the period shown in figure 11 was for the Adaptive Controller and for the Static controller. The average weekly cooling energy savings over the entire evaluation period of 31 weeks was or kBTUs for the Adaptive Controller versus or kBTUs for the NonAdaptive/Static Controller.
7.6 Heating Energy Performance
Similarly, we compared the heating energy performance of an adaptive and static controller over the same timeline as shown in figure 12. This plot shows the severe issue of overcooling that can occur in the building when controller is not updated regularly, Due to lower action set point of the static controller, the total heating energy consumption for the building goes up over the entire period of cool weather. The heating energy savings over the period shown in figure 11 was for the Adaptive Controller while the Static controller increased the energy consumption by . The average weekly heating energy savings over the entire evaluation period of 31 weeks was or kBTUs for the Adaptive Controller whereas the NonAdaptive/Static Controller increased the energy consumption by or kBTUs.
The sum total of the heating and cooling energy consumption under the historical rule based controller, the adaptive controller and the nonadaptive controller is shown in figure 13. The adaptive controller consistently saves more energy than the nonadaptive controller. Overall the adaptive controller was able to save 300.72 kBTUs each week on average whereas the static controller was able to save only 30.03 kBTUs.
7.7 Control Actions
Here we show why the overall energy consumption of the building went up when we use a static controller. We plot the Discharge/Supply Air Temperature setpoint resulting from the actions of both the adaptive and static controller along with outside air temperature and relative humidity in Figure 14. On October 12th, the outside temperature goes down and both the adaptive and static controller fail to improve building comfort condition. After October 13th , the adaptive controller is retrained by considering the last weeks data where it encounters environments states with lower outside air temperatures as subsequently it adapts to those conditions. For the remaining of the time period analyzed the adaptive controller keeps the Supply Air Temperature setpoint closer to the comfort conditions required by the occupants.
Conclusions
We demonstrated the effectiveness of including retraining in a datadriven reinforcement learning framework.
It may be argued that our reward is only improving against a baseline Rule Based Controller. The truth is that we can only compare against controllers which select reasonable actions within the distribution of the data on which the data driven models were trained. If we were to learn our reinforcement learning controller without any comparison during training, the exploratory behavior of reinforcement learning methods may have found even better control actions. But as we are using datadriven models, it is highly likely that the actions chosen by the controller might lead the datadriven models to extrapolate results and introduce Out of Distribution Error. By comparing against a rule based controller and constraining actions from veering too far from the current actions, we might leave some savings but we can ensure that the datadriven models used in the environment are not leading us to spurious results by extrapolating.
References
 [1] (2018) A review of datadriven building energy consumption prediction studies. Renewable and Sustainable Energy Reviews 81, pp. 1192–1205. Cited by: §4.1.
 [2] (201606) Experimental analysis of datadriven control for a building heating system. Sustainable Energy, Grids and Networks 6, pp. 81–90. External Links: Document, 1507.03638, ISSN 23524677 Cited by: §2.2.
 [3] (2000) EnergyPlus: energy simulation program. ASHRAE Journal 42, pp. 49–56. Cited by: §2.1.
 [4] (2019) Challenges of realworld reinforcement learning. CoRR abs/1904.12901. External Links: Link, 1904.12901 Cited by: §3.
 [5] (2019) Implementation matters in deep rl: a case study on ppo and trpo. In International Conference on Learning Representations, Cited by: §4.2.

[6]
(201905)
Variational Regret Bounds for Reinforcement Learning.
35th Conference on Uncertainty in Artificial Intelligence, UAI 2019
. External Links: 1905.05857, Link Cited by: §2.3.  [7] (201502) Contextual Markov Decision Processes. arXiv preprint arXiv:1502.02259. External Links: 1502.02259, Link Cited by: §2.3.
 [8] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
 [9] (202001) Datadriven control of microclimate in buildings; an eventtriggered reinforcement learning approach. arXiv preprint arXiv:2001.10505. External Links: 2001.10505, Link Cited by: §2.1.
 [10] (200505) Robust dynamic programming. Vol. 30, INFORMS. External Links: Document, ISSN 0364765X Cited by: §2.3.
 [11] (2010) Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11 (Apr), pp. 1563–1600. Cited by: §2.3.
 [12] (2002) Approximately optimal approximate reinforcement learning. In ICML, Vol. 2, pp. 267–274. Cited by: §4.2.
 [13] (2002) A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538. Cited by: §4.2.
 [14] (201112) Difficulties and limitations in performance simulation of a double skin façade with EnergyPlus. Energy and Buildings 43 (12), pp. 3635–3645. External Links: Document, ISSN 03787788 Cited by: §2.2.
 [15] (201703) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America 114 (13), pp. 3521–3526. External Links: Document, 1612.00796, ISSN 10916490 Cited by: 2nd item.
 [16] (2000) Actorcritic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §4.2.
 [17] (2019) ShortTerm Residential Load Forecasting Based on LSTM Recurrent Neural Network. IEEE Transactions on Smart Grid 10 (1), pp. 841–851. Cited by: §4.1.
 [18] (2019) NonStationary Markov Decision Processes a WorstCase Approach using ModelBased Reinforcement Learning. In Advances in Neural Information Processing Systems, pp. 7214–7223. Cited by: §3.
 [19] (201907) Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning. IEEE Transactions on Cybernetics 50 (5), pp. 2002–2013. External Links: Document, 1709.05077, ISSN 21682267 Cited by: §2.1.
 [20] (2018) Tune: a research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118. Cited by: §6.2.1.
 [21] (201609) Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016  Conference Track Proceedings, External Links: 1509.02971 Cited by: §1, §2.1, §2.2.
 [22] (2014) Handling model uncertainty in model predictive control for energy efficient buildings. Energy and Buildings 77, pp. 377–392. Cited by: §1.
 [23] (201804) Learning Robust Options. Technical report External Links: Link Cited by: §2.3.
 [24] (201502) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: Document, ISSN 14764687 Cited by: §2.1.
 [25] (2018) Online building energy optimization using deep reinforcement learning. IEEE transactions on smart grid 10 (4), pp. 3698–3708. Cited by: §2.2.
 [26] (201810) Reinforcement Learning Testbed for PowerConsumption Optimization. In Communications in Computer and Information Science, Vol. 946, pp. 45–59. External Links: 1808.10427, ISBN 9789811328527, ISSN 18650929 Cited by: §2.1.
 [27] (201805) Deep Reinforcement Learning for Optimal Control of Space Heating. arXiv preprint arXiv:1805.03777. External Links: 1805.03777, Link Cited by: §2.2.
 [28] (201906) Online energy management in commercial buildings using deep reinforcement learning. In Proceedings  2019 IEEE International Conference on Smart Computing, SMARTCOMP 2019, pp. 249–257. External Links: Document, ISBN 9781728116891 Cited by: §1, §2.2, §5.1.
 [29] (2018) Data driven methods for energ reduction in large buildings. In 2018 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 131–138. Cited by: §2.2.
 [30] (2019) Reinforcement learning in nonstationary environments. CoRR abs/1905.03970. External Links: Link, 1905.03970 Cited by: §2.3.
 [31] (201308) Difficulties and issues in simulation of a highrise office building. pp. . Cited by: §2.2.
 [32] (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc.. Cited by: §3.
 [33] (201802) Predicting electricity consumption for commercial and residential buildings using deep recurrent neural networks. Applied Energy 212, pp. 372–385. External Links: ISSN 03062619 Cited by: §4.1.
 [34] (201502) Trust Region Policy Optimization. 32nd International Conference on Machine Learning, ICML 2015 3, pp. 1889–1897. External Links: 1502.05477, Link Cited by: §2.1.
 [35] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.2.
 [36] (2014) A review on optimized control systems for building energy and comfort management of smart sustainable buildings. Renewable and Sustainable Energy Reviews 34, pp. 409–429. Cited by: §1.

[37]
(201703)
Deep Robust Kalman Filter
. arXiv preprint arXiv:1703.02310. External Links: 1703.02310, Link Cited by: §2.3.  [38] (2019) Change point detection for compositional multivariate data. arXiv preprint arXiv:1901.04935. Cited by: §2.3.
 [39] (2018) Datadriven model predictive control using random forests for building energy optimization and climate control. Applied energy 226, pp. 1252–1272. Cited by: §1.
 [40] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.2, §3, §4.2.
 [41] (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §4.2.
 [42] (2014) Scaling up robust mdps using function approximation. In International Conference on Machine Learning, pp. 181–189. Cited by: §2.3.
 [43] (2017) Deep reinforcement learning for building hvac control. In Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6. Cited by: §1, §2.1.
Comments
There are no comments yet.