On Conant and Ashby's "Good Regulator Theorem"
This is a short note on Conant and Ashby’s “Good Regulator Theorem” ^{1} and especially on its relevance to AI and more specifically reinforcement learning (RL). The good regulator theorem says that an optimal controller of a system must be a model of that system. This is usually taken to be a strong argument for the necessity of modelbased techniques in AI and ML. That is exactly the reason why I wanted to take the time to think a little bit about this theorem and its relevance to AI. As a disclaimer, let me say that I haven’t read the original paper, and I will rely on the tutorial style report by David Scholten ^{2}, which I recommend to anyone interested in the good regulator theorem (there is also another report by Scholten^{3}, but this deals more with the philosophical aspects).
I must say, at the outset, the good regulator theorem is much less significant and much less relevant for AI than I imagined. This is mostly because the notion of “model” in the good regulator theorem is quite different than what is usually meant by model in AI, and especially RL. Before I make this point more clear, let me briefly mention what the good regulator theorem says. Note this will be a rather highlevel summary. Please take a look at the primer^{2} by Scholten if you’d like to get the details.
Here is the setup for the theorem. You have a system that can be in different states, \(s_i \in S\). The model (regulator) in this case is a mapping from this system state to regulator states (i.e., actions in the RL setting), \(r_i \in R\). The model is represented with a conditional probability distribution \(P(r \vert s)\). These regulator states (actions) are applied on the system which transitions the environment to some final state, \(z_i \in Z\). Note this is essentially a single step MDP.^{4} The objective in the good regulator theorem is to find the model \(P(r \vert s)\) that minimizes the entropy of the final state distribution (i.e., after applying a single action). Note this is slightly unusual in the sense that you don’t care about what state you end up in, as long as you always end up in the same one. Then what the theorem shows is that the simplest optimal model \(P(r \vert s)\) is a mapping of \(s \to r\) (i.e., one action gets all the probability for each state).^{5} In essence, what the theorem says is that if two system states require different \(r_i\) (actions) to transition to the same final state, the model (regulator) should map these to different actions. This is almost trivial.
Note the “model” here is in fact what we usually call a policy in RL. So the good regulator theorem doesn’t really say anything about the “model” in the sense used in modelbased RL. And all the theorem says is that 1) the optimal policy doesn’t map a state to two actions that lead to different states 2) the optimal policy doesn’t map two states that require different actions (to reach the final state) to the same action.
Also, as a side note, note the objective here is quite peculiar: minimizing the entropy of the final state distribution. As I mentioned, this does not care about the final state. However, there might be a useless action that trivially takes you to the same final state all the time, e.g., closing your eyes. So in essence, an optimal model (policy) could be to always close your eyes. Obviously, we need to modify the objective so it cares about the states we end up in. We can simply define a goal state that we want to maximize the probability of ending up in. I think the theorem should still apply in this case.
From the perspective of RL, the good regulator theorem is really not that relevant. It certainly doesn’t say anything about the question of modelfree vs modelbased RL. In fact, modelfree methods essentially solve the same problem (in a multistep setting) as the good regulator theorem. So modelfree RL is modelbased from the perspective of the good regulator theorem.
“Model” in modelbased RL means something rather different; it maps every possible stateaction pair to a next state distribution. However, in the good regulator theorem the “model” is essentially a policy that maps states to optimal actions. And this “model” doesn’t know what the next state would be for an arbitrary stateaction pair. The closest “model” in RL in the sense of the “model” in good regulator theorem is probably the work on latent MDPs/bisimulation metrics/MDP homomorphisms^{6}^{7}. These learn a new abstract MDP that is equivalent to the original MDP in its behavior. If we were to construct this MDP solely based on what optimal action is needed in each state, this would learn a model that is essentially equivalent to the “model” in the good regulator theorem. However, this is in essence not much different from what modelfree RL techniques already do.
Bibliography

Roger C. Conant and W. Ross Ashby, Every good regulator of a system must be a model of that system), International Journal of Systems Science 1 (1970), 89–97. ↩

Scholten, Daniel L. n.d. “A Primer For Conant & Ashby’s ‘GoodRegulator Theorem.’” ↩ ↩^{2}

———. n.d. “Every Good Key Must Be A Model Of The Lock It Opens (The Conant & Ashbby Theorem Revisited).” ↩

We used different variables for the first state and the final state of the system, but this doesn’t change the fundamental picture. ↩

Note there might be equivalent models that are not simpler in the sense that for a given s, the regulator (model) picks different actions, but even in this case these actions should transition the initial state to the same final state. ↩

Gelada, Carles, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. 2019. “DeepMDP: Learning Continuous Latent Space Models for Representation Learning.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1906.02736. ↩

Pol, Elise van der, Thomas Kipf, Frans A. Oliehoek, and Max Welling. 2020. “Plannable Approximations to MDP Homomorphisms: Equivariance under Actions.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2002.11963. ↩