TY - JOUR
T1 - Invariant Policy Learning
T2 - A Causal Perspective
AU - Saengkyongam, Sorawit
AU - Thams, Nikolaj
AU - Peters, Jonas
AU - Pfister, Niklas
N1 - Publisher Copyright:
IEEE
PY - 2023
Y1 - 2023
N2 - Contextual bandit and reinforcement learning algorithms have been successfully used in various interactive learning systems such as online advertising, recommender systems, and dynamic pricing. However, they have yet to be widely adopted in high-stakes application domains, such as healthcare. One reason may be that existing approaches assume that the underlying mechanisms are static in the sense that they do not change over different environments. In many real-world systems, however, the mechanisms are subject to shifts across environments which may invalidate the static environment assumption. In this paper, we take a step toward tackling the problem of environmental shifts considering the framework of offline contextual bandits. We view the environmental shift problem through the lens of causality and propose multi-environment contextual bandits that allow for changes in the underlying mechanisms. We adopt the concept of invariance from the causality literature and introduce the notion of policy invariance. We argue that policy invariance is only relevant if unobserved variables are present and show that, in that case, an optimal invariant policy is guaranteed to generalize across environments under suitable assumptions.
AB - Contextual bandit and reinforcement learning algorithms have been successfully used in various interactive learning systems such as online advertising, recommender systems, and dynamic pricing. However, they have yet to be widely adopted in high-stakes application domains, such as healthcare. One reason may be that existing approaches assume that the underlying mechanisms are static in the sense that they do not change over different environments. In many real-world systems, however, the mechanisms are subject to shifts across environments which may invalidate the static environment assumption. In this paper, we take a step toward tackling the problem of environmental shifts considering the framework of offline contextual bandits. We view the environmental shift problem through the lens of causality and propose multi-environment contextual bandits that allow for changes in the underlying mechanisms. We adopt the concept of invariance from the causality literature and introduce the notion of policy invariance. We argue that policy invariance is only relevant if unobserved variables are present and show that, in that case, an optimal invariant policy is guaranteed to generalize across environments under suitable assumptions.
KW - Causality
KW - contextual bandits
KW - distributional shift
KW - Extraterrestrial measurements
KW - Heuristic algorithms
KW - off-policy learning
KW - Particle measurements
KW - Random variables
KW - Reinforcement learning
KW - Training
KW - Visualization
UR - http://www.scopus.com/inward/record.url?scp=85147223594&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2022.3232363
DO - 10.1109/TPAMI.2022.3232363
M3 - Journal article
C2 - 37018267
AN - SCOPUS:85147223594
VL - 45
SP - 8606
EP - 8620
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
SN - 0162-8828
IS - 7
ER -