Unsupervised Skill Discovery in Deep Reinforcement Learning
Google AI: DADS - Unsupervised Off-Policy Reinforcement Learning for Skill Discovery
AI residents A. Sharma et al. from Google Research and Google Brain published an interesting approach addressing the complicated issue of specifying a well-designed task-specific reward function in an unsupervised manner in order to remedy the problems of manually labeling “goal” states or introducing porbably costly instrumentations, e.g. in form of sensors.
Usually, an agent in a supervised reinforcement learning environment uses an extrensic reward function specifically designed to address a specific problem area. In contrast, in unsupervised reinforcement learning, an agent utilizes an intrinsic reward function, e.g. procedures mimicing behaviours like curiosity, ‘to generate its own trainings signals for acquiring a probably broad set of task-agnostic behaviors’ - [Sharma et al. 2020].
Essentially, this would allow for neglegting the effort of desining an extrinsic, task-specific reward function while being generalizable for other tasks as well. Though it can be considered difficult to learn agent-environment interactions without a guiding reward signal, solving this specific problem in an unsupervised fashion could prove extremely rewarding for many domains outside of classic anthropomatics.
Sharma et al.’s work includes two current research papers, Dynamics-Aware Unsupervised Discovery of Skills, and the more recent Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning, from Google Research and Google Brain.
In their foundational work Dynamics-Aware Unsupervised Discovery of Skills they introduce “predictability” as optimization objective for discovering new skills, essentially allowing to create a dynamics model of the environment which in turn enables the use of planning algorithms. They prove the practicability of the approach in a simulated robotics setup. Sharma et al. improve the sample efficiency in their follow-up work and prove the practicability of DADS in a real-world scenario using an off-policy variant for training D’Kitty from ROBEL
DADS is based on the design of an intrinsic reward function which encourages a curiosity like behavior for discovering both a “predictable” and “diverse” skill set. As there is not any reward given by the environment, optimizing skills with respect to diversity allows the agent to discover many potentially useful behaviors.
They utilize a second neural network (skill-dynamics network) to actually predict if a skill is associated with a predictable change in the environment. Better prediction performance of the skill_dynamics network is associated with the preditability of environmental state changes and therefor with the predictability of the skill.
Essentially, Sharma et al.’s approach could lead to interesting possibilities in the lesse complex areas of online retail, lead management and customer experience if applied accordingly to specific intrinsic problem domains.
If you are interested in the resources and their current research regarding unsupervised reinforcement learning, have a look at the following resources.