Unsupervised Skill Discovery in Deep Reinforcement Learning

Google AI: DADS - Unsupervised Off-Policy Reinforcement Learning for Skill Discovery

The mothership All.In Data has republished this article on their blog June the 29th. Thank you 😄

AI residents A. Sharma et al. from Google Research and Google Brain published an interesting approach addressing the complicated issue of specifying a well-designed task-specific reward function in an unsupervised manner in order to remedy the problems of manually labeling “goal” states or introducing porbably costly instrumentations, e.g. in form of sensors.

Usually, an agent in a supervised reinforcement learning environment uses an extrensic reward function specifically designed to address a specific problem area. In contrast, in unsupervised reinforcement learning, an agent utilizes an intrinsic reward function, e.g. procedures mimicing behaviours like curiosity, ‘to generate its own trainings signals for acquiring a probably broad set of task-agnostic behaviors’ - [Sharma et al. 2020].

Essentially, this would allow for neglegting the effort of desining an extrinsic, task-specific reward function while being generalizable for other tasks as well. Though it can be considered difficult to learn agent-environment interactions without a guiding reward signal, solving this specific problem in an unsupervised fashion could prove extremely rewarding for many domains outside of classic anthropomatics.

Sharma et al.’s work includes two current research papers, Dynamics-Aware Unsupervised Discovery of Skills, and the more recent Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning, from Google Research and Google Brain.

‘The behavior on the left is random and unpredictable, while the behavior on the right demonstrates systematic motion with predictable changes in the environment. Our goal is to learn potentially useful behaviors such as those on the right, without engineered reward functions.’ - ai.googleblog.com

In their foundational work Dynamics-Aware Unsupervised Discovery of Skills they introduce “predictability” as optimization objective for discovering new skills, essentially allowing to create a dynamics model of the environment which in turn enables the use of planning algorithms. They prove the practicability of the approach in a simulated robotics setup. Sharma et al. improve the sample efficiency in their follow-up work and prove the practicability of DADS in a real-world scenario using an off-policy variant for training D’Kitty from ROBEL

DADS is based on the design of an intrinsic reward function which encourages a curiosity like behavior for discovering both a “predictable” and “diverse” skill set. As there is not any reward given by the environment, optimizing skills with respect to diversity allows the agent to discover many potentially useful behaviors.

They utilize a second neural network (skill-dynamics network) to actually predict if a skill is associated with a predictable change in the environment. Better prediction performance of the skill_dynamics network is associated with the preditability of environmental state changes and therefor with the predictability of the skill.

‘Schematic of the DADS model.’ - ai.googleblog.com

Essentially, Sharma et al.’s approach could lead to interesting possibilities in the lesse complex areas of online retail, lead management and customer experience if applied accordingly to specific intrinsic problem domains.

If you are interested in the resources and their current research regarding unsupervised reinforcement learning, have a look at the following resources.

Google AI Blog Post

Dynamics-Aware Unsupervised Discovery of Skills

Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning

Github Repo: DADS

Kind regards,

Henrik Hain

Henrik Hain
Henrik Hain
Senior Data Scientist / Data Engineer

My (research) interests evolve around the practical and theoretical aspects of software engineering, (self-) learning systems and algorithms, especially (deep) reinforcement learning, spatio-temporal event detection, and computer vision approaches.