Invited Speakers

2023 Spring Series - Ongoing

Alessandro Lazaric (Meta AI)

Understanding Incremental Unsupervised Exploration for Goal-based RL

One of the key features of intelligent beings is the capacity to explore and discovery an unknown environment and to progressively learn how to control it. This process is not driven by an explicit reward and it may unfold in a completely unsupervised way. In this talk I will propose a formalization of unsupervised discovery and exploration as the process of incrementally learning policies that reach goals of increasing difficulty. The resulting goal-based policy then allows the agent to solve any goal-reaching task at downstream time with no additional learning or planning. I will illustrate algorithmic principles, theoretical guarantees, and preliminary empirical results that could lay the foundations for designing agents that can efficiently learn in open-ended environments.


On unsupervised exploration:

Adaptive Multi-Goal Exploration; AISTATS 2022

Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching; ICLR 2022

A Provably Efficient Sample Collection Strategy for Reinforcement Learning; NeurIPS 2021

Improved Sample Complexity for Incremental Autonomous Exploration in MDPs; NeurIPS 2020

On exploration for goal-based RL:

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret; NeurIPS 2021

Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model; A

About the Speaker

Alessandro is a research scientist at Meta AI (/FAIR), where he has been since 2017. Prior to working at Meta, he completed a PhD at Politecnico di Milano and worked as a researcher at INRIA Lille. His main research topic is reinforcement learning, with extensive contributions on both the theoretical and algorithmic aspects of RL. In the last ten years he has studied the exploration-exploitation dilemma both in the multi-armed bandit and reinforcement learning framework, notably on the problems of regret minimization, best-arm identification, pure exploration, and hierarchical RL.

2022 Winter Series

Elise van der Pol (Microsoft Research)

Symmetries in Single and Multi-agent Learning and AI4Science

Recording - WiP

In this talk, I will discuss our work on symmetry and structure in single and multi agent reinforcement learning. I will first discuss MDP Homomorphic Networks (NeurIPS 2020), a class of networks that ties transformations of observations to transformations of decisions. Such symmetries are ubiquitous in deep reinforcement learning, but often ignored in earlier approaches. Enforcing this prior knowledge into policy and value networks allows us to reduce the size of the solution space, a necessity in problems with large numbers of possible observations. I will showcase the benefits of our approach on agents in virtual environments. Building on the foundations of MDP Homomorphic Networks, I will also discuss our recent multi-agent works, Multi-Agent MDP Homomorphic Networks (ICLR 2022) and Equivariant Networks for Zero-Shot Coordination (NeurIPS 2022), which consider symmetries in multi-agent systems. This forms a basis for my vision for reinforcement learning for complex virtual environments, as well as for problems with intractable search spaces. Finally, I will briefly discuss AI4Science.

About the Speaker

Elise van der Pol is a Senior Researcher at Microsoft Research AI4Science Amsterdam, working on reinforcement learning and deep learning for molecular simulation. Additionally, she works on symmetry, structure, and equivariance in single and multi-agent reinforcement learning and machine learning.

Before joining MSR, she did a PhD in the Amsterdam Machine Learning Lab, working with Max Welling (UvA), Frans Oliehoek (TU Delft), and Herke van Hoof (UvA). During her PhD, she spent time in DeepMind’s multi-agent team. Elise was an invited speaker at the BeneRL 2022 workshop and the Self-Supervision for Reinforcement Learning workshop at ICLR 2021. She was also a co-organizer of the workshop on Ecological/Data-Centric Reinforcement Learning at NeurIPS 2021.

12th October 2022

Alhussein Fawzi (DeepMind)

Discovering faster matrix multiplication algorithms with reinforcement learning

Recording - WiP

Improving the efficiency of algorithms for fundamental computational tasks such as matrix multiplication can have widespread impact, as it affects the overall speed of a large amount of computations. The automatic discovery of algorithms using machine learning offers the prospect of reaching beyond human intuition and outperforming the current best human-designed algorithms. In this talk I'll present AlphaTensor, our reinforcement learning agent based on AlphaZero for discovering efficient and provably correct algorithms for the multiplication of arbitrary matrices. AlphaTensor discovered algorithms that outperform the state-of-the-art complexity for many matrix sizes. Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor's algorithm improves on Strassen's two-level algorithm for the first time since its discovery 50 years ago. I'll present our problem formulation as a single-player game, the key ingredients that enable tackling such difficult mathematical problems using reinforcement learning, and the flexibility of the AlphaTensor framework.

About the Speaker

Alhussein Fawzi is a Research Scientist in the Science team at DeepMind, where he leads the algorithmic discovery efforts. He is broadly interested in using machine learning to unlock new scientific discoveries. He obtained his PhD in machine learning and computer vision from EPFL in 2016.

12th October 2022

Tim Rocktäschel (DeepMind)

The NetHack Learning Environment and Its Challenges for Open-Ended Learning

Recording - WiP

Progress in Reinforcement Learning (RL) methods goes hand-in-hand with the development of challenging environments that test the limits of current approaches. While existing RL environments are either sufficiently complex or based on fast simulation, they are rarely both these things. Moreover, research in RL has predominantly focused on environments that can be approached by tabula rasa learning, i.e., without agents requiring transfer of any domain or world knowledge outside of the simulated environment. I will talk about the NetHack Learning Environment (NLE), a scalable, procedurally generated, stochastic, rich, and challenging environment for research based on the popular single-player terminal-based rogue-like game, NetHack. We argue that NetHack is sufficiently complex to drive long-term research on problems such as auto-curriculum learning, exploration, planning, skill acquisition, goal-driven learning, novelty search, and language-conditioned RL, while dramatically reducing the computational resources required to gather a large amount of experience.

About the Speaker

Tim is the Open-Endedness Team Lead at DeepMind, an Associate Professor at the Centre for Artificial Intelligence in the Department of Computer Science at University College London (UCL), and a Scholar of the European Laboratory for Learning and Intelligent Systems (ELLIS). Prior to that, he was a Manager and Research Scientist at Facebook AI Research (FAIR) London, a Postdoctoral Researcher in Reinforcement Learning at the University of Oxford, a Junior Research Fellow in Computer Science at Jesus College, and a Stipendiary Lecturer in Computer Science at Hertford College. Tim obtained his Ph.D. from UCL under the supervision of Sebastian Riedel, and he was awarded a Microsoft Research Ph.D. Scholarship in 2013 and a Google Ph.D. Fellowship in 2017. 

12th October 2022

Felix Hill (DeepMind)

How knowing language can make general AI systems smarter

Recording - WiP

Having and using language makes humans as a species better learners and better able to solve hard problems. I'll present three studies that demonstrate how this is also the case for artificial models of general intelligence. In the first, I show that agents with access to visual and linguistic semantic knowledge explore their environment more effectively than non-linguistic agents, enabling them to learn more about the world around them. In the second, I demonstrate how an agent embodied in a simulated 3D world can be enhanced by learning from explanations -- answers to the question "why?" expressed in language. Agents that learn from explanations solve harder cognitive challenges than those trained from reinforcement learning alone, and can also better learn to make interventions in order to uncover the causal structure of their world. Finally, I'll present evidence that the skewed and bursty distribution of natural language may explain how large language models can be prompted to rapidly acquire new skills or behaviours. Together with other recent literature, this suggests that modelling language may make a neural network better able to acquire new cognitive capacities quickly, even when those capacities are not necessarily explicitly linguistic. 

About the Speaker

Felix is a Research Scientist at DeepMind where he leads a team focusing on the relationship between natural language and general intelligence. His work combines insights from Cognitive Science, Neuroscience and Linguistics in working towards scientifically and practically useful models of human cognition and behaviour. 

14th September 2022

Sam Devlin (Microsoft Research)

Towards Ad-Hoc Teamwork for Improved Player Experience

Recording - WiP

Collaborative multi-agent reinforcement learning research often makes two key assumptions: (1) we have control of all agents on the team; and (2) maximising team reward is all you need. However, to enable human-AI collaboration, we need to break both of these assumptions. In this talk I will formalise the problem of ad-hoc teamwork and present our proposed approach to meta-learn policies robust to a given set of possible future collaborators. Then talk about recent work on modelling human play, showing reward maximisation may not be sufficient when trying to entertain billions of players worldwide.

About the Speaker

Sam is a Principal Researcher in the Deep Reinforcement Learning for Games group at Microsoft Research Cambridge. He received his PhD on multi-agent reinforcement learning in 2013 from the University of York; was a postdoc from 2013 to 2015, working on game analytics; and then was on the faculty from 2016 until joining Microsoft in 2018. He has published more than 60 papers on reinforcement learning and game AI in leading academic venues and presents regularly at games industry events including Develop and the Game Developers Conference (GDC).

2022 Spring Series

16th June 2022

David Abel (DeepMind)

On the Expressivity of Markov Reward

Recording - WiP

Reward is the driving force for reinforcement-learning agents. In this talk, I will present our recent NeurIPS paper that explores the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of “task” that might be of interest: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while Markov reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. I conclude by summarizing recent follow up work that studies alternatives for enriching the expressivity of reward.

About the Speaker

David Abel is a Research Scientist at DeepMind in London. Before that, he completed his Ph.D in computer science and Masters in philosophy at Brown University, advised by Michael Littman (CS) and Josh Schechter (philosophy).

26th May 2022

Roberta Raileanu (Meta AI Research)

From Solving Narrow Tasks to Learning General Skills

Recording - WiP

In the past few years, deep reinforcement learning (RL) has led to impressive achievements in games and robotics. However, current state-of-the-art RL methods struggle to generalize to scenarios they haven’t experienced during training. In this talk, I will show how we can leverage diverse data and a minimal set of inductive biases to generalize to new task instances. First, I will discuss how we can use data augmentation to learn policies which are invariant to task-irrelevant changes in the observations. Then, I will show how we can generalize to new task instances with unseen states and layouts by decoupling the representation of the policy and value function. And finally, I will briefly describe how we can quickly adapt to new dynamics by learning a value function for a family of behaviors and environments.

About the Speaker

Roberta is a research scientist at Meta AI / FAIR. Previously, she did her PhD in computer science at NYU, advised by Rob Fergus. Her research focuses on designing machine learning algorithms that can make robust sequential decisions in complex environments. In particular, Roberta works in the area of deep reinforcement learning, with a focus on generalization, adaptation, continual, and open-ended learning. During her PhD, she spent time as an intern at DeepMind, Microsoft Research, and Facebook AI Research. Roberta also holds a B.A. in Astrophysics from Princeton University, where she worked on theoretical cosmology and supernovae simulations.

21st April 2022

Haitham Ammar (Huawei / UCL)

High-Dimensional Black-Box Optimisation in Small Data Regimes

Recording - WiP

Many problems in science and engineering can be viewed as instances of black-box optimisation over high-dimensional (structured) input spaces. Applications are ubiquitous, including arithmetic expression formation from formal grammars and property-guided molecule generation, to name a few. Machine learning (ML) has shown promising results in many such problems (sometimes) leading to state-of-the-art results. Abide those successes, modern ML techniques are data-hungry, requiring hundreds of thousands if not millions of labelled data. Unfortunately, many real-world applications do not enjoy such a luxury -- it is challenging to acquire millions of wet-lab experiments when designing new molecules.

This talk will elaborate on novel techniques we developed for high-dimensional Bayesian optimisation (BO), capable of efficiently resolving such data bottlenecks. Our methods combine ideas from deep metric learning with BO to enable sample efficient low-dimensional surrogate optimisation. We provide theoretical guarantees demonstrating vanishing regrets with respect to the true high-dimensional optimisation problem. Furthermore, in a set of experiments, we confirm the effectiveness of our techniques in reducing sample sizes by acquiring state-of-the-art logP molecule values utilising only 1% labels compared to previous SOTA.

About the Speaker

Haitham leads the reinforcement learning team at Huawei technologies Research & Development UK and is an Honorary Lecturer at UCL. Prior to Huawei, Haitham led the reinforcement learning and tuneable AI team at, where he contributed to their technology in finance and logistics. Prior to joining, Haitham was an Assistant Professor in the Computer Science Department at the American University of Beirut (AUB). Before joining the AUB, Haitham was a postdoctoral research associate in the Department of Operational Research and Financial Engineering (ORFE) at Princeton University. Prior to Princeton, he conducted researcher in lifelong machine learning while being employed as a postdoctoral researcher at the University of Pennsylvania. Being a former member of the General Robotics Automation Sensing and Perception (GRASP) lab, he also contributed to the application of machine learning to robotics. His primary research interests lie in the field of statistical machine learning and artificial intelligence, focusing on Bayesian optimisation, probabilistic modelling and reinforcement learning. He is also interested in learning using massive amounts of data over extended time horizons – a property common to "Big-Data" problems. His research also spans different areas of control theory and nonlinear dynamical systems, as well as social networks and distributed optimization.

17th March 2022

Angeliki Lazaridou (DeepMind)

Towards Multi-agent Emergent Communication as a Building Block of Human-centric AI


The ability to cooperate through language is a defining feature of humans. As the perceptual, motory and planning capabilities of deep artificial networks increase, researchers are studying whether they can also develop a shared language to interact. In this talk, I will highlight recent advances in this field but also common headaches (or perhaps limitations) with respect to experimental setup and evaluation of emergent communication. Towards making multi-agent communication a building block of human-centric AI, and by drawing from my own recent work, I will discuss approaches on making emergent communication relevant for human-agent communication in natural language.

About the Speaker

Angeliki Lazaridou is a Staff Research Scientist at DeepMind. She obtained her PhD from the University of Trento, where she worked on predictive grounded language learning. At DeepMind, she has worked on interactive methods for language learning that rely on multi-agent communication as a means of alleviating the use of supervised language data. More recently, she has focused on understanding (and improving) the temporal generalization of language models.

17th February 2022

Jakob Foerster (University of Oxford)

Zero-Shot Coordination and Off-Belief Learning


There has been a large body of work studying how agents can learn communication protocols in decentralized settings, using their actions to communicate information. Surprisingly little work has studied how this can be prevented, yet this is a crucial prerequisite from a human-AI coordination and AI-safety point of view.

The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents' actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy pi_1 that is optimized assuming past actions were taken by a given, fixed policy, pi_0, but assuming that future actions will be taken by pi_1. When pi_0 is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents' behavior.

OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike existing approaches, which may converge to any equilibrium policy, OBL converges to a unique policy, making it suitable for zero-shot coordination (ZSC).

OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark human-AI & ZSC problem Hanabi.

About the speaker

Jakob Foerster started as an Associate Professor at the department of engineering science at the University of Oxford in the fall of 2021. During his PhD at Oxford he helped bring deep multi-agent reinforcement learning to the forefront of AI research and interned at Google Brain, OpenAI, and DeepMind. After his PhD he worked as a research scientist at Facebook AI Research in California, where he continued doing foundational work. He was the lead organizer of the first Emergent Communication workshop at NeurIPS in 2017, which he has helped organize ever since and was awarded a prestigious CIFAR AI chair in 2019.

His past work addresses how AI agents can learn to cooperate and communicate with other agents, most recently he has been developing and addressing the zero-shot coordination problem setting, a crucial step towards human-AI coordination.

His work has been cited over 5000 times, with an h-index of 29 (Google Scholar page).