As part of my Machine Learning course, we have to work on a group project. We decided to pursue a reasonably unchartered territory in Reinforcement Learning. Even though none of us are experts in this field, we want to learn our way through the next three months, and hopefully, you get some insights from it too. Here’s the proposal:

# Introduction

We propose to evaluate the efficacy of a novel Transfer Learning for Reinforcement Learning algorithm to autonomously learn inter-task mappings between a pair of tasks set in distinct continuous-state- and discrete-action- space domains without the aid of any human-coded task alignment.

Solving this problem helps accelerate training, increase asymptotic performance and/or increase the cumulative reward for RL agents. This is very attractive in high-cost environments that cannot be accurately simulated, for agents engaging in curriculum learning, lifelong learning, or imitation learning. For example, an agent trained to drive a car should be able to transfer some skills to flying a plane.

Reinforcement Learning (RL) is a machine learning technique to solve sequential decision-making problems (commonly known as Markov decision processes (MDP)). An intelligent agent uses RL to perceive its environment and makes an informed decision based on immediate and future rewards. When humans start a task they have never seen before; they never start from scratch. They use prior experience gathered by them and others to give them a head start on the task at hand. Most of the current RL agents are very specific and will have to begin tabula rasa if they were asked to do another task. We want an agent to behave more like a human and use what they have learned in other domains to perform a task. We are focusing on cross-task knowledge transfer in MDPs with continuous or discrete state spaces, discrete action spaces and with deterministic transition functions.

# Review of Related Work

RL, in general, is covered in the canonical introductory text by Sutton and Barto [1]. Though they do not cover transfer learning in any detail, they motivate RL, describe the dimensions of possible RL domains and their corresponding learning algorithms.

The problem of transfer learning in RL has many different formulations, usually depending on the known differences between the two tasks of interest. Most early TL for RL work would prescribe a solution to one particular type of formulation, though over time the restrictions on task similarity have become looser. Taylor’s 2009 survey paper [2] has a thorough review of early TL for RL research.

Most inter-task mapping has been aided by some human-coded task-alignment. Gupta and Devin [3] devise a cross-domain transfer learning problem that relies on implicit task alignment. Their construction has two different agents operating in distinct state and action spaces, though the transfer solution relies on the a priori knowledge that the agents are each accomplishing the same pair of tasks with the same pair of skills, and also assume that the distributions of states visited by the optimal policies of each domain are similar. Much transfer learning research to date imposes a similar implicit a priori task alignment, whereas we aim to construct an algorithm that has no prior knowledge of the degree of relatedness of the two tasks or how any of their structures may be analogous.

The first generalized automated inter-task mappings were generated by Taylor [4], introduced as the MASTER algorithm. This requires learning models of the source and target tasks and then comparing modeled transition dynamics between all state-action pairs in each task, finding the best linear projection of the source task into the target task. The MASTER algorithm is computationally intensive and limited to small discrete state and action spaces.

Ammar [5] automates learning of inter-task mappings by projecting source, and target transition vectors (s, a, s’) into a latent vector space learned using sparse coding trained on the source task vector space, and then aligns the transition vector instances based on their Euclidean distance in this latent space. Creating this common projection space and mapping instances across tasks allows Ammar to transfer experience of optimal policy instances from the source task to the target task thereby generating an initial policy for the target task from transferred source task knowledge. This algorithm does not consider the reward achieved by steps in each task when determining correspondence, and so instead relies solely on the transition dynamics of each domain. It is possible that if the tasks have significantly different reward structures, then this algorithm will result in negative transfer, and this procedure is sensitive to the breadth of the samples used in the correspondence learning step.

Ammar [6] also constructs a generally autonomous inter-task mapping procedure using three-way Restricted Boltzmann Machines called Transfer RBM (TrRBM). With one layer each for the target transition vectors of the source and target task, plus one layer of latent variables, he optimizes the three-way weight tensor based on maximizing the conditional posterior probabilities of one layer given the other two, allowing for instance transfer like in [5]. This algorithm is also sensitive to the underlying transition distribution of each task environment and does not consider the correspondence of the tasks’ reward function.

# Viable Project Plan

We want to use a feedforward neural network to map (s) -> (s) and (s, a) -> (a) from target task to source task in order to facilitate knowledge transfer, training the network by backpropagating the rewards expected in the source task given the target->source mapped (s, a) tuple and the actual Bellman backup reward observed in the target task. If both tasks use model-based learning, an additional loss function could be the negative log-likelihood of the observed target task transition in the source transition model given the target->source mapped source tuple. Using these loss functions, we would be determining a correspondence between tasks based on reward structure as well as environment transition dynamics. An initial sketch of the process of learning the mapping is displayed in the figure below.

Once this mapping is learned we hope to transfer knowledge (in terms of a policy, value functions, and/or transition dynamics) from source task to target task, though the manner in which this will be accomplished is still unknown and also likely depends on the RL algorithm used in each task. We are aware that the target->source mapping is non-injective, and so will map knowledge (i.e. the Q value) of one (s, a) source task tuple back to potentially many (s, a) target task tuples. If we cannot do this in the directed model setting we would have to look at undirected graphical models, which we are trying to avoid due to the computational complexity of the sampling methods required.

Experiments would include reproducing Taylor’s MASTER algorithm and Ammar’s TrRBM algorithm as baselines. This would involve applying their algorithms in OpenAI Gym to the Inverted Pendulum, Cart Pole and Mountain Cart environments. We would then implement our novel transfer algorithm and compare our results using five metrics as mentioned in Taylor’s work [1] as well as computational and sample complexity.

# Nice to Haves

If time and resources permit, we might like to implement Ammar’s Sparse Coding algorithm for inter-task mappings and compare its performance as another baseline to our approach, though he shows his TrRBM model as being more efficient.

It would be interesting to test all these algorithms in new experimental environments like a mountain cart-pole hybrid and a mountain cart-pong hybrid, and between a more varied set of task pairs.

Ideally, we would use Bayesian methods to determine the weight given to source task advice when combining transferred knowledge with normal within-target-task RL learning. These weights could be dependent on the relative degrees of uncertainty of the intertask mapping and the within-target-task policy, which could be estimated from the magnitude of a sample of recent loss gradients throughout training.

It also may be possible to leverage this mapping architecture to accelerate significantly lifelong multi-task transfer learning using the idea of learning coherent mappings within a clique of tasks.

# References

[1] Sutton, R. and Barto, A. (2012). Reinforcement learning. Cambridge, Massachusetts: The MIT Press.

[2] Taylor, M. and Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research.

[3] Taylor, M., Kuhlmann, G. and Stone, P. (2008). Autonomous Transfer for Reinforcement Learning. The Autonomous Agents and Multi-Agent Systems Conference (AAMAS-07).

[4] Gupta, A. et al. (2017). Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning. ICLR.

[5] Ammar, H., Taylor, M., Tuyls, K., Driessens, K. and Weiss, G. (2012). Reinforcement Learning Transfer via Sparse Coding. In Proceedings of the International Conference on Autonomous Agents and Multi-agent Systems (AAMAS).

[6] Ammar, H., Mocanu, D., Driessens, K., Tuyls, K. and Weiss, G. (2013). Automatically Mapped Transfer Between Reinforcement Learning Tasks via Three-Way Restricted Boltzmann Machines. In Proceedings of the European Conference on Machine Learning (ECML)