Limitations of Experience Replay
While consequential in restarting the research of deep reinforcement learning, experience replay suffers from a number of shortcomings. First, it needs to store a large amount of data - up to two million transition tuples in recent work on medium-sized environments. This is not expected to scale well with larger dimensionality of the input space. Second, it is not clear how much time a memory should be retained and how often one should learn from a given transition as learning progresses. The original algorithm simply employed a circular buffer and sampled uniformly from it, while \citet{schaul2015prioritized} retain transitions based on their absolute temporal difference error. A similar heuristic is used by \citet{liu2017effects} to dynamiclly adjust the size of the buffer. Third, while \citet{Mnih2015} and \citet{Marblestone_2016} claim that ER models the consolidation of day-time experiences during sleep in the hippocampus it seems highly improbable that biological models of memory formation are actually able to store and replay accurate pixel-level representations of reality. I hypothesize that, such a mechanism, if it exists, to take advantage of the predictive power of the neocortex and to consolidate learning based on episodic memories in a lower-dimensional representation space.
Proposal
Experience Replay appears to be a crude but highly effective mechanism that facilitates learning from a data distribution that approximates the real distribution of state-actions. My proposal is to learn a shallow model of the environment in the latent space and use it to generate transitions or even short trajectories from different segments of a game conditioned on actions and use these imagined experiences to learn the agent's policy. As opposed to model-based approaches used for planning such as \citet{Oh2015ActionConditionalVP}, \citet{Leibfried2016ADL} we are no interested in multi-step predictive coherence over long horizons, but on the ability of generating a large amount of local-coherent data from many locations within the duration of an episode.
The approach taken will be to evaluate conditional models shown to be effective on highly-dimensional data such as \citet{sohn2015learning} and \citet{mirza2014conditional} and adapt them to the action-conditional, online setting natural to reinforcement learning.
Furthermore, the proposal aims not only at replacing a non-parametrized component of modern algorithms with a more expressive and potentially more performant one, but to bridge the gap between strictly reactive methods based on the prediction of an instantaneous action and model-based methods traditionally used for planning and efficient exploration of the environment. It is hoped that for example learning a joint latent space for both the reactive policy and the world-model we can develop powerful new methods closer to their biological counterparts.