\documentclass[10pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\usepackage[round]{natbib}
\let\cite\citep
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[ngerman,english]{babel}
\begin{document}
\title{Curriculum Learning}
\author[1]{ClĂ©ment}%
\affil[1]{Cogitai}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\sloppy
\section*{Motivation}
{\label{175117}}
In our current setup, a Goal Configuration (GC) is a classifier learned
from positive and negative examples provided by the user. It learns a
metrics minimizing (resp. maximizing) the distance to the goal of
positive (resp. negative) examples, as described in
these~\href{https://drive.google.com/drive/u/1/folders/0BwcB7wj_CBS8ODZzcHZELTdOQ1E}{notes
on GC classifier by Varun}.
Similarly, an Initiation Set (IS) is also learned from positive and
negative examples. Given the current state, a GC (resp. an~IS)
classifier is~able to compute the probability that this state belongs to
the GC (resp. the IS). A GC is associated with a reward function defined
as the gain in distance to the goal resulting from each of the agent
action.~The objective is then to learn a policy maximizing
discounted~cumulated reward when achieving a GC from any state in the
initiation set.
In the current architecture illustrated in
Fig.~{\ref{778079}}, the IS is associated with
an~\emph{Initiation State Tracker} informing the~\emph{Option
Scheduler}~when the~\emph{Current State}~has a high probability of
belonging to the IS, i.e. when the agent can start practicing the GC for
improving the associated~\emph{Policy}. Similarly, a GC is associated
with a~\emph{Goal State Tracker~}informing the~\emph{Option Scheduler}
when the~ \emph{Current State} has a high probability of belonging to
the GC, i.e. the agent has to stop executing the associated policy.~
However, all initial states are not of equal value for improving the
policy given the current state of knowledge of the agent. The goal is
usually easier to reach from initial states which are closer to it (in
term of the metrics learned by the GC). Moreover, the policy learned
from these closer initial states is likely to be useful when starting
from states that are further away from the goal. The aim of this
document is to propose a method estimating which initial states are
currently relevant for improving a given policy according to the
previous experience of the agent. This is represented in
Fig.~{\ref{778079}}~by the box
labelled~\(\theta\) in the~\emph{Initiation State Tracker},
corresponding to a threshold on the initial state distance to the goal
which is adapted after each episode according to the outcome of the
episode (a~\emph{success}~if the agent was able to reach the goal within
the maximimal number of time steps allowed, a~\emph{failure}
otherwise).~ ~
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/arch/arch}
\caption{{Architecture overview.
{\label{778079}}%
}}
\end{center}
\end{figure}
\par\null
\section*{Notations}
{\label{835185}}
This document will use the following notations for the concepts defined
in the previous section:
\begin{tabular}{l|l}
{$G$} & The set of available GCs \\
$I(g)$ & The set of initial states for a GC $g\in G$ \\
$p_g(s)$ & The probability that state $s$ belongs to GC $g$ \\
$d_g(s)$ & The distance to GC $g$ (as inferred by the classifier) of state $s$ \\
$p_{I(g)}(s)$ & The probability that state $s$ belongs to the initiation set $I(g)$ \\
\end{tabular}
\section*{Modes}
{\label{374425}}
Two modes are available:
\begin{itemize}
\tightlist
\item
\textbf{Dedicated practice}, where a unique GC is selected by the
user. The agent then learns a policy by being teleported to a
random~state~\( s\in I(g)\)~at the start of each episode.
\item
\textbf{Autonomous practice}, where the robot executes an exploration
routine for visiting places of interest (e.g. object and room
locations). For each available GC~\(g \in G\), an initiation
state tracker continuously returns the probability that the current
state~\(s\) belongs to the initiation
set~\(I(g)\), i.e. returns~\(p_{I(g)}(s)\).~ At each time
step, the probability of starting an episode for a given GC depends on
all the~\(\{p_{I(g)}(s) | g \in G\}\), as well a probability of continuing the
exploration routine (an hyperparameter decaying with time).
\end{itemize}
The aim of the curriculum learning feature described below is to
modulate the probabilites~\(p_{I(g)}\) according to the current
level of competence of the agent, allowing the rejection of initial
states which are expected to be too difficult for improving the policy
learning.~
\section*{Introducing curriculum
learning}
{\label{126604}}
The positive examples of initial states provided by the user usually do
not require the same level of experience for achieving the goal. Some of
them can be very close to the goal (easy to achieve), others much
further away (difficult, or sometimes impossible to achieve).~ In the
current setup described in the previous section, the various levels of
difficulty (in term of distance to the goal) are not taken into account
when deciding to start an episode: in the dedicated practice mode, a
initial state is selected simply by drawing uniformely from the set of
positive examples; in the autonomous practice mode, the drawing depends
on~\(\{p_{I(g)}(s), \forall g \in G\}\) and therefore does not~ imply the take into
account the goal distance.~
The aim of curriculum learning is to adaptively increase or reduce the
difficulty according to the agent competence to reach a given GC. For
this aim, a multiplier~\(m\) is applied on the initial
state probability~\(p_{I(g)}(s)\):
\(p_{I(g)}(s) \leftarrow m p_{I(g)}(s)\).
The multiplier~\(m\) is a function of the initial distance
to the goal~\(d_g(s)\), as well as a
threshold~\(\theta\)~representing the current level of
competence in achieving the current GC~\(g\):
\(m = f(\theta, d_g(s))\)~.
The function~\(f\) has the role of filtering out initial
states which are too far from the goal when compared to the current
threshold. Three functions has been considered:
\begin{itemize}
\tightlist
\item
Decreasing linear:~\(f(\theta, d) = max(0, \frac{theta - d}{\theta})\) (filtering out states
where~\(d > \theta\) and favoring low distances to the goal.)
\item
Gaussian:~\(f(\theta, d) = e^{\frac{(d - \theta)^2}{2\sigma^2}}\) (favoring states which are close to the
threshold and parameterized by a standard
deviation~\(\sigma\).)
\item
Step:~\(f(\theta, d) = \begin{cases} 1, & \text{if}\ d < \theta \\ 0, & \text{otherwise} \end{cases}\) (keeping equal interest for all states
with~\(d < \theta\))
\end{itemize}
The results presented below use the step function, which has the
advantage of selecting a wide range of initial states (all those
with~\(d < \theta\)), thus favoring generalization of the learned
policy and avoiding catastrophic forgetting.
\section*{Threshold adaptation}
{\label{703616}}
After each episode of a given GC, the associated threshold is adapted
according to the result of the episode. We use a method adapted from
the~\href{https://en.wikipedia.org/wiki/Elo_rating_system}{ELO rating
system} for calculating the relative skill levels of players in
competitor-versus-competitor games such as chess.
At the start of an episode, the expected outcome~\(\hat{e}\) of
the episode is computed as a function of the start distance to the
goal~\(d_g(s)\) and the current threshold~\(\theta\):
\(\hat{e}(d, \theta) = \frac{1}{1 + 10^{\alpha (d - \theta)}}\), where~\(\alpha\) is a hyperparameter
indicating how much the difference~\(d - \theta\) affects the
expected outcome \(\hat{e}\).~ (\(\alpha=2\) in the
experiments below)~~
\(\hat{e}(d, \theta)\)~estimates the probability that the episode will be a
success, i.e. that the agent will be able to reach the goal within the
maximum number of time steps allowed, by starting from
distance~\(d\), and given the current
threshold~\(\theta\)~representing the level of competence of
the agent for the current GC.
The threshold value is then updated according to the actual
outcome~\(e\) of the episode, where ~\(e=1\)
if the~ episode was a success (as defined above), or~\(e=0\)
if it was a failure:
~\(\theta \leftarrow \theta + k(e-\hat{e})\), where~\(k\)~is a learning rate
hyperparameter set to 0.1 in the experiments below.
This update rule makes the threshold increase (resp. decrease) when the
outcome of the episode is higher (resp. lower) than expected,
proportionally to the difference~\(e - \hat{e}\).
\section*{Results}
{\label{258374}}
This section describes the results obtained in the dedicated pratice
mode. For evaluation purpose, only one GC is provided for approaching
the blue ball. Once the GC has been learned from the positive and
negative examples (see Fig.~{\ref{625945}}), we provide
a dozen of initial states spanning a wide range of distances to the goal
(see
Fig.~{\ref{412325}},~{\ref{558756}}
and~{\ref{595642}}). The threshold is initially set to
1. This means that, at the beginning of the experiment, all the provided
initial states can be selected, whcih is equivalent to running the
experiment without curriculum learning. The threshold is then updated
after each episode as described in the previous section.
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/near-to-blue-ball-gc/near-to-blue-ball-gc}
\caption{{A positive example of the GC~\emph{near\_to\_blue\_ball.}
{\label{625945}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/easy/easy}
\caption{{An easy initial state (goal distance around 0.2 as shown in the
bottom-left plot of the bottom-right window).
{\label{412325}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/medium/medium}
\caption{{An initial state with medium difficulty (goal distance around 0.5).
{\label{558756}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/hard/hard}
\caption{{A difficult initial state (goal distance around 1).
{\label{595642}}%
}}
\end{center}
\end{figure}
\par\null
Fig.~{\ref{562312}} shows the result of a dedicated
practice session of 90 episodes. Without curriculum (top plot) we
observe that, while initial states are chosen uniformly, those with a
high initial distance to the goals are rarely successful. With the
curriculum activated (bottom plot), we observe three different phases
self-organizing out of the threshold adaptation. In a first phase (from
the 1st to the 10th episode approx.), the agent experience many failures
and the threshold decreases in consequence (down to 0.25, which is the
minimum allowed). In a second phase (from the 10th to the 25th episode
approx.), the agent selects initial states with a low distance to the
goal (i.e. below threshold). Note that, from time to time, it can select
states with a distance higher than the threshold: this happens when no
states below the threshold were proposed during 10 consecutive trials
(in that case, the next proposed state is automatically selected to
avoid a blocking situation). Finally, in a third phase (from the 25th to
the last episode), the agent experiences more successes due to the
learning occurring~during the second phase and the threshold increases
in consequence, allowing the selection of initial states at increasing
distances to the goal. Although we observe more successes with the
curriculum activated, it is not completely clear from
Fig.~{\ref{562312}} that the activation of the
curriculum improves the learning speed of the agent.~
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/curr-nocurr-ep-reached/curr-nocurr-ep-reached}
\caption{{Curriculum learning allows an adaptive modulation of the task
difficulty. Both plots show the evolution of the threshold, the initial
goal distance and the average reward accross episodes when learning how
to reach a single GC (\emph{near\_to\_blue\_ball).~}Top: Without
curriculum (\(\theta = 1\) for all episodes). Bottom: With
curriculum (\(\theta\) initialized at 1, then updated after each
episode as descibed in the previous section). The X-axis corresponds to
the episode number and the Y-axis to the three mesured values (the
average reward is divided by 10 for improving readability).~ Vertical
grey lines indicate the successful episodes where the agent was able to
achieve the goal within the maximum number of time steps allowed (here
100). Horizontal grey lines indicate the initial goal distance of those
successful episodes.
{\label{562312}}%
}}
\end{center}
\end{figure}
\par\null
Fig.~{\ref{528901}} shows the results of the same
experiment as in Fig.~{\ref{562312}} but considering
time on the X-axis as the total number of time steps, instead of the
number of episodes (an episode can run during 100 time steps maximum).
That way we can compare the two conditions on the same number of time
steps (``true'' time). Here we observe that within the same period of
4000 time steps, the agent is able to reach the goal from a wider range
of initial distances when the curriculum is activated. This is partly
due to the fact that it does not waste~time by trying to reach the goal
from initial states which are at very high distances, where it as little
or no chance of success.~
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/curr-nocurr-truetime-reached/curr-nocurr-truetime-reached}
\caption{{Curriculum learning increases the learning speed in term of the total
number of time steps. This figure uses the same data and follows the
same convention as in Fig.~{\ref{562312}}. However, the
X-axis now shows the total number of time steps since the start of the
experiment (instead of the number of episodes).
{\label{528901}}%
}}
\end{center}
\end{figure}
\par\null
Finally, Fig.~{\ref{409857}} shows the distribution of
episode rewards according to the initial distance to the goal. We
observe that with the curriculum activated (bottom plot), the agent is
able to achieve better average performance for starting distances
between 0.2 and 0.8, as compared with no curriculum (top plot).
Moreover, the agent does not waste time in~attempting to achieve the
goal from distances that are too high (higher than 0.8), where it has
little or no chance of success.~\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/dist-reward-hist/dist-reward-hist}
\caption{{Curriculum learning allows to succeed reaching the goal from higher
distances. Top: Without curriculum. Bottom: with curriculum. In each
plot, each cell color represents the number of episodes where the agent
obtained the reward indicated on the Y-axis while starting from state at
the distance to goal indicated on the X-axis.
{\label{409857}}%
}}
\end{center}
\end{figure}
\section*{}
{\label{606802}}
As in Fig.~{\ref{528901}}, we now compute the histogram
on the same experiment data, but considering an equal number of time
steps in the two conditions (Fig.~{\ref{570069}}).
Since there are much more failures without the curriculum (see
Fig.~{\ref{562312}}), this reduces the number of
considered episodes in the condition without curriculum (top plot). The
effect previously mentioned is stronger: in the condition without
curriculum, the agent is actually quite bad at reaching goals at a
distance between 0.4 and 0.6 and do not have time to select initial
states at a distance between 0.6 and 0.8 (which are only a few).
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/dist-reward-hist-truetime/dist-reward-hist-truetime}
\caption{{Same as the previous figure, but the number of considered episodes in
the top plot (without curriculum) is reduced to represent the same
number of time steps as in the bottom plot (with curriculum). Be aware
of the different scaling in the colorbars.
{\label{570069}}%
}}
\end{center}
\end{figure}
\par\null
\section*{Conclusion}
{\label{606802}}
This preliminary version of the curriculum learning feature shows that
it can improve the learning speed, in particular when some initial
states provided by the user are too far from the goal. A sort of a
developmental sequence, first reducing the task complexity and then
increasing it progressively according to the agent experience,
self-organizes out of this mechanism. However, both further analysis, as
well as improvements of the curriculum mechanism, are required to fully
validate the utility of this feature.
A possible drawback of this approach (and of all approaches constraining
the experience of the agent) is an overfitting of the policy for the
current states of interest. It was the case when using a Gaussian
function for the computing the multiplier (see
section~{\ref{126604}}), although the corresponding
results are not shown in this document. The use of a step function is
helping in this respect, ensuring that the agent will be trained on all
initial states~with a distance below the threshold.~
The future lines of research are (all considering the autonomous
practice mode):
\begin{itemize}
\tightlist
\item
Using the cumulated reward instead of the episode success/failure.
\item
Introducing a goal selection mechanism (current proposal uses learning
progress estimation).
\item
Actively moving to initial states where progress is expected.
\item
Extending the time horizon, i.e. an optimal planning of states to
visit which will maximally improve learning on average.
\item
Improving the default behavior for ensuring the visit of all
potentially interesting state. Information-theoretic methods based on
the concept of Empowerment are interesting candidates (see
\href{https://arxiv.org/abs/1310.1863}{an overview}).
\end{itemize}
\par\null
\selectlanguage{english}
\FloatBarrier
\end{document}