Prezi Scale challenge

Problem description

Prezi uses virtual machines (VMs) to complete three different kind of jobs:

  • export

  • url

  • default

These jobs have pre-determined processing times, which are distributed according to some general distribution \(G\) (see figures). Each job type has a corresponding type of VM. VM running costs are charged in hourly increments. This can be represented using a cost function \(C(t)\) where \(t\) is the running time of a server in seconds

\[C(t) = \left\lceil \frac{t}{3600}\right\rceil\]

VMs can be turned on and off, but it takes 2 minutes (120 seconds) for a virtual machine to start. This time is charged.

The objective is to minimize the total running cost of the servers, while ensuring that all jobs are handled within 5 seconds. If a job waits for more than 5 seconds, a penalty is incurred. A job is not allowed to wait for more than 2 minutes. The holding cost of a job \(H(t)\) is defined

\[H(t) = \max\left(\left\{ 0, \frac{t - 5}{40} \right\} \right)\]

Two datasets are given, each spanning a period of 48 hours, starting at 10pm GMT (so 2/5pm on the West/East coast). They are of the form timestamp job-duration unique-id queue , for example 1378072800 2.133 9ea24acd-cee3-4248-bf43-fb33e7e9ac4b url with 475858 and 669642 jobs in the respective datasets.


Arrival rate

We can assume our job arrivals to be governed by a Poisson process with parameter \(\lambda\). However, our arrival times differ depending on the time of day (e.g. less jobs at night). The arrivals can then be modelled as an inhomogeneous Poisson process \(\lambda = f(t)\), where we need to determine \(f(t)\) from the data given.

Some days are busier than others, so we normalize the given data (dividing the number of arrivals at each time step by the number of arrivals in 24 hours) and write \(\lambda = R_i f(t)\) where \(R_i\) is the relative business of the day.

The easiest approach is to assume \(f(t)\) is piecewise linear. For a given segment we can then simply use the fact that the maximum likelihood estimator (MLE) for a homogeneous Poisson population of \(n\) samples is its average

\[\widehat{\lambda} = \frac{1}{n}\sum_{i=1}^n k_i.\]

Perhaps it is a good idea to use maximum likelihood estimation to fit the estimates from the 4 (or 3, if we want a hold-out set) to a normal distribution \(\lambda\sim\mathcal{N}\left(a,b\right)\) so that we can get a confidence interval for the value of \(\lambda\) at each time of day. There are also papers out there which talk about more advanced methods of fitting an inhomogeneous Poisson process. This one has a nice overview (section 2). We also need to pick a timeframe over which to average (i.e. how long are the linear segments). If you don’t average enough you get very noisy data (see this and this graph for the extreme case where there is no averaging at all), and if you average too much you might miss trends (if you average over the entire 24 hours you just get a straight line).

Another method (but complicated one) is in (Shen 2008). Not sure whether it will work as well with only 4 realizations though.

Our final algorithm could use the \(f(t)\) we calculated together with information on how busy the day has been so far to estimate how many servers we need e.g. 2 minutes from now. Perhaps this forecasting would have to be combined with simple extrapolation in order to deal with unusual cases (for example the dark blue line in the figure below). Also, see this paper, which does exactly this (forecast inhomogeneous Poisson processes using the information from past days as well as the current day).