Hypotheses
The literature  \cite{Feinberg2010, Wechsung2014} suggests that social presence during a complex cognitive task like this one should lead to worse performance. Accordingly, our hypotheses were the following:
Protocol & Data Collection
As outlined previously, while our plan was to run four conditions (alone, human presence, NAO presence, Pepper presence), we first ran the two baseline conditions: alone and human observer. 15 participants were recruited in the alone condition, 16 participants in the human condition.
The experimental setup was similar to Figure \ref{fig:setup} with two differences: when present, the human observer was sitting at the table, facing the participant, and the tablets were replaced with laptops with a keyboard to facilitate the input of the answers. For each participant, we recorded how many additions were attempted, the total gain (i.e., the number of correct answers), and the time to calculate each of the additions.
Results
\label{sec:study2-results}
Based on the data (31 participants for a total of 633 additions), the average time to dismiss the debug dialogue was 1185ms and the average time to provide an answer was 9980ms. Based on these values, we conservatively consider cheating as taking more than 0.8 seconds to dismiss the spurious debug dialogue and taking less than 5 seconds to calculate the sum and providing a correct answer. It results in 147 cheating rounds (23.2% of all rounds).
Looking at these results per condition, we find 77 rounds involving cheating from 316 rounds in the human condition (24.4%) and 70 rounds involving cheating from 317 rounds in the alone condition (22.1%). TBD: T-Test. This result shows that 1) participants do cheat relatively often, 2) however the presence of a human observer does not significantly impact the cheating behaviour of the participants, providing no support for H1.
In term of performance, participants in the human presence condition gave 28 wrong answers out of 239 rounds with no cheating (11.7% were wrong answers), while participants in the alone condition gave 25 wrong answers out of 247 (10.1%). TBD: again, T-test. Again, there is no significant performance difference between the two conditions, providing no support for H2. Therefore, neither of our hypotheses are supported. Due to the absence of any effects between the human and alone conditions, we did not pursue the study with robots.