Figure 1. Experiment structure. a. Structure
of a single learning block, consisting of one 20 s acquisition trial,
followed by 6 test trials. During acquisition trials, participants
either actively explored (agent condition) or passively observed
(observer condition) the relationships between movement directions of a
cursor and 8 different sound stimuli. In test trials, participants were
tested on their memory of the associations. b. Structure of a
contingency block. Each contingency block consisted of 7 learning
blocks. The first three were considered the “early learning stage”,
and the last three were considered the “late learning stage”.c. Structure of the experiment: The experiment consisted of 14
contingency blocks, 7 of which belonged to the agent condition and 7 of
which to the observer condition.
In order to make the cursor move
in a “gaze-like” style in the observer condition, it was
computer-animated using the participant’s own movements in acquisition
trials of the preceding agent contingency block. In case the experiment
started with the observer condition, we used the eye movement recordings
from the training block, which always involved active exploration. In
order to make eye movements less recognisable to the participant, we
randomized the order of previously recorded trials across the learning
blocks.
Training
Before starting the experiment, participants underwent two stages of
training. First, a “free training” session with the purpose of
adjusting the eye tracker, allowing the participants to familiarise
themselves with the equipment, and learn how to use the gaze-controlled
cursor. Participants sat facing a screen at 60 cm distance from their
eyes. Their head position was stabilized for eye tracking via a chin and
forehead rest, and they were wearing a pair of headphones connected to
the experiment computer.
Participants were then instructed to move their gaze across the screen
and “explore” the sounds that they were able to trigger by moving the
cursor (for details, see section “Gaze-controlled sound generation”).
During the free training, the experimenter ensured that the participant
understood how to use the gaze-controlled cursor and was familiar with
the experiment structure. The duration of the free training was variable
but lasted typically around 5 minutes.
The subsequent “structured
training” followed the same pattern as an agent experimental block, but
with only 3 instead of 6 test
trials.
Visual stimulation and gaze-controlled
sound
generation
Before the start of the free training and before every agent
experimental block, the eye tracker was calibrated collecting fixation
samples from known target points in order to map raw eye data to the
participant’s gaze position (standard in-built Eyelink calibration
procedure). After the calibration was successful, the experiment screen
appeared: a grid of 9 red squares over a black background. Each red
square’s side had a visual angle of 5° 18’ 0.99”, with gaps of 1° 28’
0.39” between squares. The center of each red square was marked by a
small black square with a side length of 0° 49’ 0.11”. The gaze
position of the participant appeared on the screen as a white dot
(radius = 0° 19’ 0.64”). A fixation on a square was defined as the gaze
resting within a radius of 0° 29’ 0.47” around the edges of the square.
The distance between the chin and forehead rest and the screen was 60
cm, as suggested by the Eyelink 1000 user manual, which translates to an
eye-screen distance of about 70 cm.
During the free training, the structured training and the agent
experimental condition, participants were able to generate sounds by
moving their gaze from one square on the screen to another, adjacent
square. The possible movement directions that could trigger a sound
were: vertical up and down, horizontal left and right, and diagonal
up-right, up-left, down-right, and down-left. A participant could move
their gaze from one square to another, and in order to trigger a sound,
a fixation on the target square with a duration of 750 ms was required.
In the case that the participant interrupted the fixation before the
delay period of 750 ms ended, no sound was played.
Sound
stimuli
Sound stimuli were synthesized speech sounds created with Google
text-to-speech API through Python set to a male Spanish speaker with a
sampling rate of 16000 Hz. The sound stimuli were then manually
manipulated in Praat using the Vocal Toolkit (Boersma, 2002) to have the
same duration and flat pitch. Sounds were normalized and resampled to
96000 Hz. Each sound was a 500 ms /CV/ syllable delivered at 70 dB,
formed by a random combination of one of 8 different pitches, vowels and
consonants. Pitch (in Hz) was either 90, 120, 150, 180, 210, 240, 270 or
300; the consonant was either [f], [g], [l], [m],
[p], [r], [s] or [t]; the vowel was either [a],
[e], [i], [o] or [u]. Per participant, 14 sets of 8
different sounds were generated. In each contingency-block, 8 sounds
were randomly paired with the 8 possible movement directions.
Apparatus
An ATI Radeon HD 2400 monitor and Sennheiser KD380 PRO noise cancelling
headphones were used for presentation of visual and auditory stimuli,
respectively. A midi keyboard, the Korg nanoPAD2, was used to record
participants’ responses. This keyboard was chosen because key presses
don’t produce any sounds. The presentation of the stimuli and recording
of participants’ responses was controlled using MATLAB R2017a (The
Mathworks Inc.), the Psychophysics Toolbox extension (Brainard, 1997;
Kleiner et al., 2007; Pelli, 1997), and the Eyelink add-in toolbox for
eyetracker control.
EEG was recorded using Curry 8 Neuroscan software and a Neuroscan
SynAmps RT amplifier (NeuroScan, Compumedics, Charlotte, NC, USA).
Continuous DC recordings were acquired using Ag/AgCl electrodes attached
to a nylon cap (Quick-Cap; Compumedics, Charlotte, NC, USA) at 64
standard locations following the 10% extension of the international
10-20 system (Chatrian, Lettich, & Nelson, 1985; Oostenveld &
Praamstra, 2001). Further electrodes were placed on the tip of the nose
(online reference), and above and below the left eye (vertical
electrooculogram, VEOG). Further two electrodes were placed next to the
outer canthi of both eyes referenced to the common reference (horizontal
electrooculogram, HEOG). The ground electrode was located at AFz.
Impedances were required to be below 10 kΩ during the whole recording
session and data was sampled at 500 Hz.
Horizontal and vertical gaze position of the left eye were recorded
using the EyeLink 1000 desktop mount (SR Research) at a sampling rate of
1,000 Hz.
Behavioural data
analysis
We analysed the percentage of correct responses (%Correct) to the
question of whether the movement-sound pair presented in a test trial
was congruent (“Did they match?”). Missing responses were counted as
false. Test trials presenting unseen sound-movement pairs were excluded
from the analysis to avoid forced guessing. After performing this
exclusion, we calculated the %Correct of each participant per learning
block, distinguishing between associations acquired in the agent and
observer condition. We performed a repeated-measures ANOVA with the
factors agency (agent/observer) and learning block (seven levels).
During initial stages of learning, participants were expected to perform
very poorly on the memory task due to the little exposure to the
associations. During late stages, they were expected to be proficient.
EEG data analysis
Preprocessing
EEG data was preprocessed using EEGLAB (Delorme & Makeig, 2004). After
a high-pass filter was applied to the data (0.5 Hz high-pass, Kaiser
window, Kaiser β 5.653, filter order 1812), the continuous recording of
each participant was inspected, and non-stereotypical artefacts were
manually rejected. Then, eye movements were removed from the data using
Independent Component Analysis (SOBI algorithm). Independent components
representing eye movement artefacts were rejected based on visual
inspection and the remaining components were projected back into
electrode space. A low-pass filter was applied (30 Hz low-pass, Kaiser
window, Kaiser β 5.653, filter order 1812). Malfunctioning electrodes
were interpolated (spherical interpolation). A −100 ms to 500 ms epoch
was defined around each sound both during acquisition and test trials
(-100 to 0 ms baseline correction). A 75 μV maximal signal-change per
epoch threshold was used to reject remaining artefacts. Participant
averages were calculated for each event of interest, as well as the
grand averages using all participants. We obtained ERPs for acquisition
sounds in agent and observer acquisition mode, as well as early (blocks
1 to 3) and late (blocks 5 to 7) learning stages. For test sound ERPs,
we calculated averaged ERPs for test sounds acquired in agent versus
observer mode, early versus late learning stages, and congruent versus
incongruent test sounds (relative to the associations between movements
and sounds learned in acquisition trials). The mean number of trials per
subject-level average was 361, with a standard deviation of 185 trials.
Statistical
analyses
Both in acquisition and test sounds, statistical comparisons were
conducted to extract agency and learning stage effects and their
interactions. In test sounds, we analysed the effects of congruency and
the interaction between congruency and the factors agency and learning
stage.