Introduction
Biometric authentication has significantly advanced over the last decade,
with exceptional results in the speaker verification task. Specifically,
speaker verification was well established by the probabilistic methodology
proposed by Reynolds \cite{Reynolds_1995} and has
further been improved by more sophisticated approaches [2, 3, 4, 5] which are
mainly based on algorithms from the areas of machine learning and linear
algebra. These methodologies are regularly evaluated by well-known scientific
challenges and evaluations such as the NIST SRE (Speaker Recognition
Evaluation) [6, 7] the AVSpoof database evaluations [2] and the RedDots challenge [3].
Although a number of different
approaches and architectures have been proposed for speaker verification, three
major methods can be detected: the GMM-UBM [4], the HMM-UBM [5] and the recently proposed i-vectors [6] approach. In the GMM-UBM method a Gaussian Mixture Model (GMM) is
used to train a Universal Background Model (UBM), using recordings from a large
number of speakers that is then adapted to the target speaker's enrolment
recordings, usually using a mean-only adaptation. In the HMM-UBM method, a
similar methodology to the GMM-UBM is adopted with the difference that here,
the UBM and the target speaker are built by hidden Markov models, and thus able
to model temporal information. In the case of i-vectors, the means of the
Gaussian distributions of the GMM-UBM models are constructing a super-vector,
which is a descriptor of the whole voice input and using joint factor analysis,
split into channel and speaker vectors. The i-vectors approach achieve
state-of-the-art performance in cases where a significant amount of training
data is available [7]. Despite these gains, it is possible to improve upon those results
with the use of fusion [8]. However, an important consideration for a successful biometric
system is inherent robustness against spoofing attacks [9]. We distinguish inherent robustness that is based on the quality of
speaker recognition from robustness as a result of specific countermeasures [10] employed. This paper focusses on the inherent robustness of a
specific architecture.
Despite having more than
30 years of research effort in the area of speaker verification, voice
biometric protection against spoofing is still open for improvement since most
speaker verification approaches are vulnerable to spoofing attacks. There are
four major types of spoofing attacks, namely impersonation, audio replay,
speech synthesis and voice conversion attack [9]. Comparison of spectrograms between genuine and impersonated voice
samples have shown that the formants do not quite match each other [11]. The drawback of speech synthesis and voice conversion based
spoofing attacks is the phase [12] and prosody (F0 statistics) [13] information, which are often not speech-like. In audio replay
attacks, countermeasures based on uncharacteristic similarity between recorded
inputs [14] and dissimilarity due to the specific environmental characteristics
of replay attacks [15] have been proposed. However, when the recording and replay acoustic
environment conditions are similar, the detection of audio replay spoofing attack
is difficult and thus the voice biometric applications become vulnerable to
attacks [9]. This paper evaluates the robustness of a specific architecture to
replay attacks.