Introduction

Biometric authentication has significantly advanced over the last decade, with exceptional results in the speaker verification task. Specifically, speaker verification was well established by the probabilistic methodology proposed by Reynolds \cite{Reynolds_1995} and has further been improved by more sophisticated approaches [2, 3, 4, 5] which are mainly based on algorithms from the areas of machine learning and linear algebra. These methodologies are regularly evaluated by well-known scientific challenges and evaluations such as the NIST SRE (Speaker Recognition Evaluation) [6, 7] the AVSpoof database evaluations [2] and the RedDots challenge [3].
Although a number of different approaches and architectures have been proposed for speaker verification, three major methods can be detected: the GMM-UBM [4], the HMM-UBM [5] and the recently proposed i-vectors [6] approach. In the GMM-UBM method a Gaussian Mixture Model (GMM) is used to train a Universal Background Model (UBM), using recordings from a large number of speakers that is then adapted to the target speaker's enrolment recordings, usually using a mean-only adaptation. In the HMM-UBM method, a similar methodology to the GMM-UBM is adopted with the difference that here, the UBM and the target speaker are built by hidden Markov models, and thus able to model temporal information. In the case of i-vectors, the means of the Gaussian distributions of the GMM-UBM models are constructing a super-vector, which is a descriptor of the whole voice input and using joint factor analysis, split into channel and speaker vectors. The i-vectors approach achieve state-of-the-art performance in cases where a significant amount of training data is available [7]. Despite these gains, it is possible to improve upon those results with the use of fusion [8]. However, an important consideration for a successful biometric system is inherent robustness against spoofing attacks [9]. We distinguish inherent robustness that is based on the quality of speaker recognition from robustness as a result of specific countermeasures [10] employed. This paper focusses on the inherent robustness of a specific architecture.
Despite having more than 30 years of research effort in the area of speaker verification, voice biometric protection against spoofing is still open for improvement since most speaker verification approaches are vulnerable to spoofing attacks. There are four major types of spoofing attacks, namely impersonation, audio replay, speech synthesis and voice conversion attack [9]. Comparison of spectrograms between genuine and impersonated voice samples have shown that the formants do not quite match each other [11]. The drawback of speech synthesis and voice conversion based spoofing attacks is the phase [12] and prosody (F0 statistics) [13] information, which are often not speech-like. In audio replay attacks, countermeasures based on uncharacteristic similarity between recorded inputs [14] and dissimilarity due to the specific environmental characteristics of replay attacks [15] have been proposed. However, when the recording and replay acoustic environment conditions are similar, the detection of audio replay spoofing attack is difficult and thus the voice biometric applications become vulnerable to attacks [9]. This paper evaluates the robustness of a specific architecture to replay attacks.