Synthetic Voice Spoofing Detection Based On Online Hard Example Mining
Chenlei Hu, Ruohua Zhou
The automatic speaker verification spoofing (ASVspoof) challenge series
is crucial for enhancing the spoofing consideration and the
countermeasures growth. Although the recent ASVspoof 2019 validation
results indicate the significant capability to identify most attacks,
the model’s recognition effect is still poor for some attacks. This
paper presents the Online Hard Example Mining (OHEM) algorithm for
detecting unknown voice spoofing attacks. The OHEM is utilized to
overcome the imbalance between simple and hard samples in the dataset.
The presented system provides an equal error rate (EER) of 0.77% on the
ASVspoof 2019 Challenge logical access scenario’s evaluation set.
Introduction: Automated speaker verification systems(ASV),
which identify speakers by voice, are now widely utilized in engineering
applications [1]. However, with recent advances in artificial
intelligence and hardware technology, ASV systems are threatened by
various voice spoofing attacks. Illegals attack ASV systems and deceive
humans by creating spoof voices such as mimic, replay, and synthetic
voice attacks (involving text-to-speech (TTS) and voice conversion (VC))
to impersonate real users [2]. Reliable anti-spoof systems are
required to alleviate or eliminate the risk of fraud to ASV systems and
human users.
Most spoofing countermeasure systems comprise the front and back end
[3]. The traditional front ends are based on digital signal
processing algorithms, such as linear frequency cepstral coefficients
(LFCC)[4] and Constant-Q transform (CQT)[5], to extract acoustic
features. Besides, some studies indicate that more discriminative
acoustic features can be extracted for anti-spoofing tasks by adding
front-end feature extraction to the training model. Similarly, DNN
determines the filter’s center frequency in the filter bank [6]. As
for the back ends, many studies have employed convolutional neural
networks and loss functions [7] for face verification and image
classification tasks. For example, Galina Lavrentyeva et al. [8]
utilized a novel convolutional network, LCNN, for replay and synthetic
speech identification. Li et al. [9] applied a new variant of the
residual network to replay and synthetic spoof detection.
The previous systems cannot generalize the unseen spoofing attacks in
the evaluation step [10]. This paper employs OHEM to develop an
anti-spoofing system to distinguish unknown synthetic voice spoofing
attacks. OHEM with screened sampling is a commonly used sampling
approach, which selects the training examples to promote the
anti-spoofing system’s efficiency. This work focuses on combining the
OHEM algorithm with the anti-spoofing model.
This paper is organized as the following. The application of the OHEM
algorithm in synthetic speech spoofing detection is described in Section
2. The details of the experimental design and results are given in
Section 3. Section 4 concludes the paper.
Methods: For network training, the objective function directly
defines the latent mappings that the network should fit. The quality of
the samples counted to minimize the objective function can determine the
realization degree of the network’s goal. This work indicates that the
number of negative samples is larger than that of positive ones in a
limited training set sample, which can bias the loss function. Besides,
most negative samples are easily classified. The ratio of simple samples
is much greater than that of hard samples. Moreover, the loss of simple
samples is much lower than the hard ones (around zero).
Therefore, an online hard negation example mining (OHEM) anti-spoofing
model is proposed, which employs the OHEM strategy to tune the objective
function and selectively search informative hard negative samples to
improve the training efficiency significantly. Motivated by [11],
the proposed OHEM strategy is based on excluding non-informative samples
from the loss during the training to alleviate the imbalance between
simple and hard samples. In this regard, the proposed OHEM comprises
three main steps in each training iteration. First, the number of hard
negative examples should be selected. This number is adaptively
determined as N/4, where N represents the number of samples in the
min-batch. Then, the N training samples are sorted according to their
prediction scores. Finally, our loss function computes only the top N/4
sample losses and discards the rest. We ignore simple negative samples
and only focus on the hard samples.
Research process: This section describes the ASVspoof2019 LA
datasets, front-end features, and back-end network models used in the
experiments. Besides, four sets of experimental results and the fusion
system results are given.
Datasets
All tests were performed using the ASVspoof 2019 Logical Access (LA)
database. Table 1 presents a detailed representation of the two
mentioned subsets. Training and Developing sets share similar six
attacks. The six attacks mainly contain two VC and four TTS algorithms.
In the evaluation set, there are 11 unknown attacks (A07-A15, A17, A18)
including combinations of different TTS and VC attacks. The evaluation
set also includes two attacks (A16, A19) which use the same algorithms
as two of the attacks (A04,A06) in the training set but were trained
with different data. [10].
Table 1: ASVspoof 2019 LA
Datasets