Synthetic Voice Spoofing Detection Based On Online Hard Example Mining
Chenlei Hu, Ruohua Zhou
The automatic speaker verification spoofing (ASVspoof) challenge series is crucial for enhancing the spoofing consideration and the countermeasures growth. Although the recent ASVspoof 2019 validation results indicate the significant capability to identify most attacks, the model’s recognition effect is still poor for some attacks. This paper presents the Online Hard Example Mining (OHEM) algorithm for detecting unknown voice spoofing attacks. The OHEM is utilized to overcome the imbalance between simple and hard samples in the dataset. The presented system provides an equal error rate (EER) of 0.77% on the ASVspoof 2019 Challenge logical access scenario’s evaluation set.
Introduction: Automated speaker verification systems(ASV), which identify speakers by voice, are now widely utilized in engineering applications [1]. However, with recent advances in artificial intelligence and hardware technology, ASV systems are threatened by various voice spoofing attacks. Illegals attack ASV systems and deceive humans by creating spoof voices such as mimic, replay, and synthetic voice attacks (involving text-to-speech (TTS) and voice conversion (VC)) to impersonate real users [2]. Reliable anti-spoof systems are required to alleviate or eliminate the risk of fraud to ASV systems and human users.
Most spoofing countermeasure systems comprise the front and back end [3]. The traditional front ends are based on digital signal processing algorithms, such as linear frequency cepstral coefficients (LFCC)[4] and Constant-Q transform (CQT)[5], to extract acoustic features. Besides, some studies indicate that more discriminative acoustic features can be extracted for anti-spoofing tasks by adding front-end feature extraction to the training model. Similarly, DNN determines the filter’s center frequency in the filter bank [6]. As for the back ends, many studies have employed convolutional neural networks and loss functions [7] for face verification and image classification tasks. For example, Galina Lavrentyeva et al. [8] utilized a novel convolutional network, LCNN, for replay and synthetic speech identification. Li et al. [9] applied a new variant of the residual network to replay and synthetic spoof detection.
The previous systems cannot generalize the unseen spoofing attacks in the evaluation step [10]. This paper employs OHEM to develop an anti-spoofing system to distinguish unknown synthetic voice spoofing attacks. OHEM with screened sampling is a commonly used sampling approach, which selects the training examples to promote the anti-spoofing system’s efficiency. This work focuses on combining the OHEM algorithm with the anti-spoofing model.
This paper is organized as the following. The application of the OHEM algorithm in synthetic speech spoofing detection is described in Section 2. The details of the experimental design and results are given in Section 3. Section 4 concludes the paper.
Methods: For network training, the objective function directly defines the latent mappings that the network should fit. The quality of the samples counted to minimize the objective function can determine the realization degree of the network’s goal. This work indicates that the number of negative samples is larger than that of positive ones in a limited training set sample, which can bias the loss function. Besides, most negative samples are easily classified. The ratio of simple samples is much greater than that of hard samples. Moreover, the loss of simple samples is much lower than the hard ones (around zero).
Therefore, an online hard negation example mining (OHEM) anti-spoofing model is proposed, which employs the OHEM strategy to tune the objective function and selectively search informative hard negative samples to improve the training efficiency significantly. Motivated by [11], the proposed OHEM strategy is based on excluding non-informative samples from the loss during the training to alleviate the imbalance between simple and hard samples. In this regard, the proposed OHEM comprises three main steps in each training iteration. First, the number of hard negative examples should be selected. This number is adaptively determined as N/4, where N represents the number of samples in the min-batch. Then, the N training samples are sorted according to their prediction scores. Finally, our loss function computes only the top N/4 sample losses and discards the rest. We ignore simple negative samples and only focus on the hard samples.
Research process: This section describes the ASVspoof2019 LA datasets, front-end features, and back-end network models used in the experiments. Besides, four sets of experimental results and the fusion system results are given.
Datasets
All tests were performed using the ASVspoof 2019 Logical Access (LA) database. Table 1 presents a detailed representation of the two mentioned subsets. Training and Developing sets share similar six attacks. The six attacks mainly contain two VC and four TTS algorithms. In the evaluation set, there are 11 unknown attacks (A07-A15, A17, A18) including combinations of different TTS and VC attacks. The evaluation set also includes two attacks (A16, A19) which use the same algorithms as two of the attacks (A04,A06) in the training set but were trained with different data. [10].
Table 1: ASVspoof 2019 LA Datasets