Feature selection
For manual feature selection, the code prepared by the official ASVspoof 2019 competition was employed for extracting 60-dimensional LFCCs features from the raw voice data, containing a 20ms window size and a 10ms sliding window. For the raw audio, the data were fixed to a voice length of about 4 seconds to uniformize the size of all utterances.
Model selection
The experiments involve two models. One is based on the residual network variant [12], containing Resnet-18, Resnet-50, and SE-res2net [9]. The other is a new variant proposed based on the rawnet2 [8] architecture. The residual network has been utilized with excellent results and applications in anti-spoofing. Table 2 shows the detail, where the first input is the raw audio. The SincNet layer is passed first, followed by the residual structure. Besides, the original residual block structure for the residual block is replaced with the 1D Res2net block. Finally, the GRU operation is performed before the full connection.
Table 2: The architecture of Raw-res2net, BN refers to batch normalization. Cons involves the block convolution of the Res2net architecture and the BN&LeakyReLu operation, and SELayer stands for squeezing excitation block.