Feature selection
For manual feature selection, the code prepared by the official ASVspoof
2019 competition was employed for extracting 60-dimensional LFCCs
features from the raw voice data, containing a 20ms window size and a
10ms sliding window. For the raw audio, the data were fixed to a voice
length of about 4 seconds to uniformize the size of all utterances.
Model selection
The experiments involve two models. One is based on the residual network
variant [12], containing Resnet-18, Resnet-50, and SE-res2net
[9]. The other is a new variant proposed based on the rawnet2
[8] architecture. The residual network has been utilized with
excellent results and applications in anti-spoofing. Table 2 shows the
detail, where the first input is the raw audio. The SincNet layer is
passed first, followed by the residual structure. Besides, the original
residual block structure for the residual block is replaced with the 1D
Res2net block. Finally, the GRU operation is performed before the full
connection.
Table 2: The architecture of Raw-res2net, BN refers to batch
normalization. Cons involves the block convolution of the Res2net
architecture and the BN&LeakyReLu operation, and SELayer stands for
squeezing excitation block.