Sparse processing methods for detecting acoustic/phonetic events in continuous speech Kovács1,2, M. Coath3, K. Mády4, S. L. Denham3, I. Winkler1 1 Institute of Cognitive Neuroscience and Psychology, Research Centre for Natural Sciences, MTA, Hungary 2 Department of Telecommunication and Media Informatics, Budapest University of Technology and Economics, Hungary 3 Cognition Institute and School of Psychology, University of Plymouth, United Kingdom 4 Research Institute for Linguistics, MTA, Hungary Abstract — Information in continuous communication sounds such as speech is not evenly distributed but rather seems to be concentrated into perceptually critical points in the signal (Furui 1986). Here we compare three methods for identifying salient events in continuous speech: a) a model of transient responses in the subcortical auditory system (Coath and Denham 2007), b) an information-theoretic model based on Bayesian surprise (Baldi and Itti 2010), and c) Stevens’ landmark detector (Stevens 2002), which is based on the notion proposed that information-bearing landmarks can be best understood in terms of the articulatory gestures used to generate them and that speech perception can be regarded as gesture recognition. The methods were compared with each other and against speech segmentation by linguistic experts in a fully crossed design of three languages (English, German, and Hungarian) and speaker gender. we found that 1) SKV produced the best match to the expert segmentation, followed by LM and WOW; 2) WOW detected a subset of the events found by SKV; 3) although largely overlapping, WOW and LM detected slightly different sets of events; 4) SKV and WOW better matched the expert annotation for male than for female speakers; and finally, 5) the algorithms performed more similarly to the expert segmentation for Hungarian than for English and German. The results support the hypothesis that onset transients in speech (detected by SKV) are possibly the most important cues of segmentation. As SKV is a biologically realistic algorithm, the current results may help the modeling of speech segmentation in the human brain. Combining the three algorithms allows the development of fast automatic speech segmentation with relatively low computational resources performing similarly across different languages. Keywords — Speech processing, auditory transients, spectro-temporal responses, acoustic landmarks, Bayesian surprise, salient event detection, speech segmentation INTRODUCTION For phonetic analysis, the continuous speech signal is segmented and categorized into language-specific phonemic and/or subphonemic classes (Roach et al 1990). Segment boundaries are often marked by non-stationarities, i.e., abrupt spectro-temporal changes in the signal (Furui 1986). The human auditory system is highly sensitive to abrupt spectro-temporal changes, which elicit synchronized response from large neuronal populations (termed [auditory] event-related brain potentials; see e.g., Luck, 2005; Picton, 2010). Thus it is possible that the human brain utilizes non-stationarities (“acoustic events”) for speech segmentation (Sanders et al 2002, Cunillera et al 2006). If such acoustic events mark the onset of articulatory gestures (as was suggested by Stevens, 2002), the notion that the brain utilizes them for segmenting speech provides an information-processing basis for motor theories of speech perception (Liberman et al, 1967, 1985). Several algorithms can be employed for detecting non-stationarities. Here we compare the detection of acoustic speech events 1) using algorithms based on three different principles and 2) with speech segmentation by an expert linguist, the latter containing segmental and subsegmental level labeled acoustical and phonological events. In phonetics, speech segmentation is the process of taking the phonetic transcription of speech and determining the timing of the phonemes. Therefore, manual segmentation of speech can be viewed as a version of the splitting problem known from computational complexity theory. In contrast to phonetic theories of speech segmentation, there is as yet no consistent account of how the brain transforms acoustic signals into representations that convey meaning, although there is increasing consensus that there might be an enhancement process of the relevant aspects of speech sounds at the auditory periphery (Dunlap et al., 2013). Abrupt spectrotemporal changes may serve as triggers evoking the enhancement processes. The use of onsets and offsets, as a means of sound segmentation, has been explored using convolutions between the bandpass filtered sound and an asymmetric kernel (Smith, 1995; Fishbach et al., 2001). The model of transient responses used here (Coath and Denham, 2005) is similar in spirit and is based on the asymmetry (skewness) of the energy distribution within variable-size frequency-dependent time windows. This representation is referred to as the auditory transient method, or SKewness in Variable time (SKV). The SKV representation has no free parameters apart from the number of samples over which the skewness is calculated and the minimum window size allowed. It generates responses which are in agreement with physiological data (Coath and Denham, 2005). Coath and Denham’s (2005) method is based on the notion that the spectrotemporal receptive field (STRF) of neurons in auditory cortex can be analyzed in terms of the time and frequency characteristics of the stimulus set (Chi et al., 1999). It has been shown (Coath, Brader et al, 2004; Coath and Denham, 2004) that the summed response of an ensemble of STRFs, which is comparable to the summed SKV response, can be used to indicate the onset of salient events in continuous speech. From a pragmatic point of view, this method is robust against short-term signal variations and relatively insensitive to DC offsets. However, it is not clear whether it can also signal the phonemic segmentation points or only the more salient syllabic boundaries. Therefore, we tested the SKV representation to explore its phonemic and subphonemic segmentation properties. An alternative approach to defining salient events in an acoustic signal is statistical 'surprise' (WOW; Itti and Baldi, 2009). Surprise is based on the difference between the recipient’s expectation calculated from the history of the signal and the actual input. It is an information-theoretic concept measuring how an observer is affected by a new piece of incoming data based on the difference between their prior and posterior beliefs: the more unexpected the stimulus, the more information it contains (Itti and Baldi, 2009). Modern theories of perception assume that information processing in the brain is essentially predictive (e.g., Friston, 2005; Gregory, 1980) and specifically, auditory novelty detection has been explained by predictive processes (Garrido et al., 2009; Wacongne et al., 2011; Winkler, 2007). Bayesian surprise has been used to model the detection of salient events in the visual domain (Mundhenk et al 2009, Chen et al, 2014), but it has not yet been widely explored in the auditory modality. One exception is the study of Kayser and colleagues (2005), in which Bayesian surprise has been used to generate saliency maps to predict which of the test stimuli attract attention (Kayser et al., 2005). Bayesian surprise has also been employed in auditory salient event detection applications (Corchs et al., 2013; Schauerte, Stiefelhagen, 2013). Further, some computational models of auditory stream segregation (Bregman, 1990), are based on Bayesian principles (e.g., Barniv and Nelken, 2015; for a review, see Szabó et al., 2016). On phonological grounds, potential segment boundaries (termed acoustic ’landmarks’) can be identified by detecting distinctive spectrotemporal patterns (Stevens, 2000, 2002). Landmark analysis is similar to speech segmentation and windowing techniques. However, whereas these other two techniques use boundaries to cut the signal into segments, the landmark algorithm detects the time-points in the continuous speech signal where changes across all frequency bands are most salient. Theoretically, an opening landmark is always followed by a closing one (onset and offset), but when articulation blurs together the end of a phoneme and the start of another, some starting or ending landmarks are not distinguished. These markers are strongly related to the details of articulation. Therefore, different articulatory events and processes are assigned to different types of landmarks. For example the abrupt increase in the amplitude level for a broad range of frequencies above 3 kHz indicates the onset of a burst. Likewise, an abrupt decrease in the same frequency band indicates sonorancy (i.e., an interval when the oral cavity is relatively unconstructed). Vowel landmarks represent local energy maxima characterized by harmonic power. These landmark patterns are identified by comparison between coarse and fine spectral detail (Liu, 1996). It is argued that when listening to human speech, one focuses on the landmarks to locate the underlying distinctive features which serve as markers for finding the corresponding word representation in the lexicon (Stevens, 2002). Although acoustic landmarks are widely used in speech training aids, speech modification and phoneme recognition (Boyce et. al.. 2011, 2012, 2013), they have not yet been compared with other speech event detecting algorithms. The output of the above three algorithms was compared with each other as well as to the phonemic- and subphonemic-level segmentation carried out by trained linguists. The latter served as ground truth in the comparison methods. Annotation of the subsegmental level was also included in the events. This was necessary because some phonemes cannot be described as a single event, with important differences clearly manifesting at the subsegmental level (e.g. plosives, such as /t/, /k/, /p/ plus a burst in between two phases: a stop of the airflow involving the lips, teeth or palate and a sudden release of air). These kinds of subsegmental events are also captured by most of the above mentioned event-detector algorithms. Therefore, for the comparison, both the phonemic and subphonemic levels of segmentation were consulted when there was a difference between the orthographical transcription and the phoneme realization. In summary, the purpose of this study was to assess a) the commonalities and differences between salient events detected on the basis of three principles and b) which of the three acoustic event detecting principles produces the best match of the phonologic segmentation performed by an expert linguist. Further, we hypothesized that since these algorithms label basic phonological-acoustical events, the results will be qualitatively similar across different languages. In order to test this hypothesis, we included spoken sentences from three different languages in our test corpus. METHODS The auditory transient (SKV) representation Speech signals were first processed using a simple cochlear model consisting of a bank of 128 Gammatone filters (Slaney, 1994) with center frequencies (CF) ranging from 100 to 8000 Hz arranged evenly on an equivalent rectangular bandwidth scale (Glasberg and Moore, 1990). This is a common human peripheral auditory filter approximation used in psychoacoustics (Darling, 1991). The SKV representation was calculated separately within each frequency band as the skewness of the energy distribution within overlapping time windows, the length of which varied with the CF of the band; the windows were eight times the period of the CF, but with a minimum length of 2.5 milliseconds (ms) at high frequencies. These parameters were chosen according to Wiegrebe (2001), who describes these values as minimally necessary for pitch extraction. The overlap between the windows was set to 10% (Coath, 2005). Skewness is the third central moment of a distribution. It is a measure of the asymmetry in the activation pattern within each time window. The result of this process is a spectro-temporal map of activation level changes, separately for each frequency band. For detecting putative speech ‘events’, the SKV responses were summed across all frequency bands; this is referred as the summed SKV (Fig.1). Figure 1: Normalized summed onset (positive) SKV representation (top panel) of a sentence together with onset SKV responses calculated for each frequency band (bottom panel, spectrogram). Time is represented by the x axis, summed SKV (top panel) and frequency (bottom panel) on the y axis. In order to avoid time-consuming computations, artificial neural networks were used to implement the SKV calculations (Kovács et al., 2015). The neural network used the output of the simple cochlear model as input with the target output value being the conventionally calculated SKV. A two-layer-feed-forward neural network with sigmoid hidden neurons and linear output neurons was trained using the Levenberg-Marquardt backpropagation learning algorithm. The resulting network was used to generate the SKV output in the rest of the study. Only the onset SKV events were included in the comparisons. For training, sentences from four Hungarian speakers (2 male and 2 female, one sentence, each) were used from the speech corpus described in the Materials section with no overlap between the sentences used for training and for testing the three methods. Bayesian surprise (WOW) Surprise was calculated from the simple cochlear model’s output using a formal Bayesian definition (Itti and Baldi, 2009) with the Bayesian Surprise Matlab Toolkit (iLab at USC, 2004-2007) implementation. The time-varying surprise value was assessed separately for each frequency band and then summed across them (Fig. 2). The decay factor in the surprise calculation was set to 0.7 which is a good general value (Itti and Baldi, 2005, 2006). Comparing Figs. 1 and 2, one can see that the SKV and WOW methods identify approximately the same speech events, but WOW produces a sparser output. Figure 2: Normalized summed WOW representation (top panel) of a sentence together with WOW responses calculated for each frequency band (bottom panel, spectrogram). Time is represented by the x axis, summed WOW (top panel) and frequency (bottom panel) on the y axis. The WOW values across the different frequency bands are shown by greyscale with darker shades representing higher WOW values. Acoustic landmarks (LM) The landmark annotation software used in this work was developed by the SpeechMark Team (www.speechmrk.com, Boyce et al., 2012). We used the same landmark groups as in Boyce and colleagues (2010) and Liu (1995) for our comparisons with the other methods: (1) Based on the harmonic spectrum, glottis (g) marks a time point at which voicing begins (+g) or ends (-g); (2) Burst (b) marks frication onset of affricate/stop bursts (+b) and the time where the aspiration or frication ends due to a stop closure (-b); (3) Syllabicity (s) marks sonorant consonantal releases (+s) and closures (-s). The software produced two further landmark types: vowels (v) and frication (f). The frication type landmarks are included in the burst category. Therefore, frication landmarks were not separately utilized in the comparisons. The software marked the middle of the vowel waveform (v). Because, this is not comparable with the other event detection algorithms, these landmarks were not utilized either. The output of the landmark analysis includes the time and the type (onset/offset; b/g/s) of the detected landmarks. Only the onset (+) type of landmarks were included in comparisons with the other two methods. Expert speech segmentation Annotation was manually conducted by two trained linguists based on the spectrogram and the oscillogram of the speech using the Praat software (Boersma, Weenik, 2013). At the phonemic level, only the phonemes were indicated, including those that, as is typical for fluent speech, were left unpronounced by the speaker. Examples of deleted but indicated phonemes are /l/ in Hungarian in English. At this level of annotation, phonemes consisting of more than one acoustical event constitute a single item. Thus diphthongs and affricates, which typically consist of two elements, were not separated into two events, and plosives were not separated into a closure and a release phase. In addition, neither allophonic differences nor co-articulation were indicated. To improve on the phoneme labeling, a second-level subphonemic annotation was added. This included idiosyncratic features, such as deleted sounds, or the replacement of a sequence < vowel,/r/> sequence with a single r-colored vowel. This two-level segmentation led to marking not only the phonemes but all the main phonetic events in the corpora. Materials Ten different sentences selected from three corpora of different languages (Hungarian, German, and English) were used for the main tests. Each sentence was pronounced separately by four (2 male and 2 female) native language speakers. Testing on material from three different languages served to assess language-dependent features of the three algorithms. All of the sentences were approximately the same length, and there was no major difference between the amounts of speech material selected from the three language corpora. Hungarian sentences were extracted from a folk tale retold by four native Hungarian untrained speakers. The utterances were recorded in a sound-attenuated room with a Behringer C-1 large diaphragm condenser cardioid type microphone. The signal of the microphone was amplified and digitized by an Alesis iO2 Express audio interface connected to a computer via USB. The sound files were created using Audacity (version 2.0.5), with a sampling rate of 44100 Hz and a depth of 16 bit. They were later resampled to 16000 Hz. The German speech material was part of the Berlin Database of Emotional Speech (Burkhardt et al., 2005), which contains different speakers uttering the same sentences spoken in different emotional states. The sentences were recorded by native German-speaking actors and were available in 16000 Hz, 16 bit, mono format. We selected emotionally neutral sentences for the current study. The English speech material consisted of the second 10 sentences from Stevens’ Lexical Access From Features (LAFF) database (Stevens, 2002), which is a collection of 110 sentences recorded by non-professional native American or Canadian English speakers. The database was initially recorded to audiotape, later low-pass filtered to 7500 Hz and digitalized at 16 bits with a sampling rate of 16000 Hz. For training the artificial neural networks for SKV calculation, four additional sentences were selected from the Hungarian fairy tale material described above. These sentences were distinct from those used in the main tests. For computing thresholds and window lengths (see sections 2.7 and 2.8), further additional sentences were selected, one sentence for each speaker and corpus. Comparison between the outputs of the different methods In order to compare the different event detection methods and the manual segmentation, the outputs of these methods were turned into binary vectors of event (1) – non-event (0) sampled at 200 Hz. Whereas the acoustic landmarks and the manual segmentation directly produced discrete information (event type) together with their timing, events needed to be extracted from the continuous output of the SKV and WOW methods. The threshold values for these methods were set to maximize their approximation of the expert manual segmentation, which was regarded as ground truth (see Section 2.7). Values exceeding the threshold level were turned to ones, the rest to zeros. From the thresholded vectors an event vector was created, which only marked the first above-threshold value (the onset of a continuous segment) as a speech event without any event-type marker (see Figure 3). Figure 3: Summed, normalized SKV function plotted above the spectrogram of a Hungarian sentence (“Szépen beszélt, csodaszépen”). Vertical thick black lines indicate the speech events extracted with 0.1 (10%) threshold. Time is represented by the x axis, frequency for the spectrogram (bottom panel) and the normalized SKV (top panel) on the y axis. From the event vectors, various metrics were defined according to binary matching between the vectors. Because one cannot assume that the various transient event detectors have found the same speech events at exactly the same time, matching was done within tolerance windows centered on the time of the event in one of the vectors (termed the “base method” and specified for each comparison). The length of the tolerance window was optimized for each comparison to achieve maximal match with minimal false matches (see Section 2.8). After matching between two event vectors, the following scores were initially calculated: True positives (TP): the number of time points at which both methods detected a speech event (i.e., an event in the base method was accompanied by an event within the tolerance window of the comparison method); False negatives (FN): the number of events detected by the base method with no corresponding event within the tolerance window of the comparison method. False positives (FP): the number of time points with an event detected by the comparison method, which did not fall into the tolerance window of any of the events of the base method; From these scores the following metrics were derived according to Powers and Ward (2007): Precision: P=TP/(TP+FP) precision is primarily sensitive to the number of false positives; Recall: R=TP/(TP+FN) recall is primarily sensitive to the number of false negatives; F-score: F = 2*(P*R)/(P+R) the harmonic mean of the precision and recall; it helps to assess whether the given method maximizes both recall and precision or only one of them; 2.7. Optimizing the threshold levels For transforming the output of the SKV and WOW algorithms into binary event vectors, they needed to be thresholded (as was mentioned in Section 2.6). To this end, the output of these methods was first normalized to the 0 – 1 interval based on a set of Hungarian, English and German test sentences. The principle of finding the threshold value was to maximize the F-score calculated separately for each method in relation to the expert segmentation data. The F-score balances precision and sensitivity. Thus it can be used to optimize these methods in terms of detecting segment boundaries as marked by the experts. During the optimization, the F-score was calculated with different threshold values from 0 to 1 with 0.1 increments on a set of Hungarian, English, and German sentences (see Section 2.5 for the description of the material used for these calculations). The threshold value that produced the maximal F-score was selected, separately, for the SKV and the WOW method. The tolerance window was set to ±50 ms for optimizing the thresholds. Figure 4 shows the F-score values with respect to the expert notations, separately for the two methods and for male and female speakers (but collapsed across the three languages). The best threshold value (in terms of the F-scores) was 10% for both methods and speaker genders. Thus this value was selected for the subsequent analyses. Figure 4: F-score values (y-axis) for the SKV (left panel) and the WOW (right panel) method at different threshold levels (x-axis) based on Hungarian, English, and German test sentences (collapsed) for male (continuous lines) and female speaker (dashed lines). In all four cases, 10% produced the highest F-score values, as indicated by black diamonds. 2.8 Optimal tolerance window Figure 5 shows the dependence of event matching (as measured by TP) on the width of the tolerance window (from 0 to ±50 ms in 5-ms steps), separately for the three possible pairwise comparisons between the three automatic event detection methods. In comparisons between SKV and the other two methods, tolerance windows were centered on the events detected by the SKV algorithm (base), because SKV detected the largest number of events. For the comparison between WOW and LM, WOW was regarded as the base method. We considered a window length the point of convergence when the TP growth rate decreased below 5%. In all comparisons, TP rises sharply between 0 and ±5 ms window length, followed by small further increases until ±25 ms window length. The convergence points are shown on the Figure 5 with filled geometric shapes. In two of the three comparisons the convergence point was ±25 ms, whereas in the third (SKV-WOW), it was ±20 ms. This suggests that the latencies of the same events detected by the different algorithms typically fall within a ±25 ms time window. Therefore this value is used for event-matching in the rest of the analyses. Figure 5: The number of matching events (TP; y axis) as a function of the tolerance window length (x axis), collapsed across the three languages, for pairwise comparisons between the three automatic event detection method (SKV-WOW: triangles; SKV-LM: squares; WOW-LM: circles). The points of convergence are labeled with larger filled shapes. 2.9 Statistical testing Statistical testing was performed by ANOVAs on the F score values between pairs of methods. For this analysis, the F scores were calculated separately for each sentence, which were regarded as samples. Statistical tests were performed using the STATISTICA software. The alpha level was 0.05. Greenhouse-Geisser correction was employed when the sphericity assumption was violated. The ԑ correction factor and partial η2 effect size value are shown for each significant effect. P-values were considered significant below 0.01 (the stricter than usual alpha value was selected based on the high number of samples). Post-hoc test were performed with Tukey’s HSD. When comparing the similarity between pairs of algorithms, a three-way mixed mode ANOVA was conducted on the F-score values with the dependent factors of the speaker’s Gender (male vs. female), since they pronounced the same sentences and Comparison (SKV-WOW vs. WOW-SKV vs. SKW-LM vs. LM-SKW vs. WOW-LM vs. LM-WOW; note that the reversed comparisons were necessary because the similarity matrix is not symmetric) and the independent factor of Language (Hungarian vs. English vs. German). When comparing how well the three methods approximated the hand annotation, a three-way mixed mode ANOVAs was conducted on the F-score values with the dependent factors of Gender (male vs. female) and Algorithm (Segmentation-SKV vs. Segmentation-WOW vs. Segmentation-LM), and the independent factor of Language x (Hungarian vs. English vs. German). RESULTS Figure 6 shows a Hungarian sample sentence with the events detected by the different methods marked on the Praat display. The display suggests that the events defined by the WOW algorithm largely coincide with those by the SKV. Many of the landmarks are also covered by one or both of other algorithms. Figure 6: Capture of Praat screen with a short Hungarian sentence analyzed (“Hanem egyszer ő is csak meghalt”). Below the mono waveform, the spectrogram and the events detected by the different methods are presented in different tiers: (1) ortographic - orthographic transcription, (2) phonemic - hand annotated segmentation, (3) SKV – SKV events, (4) LM – LM events, and (5) WOW – WOW events. The numbers of marked events are given in parenthesis below the tier name on the right side. Comparison across the three event detection methods The SKV, WOW and LM methods were compared using the measures defined in Section 2.6. Overall, there were fewer WOW events (1234), more LM events (1868), and even more SKV events (2721) within the whole sample. Figure 7 and Table 1 shows the comparisons of the three methods by the F-score (Tables for Precision and Recall are provided in Appendix I, Tables 1 and 2). Statistical analysis revealed a main effect of Algorithm (F5,580=51.653; p<0.001, ε = 0.380, ηp2=0.308). Post-hoc test revealed that this effect was due to that the WOW-LM and LM-WOW similarity measures differed from similarity between all other pairs except each other (p < 0.001, all). This result shows that similarity was lowest between the WOW and LM algorithms. In contrast, the SKV found a more similar set of events both compared with the WOW and the LM method. This is indicated by the lack of significant difference between SKV-WOW, SKV-LM, WOW-SKV and LM-WOW F-scores (p > 0.785, at least). There were no significant main effects or interactions involving the Gender and the Language factors. Figure 7: Mean F-score similarity measures for comparisons in both directions (because the comparison is not symmetric) between the SKV and WOW, SKV and LM and LM and WOW pairs of algorithms, separately for the three languages (rows) and for male (black bar) and female mean speakers (hollow bars). Standard deviation is marked on top of each bar. Table 1: Comparison between the SKV, WOW and LM algorithms on the F score measure (mean with standard deviation in brackets). Each column is subdivided for separately indicating the values for the male (M) and for the female (F) speakers. F SCORE COMPARISON METHOD SKV WOW LM M F M F M F BASE METHOD SKV HUN 0.6695 (0.1368) 0.6718 (0.1438) 0.6794 (0.0831) 0.656 (0.1155) ENG 0.6973 (0.1331) 0.7332 (0.1327) 0.6664 ( 0.0989) 0.7651 ( 0.0762) GER 0.7127 (0.0985) 0.7020 (0.092) 0.6978 (0.093) 0.6447 (0.0654) WOW HUN 0.6695 (0.1369) 0.6718 (0.1438) 0.576 (0.1237) 0.5542 (0.1363) ENG 0.6973 (0.1331) 0.7332 (0.1327) 0.6106 (0.1521) 0.6455 (0.0967) GER 0.7127 ( 0.0985) 0.7020 (0.0920) 0.6030 (0.1216) 0.5745 (0.0795) LM HUN 0.6789 (0.0823) 0.6646 (0.1114) 0.5741 (0.1227) 0.564 (0.1486) ENG 0.6803 (0.1014) 0.7752 (0.0811) 0.6194 (0.1560) 0.6455 (0.0967) GER 0.7056 ( 0.1) 0.6502 (0.0628) 0.6074 (0.1239) 0.581 (0.0849) Looking at event matches (TP) between the three algorithms (Figure 8 and Table 2; the similar comparisons for the FP and FN measures are given in Appendix I, Tables 3 and 4), WOW and LM each appear to cover subsets of the SKV. This is shown by the asymmetry of the matching data: most of the WOW and LM events are also detected by SKV, whereas the reverse is not true (Figure 8, left and middle columns). This means that the SKV algorithm produces the most diverse event set. Wow detects a subset of the LM events (Figure 8, right side), most of which are also detected by SKV, since almost all WOW events are also SKV events (Figure 8, left side). Figure 8: Mean percentage of matching events (with respect to all detected events by the comparison algorithm, marked under the bar) between the SKV, WOW, and LM algorithms. The three languages are shown in separate rows. Black bars show the percentage of the events detected by the second labeled method for which a matching event was found by the first method (i.e., the second method serves as the base, while the first one as the comparison method - see Section 2.6). Hollow bars show the results of the reversed comparison (the first labeled method serves as the base and the second as the comparison method). Table 2: Comparison between the SKV, WOW and LM algorithms on the True Positive (TP) measure (mean with standard deviation in brackets). Each column is subdivided for separately indicating the values for the male (M) and for the female (F) speakers. TRUE POSITIVE COMPARISON METHOD SKV WOW LM M F M F M F BASE METHOD SKV HUN 9.7 (3.5851) 9.1 (4.3923) 11.65 (3.4378) 10.95 (3.8454) ENG 8.55 (3.2843) 8.6 (4.2103) 9.15 (3.1834) 10.35 (3.3916) GER 11.05 (3.7763) 8.95 (3.6487) 12.8 (4.3842) 10.7 (3.7989) WOW HUN 9.7 (3.5851) 9.1 (4.3923) 7.9 (3.3230) 7.55 (2.9465) ENG 8.55 (3.2843) 8.6 (4.2103) 7.05 (3.2359) 7.5 (3.2687) GER 11.05 (3.7763) 8.9 (3.6487) 9.05 (3.1368) 7.85 (2.9069) LM HUN 11.65 (3.4378) 11.1 (3.8648) 7.9 (3.3230) 7.65 (2.9784) ENG 9.35 (3.265) 10.5 (3.5019) 7.15 (3.265) 7.5 (3.2687) GER 13 (4.6792) 10.8 (3.8471) 9.15 (3.3289) 7.95 (3.0171) The events which are mostly marked by all three algorithms are the following: voiceless plosive – vowel transitions, voiced plosive – vowel transitions, the time of the plosion, nasals. The comparisons between the algorithms revealed the following pattern of differences in event detection (the first algorithm in the comparisons is always the base method and the second one is the compared method). 1) SKV-WOW comparison: The events missed by WOW are those where the energy change does not differ too much between different frequency bands (for example in the /i/-/h/ transitions). In the reverse comparison, there are only very few WOW events which have been missed by the SKV algorithm. Some vowel-voiceless fricative transitions produce weak SKV peaks and thus could only be detected if the threshold was lowered. 2) SKV-LM comparisons: There are a few events which appear in the LM but not in the SKV output. These events are typically the sentence starter events in LM and the energy maxima of voiced phonemes. In addition, LM tends to mark some events multiple times (e.g., those which are both glottis and burst type). In contrast, some nasal-vowel transitions and vowel-nasal transitions when the vowel is unstressed are detected by SKV but not by LM. 3) LM-WOW comparisons: Most WOW events are also found by LM. The few missing ones are typically the same as those listed in the WOW-SKV comparisons (e.g. vowel-voiceless fricative transitions), and sometimes the intensity peak in the /r/ phoneme. Similarly to the LM events missed by SKV, the events, missed by WOW, are often due to LM labeling some speech events multiple times. In summary, the SKV and WOW algorithms identify approximately the same speech events with WOW labeling only events with higher amount of change/surprise, thus detecting a subset of the SKV events. The LM-SKV and LM-WOW comparisons suggest that the LM might also detect a subset of the SKV events, which is however partly distinct from the WOW event set. Comparison of the event detection methods with expert segmentation Expert segmentation events served as the base method, because the hand annotated segmentation was regarded as the ultimate ground truth in the linguistic sense. First we checked how well the events common to all three methods compare against the expert annotation (Figure 9). The F-score (Figure 9, right column) values along with the precision and recall values are relatively low and they do not much differ across the three languages. This is likely the consequence of the relatively small number of events common to all the three methods compared to the wealth of events marked by hand-annotation. Figure 9: Performance measures, precision (left), recall (middle), and F-score (right) for the events common to the three automatic event detection methods compared to the expert annotation, separately for the three languages (rows). Figure 10 and Table 3 show the F-scores between the expert annotation and the three event detection methods, separately for the different methods (columns), the three languages (rows) and for male and female speakers (bar shading). Tables for Precision and Recall are provided in Appendix II, Tables 1 and 2. Figure 10: Comparison of the SKV, WOW and LM algorithms (columns) against the expert segmented ground truth in terms of mean F-score values for Hungarian, English and German languages (rows) for male (grey) and female (black) speakers. Error bars show the standard deviation. Table 3: Mean F score values with standard deviation in brackets calculated for the three methods (columns) against the expert segmentation as ground truth. The results are shown separately for the three languages (rows) and for male and female speakers (row divisions). F SCORE COMPARISON METHOD SKV WOW LM EXPERT GROUND TRUTH HUN M 0.6044 (0.0566) 0.4863 (0.1056) 0.4909 (0.0914) F 0.5067 (0.0743) 0.4154 (0.1166) 0.4787 (0.0837) ENG M 0.5303 (0.0748) 0.4338 (0.091) 0.41 (0.086) F 0.4952 (0.0687) 0.3888 (0.131) 0.4249 (0.0703) GER M 0.5028 (0.0806) 0.4046 (0.1241) 0.4228 (0.0637) F 0.4326 (0.093) 0.3313 (0.1147) 0.4138 (0.0777) Significant interaction was found between Gender and Comparison (F2,232=7.2; p<0. 01, ε = 0.959, ηp2=0.058). This interaction was caused by the SKV and WOW algorithms finding more expert marked events for male compared to female speakers (p < 0.01, both). The most ground truth events were found by the SKV algorithm for male speakers, significantly more than any of the other algorithms irrespective of the speaker’s gender (p<0.001, all). In contrast, the lowest number of events was found by the WOW algorithm for female speakers, significantly less than any of the other algorithms irrespective of the speaker’s gender (p < 0.01, all). There were also significant main effects of Comparison (F2,232=58.979; p<0.001, ε = 0.959, ηp2=0.337), Gender (F1,116=12.812; p<0.001, ηp2=0.099), and Language (F2,116=13.937; p<0.001, ηp2=0.193). The main effect of Comparison was due to all three algorithms performing significantly differently from each other: SKV>LM>WOW. The Gender effect was due to more expert marked events being detected for male than for female speakers (p < 0.001). As for the Language effect, post-hoc tests revealed that the three algorithms were significantly more similar to the expert segmentation for the Hungarian than for the other two languages (p < 0.01, for both comparisons), whereas the performance for English and German did not significantly differ from each other (p > 0.136). For analyzing how the different algorithms differ from the expert segmentation, we have also looked at the patterns of TP, FP, and FN scores (see Figure 11 and Table 4 for the TP comparisons; the FP and FN comparisons are shown Appendix II, Tables 3 and 4). Figure 11: Comparison of the SKV (left), WOW (middle), and LM (right) algorithms against the expert segmented ground truth in terms of true positive (TP) values for Hungarian, English, and German (rows) and for male (grey) and female speakers (black). Error bars show the standard deviation. Table 4: Mean True Positive (TP) values with standard deviation in brackets calculated for the three methods (columns) against the expert segmentation as ground truth. The results are shown separately for the three languages (rows) and for male and female speakers (row divisions). TRUE POSITIVE COMPARISON METHOD SKV WOW LM EXPERT GROUND TRUTH HUN M 13.55 (3.1535) 9.25 (3.1602) 10.65 (3.0310) F 11.25 (4.0507) 8.05 (3.6631) 10.4 (2.8172) ENG M 10.05 (2.6052) 7.4 (2.9806) 7.55 (2.7237) F 9.3 (2.9218) 6.75 (3.5964) 8.05 (2.781) GER M 13.9 (5.1391) 9.7 (4.1562) 11.3 (3.6288) F 11.25 (4.3634) 7.35 (2.6413) 10.9 (3.6548) Looking at the events not detected by the three algorithms, it appears that typically the /n/-/b/ and /m/-/b/ transitions are missed by all three. In addition, there are types of events missed by only some of the algorithms. The SKV algorithm does not find the start of the sentence when the first phoneme is a vowel. SKV sometimes finds false events in the middle of vowels at the energy maximum points and misses some vowel-vowel and nasal-vowel transitions. Voiced-/s/,/S/,/f/ transitions are marked by SKV with less accurate timing than WOW. However SKV can find transitions between two voiced phonemes (whether vowels or consonants), which is unique compared to the other two algorithms. Further, for transitions containing unvoiced plosives (/p/, /t/, /k/) the two successive plosions are marked by the SKV. In addition to the events missed by SKV WOW does not find transitions between two voiced phonemes, typically missing the aforementioned /n/-/b/, /m/-/b/, /n/-/m/ transitions as well as transitions in which one of the phonemes is a vowel and the other is a voiced consonant (specifically nasals, for example /a/-/m/). Further, transitions involving /r/ are never marked, vowel-vowel transitions are not properly detected, and only few of the fricative to vowel transitions are found. On the other hand, WOW marks many sentence and phrase onsets and the voiced-/s/, /S/, /f/ transitions. The LM algorithm is not sensitive to voiced transitions, particularly the aforementioned voiced-fricative transitions (voiced-/s/, /S/, /f/). In contrast to the SKV and WOW algorithms, LM can find the vowel-nasal transitions mostly but misses the nasal-vowel ones. Transitions related to the phoneme /r/ are not always detected. However, LM is sensitive to the /k/-/t/ transitions; it correctly marks not just the plosion of the consonant, but also the start of the silent period before the stop and it is the only algorithm, which can reliably find both phonemes of the transition. LM also detects the start of the sentence. The WOW and SKV algorithms cannot be fully compared to the expert segmentation in the sense that these algorithms are not designed for segmentation purposes, rather they find acoustically salient points in speech which do not cover all possible segment boundaries. Having said that, there are some differences between the results achieved by the three approaches; SKV and WOW cannot find the vowel-nasal and voiced consonant-voiced to consonant transitions; additionally, WOW cannot detect fricative-vowel transitions. LM cannot find voiced-fricative transitions, which are detected by the two other algorithms. Nasal-vowel transitions are missed by the LM mostly, but vowel-nasal transitions are marked in contrast to the SKV and WOW algorithms. Again we note that the events detected by the WOW algorithm are mostly also covered by SKV. DISCUSSION Detecting salient stimulus events has been widely used in studying human auditory, visual, somatosensory, and multimodal perception and action selection (Ke et al., 2007; Ellingsen 2008; Hohwy et al., 2008; Itti, Baldi, 2009; Rapantzikos et al., 2011; Ostwald et al, 2012; Avila, Martinez, 2014) with Bayesian surprise (WOW) receiving more attention than skewness in variable time (SKV). Acoustic landmarks (LM; Stevens, 2000, 2002) are of course specific to speech processing, having been applied to stress detection, the measurement of speech intelligibility in dysphonia, speech analysis in Parkinson’s disease, etc. (Fell et al, 2015, Boyce et al, 2013). Here we compared the event-detection performance of these three algorithms with each other and with expert-annotated phoneme segmentation (ground truth in terms of linguistic expertise). In a fully crossed design, we tested the three algorithms on three different languages, and on male and female speakers. The main findings are the following: 1) Although none of the algorithms matched the expert segmentation closely, SKV produced the best of the three, followed by LM and WOW; 2) WOW detected a subset of the events found by SKV; 3) Although largely overlapping, WOW and LM detected slightly different sets of events with SKV better covering the WOW events than LM; 4) SKV and WOW better matched the expert annotation for male than for female speakers; and 5) The algorithms performed more similarly to the expert segmentation for Hungarian than for English and German. It is somewhat surprising that the SKV algorithm, which not specifically developed for speech segmentation outperformed the LM algorithm, which utilizes some specific speech features. (Note, however, that for the sake of comparing across the three algorithms, the results shown here are based on the phoneme onsets only, thus reducing the match with the expert segmented segment boundaries for LM.) The superiority of the SKV algorithm suggests that abrupt spectral changes are essential cues in speech segmentation. This is ties together the linguistic approach to speech segmentation and the segmentation performed by the human brain, as SKV is the only algorithm of the three, which is plausible in terms of neural implementation. Indeed, when constructing their biologically plausible model Coath and Denham (2007) argued that within-channel transient-sensitive processing on multiple frequency-related time scales is related to the goal of efficient coding of naturalistic, behaviorally relevant stimuli. The summed SKV gives phasic peaks can be regarded as events or short time windows, which represent changes in overall energy or spectral content, a dynamic acoustic feature the human brain is known to be sensitive to (for a review, see Näätänen and Picton, 1987). In a recent study, Khalighinejad and colleagues (2017) recorded event-related brain potentials (ERP; electric brain responses time-locked to stimulus events) to phonemes boundaries (the authors termed these responses phoneme-related potentials: PRP). They found that the ERP responses to different phoneme categories were organized by phonetic features: specifically, the manner of articulation proved to be the dominant feature, followed by the place of articulation. Thus these results suggest that the phonetic organization of speech sounds does exist in the human brain and that this is observable with EEG. However, the SKV algorithm (at least the present variant) does not categorize phonetic events. Therefore, if categorization is needed, the LM method may be employed in combination with the SKV. Stevens’ landmarks can be categorized into different event types, providing more similarity to expert annotation. The burst, glottis, and syllabicity types of landmarks can be matched to SKV events, which can also considered as SKV subcategories. WOW events form another subset of the SKV events. Since the WOW events reflect abrupt and unpredictable changes in the speech signal, they could be used as a phrase-sensitive event set in the SKVs. Events which are covered by neither the WOW nor the LM algorithm can be identified as quasi-stationary SKV events (Kovács et. al., 2015). Further, utilizing the finding that sentences starting with vowels can be detected by the LM and WOW algorithms better than the SKV allows the identification of these events. Finally, detection of plosive-plosive transitions is unique to the LM algorithm, which could also be utilized to improve the performance of automatic event detection by combining the three algorithms. Further improvements could come from utilizing offset markers from LM, as sometimes these provide a better match for the linguistic segment boundary. Finally, for a better match with expert segmentations, personal threshold values need to be established for the SKV and WOW calculations. The segment boundaries defined by such a combined event set can for example be used as triggers in ERP studies utilizing the brain’s ability for detecting salient events. In contrast to the expected good match between LM and expert segmentation in English (for which it was developed) the match was actually higher in German and Hungarian. Since for English we used sentences from the database published by the developers of LM themselves (Stevens, 2002), this finding suggests that the features utilized by LM are truly language-independent, and LM needs no tailoring to any language. Finally, the algorithms’ performance with respect to the expert segmentation was higher for male than for female speakers. This may be due to differences between the male and female vocal tract. The wavelength, frequency and pitch are determined by how quickly the vocal folds open and close, which is highly related to their size. The wavelength is longer for male speakers, resulting in lower pitch. This sound wave carries all the difference, beginning with the larynx and then being shaped by the vocal tract, which is also longer by males. As a result, the structure of a voiced phoneme (fundamental frequency and formants) is clearer for male than for female speakers (Pépiot, 2013). This is also reflected by the shimmer difference, which refers to the greater variability of the pitch in male than in female speakers. In summary, we have shown that algorithms detecting salient events, especially skewness in variable time (SKV) based on abrupt spectral changes provide a reasonable low-cost and fast means for automatic speech segmentation (i.e., ca. as good as the acoustic landmarks based algorithm specifically developed for finding phonetic boundaries). Combining the three algorithms tested (SKV, WOW, LM) may further improved the match to expert segmentation. Importantly, SKV is an algorithm that may be implemented in the human brain. This allows one to test and model how the brain segments speech. ACKONWLEDGMENTS This work was funded by the Hungarian Academy of Sciences’ Lendület grant (LP2012/036). REFERENCES Aertsen, A. M. H. J., & Johannesma, P. I. M. (1981). The Spectro-Temporal Receptive Field - A functional characteristic of auditory neurons. Biological Cybernetics, 42(2), 133–143. doi:10.1007/BF00336731 Avila, L., and Martínez, E. (2014), Behavior monitoring under uncertainty using Bayesian surprise and optimal action selection, Expert Systems with Applications, Volume 41, Issue 14. doi: 10.1016/j.eswa.2014.04.031. Baldi, P., and Itti, L. (2009), Of bits and wows: A Bayesian theory of surprise with applications to attention, Neural Networks, Volume 23, Issue 5, 2010, Pages 649-666, ISSN 0893-6080, doi: 10.1016/j.neunet.2009.12.007. Barniv, D., and Nelken, I. (2015). Auditory streaming as an online classification process with evidence accumulation. PLoS ONE 10:e0144788. doi: 10.1371/journal.pone.0144788 Bregman, A. S. (1990). Auditory Scene Analysis. The Perceptual Organization of Sound. Cambridge, MA: MIT Press. Boersma P.& Weenink D. (2013). Praat: doing phonetics by computer [Computer program]. Version 6.0.19, retrieved 2 June 2016 from http://www.praat.org/ Boyce, S., Fell, H., MacAuslan, J., & Wilde, L. (2010). A Platform for Automated Acoustic Analysis for Assistive Technology. In Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies (pp. 37–43 Boyce S., Fell H., MacAuslan, J.: SpeechMark: Landmark Detection Tool for Speech Analysis. INTERSPEECH 2012: 1894-1897 Boyce S., Fell H., MacAuslan, J.: Automated tools for identifying syllabic landmark clusters that reflect changes in articulation. MAVEBA 2011:63-66 Boyce, S., Speights, M., Ishikawa, K., & MacAuslan, J. (2013). Speechmark acoustic landmark tool: Application to voice pathology. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 2672–2674). Burkhardt, F., Paeschke, a, Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A Database of German Emotional Speech. Ninth European Conference on Speech Communication and Technology, 2005, 3–6. Chen Y., Song M., Xue L., Chen X., Wang M., An audio–visual human attention analysis approach to abrupt change detection in videos, Signal Processing, Volume 110, May 2015, Pages 143-154, ISSN 0165-1684, http://dx.doi.org/10.1016/j.sigpro.2014.08.006. Christiansen, Thomas Ulrich, Henrichsen, Peter Juel "Objective Evaluation of Consonant-Vowel pairs produced by Native Speakers of Danish", Proceedings of Forum Acusticum 2011 (ISBN: 978-84-694-1520-7), 2011 Coath, M., Brader, J. M., Fusi, S., and Denham, S. L. (2005). Multiple views of the response of an ensemble of spectro-temporal features support concurrent classification of utterance, prosody, sex and speaker identity. Network, 16(2-3):285–300. Coath, M., & Denham, S. L. (2005). Robust sound classification through the representation of similarity using response fields derived from stimuli during early experience. Biological Cybernetics, 93(1), 22–30. doi:10.1007/s00422-005-0560-4 Coath, M., & Denham, S. L. (2007). The role of transients in auditory processing. BioSystems, 89(1-3), 182–189. doi:10.1016/j.biosystems.2006.04.016 Cunillera T., Toro J.M., Sebastián-Gallés N., Rodríguez-Fornells A., The effects of stress and statistical cues on continuous speech segmentation: An event-related brain potential study, Brain Research, Volume 1123, Issue 1, 6 December 2006, Pages 168-178, ISSN 0006-8993 Darling, A. M. (1991).“Properties and implementation of the gammatone filter: a tutorial,” in Speech Hearing and Language, work in progress (University College London, Department of Phonetics and Linguistics), pp. 43–61. Drullman, R. (1995). Temporal envelope and fine structure cues for speech intelligibility. The Journal of the Acoustical Society of America, 97(January), 585–592. doi:10.1121/1.413112 Dunlap, A. G., Lin, F., and Liu, R. (2013). Auditory processing for contrast enhancement of salient communication vocalizations. In Proceedings of Meetings on Acoustics, volume 19, page 010025. Ellingsen, K. (2008). Salient event-detection in video surveillance scenarios. Proceeding of the 1st ACM Workshop on Analysis and Retrieval of Events/actions and Workflows in Video Streams - AREA ’08, 57. doi:10.1145/1463542.1463552 Fishbach, A., Nelken, I., & Yeshurun, Y. (2001). Auditory edge detection: a neural model for physiological and psychoacoustical responses to amplitude transients. Journal of Neurophysiology, 85(6), 2303–23. Friston, K. (2005). A theory of cortical responses. Phil. Trans. R. Soc. Lond. Ser. B Biol. Sci. 360, 815-836. Fu, Q. J., Zeng, F. G., Shannon, R. V, & Soli, S. D. (1998). Importance of tonal envelope cues in Chinese speech recognition. The Journal of the Acoustical Society of America, 104(1), 505–510. doi:10.1121/1.423251 Garrido, M. I, Kilner, J M., Stephan, K. E., & Friston, K. J. (2009) The mismatch negativity: A review of underlying mechanisms. Clin Neurophysiol 120:453-463 Glasberg, B. R., & Moore, B. C. J. (1990). Derivation of auditory filter shapes from notched-noise data. Hearing Research, 47(1-2), 103–138. doi:10.1016/0378-5955(90)90170-T Gregory, R.L. (1980). Perceptions as hypotheses. Phil. Trans. R. Soc. Lond., Ser. B, Biol. Sci. 290, 181-197. Heil, P. (1997). Auditory cortical onset responses revisited. ii. response strength. J Neurophysiol, 77(5):2642–2660. Itti, L., & Baldi, P. (2009). Bayesian surprise attracts human attention. Vision Research, 49(10), 1295–1306. doi:10.1016/j.visres.2008.09.007 Jaccard, P. (1912). The Distribution of the Flora in the Alpine Zone. New Phytologist, 11(2), 37–50. doi:10.1111/j.1469-8137.1912.tb05611.x Jepsen, M. L., & Dau, T. (2011). Confusion of Danish consonants in white noise. In Speech Perception and Auditory Disorders (Vol. 60, pp. 143–150). Ke, Y., Sukthankar, R., & Hebert, M. (2007). Event detection in crowded videos. In Proceedings of the IEEE International Conference on Computer Vision. doi:10.1109/ICCV.2007.4409011 Khalighinejad, B., da Silva, G. C., & Mesgarani, N. (2017). Dynamic Encoding of Acoustic Features in Neural Responses to Continuous Speech. Journal of Neuroscience, 37(8), 2176-2185 Kocsis, Z., Winkler, I., Szalárdy, O., & Bendixen, A. (2014). Effects of multiple congruent cues on concurrent sound segregation during passive and active listening: An event-related potential (ERP) study. Biological Psychology, 100(1), 20–33. doi:10.1016/j.biopsycho.2014.04.005 Kovács A., Kiss G., Vicsi K., Winkler I., Coath M., Comparison of skewness-based salient event detector algorithms in speech, 6th IEEE Conference on Cognitive Infocommunications, CogInfoCom 2015, Győr, Hungary, ISBN 978-1-4673-8128-4, pages: 285-290 Liberman, A.M., Cooper, F.S., Shankweiler, D.P., & Studdert-Kennedy, M. (1967)."Perception of the speech code". Psychological Review 74 (6): 431–461.doi:10.1037/h0020279. Liberman, A.M. & Mattingly, I.G. (1985). "The motor theory of speech perception revised" (PDF). Cognition 21 (1): 1–36. doi:10.1016/0010-0277(85)90021-6. Liu, S. a. (1996). Landmark detection for distinctive feature-based speech recognition. The Journal of the Acoustical Society of America (Vol. 100). doi:10.1121/1.416983 Liu, S. A. Landmark detection for distinctive feature-based speech recognition, Doctoral Dissertation, Massachusetts Institute of Technology Cambridge, MA, USA ©1995 Luck, S. J. (2005). An Introduction to the event-related potential technique. MIT Press, Cambridge, MA Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. BBA - Protein Structure, 405(2), 442–451. doi:10.1016/0005-2795(75)90109-9 Mundhenk T.N., Einhäuser W., Itti L., Automatic computation of an image’s statistical surprise predicts performance of human observers on a natural image detection task, Vision Research, Volume 49, Issue 13, July 2009, Pages 1620-1637, ISSN 0042-6989 Ostwald, D., Spitzer, B., Guggenmos, M., Schmidt, T. T., Kiebel, S. J., Blankenburg, F. (2012). Evidence for neural encoding of Bayesian surprise in human somatosensation, NeuroImage, Volume 62, Issue 1, 2012, Pages 177-188, ISSN 1053-8119, doi:10.1016/j.neuroimage.2012.04.050. Pépiot, E. (2013). Voice, speech and gender: male-female acoustic differences and cross-language variation in English and French speakers. XVémes Rencontres Jeunes Chercheurs de l’ED 268, Jun 2012, Paris, France Phillips, D. P., Hall, S. E., & Boehnke, S. E. (2002). Central auditory onset responses, and temporal asymmetries in auditory perception. Hearing Research, 167(1-2), 192–205. doi:10.1016/S0378-5955(02)00393-3 Picton, T. W. (2010). Human auditory evoked potentials. Plural Publishing, San Diego. POWERS, D.M.W. (AILab, School of Computer Science, Engineering and Mathematics, Flinders University, South Australia, A. (2011). EVALUATION: FROM PRECISION, RECALL AND F-MEASURE TO ROC, INFORMEDNESS, MARKEDNESS & CORRELATION. Journal of Machine Learning Technologies, 2(1), 37–63. doi:10.1.1.214.9232 Roach H., Dew A., Rowlands P. 1990. Phonetic analysis and the automatic segmentation and labeling of speech sounds. Journal of the International Phonetic Association 20.15–21. Sanders LD, Newport EL, Neville HJ. Segmenting Nonsense: An Event-Related Potential Index of Perceived Onsets in Continuous Speech. Nature neuroscience. 2002;5(7):700-703. doi:10.1038/nn873. Schiel, F. (1999): Automatic Phonetic Transcription of Non-Prompted Speech Proc. of the ICPhS 1999. San Francisco, August 1999. pp. 607-610. Shannon, R. V, Zeng, F. G., & Wygonski, J. (1998). Speech recognition with altered spectral distribution of envelope cues. The Journal of the Acoustical Society of America, 104(4), 2467–76. doi:10.1121/1.423774 Sheik, S., Coath, M., Indiveri, G., Denham, S. L., Wennekers, T., & Chicca, E. (2012). Emergent auditory feature tuning in a real-time neuromorphicVLSI system. Frontiers in Neuroscience, (FEB). doi:10.3389/fnins.2012.00017 Slaney, M. (1993). Auditory toolbox. Apple Computer Company: Apple Technical Report, 45, 1–41. Smith, L. S. (1995). Onset-based Sound Segmentation. Advances in Neural Information Processing Systems (NIPS), 8(October), 729–735. Speights, M., Boyce, S., MacAuslan, J., Fell, H. (2015). Measurement of child speech complexity using acoustic landmark detection. Journal of the Acoustical Society of America, 137, 2301-2301. doi: 10.1121/1.4920400 Stevens, K.N., et al. “Implementation of a Model for Lexical Access based on Features”, in International Conference on Spoken Language Processing (ICSLP) Proc., 1992. Stevens, K. N. (2000). Diverse acoustic cues at consonantal landmarks. Phonetica, 57(2-4):139–151. Stevens, K. N. (2002). Toward a model for lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America, 111(4), 1872–1891. doi:10.1121/1.1458026 Szabó, B.T., Denham, S.L., & Winkler, I. (2016). Computational models of auditory scene analysis: A review. Frontiers in Neuroscience,10:524. doi: 10.3389/fnins.2016.00524. Wacongne, C., Labyt, E., van Wassenhove, V., Bekinschtein, T., Naccache, L., & Dehaene, S. (2011). Evidence for a hierarchy of predictions and prediction errors in human cortex. Proc. Natl. Acad. Sci. U S A, 108, 20754-20759. Wiegrebe L. Mar 2001. Searching for the time constant of neural pitch extracUon. J Acousl See Am 109(3): 1082- 1091. Winkler, I. (2007). Interpreting the mismatch negativity (MMN). J. Psychophysiol. 21, 147-163. APPENDICES Appendix I. Table 1: Comparison between the SKV, WOW and LM algorithms on the False Positive (FP) measure (mean with standard deviation in brackets). Each column is diveded to separately indicate the values for the male (M) and female (F) speakers. FALSE POSITIVE COMPARISON METHOD SKV WOW LM M F M F M F BASE METHOD SKV HUN 0.9 (1.2096) 1.2 (1.2814) 4.4 (2.0365) 5.4 (2.4366) ENG 1 ( 1.2978) 0.7 (0.9234) 3.35 (1.4244) 3.2 (1.4364) GER 1.2 (1.105) 0.7 (0.9234) 4.75 (2.633) 6.7 (2.4516) WOW HUN 8.25 (4.0636) 7.5 (5.3852) 8.15 (2.8704) 8.8 (4.0471) ENG 5.8 (2.4836) 4.85 (2.4767) 5.4 (1.9595) 6.05 (2.3725) GER 7.9 (4.734) 6.85 (4.2584) 8.5 (4.6396) 9.55 (3.8726) LM HUN 6.3 (2.2965) 5.5 (3.8862) 2.7 (1.3803) 2.7 (2.5976) ENG 5 (1.9467) 2.9 (1.7313) 2.4 (1.667) 1.8 (1.8806) GER 5.9 (3.12) 5 (3.6992) 3.1 (1.944) 1.7 (1.4179) Table 2: Comparison between the SKV, WOW and LM algorithms on the False Negative (FN) measure (mean with standard deviation in brackets). Each column is diveded to separately indicate the values for the male (M) and female (F) speakers. FALSE NEGATIVE COMPARISON METHOD SKV WOW LM M F M F M F BASE METHOD SKV HUN 8.25 (4.0636) 7.5 (5.3852) 6.3 (2.3193) 5.7 (3.9749) ENG 5.8 (2.4836) 4.85 (2.4767) 5.2 (1.9894) 3.1 (1.8035) GER 7.9 (4.734) 6.8 (4.2584) 6.15 (3.3131) 5.1 (3.7403) WOW HUN 0.9 (1.2096) 1.2 (1.2814) 2.7 (1.4546) 2.8 (2.5874) ENG 1 (1.2978) 0.7 (0.9234) 2.5 (1.7014) 1.8 (1.886) GER 1.2 (1.105) 0.7 (0.9234) 3.2 (1.9358) 1.8 (1.5079) LM HUN 4.4 (1.9841) 5.25 (2.3592) 8.15 (2.8149) 8.7 (4.1688) ENG 3.15 (1.4244) 3.05 (1.4681) 5.35 (2.0072) 6.05 (2.3725) GER 4.55 (2.7043) 6.6 (2.4149) 8.4 (4.5584) 9.45 (3.8997) Table 3: Comparison between the SKV, WOW and LM algorithms on the precision measure (mean with standard deviation in brackets). Each column is diveded to separately indicate the values for the male (M) and female (F) speakers. PRECISION COMPARISON METHOD SKV WOW LM M F M F M F BASE METHOD SKV HUN 0.9285 (0.0885) 0.9003 (0.0940) 0.7255 (0.1094) 0.6612 (0.1517) ENG 0.9129 (0.1113) 0.9387 (0.0769) 0.7270 (0.1006) 0.7633 (0.0847) GER 0.9124 (0.0756) 0.9288 (0.0901) 0.7293 (0.1243) 0.612 (0.0842) WOW HUN 0.5456 (0.1662) 0.5687 (0.1931) 0.4841 (0.1425) 0.4749 (0.1892) ENG 0.5906 (0.1728) 0.6257 (0.1816) 0.5452 (0.1656) 0.5477 (0.1454) GER 0.6056 (0.1533) 0.5818 (0.1332) 0.5342 (0.1772) 0.4542 (0.1038) LM HUN 0.6483 (0.1056) 0.4749 (0.1492) 0.7438 (0.1214) 0.7875 (0.1463) ENG 0.6461 (0.1396) 0.7848 (0.1081) 0.7522 (0.2017) 0.8459 (0.1256) GER 0.6964 (0.1342) 0.7059 (0.1152) 0.7521 (0.1128) 0.8407 (0.1213) Table 4: Comparison between the SKV, WOW and LM algorithms on the recall measure (mean with standard deviation in brackets). Each column is diveded to separately indicate the values for the male (M) and female (F) speakers. RECALL COMPARISON METHOD SKV WOW LM M F M F M F BASE METHOD SKV HUN 0.5456 (0.1662) 0.5686 (0.4836) 0.6486 (0.1051) 0.6886 (0.1515) ENG 0.5905(0.1728) 0.6257 (0.1816) 0.6329 (0.1361) 0.7752 (0.1085) GER 0.6056 (0.1533) 0.5818 (0.1332) 0.6893 (0.1326) 0.7002 (0.1186) WOW HUN 0.9285 (0.0885) 0.9003 (0.0940) 0.7457 (0.1200) 0.776 (0.1395) ENG 0.9128 (0.1113) 0.9387 (0.0769) 0.7430 (0.2032) 0.8459 (0.1256) GER 0.9124 (0.0756) 0.9288 (0.0901) 0.7465 (0.1083) 0.8329 (0.1257) LM HUN 0.7246 (0.1054) 0.6697 (0.1485) 0.4823 (0.1407) 0.4836 (0.1973) ENG 0.7425 (0.1061) 0.7739 (0.0942) 0.5538 (0.1724) 0.5477 (0.1454) GER 0.7378 (0.1332) 0.6173 (0.0833) 0.5379 (0.1777) 0.4598 (0.1091) Appendix II. Table 1: Mean False Positive (FP) values with standard deviation in brackets calculated for the three methods (columns) against the expert segmentation as ground truth. The results are shown separately for the three languages (rows) and for male and female speakers (row divisions). FALSE POSITIVE COMPARISON METHOD SKV WOW LM EXPERT GROUND TRUTH HUN M 4.4 (2.2572) 1.35 (1.5313) 5.4 (1.875) F 5.4 (3.3935) 2.3 (2.3418) 5.95 (2.2118) ENG M 4.3 (1.4903) 2.15 (1.4609) 4.95 (1.6694) F 4.15 (2.3005) 2.55 (1.7313) 5.5 (2.0647) GER M 5.05 (2.5438) 2.55 (1.572) 6.25 (2.3592) F 4.55 (2.9285) 2.3 (1.8382) 6.5 (2.5026) Table 2: Mean False Negative (FN) values with standard deviation in brackets calculated for the three methods (columns) against the expert segmentation as ground truth. The results are shown separately for the three languages (rows) and for male and female speakers (row divisions). FALSE NEGATIVE COMPARISON METHOD SKV WOW LM EXPERT GROUND TRUTH HUN M 13.45 (3.4864) 17.75 (5.2703) 16.35 (4.44) F 15.7 (3.1473) 18.9 (4.8764) 16.55 (4.5477) ENG M 13.5 (3.7487) 16.15 (3.376) 16 (3.5244) F 14.6 (3.3309) 17.15 (3.9772) 15.85 (3.4683) GER M 22.35 (8.3368) 26.55 (10.9375) 24.95 (9.1045) F 25 (8.9443) 28.9 (11.4059) 25.35 (9.6533) Table 3: Mean precision values with standard deviation in brackets calculated for the three methods (columns) against the expert segmentation as ground truth. The results are shown separately for the three languages (rows) and for male and female speakers (row divisions).. PRECISION COMPARISON METHOD SKV WOW LM EXPERT GROUND TRUTH HUN M 0.7648 (0.0906) 0.8937 (0.0944) 0.6581 (0.0914) F 0.7068 (0.1495) 0.8107 (0.172) 0.6359 (0.1042) ENG M 0.7002 (0.0807) 0.7866 (0.1193) 0.596 (0.112) F 0.7014 (0.1148) 0.7261 (0.1834) 0.5953 (0.1088) GER M 0.736 (0.0892) 0.7859 (0.1337) 0.6443 (0.0822) F 0.7238 (0.1112) 0.7784 (0.1397) 0.6297 (0.0688) Table 4: Mean recall values with standard deviation in brackets calculated for the three methods (columns) against the expert segmentation as ground truth. The results are shown separately for the three languages (rows) and for male and female speakers (row divisions).. RECALL COMPARISON METHOD SKV WOW LM EXPERT GROUND TRUTH HUN M 0.5037 (0.0558) 0.3444 (0.0997) 0.3941 (0.0882) F 0.4076 (0.0808) 0.2936 (0.1053) 0.3868 (0.0792) ENG M 0.4291 (0.0737) 0.307 (0.0803) 0.3159 (0.075) F 0.3856 (0.0566) 0.2744 (0.1114) 0.3322 (0.0574) GER M 0.3878 (0.0804) 0.2809 (0.1072) 0.3181 (0.0602) F 0.3145 (0.0856) 0.2165 (0.0917) 0.3111 (0.0735)

loading page

Annamaria Kovacs