Gunshot recognition using low level features in the time domain

This paper explores the possibility of using scarcely used time-domain features for the task of gunshot recognition. A set of 11 features derived from temporal characteristics (waveform) of signals is calculated from a mixed dataset of gunshots and non-gunshots. The features leverage the impulsive nature of gunshots and their dissimilarity to other, especially more stationary signals. The paper includes a description of feature extraction, distribution of features and their recognition performance on a selected audio dataset. A subset achieves promising results in comparison with more frequently used spectral-domain features. This makes them a valuable addition to other frequently used features, especially for tasks of impulsive sound recognition.


I. INTRODUCTION
The aim of this paper is to explore some less frequently used features, some of which, to our knowledge, were not explored before. Most authors dealing with sound recognition mainly use features derived from a spectrum (such as LPC [1][2] and MFCC [3], spectral energy and MPEG-7 descriptors [4] and spectrogram features [5]) with temporal features being rare (e.g. zero-crossing rate). Some of the more frequently used features, along with the task for which they are used (such as simple detection, localization, classification etc.) and reference to formula for calculation, are listed in Table 1. In our pursuit of temporal features, we suppose that gunshots have an unusual waveform but are similar enough to each other. Some of the features might be susceptible to noise, however at this point we are working with recordings without added noise. Preliminary results show usability of some of the presented features. If further research proves their viability, investigation on the effects of various types of noises (white noise, impulsive, natural noises, etc.) and possible mitigation of these effects will follow.
The rest of the paper is organized as follows. Section 2 briefly describes the audio dataset we are working with and the preprocessing of input audio. Section 3 defines the proposed features and provides some explanatory figures. Section 4 summarizes the results from a statistical viewpoint, provides detection results using a simple neural network and compares them to other papers using different features.

II. DATASET AND PREPROCESSING
This work uses audio recordings from multiple sources. Gunshots are from the Free Firearm Library [8], from which we filtered out far-away recordings (defined by energy criterion described at the end of the paragraph) and subsequently created two subsets for testing. The first includes only recordings of an AK-47 -which is the most frequent weapon in the library (374 occurrences). The second, a greater subset, includes all weapons (1483 occurrences). For non-gunshot sounds, a subset of the Urban Audio Dataset [9] was chosen, recordings of public spaces from the Airborne website [10], a small part of recordings comes from Freesound [11], as well as our own recordings. Nongunshot set contains 9147 segments. During the selection process, all segments with energy below 0.1 (defined as a sum of squared samples) were omitted, in order to exclude any silent segments. Recordings contain no additional noise, except for noise present during recording, as this paper does not investigate influence of noise on feature performance.
All audio frames have a length of 11 ms (486 samples at sampling frequency of 44.1 kHz) and have been normalized so that the maximum absolute value is 1. The frame length was selected in accordance with our previous research [12].
Matlab 2016b was used for all calculations and audio processing.

III. INVESTIGATED FEATURES
Gunshots are usually distinctly N-shaped, however the most dominant extreme of some gunshots is positive and of others negative. Fig. 1 and 2 depict both these cases. The first investigated features are the relative position of zero-crossings before (RP-) and after (RP+) the most dominant peak, as well as distance between these zero-crossings (ZDist). These features are depicted in Fig. 3. Fig. 4 presents a histogram of zero-crossing positions and Fig. 5 shows a histogram of distances between these positions for an AK-47. More features, some of which are illustrated in Fig 6., are defined as: time distance between maximum and minimum values (PDist), PDist can acquire positive as well as negative values, as can be seen in Fig. 10. This is caused by the fact that minimum sometimes precedes maximum. Next features are, distance between these points in two dimensions (PlDist), angle between horizontal line and line connecting maximum and minimum points (Ang). In order to calculate the angle, the horizontal line is measured in seconds. The area of the triangle delimited by the 2 highest peaks and a minimum (referred to as Area). Area is illustrated in Fig. 8 and Fig. 11. The last 4 features, shown in Fig. 9, are coefficients (A and B in equation 1) of exponential fit to both positive local extremes (Ap, Bp) and negative local extremes (An, Bn).  Table 2 below shows mean values and standard deviation of individual features extracted from AK-47 recordings as well as values extracted from all gunshots. Mean value is denoted by µ, standard deviation by σ and an overall measure |µ|/σ is introduced (calculated as a ratio between absolute mean value and standard deviation).

IV. RESULTS
To compare performance of proposed features, they were extracted from both gunshot categories (AK-47-only and all weapons) and non-gunshot signals. Subsequently, they were fed to a neural network for recognition. This paper uses default Matlab implementation of a neural network with one hidden layer with 10 neurons. Since individual features by themselves (dimensionality 1), or even a combination of two features were not able to recognize gunshots, we resorted to compare recognition performance when increasing the number of features. In order to achieve the best results with the least possible amount of features, we considered ordering them according to several metrics. Firstly, features were ordered from expected best to expected worst according to the µ/σ measure presented in Tab. 2 (assuming: the greater the better). The second ordering was dependent on mutual information between feature values and class labels. Mutual information was calculated using the Matlab function kernelmi [13] and greater mutual information meant the feature was supposed to perform better at recognition (we disregarded effects of mutual information between individual features in ordering for now). The last criterion was based on two sample t-tests, as calculated by the Matlab function ttest2, measuring similarity of distributions of two datasets. Ordering was done according to the p-value of the function which reflects confidence in the null hypothesis, i.e. that samples are from one distribution. The compared distributions were those of features from gunshot categories against those of features from non-gunshot categories. From that follows that the lesser the p-value, the more dissimilar the sets should be and more convenient for recognition. Although all criteria produced similar results, better results were obtained after slight manual rearranging of proposed orders. The final order we settled on is mentioned in the next paragraph, just below the figures. Features that are not mentioned are deemed unfit for recognition. Fig. 12 shows performance (recall (2) and precision (3)) of different numbers of features for all gunshots; the subsequent figure, Fig. 13, only represents AK-47 assault rifle recognition (whose features were sorted in the same manner as "all gunshots"). For better illustration, Fig. 14 shows the ROC curve for the 7 best features when using all gunshots (which achieved the best results in terms of F-score). Achieved results indicate, that although other features (such as LPC or MFCC) might be more effective, there are still many possibilities for feature extraction in time. We conclude that the first 7 features (8 in the case of AK-only) might be useful for gunshot, and possibly other impulsive signal recognition. These features are: RP-and RP+ (zero crossing before and after the most dominant peak), Bn and Bp coefficients of envelope approximation, PlDist (distance between maximum and minimum), angle between horizontal line and line connecting maximum and minimum, and possibly distance between zerocrossings (ZDist). The results achieved using the first 7 features (8 in the case of AK-only) are summarized in Table 3. Formulae for the metrics used in Tab. 3 are the following: Where tp are true positives (gunshots classified as gunshots), fn are false negatives (misclassified gunshots) and fp are false positives (non-gunshots flagged as gunshots).

A. Comparison with other works
This chapter presents the results achieved by other, similar works. It is important to note that the aims of these papers varies, some focus on gunshot detection, others on specific weapon recognition and yet others have multiple objectives such as general (and/or multiple) event recognition/classification and scene classification. When there are multiple objectives, the one with the closest conditions and best results was chosen. Tab. 4 presents results achieved by previously mentioned papers. TPR means "True Positive Rate" (i.e. Recall), FPR "False Positive Rate", ACC stands for Accuracy. Most papers are performance oriented, and thus frequently use high feature dimensionality. The focus of this paper is to explore the usability of new features, so it does not use optimal combinations of existing features. This explains the comparatively worse results in performance, but also testifies to the fact that some of the features presented have the potential to boost recognition performance when used in combination with other conventional features. Further research into their compatibility with conventional features is required, but their different nature (as they are not derived from a spectrum) might hint to low mutual information between the two groups.