Employing a Variational Auto-Encoder to Detect Unknown Sounds for Hearing-Impaired People

Authors: Sarafaslanyan A.Kh., Cheprakov V.V., Suvorov D.A., Mozgovoy M.V., Volkov A.V. Published: 16.02.2019
Published in issue: #1(124)/2019  
DOI: 10.18698/0236-3933-2019-1-35-49

Category: Instrument Engineering, Metrology, Information-Measuring Instruments and Systems | Chapter: Instrumentation and Methods to Transform Images and Sound  
Keywords: variational autoencoder, deep learning, sound recognition, digital signal processing, detection, learning

The paper presents a system of detecting unknown sounds for hearing-impaired people built upon a variational auto-encoder. We define the architecture of our variational autoencoder, the encoder and decoder in which both consist of fully connected layers. We describe the process of creating the dataset and splitting it into training, test and unknown sound detection subsets. We then describe the method of training the system and the mathematics behind it, including the Adam stochastic optimization method and a variational lower bound as a loss function. We tested our system and established that there are no false negative detection results for unknown sounds and that the false positive result probability is 14 %, which is quite acceptable in practice. We provide the technology we used to implement the system and the device that should house it. We consider possible ways of further improving the system

This work was supported by the Innovation Promotion Foundation (grant no. 168GRNTIS5/35848)


[1] Hersh M. Deafblind people, communication, independence, and isolation. J. Deaf. Stud. Deaf. Educ., 2013, vol. 18, iss. 4, pp. 446–463. DOI: 10.1093/deafed/ent022

[2] Sainath T.N., Parada C. Convolutional neural networks for small-footprint keyword spotting. INTERSPEECH, 2015, pp. 1478–1482.

[3] Tzanetakis G., Cook P. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process., 2002, vol. 10, iss. 5, pp. 293–302. DOI: 10.1109/TSA.2002.800560

[4] Tavares T.F., Foleiss J.H. Automatic music genre classification in small and ethnic datasets. Proc. 13th CMMR Int. Symp., 2017, pp. 25–28.

[5] Bragg D., Huynh N., Ladner R.E. A personalizable mobile sound detector app design for deaf and hard-of-hearing users. Proc. 18th Int. ACM SIGACCESS Conf. Computers Accessibility, 2016, pp. 3–13. DOI: 10.1145/2982142.2982171

[6] Lecomte S., Lengellé R., Richard C., et al. Abnormal events detection using unsupervised one-class SVM - Application to audio surveillance and evaluation. 8th IEEE AVSS Int. Conf., 2011, pp. 124–129.

[7] Bishop C.M. Pattern recognition and machine learning. Springer, 2006.

[8] Zong B., Song Q., Min M.R., et al. Deep autoencoding Gaussian mixture model for unsupervised anomaly detection. ICLR, 2018. Available at: https://openreview.net/pdf?id=BJJLHbb0- (accessed: 09.07.2018).

[9] Oh D.Y., Yun I.D. Residual error based anomaly detection using autoencoder in SMD machine sound. Sensors, 2018, vol. 18, no. 5, art. 1308. DOI: 10.3390/s18051308

[10] Kingma D.P., Welling M. Auto-encoding variational Bayes. ICLR, 2014. Available at: https://arxiv.org/pdf/1312.6114.pdf (accessed: 09.07.2018).

[11] Zhukov R.A., Suvorov D.A., Teteryukov D.O. et al. Designing a signal input subsystem based on a digitally interfaced microphone array. Vestn. Mosk. Gos. Tekh. Univ. im. N.E. Baumana, Priborostr. [Herald of the Bauman Moscow State Tech. Univ., Instrum. Eng.], 2018, no. 3, pp. 70–82 (in Russ.). DOI: 10.18698/0236-3933-2018-3-70-82

[12] Salamon J., Jacoby C., Bello J.P. A dataset and taxonomy for urban sound research. Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 1041–1044. DOI:10.1145/2647868.2655045

[13] Hertel L., Phan H., Mertins A. Comparing time and frequency domain for audio event recognition using deep learning. IEEE IJCNN, 2016, pp. 3407−3411. DOI: 10.1109/IJCNN.2016.7727635

[14] Fulop S.A., Fitz K. Algorithms for computing the time-corrected instantaneous frequency (reassigned) spectrogram, with applications. J. Acoust. Soc. Am., 2006, vol. 119, iss. 1, pp. 360–371. DOI: 10.1121/1.2133000

[15] Maas A., Hannun A., Ng A. Rectifier nonlinearities improve neural network acoustic models. ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013. Available at: https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf (accessed: 09.07.2018).

[16] Xu H., Chen W., Zhao N., et al. Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications. Proc. 2018 World Wide Web Conf., 2018, pp. 187–196. DOI: 0.1145/3178876.3185996

[17] Kingma D., Ba J. Adam: a method for stochastic optimization. ICLR, 2015. Available at: https://arxiv.org/pdf/1412.6980.pdf (accessed: 09.07.2018).

[18] Yang X. Understanding the variational lower bound. Available at: http://legacydirs.umiacs.umd.edu/~xyang35/files/understanding-variational-lower.pdf (accessed: 09.07.2018).

[19] Press W.H., Teukolsky S.A., Vetterling W.T., et al. The art of scientific computing. Cambridge Univ. Press, 2007.

[20] Maaten L., Hinton G. Visualizing data using t-SNE. JMLR, 2008, no. 9, no. 1, pp. 2579–2605.

[21] Kumatani K., McDonough J., Raj B. Microphone array processing for distant speech recognition: from close-talking microphones to far-field sensors. IEEE Signal Process. Mag., 2012, vol. 29, iss. 6, pp. 127–140. DOI: 10.1109/MSP.2012.2205285

[22] uvorov D.A., Ge D., Zhukov R.A. Deep residual network for sound source localization in the time domain. JEAS, 2018, vol. 13, no. 13, pp. 5096–5104.

[23] Tashev I. Sound capture and processing: practical approaches. John Wiley & Sons, 2009.

[24] Aleinik S. Acceleration of Zelinski post-filtering calculation. J. Sign. Process Syst., 2017, vol. 88, iss. 3, pp. 463–468. DOI: 10.1007/s11265-016-1191-9