Audio Content Based Geotagging in Multimedia

Anurag Kumar; Benjamin Elizalde; Bhiksha Raj

2017 INTERSPEECH INTERSPEECH 2017

Audio Content Based Geotagging in Multimedia

Abstract

In this paper we propose methods to extract geographically relevant information in a multimedia recording using its audio content. Our method primarily is based on the fact that urban acoustic environment consists of a variety of sounds. Hence, location information can be inferred from the composition of sound events/classes present in the audio. More specifically, we adopt matrix factorization techniques to obtain semantic content of recording in terms of different sound classes. We use semi-NMF to for to do audio semantic content analysis using MFCCs. These semantic information are then combined to identify the location of recording. We show that these semantic content based geotagging can perform significantly better than state of art methods.

🧭 Keyword Pioneer — sound event classification

🐣 Hot Topic Early Bird — audio classification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Anurag Kumar , Benjamin Elizalde , Bhiksha Raj

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Semi-Supervised Learning

Keywords

matrix factorization audio classification semantic content mfcc feature sound event classification sound event audio semantic content semi-supervised nmf geographic tagging

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017