VoiceGurus

Blog about voice enhancements

Big data for voice recognition

Big data for sound recognition part 1: deep learning

NXP Software is akin to a spider in the middle of a web. At every stage we listen to customer requirements and focus on the user experience. We consistently invest in innovation, working with only the best partners to achieve this.

NXP is also taking the lead in R&D – and we have been making a lot of progress. To develop excellent synergy within this R&D environment, NXP Software has been working in a close partnership with Tampere University.

In the context of audio, NXP Software understands the user experience, which is why we acknowledge the value of context data for the end user.

Now we are publishing three articles on Audio Data context because we believe it is about expertise, partnerships and customer experience. All three deserve their own dedicated interaction and attention.

The first part of our blog trilogy – ‘Big data for sound recognition’ – looks in detail at collecting audio data and “deep learning”. Feel free to read this blog and learn how NXP Software is moving forward in this sector. Make sure you don’t miss the part two “Brand new research into audio context recognition” and part three!

Big data for sound recognition

Improving automatic sound recognition by training algorithms with robust data

For a phone to recognize a specific sound – someone’s voice, an alarm bell, a door closing – you need to create a model based on its unique characteristics, just like an acoustic fingerprint. Typically, these models are created by analyzing recorded data of the sound with highly advanced acoustic algorithms. At first this might seem simple enough: Record a sound, analyze it, and build a suitable model which can be used to recognize the sound again.

Data for deep learning

However, in reality this is far more challenging. The models created need to be robust enough to recognize the same sound, or class of sound (a collision, an alarm event, etc.), no matter what the conditions are when it was made. This requires two things: a wide variety of data and a well-crafted ‘deep learning’ algorithm, i.e. one which can learn by itself without being explicitly programmed.

To make a robust and accurate model of sound, the acoustic algorithms must be tested against a wide array of different sound samples. Samples must cover as many scenarios of the sound and the surrounding environment as possible.

Data for our algorithms

With our leading expertise on acoustic algorithms, NXP Software is well equipped to develop superior audio models. We just need a huge amount of data! In collaboration with universities, we are extending our existing audio database by collecting audio recordings of sounds from all around the world.

In collaboration with the universities, we have developed a dedicated testing infrastructure for audio sensing solutions. The LifeVibes 6 Sense ecosystem includes a data cloud and related mobile apps. This provides the scalability needed to meet all customer requirements from low power to high connectivity; two key issues for wearables.

In addition to the audio data, 6 Sense captures data from all the other sensors in the phone including pressure, temperature and motion. Then, by combining our audio know-how with the information provided by the other sensors, we can create meaningful solutions.

Machine learning

To ensure the data we collect is robust; we follow best practices in seven areas:

  • Space – e.g. office recordings are made in different offices under different conditions (quiet, voices, printer noise, etc.), and in different locations around the world
  • Time – recordings made at different times of day for training and testing, with recordings of different contexts made at different hours and on different days
  • System – using different phones makes the solution robust against channel diversity and microphone type
  • User – we make recordings with multiple users, because one person’s recording habits (time, location, etc.) can limit diversity
  • Semantics – people have different interpretations of a class, e.g. an alarm event can be a siren to one person, a ringtone to another, a buzzer, a bell, and so on
  • Acoustic nature – audio of the same class can sound differently depending on the source, e.g. the sound of a collision on wood, metal, ceramic, etc.
  • Quality – clean recordings reveal a signal’s acoustic properties, while noisy recordings are closer to reality: both are necessary to build a robust model

More data, more accuracy

Covering all these permutations means collecting a huge amount of data in an extensive database. This allows the algorithms to be tested against many different sounds, so that it can learn how to distinguish between sounds and identify similar ones. This intelligent training of the algorithm enables us to make very accurate acoustic models.

The 6 Sense ecosystem not only allows customers to evaluate our solutions through the cloud, but also to upload their own specific data-sets. By drawing on this large and growing database of information, we can create audio sensing models tailored to specific devices and/or applications.

Incorporating LifeVibes Sensing solutions into their devices allows phone manufacturers to address specific consumer needs and application developers to create new apps capable of identifying and utilizing any class of sounds. The robust, highly accurate sound models also enable plenty of personalization, allowing users to tailor their mobile devices to their unique personality and requirements.

Additional information

The AudioSense product page and our Video section has more information about audio context and audio sensing.

Next article in this series

Part2: Brand new research into audio context recognition

Leave a Reply