Google AI details the ML behind Pixel 4’s Recorder app

Abner Li | Dec 18 2019 - 12:01 pm PT

Earlier this month, the Pixel 4’s excellent Recorder app became available for older Google phones. The company today detailed the machine learning powering the entirely on-device transcription tool.

A post on the Google AI blog today starts by explaining the rationale for developing Recorder. Namely, how speech is the dominant information medium, but current ways for capturing and organizing are insufficient. The company hopes to make “ideas and conversations even more easily accessible and searchable.”

Over the past two decades, Google has made information widely accessible through search — from textual information, photos, and videos, to maps and jobs. But much of the world’s information is conveyed through speech. Yet even though many people use audio recording devices to capture important information in conversations, interviews, lectures, and more, it can be very difficult to later parse through hours of recordings to identify and extract information of interest.

There are three parts to Recorder. Transcription leverages an automatic speech recognition model that’s based on an all-neural on-device system that first debuted in Gboard earlier this year. Since March, Android’s keyboard has featured a “Faster voice typing” option that can be downloaded to work offline, and transcribes character-by-character.

For Recorder, Google optimized the model for long sessions that can span hours, while also “mapping words to timestamps as computed by the speech recognition model.” This indexing allows users to click on a word in the transcript to listen to the corresponding audio.

The next aspect is how to best present information. Text is handy, but visual search based on specific moments and sounds is “more useful.” Each bar in the waveform is 50 milliseconds, and Google colors it with the dominant sound during that period.

To enable this, Recorder additionally represents audio visually as a colored waveform where each color is associated with a different sound category. This is done by combining research into using CNNs to classify audio sounds (e.g., identifying a dog barking or a musical instrument playing) with previously published datasets for audio event detection to classify apparent sound events in individual audio frames.

Lastly, Google offers three tags that “represent the most memorable content” once a recording is complete. These suggestions can be used to compose a title so people don’t default to day and time naming.