Wake word detection is a common problem for voice assistants. Home Assistant made 2023 the Year of Voice, where they developed many basic features necessary for a fully open-source and local assistant. Wake word detection used openWakeWord. ESPHome satellite devices stream all voice audio to a central server running openWakeWord to detect the wake words. While this works well, it introduces extra latency to the assist pipeline. It is easy to train new custom words in openWakeWord, but, unfortunately, it uses a speech embedding model that is too slow on ESP32 devices.

In the last decade, many researchers have studied the keyword spotting problem. Hello Edge: Keyword Spotting on Microcontrollers is an early paper focusing on models that can run well on edge devices. Google Research published Streaming keyword spotting on mobile devices along with source code that makes it easy to train models that predict the wake word in a streaming fashion; i.e., the model provides a probability for each new slice of processed audio. This approach saves a lot of computational effort, as the model does not reprocess older audio slices for every inference.

I have written microWakeWord to develop streaming wake word models suitable for ESP32-S3 devices. ESPHome’s February 2024 release will add support for microWakeWord models. I heavily adapted the project from Google Research’s code. It currently uses an Inception based model. It is fast enough to be used on an ESP32-S3 chip and with high accuracy for real-world usage. We generate wake word samples using Piper sample generator. We then augment them using openWakeWord’s utilities. We use additional data sources for negative samples.

I am still in the early phase of developing this! Currently, microWakeWord trains usable models. However, manual tweaking is necessary to produce a model with low false accept and false rejection rates. The project can convert a folder of audio files into spectrogram features for training, but no augmentations are applied. I have uploaded two models: “Hey Jarvis” and “Okay Nabu.” I will rework the “Hey Jarvis” model to handle background noise better. The newer “Okay Nabu” performs very well even with loud music playing due to various model design improvements and a better training process. I encourage you to follow the project on GitHub to see future improvements!