A new update on Apple’s machine learning blog explores their approach to speaker recognition in detecting “Hey, Siri”. It’s obviously fairly technical, but I found this bit interesting as it describes how they measure the success of the key phrase activating Siri:
The overall goal of speaker recognition (SR) is to ascertain the identity of a person using his or her voice. We are interested in “who is speaking,” as opposed to the problem of speech recognition, which aims to ascertain “what was spoken.” SR performed using a phrase known a priori, such as “Hey Siri,” is often referred to as text-dependent SR; otherwise, the problem is known as text-independent SR.
We measure the performance of a speaker recognition system as a combination of an Imposter Accept (IA) rate and a False Reject (FR) rate. It is important, however, to distinguish (and equate) these values from those used to measure the quality of a key-phrase trigger system. For both the key-phrase trigger system and the speaker recognition system, a False Reject (or Miss) is observed when the target user says “Hey Siri” and his or her device does not wake up. This sort of error tends to occur more often in acoustically noisy environments, such as in a moving car or on a bustling sidewalk. We report FR’s as a fraction of the total number of true “Hey Siri” instances spoken by the target user. For the key-phrase trigger system, a False Accept (or False Alarm, FA) is observed when the device wakes up to a non-“Hey Siri” phrase, such as “are you serious” or “in Syria today.” Typically, FA’s are measured on a per-hour basis.
I’ve been extremely impressed by the performance of “Hey, Siri” over the last couple of years. Not only does it reliably wake my device, it also does not wake my girlfriend’s — and vice-versa, when she says “Hey, Siri”.
What Siri does after that leaves much to be desired, of course.