If you find voice assistants frustratingly dumb you’re hardly alone. The much hyped promise of AI-driven vocal convenience very quickly falls through the cracks of robotic pedantry.
A smart AI that has to come back again (and sometimes again) to ask for extra input to execute your request can see especially dumb — when, for example, it doesn’t get that the most likely repair shop you’re asking about is not any one of them but the one you’re parked outside of right now. Researchers at the Human-Computer Interaction Institute at Carnegie Mellon University, working with Gierad Laput a machine learning engineer at Apple, have devised a demo software add-on for voice assistants that lets smartphone users boost the savvy of an on-device AI by giving it a helping hand — or rather a helping head. The prototype system makes simultaneous use of a smartphone’s front and rear cameras to be able to locate the user’s head in physical space, and more specifically within the immediate surroundings — which are parsed to identify objects in the vicinity using computer vision technology. The user is then able to use their head as a pointer to direct their gaze at whatever they’re talking about — i.e. ‘that garage’ — wordlessly filling in contextual gaps in the AI’s understanding in a way the researchers contend is more natural. So, instead of needing to talk like a robot in order to tap the utility of a voice AI, you can sound a bit more, well, human. Asking stuff like ‘Siri, when does that Starbucks close?’ Or — in a retail setting — ‘are there other color options for that sofa?’ or asking for an instant a price comparison between ‘this chair and that one’. Or for a lamp to be added to your wish-list. In a home/office scenario, the system could also let the user remotely control a variety of devices within their field of vision — without needing to be hyper specific about it. Instead they could just look towards the smart TV or thermostat and speak the required volume/temperature adjustment. The team has put together a demo video (below) showing the prototype — which they’ve called WorldGaze — in action. “We use the iPhone’s front-facing camera to track the head in 3D, including its direction vector. Because the geometry of the front and back cameras are known, we can raycast the head vector into the world as seen by the rear-facing camera,” they explain in the video. “This allows the user to intuitively define an object or region of interest using the head gaze. Voice assistants can then use this contextual information to make enquiries that are more precise and natural.” In a research paper presenting the prototype they also suggest it could be used to “help to socialize mobile AR experiences, currently typified by people walking down the street looking down at their devices”. Asked to expand on this, CMU researcher Chris Harrison told TechCrunch: “People are always walking and looking down at their phones, which isn’t very social. They aren’t engaging with other people, or even looking at the beautiful world around them. With something like WorldGaze, people can look out into the world, but still ask questions to their smartphone. If I’m walking down the street, I can inquire and listen about restaurant reviews or add things to my shopping list without having to look down at my phone. But the phone still has all the smarts. I don’t have to buy something extra or special.” In the paper they note there is a long body of research related to tracking users’ gaze for interactive purposes — but a key aim of their work here was to develop “a functional, real-time prototype, constraining ourselves to hardware found on commodity smartphones”. (Although the rear camera’s field of view is one potential limitation they discuss, including suggesting a partial workaround for any hardware that falls short.)
“Although WorldGaze could be launched as a standalone application, we believe it is more likely for WorldGaze to be integrated as a background service that wakes upon a voice assistant trigger (e.g., “Hey Siri”),” they also write. “Although opening both cameras and performing computer vision processing is energy consumptive, the duty cycle would be so low as to not significantly impact battery life of today’s smartphones. It may even be that only a single frame is needed from both cameras, after which they can turn back off (WorldGaze startup time is 7 sec). Using bench equipment, we estimated power consumption at ~0.1 mWh per inquiry.”