WorldGaze uses smartphone cameras to help voice AIs cut to the chase

If you find voice assistants frustratingly dumb you’re hardly alone. The much hyped promise of AI-driven vocal convenience very quickly falls through the cracks of robotic pedantry.

A smart AI that has to come back again (and sometimes again) to ask for extra input to execute your request can see especially dumb — when, for example, it doesn’t get that the most likely repair shop you’re asking about is not any one of them but the one you’re parked outside of right now. Researchers at the Human-Computer Interaction Institute at Carnegie Mellon University, working with Gierad Laput a machine learning engineer at Apple, have devised a demo software add-on for voice assistants that lets smartphone users boost the savvy of an on-device AI by giving it a helping hand — or rather a helping head.

The prototype system makes simultaneous use of a smartphone’s front and rear cameras to be able to locate the user’s head in physical space, and more specifically within the immediate surroundings — which are parsed to identify objects in the vicinity using computer vision technology. The user is then able to use their head as a pointer to direct their gaze at whatever they’re talking about — i.e. ‘that garage’ — wordlessly filling in contextual gaps in the AI’s understanding in a way the researchers contend is more natural. So, instead of needing to talk like a robot in order to tap the utility of a voice AI, you can sound a bit more, well, human. Asking stuff like ‘Siri, when does that Starbucks close?’ Or — in a retail setting — ‘are there other color options for that sofa?’ or asking for an instant a price comparison between ‘this chair and that one’. Or for a lamp to be added to your wish-list. In a home/office scenario, the system could also let the user remotely control a variety of devices within their field of vision — without needing to be hyper specific about it. Instead they could just look towards the smart TV or thermostat and speak the required volume/temperature adjustment. The team has put together a demo video (below) showing the prototype — which they’ve called WorldGaze — in action. “We use the iPhone’s front-facing camera to track the head in 3D, including its direction vector. Because the geometry of the front and back cameras are known, we can raycast the head vector into the world as seen by the rear-facing camera,” they explain in the video. “This allows the user to intuitively define an object or region of interest using the head gaze. Voice assistants can then use this contextual information to make enquiries that are more precise and natural.” In a research paper presenting the prototype they also suggest it could be used to “help to socialize mobile AR experiences, currently typified by people walking down the street looking down at their devices”. Asked to expand on this, CMU researcher Chris Harrison told TechCrunch: “People are always walking and looking down at their phones, which isn’t very social. They aren’t engaging with other people, or even looking at the beautiful world around them. With something like WorldGaze, people can look out into the world, but still ask questions to their smartphone. If I’m walking down the street, I can inquire and listen about restaurant reviews or add things to my shopping list without having to look down at my phone. But the phone still has all the smarts. I don’t have to buy something extra or special.” In the paper they note there is a long body of research related to tracking users’ gaze for interactive purposes — but a key aim of their work here was to develop “a functional, real-time prototype, constraining ourselves to hardware found on commodity smartphones”. (Although the rear camera’s field of view is one potential limitation they discuss, including suggesting a partial workaround for any hardware that falls short.)

“Although WorldGaze could be launched as a standalone application, we believe it is more likely for WorldGaze to be integrated as a background service that wakes upon a voice assistant trigger (e.g., “Hey Siri”),” they also write. “Although opening both cameras and performing computer vision processing is energy consumptive, the duty cycle would be so low as to not significantly impact battery life of today’s smartphones. It may even be that only a single frame is needed from both cameras, after which they can turn back off (WorldGaze startup time is 7 sec). Using bench equipment, we estimated power consumption at ~0.1 mWh per inquiry.”

Of course there’s still something a bit awkward about a human holding a screen up in front of their face and talking to it — but Harrison confirms the software could work just as easily hands-free on a pair of smart spectacles. “Both are possible,” he told us. “We choose to focus on smartphones simply because everyone has one (and WorldGaze could literally be a software update), while almost no one has AR glasses (yet). But the premise of using where you are looking to supercharge voice assistants applies to both.” “Increasingly, AR glasses include sensors to track gaze location (e.g., Magic Leap, which uses it for focusing reasons), so in that case, one only needs outwards facing cameras,” he added.” Taking a further leap it’s possible to imagine such a system being combined with facial recognition technology — to allow a smart spec-wearer to quietly tip their head and ask ‘who’s that?’ — assuming the necessary facial data was legally available in the AI’s memory banks. Features such as “add to contacts” or “when did we last meet” could then be unlocked, to augment a networking or socializing experience. Although, at this point, the privacy implications of unleashing such a system into the real world look rather more challenging than stitching together the engineering. (See, for example, Apple banning Clearview AI’s app for violating its rules.) “There would have to be a level of security and permissions to go along with this, and it’s not something we are contemplating right now, but it’s an interesting (and potentially scary idea),” agrees Harrison when we ask about such a possibility. The team was due to present the research at ACM CHI — but the conference was canceled due to the coronavirus.