This post is a response to a post made by Chris Maury entitled Is the next shift in computer interaction really gestural? Probably not. I encourage you to definitely go read that post. Chris claims an intrinsic advantage to an audio based interface, a class of interface I’m involuntarily exquisitely familiar with as a result of being blind and using instance, a screen reader, on a daily basis for between 14 and 18 hours a day. As a result of this usage over the past decade or so, and as a result of some of my doctoral research, I’ve included some thoughts below.
First, the phrasing:
Phrasing is important. When I use the word interface, I think of what is presented to the user. It is the layer upon which the interaction takes place. We might want to switch to using modality, as that dictates the way the users interact with this interface, be it speech, gestural, touch, or other.
So, why choose?
The real issue is asking the right question. Instead of asking what the next modality might be, we should instead wonder if this is even an appropriate way of viewing the problem. I strongly believe there should not be a single prevalent modality in computing, but rather we should do whatever it takes to facilitate truly multimodal interaction. Multimodal environments date back to the 1980s with Richard Bolt’s rather impressive presentation at the CHI conference in 1984 (YouTube Video) called “Put That There” and have been worked on, evolved, and improved upon ever since. I believe that we need to stop trying to limit and restrict interfaces to a single modality of input or output and instead accept that no interface is appropriate for all people and that just because speech might seem sexy now, part of that is novelty and part of it is our response to the needs that speech solves so well that existing interfaces do not.
All Modalities Have Some Problem
The problem with presuming that a single modality is the future of computer interaction is that it overlooks the failures of a specific modality and discounts the great advantages that can be realized from multimodal input. For example, try using Siri or Google Now at a bus stop when a bus is passing by. Try listening to an audiobook as a blind passenger on a bus or train where you need to keep an ear out for the next stop or rerouting of schedules. Try using a mobile device without ear buds to do speech dictation while music is playing through it. All of these things are possible with gestural or other interfaces, and there are many that go the other way as well, where speech wins in the comparison chart e.g. general text entry, one handed access, etc. are all easier with speech than with gestural input. Furthermore, there’s even a third class in which keyboard or mouse or joystick modalities are all better than either gestural or speech, and so on.
So, I ask again, why choose? Sharon Oviatt has some great work in this space, as well, where she shows that by using multiple modalities, one can reduce the error of one modality with the benefits of the other, in essence ending up with an interface that is more accurate than the sum of the errors of its constituent parts would lead you to believe. I think that’s a fantastic insight, and possibly a subtle one that can be easily overlooked. I submit that what we need to start doing is to get away from hardcoding speech or gestural or touch or WIMP, which stands for window icon menu pointer, or the variety of other interfaces such as BCI (brain computer interface, anyone?), and instead focus on the interface itself. If we can simply start moving towards true semantic behaviors and the ability for the dreams of the IUI (intelligent user interface) community to be realized e.g. truly responsive, adaptive, and intelligent interfaces, then we won’t need to ever have this discussion again. We can then simply allow users to use a myriad of multimodal inputs and outputs that works best for them, and the abstraction layer present can accommodate, account for, and even adapt itself to such usage without having to ever be touched by a developer or designer.
This all might sound farfetched, but visit any good and proper website on a mobile device, and then think about your possible reactions to such responsive/fluid design in 1995, 2000, and even 2005. Why limit ourselves to the next great thing, and then the next, and finally the next, when we can simply address the problem holistically, and most importantly, for all users, regardless of functional limitation? Let’s make this a reality today, not 20 years from now after countless iterations through many more potential interfaces of the future. This way, when said awesome interfaces come out, they can just be plugged in and make our existing computing systems even more wondrous.
So, what do you think? Am I just crazy, dreaming of a future in which modalities to interfaces are as exchangeable and plug/play as keyboards/mice are today, or do you think the above misses the point all together, and there’s an even greater problem to be solved? I’d love to hear from you in the comments below. Also, thanks to Chris for writing his great post and starting the discussion on this topic.