Is the Next Shift in Computer Interaction Limited to a Single Modality?

by Sina on January 28, 2013

This post is a response to a post made by Chris Maury entitled Is the next shift in computer interaction really gestural? Probably not. I encourage you to definitely go read that post. Chris claims an intrinsic advantage to an audio based interface, a class of interface I’m involuntarily exquisitely familiar with as a result of being blind and using instance, a screen reader, on a daily basis for between 14 and 18 hours a day. As a result of this usage over the past decade or so, and as a result of some of my doctoral research, I’ve included some thoughts below.

First, the phrasing:

Phrasing is important. When I use the word interface, I think of what is presented to the user. It is the layer upon which the interaction takes place. We might want to switch to using modality, as that dictates the way the users interact with this interface, be it speech, gestural, touch, or other.

So, why choose?

The real issue is asking the right question. Instead of asking what the next modality might be, we should instead wonder if this is even an appropriate way of viewing the problem. I strongly believe there should not be a single prevalent modality in computing, but rather we should do whatever it takes to facilitate truly multimodal interaction. Multimodal environments date back to the 1980s with Richard Bolt’s rather impressive presentation at the CHI conference in 1984 (YouTube Video) called “Put That There” and have been worked on, evolved, and improved upon ever since. I believe that we need to stop trying to limit and restrict interfaces to a single modality of input or output and instead accept that no interface is appropriate for all people and that just because speech might seem sexy now, part of that is novelty and part of it is our response to the needs that speech solves so well that existing interfaces do not.

All Modalities Have Some Problem

The problem with presuming that a single modality is the future of computer interaction is that it overlooks the failures of a specific modality and discounts the great advantages that can be realized from multimodal input. For example, try using Siri or Google Now at a bus stop when a bus is passing by. Try listening to an audiobook as a blind passenger on a bus or train where you need to keep an ear out for the next stop or rerouting of schedules. Try using a mobile device without ear buds to do speech dictation while music is playing through it. All of these things are possible with gestural or other interfaces, and there are many that go the other way as well, where speech wins in the comparison chart e.g. general text entry, one handed access, etc. are all easier with speech than with gestural input. Furthermore, there’s even a third class in which keyboard or mouse or joystick modalities are all better than either gestural or speech, and so on.

The Solution?

So, I ask again, why choose? Sharon Oviatt has some great work in this space, as well, where she shows that by using multiple modalities, one can reduce the error of one modality with the benefits of the other, in essence ending up with an interface that is more accurate than the sum of the errors of its constituent parts would lead you to believe. I think that’s a fantastic insight, and possibly a subtle one that can be easily overlooked. I submit that what we need to start doing is to get away from hardcoding speech or gestural or touch or WIMP, which stands for window icon menu pointer, or the variety of other interfaces such as BCI (brain computer interface, anyone?), and instead focus on the interface itself. If we can simply start moving towards true semantic behaviors and the ability for the dreams of the IUI (intelligent user interface) community to be realized e.g. truly responsive, adaptive, and intelligent interfaces, then we won’t need to ever have this discussion again. We can then simply allow users to use a myriad of multimodal inputs and outputs that works best for them, and the abstraction layer present can accommodate, account for, and even adapt itself to such usage without having to ever be touched by a developer or designer.

This all might sound farfetched, but visit any good and proper website on a mobile device, and then think about your possible reactions to such responsive/fluid design in 1995, 2000, and even 2005. Why limit ourselves to the next great thing, and then the next, and finally the next, when we can simply address the problem holistically, and most importantly, for all users, regardless of functional limitation? Let’s make this a reality today, not 20 years from now after countless iterations through many more potential interfaces of the future. This way, when said awesome interfaces come out, they can just be plugged in and make our existing computing systems even more wondrous.

Your Thoughts

So, what do you think? Am I just crazy, dreaming of a future in which modalities to interfaces are as exchangeable and plug/play as keyboards/mice are today, or do you think the above misses the point all together, and there’s an even greater problem to be solved? I’d love to hear from you in the comments below. Also, thanks to Chris for writing his great post and starting the discussion on this topic.

Join The Club:

* indicates required
Email Format

{ 3 comments… read them below or add one }

Eric Strodthoff January 30, 2013 at 4:22 pm

Hi Sina,

Why choose, indeed!

As an educator, an employee of a world leading education company, a parent, a life coach, and a continuously learning human being, I agree 100% with the multiple modality concept you describe above. We learn best when experiencing our environment. And we experience our environment most fully when presented with many different sensory stimuli. Therefore, the most effective learning is accomplished through the fullest experiences. Multiple modalities not only reduce/compensate for error potential, they encourage and increase active engagement and participation for all users. And let’s not forget, the more involved you feel, the more fun you can have while living and learning at the same time.

Computer interaction is more and more becoming a part of, not only our entertainment choices (Wii, X-Box Kinect), but also our educational paradigm (Smart Boards, iPads, etc.). As a result, we need to think of how we interact with these, and yet to be seen devices in an holistic manner as you suggest. We’ve come along way from Pong, now let’s update our educational models to make learning more fun too!



CMaury February 1, 2013 at 12:43 pm

The point on error rates is a good one, though I do not think that adding another modality is the only answer. First, there are situations where multiple modalities aren’t possible: driving a car or walking down the street for example.

And second, there are other ways to reduce error rates. On a noisy bus or in a noisy cafe, one can use a noise canceling microphone. For those who need their hearing unobstructed, you can use bone conduction lke in Aftershokz or Google’s Glass. (another promising answer are connected hearing aides: Apple will be partnering with major manufacturers to release “made for iPhone” devices later this year).

Another possible argument against an audio only interface is the social-acceptability of talking to your phone in public, without your phone to your ear. I’m not really sure if there is a good answer to this, though I’m excited for possibilities of subvocalization, a sud mind reading tech that is maybe 5-10 years out. (see

The reason I argue that Audio based interfaces are the next wave in HCI, is not to diminish the power of multi modal interfaces, which like Eric in the comments said are powerful tools in education and otherwise, but rather it’s a response to

Computing is moving beyond mobile touch devices to wearable electronics. Google Glass and the Pebble smart watch are just the first to make it to market. In this ever smaller and more mobile paradigm, what is the modality that is most efficient? I believe that it’s an audio interface is the best bet given existing–and soon to be existing–technology, and that the weaknesses mentioned above are just hurdles to be overcome.


Rich Caloggero February 1, 2013 at 6:16 pm

What I think Chris is missing in his article is the fact that a true speech interface is a combination of accurate and intelligent speech recognition, along with intelligent and efficient voice output.

WHen your looking for a restaurant, you don’t want to have your computer spit out a web page full of navigation, ads, and some list of names and related info which may or may not correspond to restaurants. Current screen reader technology is not what I’d call an intelligent voice output solution; its pretty dumb actually, and works only because users have little other choice. I think domain will greatly influence voice output. A screen reader has to be able to give some sort of output regardless of the domain, but the output is often inefficient to examine and sometimes not even relevant. If you have limited domain (as in the answer to a specific question like find me the closest italian restaurant or list all the links in the current document), voice output can be fairly efficient.


Leave a Comment

Previous post:

Next post: