“Speech and Audio in Window Systems: When Will They Happen?” Moderated by Barry Arons, Chris Schmandt, Michael Hawley, Lester Ludwig and Polle Zellweger

Next: “Speech driven head motion synthesis based on a... »

« Previous: “SpecVar Maps: Baking Bump Maps into Specular...

Conference:

SIGGRAPH 1989

Type(s):

Panels

Entry Number: 10

Title:

Speech and Audio in Window Systems: When Will They Happen?

Moderator(s):

Additional Information:

Transcript of the welcoming speech:
Good afternoon. Boy, I can’t see anything out there. I assume you all can see me — thats why these lights are here. My name is Chris Schmandt from the Media Lab at MIT. I’m co-chairing this panel with Barry Arons, who is sitting over here. It’s actually quite a pleasure to co-chair this panel with Barry. We’ve been working together off and on for more years than I care to remember.

This panel has a long ridiculous name. Basically it’s about audio and window systems and workstations. I’m wearing two hats here. I’m going to spend a minute or two introducing the panel and then I’m going to spend some time talking about my own segment of the panel.

We’re going to try to be a panel as opposed to a series of five mini-papers that never get published. In other words, we’re going to try to keep our presentations relatively short, then segue into a series of prepared questions that the panelists are going to answer amongst themselves. Then we’ll open the floor up for questions.

In some ways this is a very incestuous crew. We’ve all known each other for quite a while. We have different slants and we’re actually going to try to focus on those slants a little bit. So if we disagree with each other, that doesn’t necessarily mean we really hate each other. We’re all friends.

Where this panel is coming from is a surge of interest in audio, and multimedia, in general, in computer workstations. The Macintosh has had audio for quite a while — you may or may not choose to call that a workstation. The NeXT computer sort of surprised people by having fairly powerful DSP and audio in and out. You’ll get a demo of that later if you haven’t seen it. The Sun SPARCStation has come out with some primitive digital record and playback capabilities.

On the other hand, there’s been interest in voice in computer workstations for years and years, and what we’ve seen so far is that voice really hasn’t had very much success. There have been a number of products that have come and gone. What has become popular has been centralized service — specifically voice mail. Voice mail is tied in more to a PBX — and the interface is more like a telephone than it is a mouse and window system, in the computer workstation interface.

Obviously, window systems are here to stay. We’re not suggesting that audio is going to replace the graphical paradigm, but rather have to interact with it.

On the other hand, everybody has a telephone. People had telephones on their desks before they had workstations, and we talk all the time at work. Voice really is a fundamental component of the way we talk, the way we interact with each other.

What we’re seeing in terms of the technologies showing up in these workstations is higher bit rate coding. Gone are the days of unintelligible low bit rate linear predictive coding or something like that — except for specialized applications.

Speech recognition is here, but it’s in its infancy. Text-to-speech — it’s around, it’s difficult to understand. You can learn to understand it.

Telephony is obviously part of this set-up if we’re dealing with audio. We don’t know whether it’s going to be analogue or digital. Is it going to be plain old telephone or is it going to be ISDN?

Those are some of the issues that we’re going to be talking about in this session. As I say, we’re going to try to keep each of the speakers to a relatively short period — and now I can put on my other hat. (puts toy plastic headset on — laughter)

Some people ask me whether speech recognition is a toy or not. Yes, it is. It’s sort of a fun toy. Speech technologies are in general fun. I was originally hoping to be able to play this out to the audience. But I don’t think it’s going to work well enough. This is actually a kid’s toy — $50 at Toys R Us. Speaker Independent Isolated Word Speech Recognizer — “yes”, “no”, “true”, and “false”. It will take you on tours about dinosaurs and things like that.

From my point of view, the key for what we can do with voice has to do with understanding its advantages and disadvantages and the comcomitant user interface requirements leading us to design reasonable applications for it.

Voice has some advantages. It’s very useful when your hands and eyes are busy; you’re looking at a screen, you have your fingers on the mouse. Sometimes it’s intuitive; we learn to talk at a very early age. People talk to their computers even if the computers don’t have speech recognition. (laughter) Usually it’s expletives — especially with UNIX. (laughter) Voice really dominates human-to-human communication. No matter what we’re doing with E-Mail and FAX, the bottom line is we just still have to spend a certain amount of time physically speaking to each other.

Telephones are everywhere. If I can turn an ordinary pay phone into a computer terminal, suddenly I have access from all over the place.

From my own work, this suggests a heavy focus on telecommunications. The kinds of systems that I’m building are really designed to use voice in a communications kind of environment. On the other hand, there’s many, many disadvantages of voice. It’s very slow. 200 words per minute, 150-250 words per minute. That’s less than a 300 baud modem and who uses those any more.

Speech is serial. You have to listen to things in sequence. It’s a time varying signal by definition. And it requires attention. You have to listen to what’s going on, as opposed to simply scrolling it by and stopping it occasionally.

My way of characterizing this is to say that speech is “bulky”. Yes, it takes up space on the file system, but most importantly you can’t “grep” it, you can’t do keyword searches on it. It’s hard to file, it’s just hard to get any kind of handle on it. It takes time.

Finally, speech broadcasts. If my workstation is talking to me and you’re sitting in my office, you’re going to hear what it says, which is very different from if it appears as text. In fact, if it appears as text, and I’m sitting in front of the screen with these kinds of tiny bit map fonts that we tend to use, I’m probably not even going to be able to read it — much less you.

This has some user interface implications. One is that it suggests that we would like, where possible, to have graphical access to sounds. I’m going to show a video in just a second, showing you an interface to audio built under the X Window System, designed to give you some kind of a graphical context, so you can mouse around and perhaps use some visual cues to keep track of where you are in the sound. If you could roll the first piece of one-inch, please.

This is a sound widget.

ACM Digital Library Publication:

Speech and Audio in Window Systems: When Will They Happen?

Overview Page:

SIGGRAPH 1989: Panels