Visual Software Agent: A Realistic Face-to-Face Style Interface connected with WWW/Netscape
Hiroshi Dohi and Mitsuru Ishizuka
Dept. of Information and Communication Engineering,
School of Engineering, University of Tokyo
7-3-1, Hongo, Bunkyo-ku, Tokyo 113, JAPAN
This paper describes the Visual Software Agent (VSA) interface connected with a voice-controlled Netscape. VSA is an Internet-based interface agent with rocking realistic face and speech dialog function. A user asks the realistic anthropomorphic agent in a display some questions and some tasks, as if he/she usually asks his/her colleagues with a visual phone. It embodies a human-like face-to-face style communication environment, and therefore it will reduce the mental barrier for communication lying between a human and a computer. A prototype system of VSA is a development platform toward intelligent multimodal interface systems. Besides the texture-mapped realistic face, it equips a camera, a speech recognizer, a speech synthesizer, and some sensors; corresponding with eyes, ears, a mouth, and some sense organs, respectively. The system is also connected with World-Wide Web (WWW) / the Netscape navigator. As a result, it works just like a voice-controlled Netscape talking with a guide agent. A user navigates on the Internet by simple speech dialog with VSA besides an usual mouse interface in the Netscape. Both the voice-control operation through VSA and the mouse operation have the same priority. The user can choose the suitable operation method in any time in accordance with the various situations.
Figure 1: Visual Software Agent
(A user asks the realistic anthropomorphic agent in a display
some questions and some tasks just like using a visual phone.)
Today's pervasive interface style is direct manipulation. A mouse is a fairy
good device, but all of users can't manipulate it satisfactory. Some people
dislike a double clicking operation. In other words, current GUI interface
may be unusable for some people unfamiliar with computer operations and
physically handicapped persons. A good interface should be so simple and
so natural that a user doesn't give it second thought. Thus, software agents
assisting users personally become important[Laurel, 1990].
The word "software agent" generally means a software that performs intelligent tasks autonomously like human agents. Many of software agents have no figure and no face. Some people call this type of software as a "filter," not an "agent."
Some research groups have developed software agent interface systems with own faces and figures. The GUIDES system[Oren et al., 1990]presents some "Guides" icons. Users select the favorite one of icons as a guide, and assume characterization behind the iconic guide figures. The Maxims system[Lashkari et al., 1994][Maes, 1994]selects E-mails based on the user's preferences, and makes personalized suggestions to the user. It has a caricature face, and it works as an indicator which communicates its internal state to users via facial expressions.
The TOSBURG-II system[Takebayashi, 1994]has the agent for a communication, animated hamburger-shop clerk. It equips powerful speech recognizer, and users can order him with natural language. The DECface system[Waters and Levergood, 1993]has a texture-mapped face with lip-synchronization. The Talkman system[Takeuchi and Nagao, 1993]has a face based on Waters', and makes multimodal human-computer conversation as a "social agent" [Nagao and Takeuchi, 1994].
We have designed and implemented an anthropomorphic interface agent with rocking realistic face and speech dialog function[Dohi and Ishizuka, 1993]as a communication partner. We call this anthropomorphic agent as the Visual Software Agent, or simply VSA. We connect our VSA with the World-Wide Web(WWW)[Berners-Lee et al., 1994], as an Internet-based vast information database. VSA make use of the Netscape navigator and the NCSA Mosaic as a hyper-text viewer. Because of this connection, VSA can handle vast and various types information stored in the Internet. A user can walk around the Internet world by talking with VSA, besides a usual mouse interface on the Netscape.
Speech dialog is our main communication channel. The advantage of a voice-controlled interface is;
However, who wants to talk with a metal box in order to control a computer?
In the movie titled "2001: A Space Odyssey," captain Bowman talks and commands
by natural language with the computer with a lip-reader, HAL-9000, however
he never be aware in where HAL is.
Suppose that you are alone in your room and speak to yourself. You may feel uneasy whenever you speak to without a communication partner. It may be reluctant psychologically for a user to mutter to a simple metal box of computer. An anthropomorphic agent with a realistic face will reduce this mental barrier for communication lying between a human and a computer. It embodies our common "face-to-face" style communication environment.
Figure 1 shows the example of the Visual Software Agent interface.
A user puts a headset on (or uses a microphone,) and asks VSA in a display some questions and some tasks just like using a visual phone.
For the speech communication with a user, we have attached some commercial speech products.
The speech recognizer is the "Phonetic Engine 200" (Speech Systems Inc.) with Japanese speech-recognition function as an addition. It can recognize speaker-independent and continuous speech.
Also, two speech synthesizers are attached to the workstation. These are the "DECtalk" (Digital Equipment Corp.) for English and the "Shaberin-bo" (NTT data) for Japanese.
VSA has a realistic face.
For generating the realistic facial image of the VSA, a real human photograph is texture-mapped on a 3D wireframe model of a face.
Figure 2: 3D wireframe face model
(It has only a front half, and a full face photograph is texture-mapped on it.)
Figure 2 shows a 3D wireframe face model.
It has only a front half. It consists of about five hundred triangle surface patches.
We use one full face photograph for texture-mapping. It is very easy to match the real texture image with the 3D face model. We have developed a VSA editor, then you have only to click three points on the displayed texture image; that is, the right eye, the left eye, and the center of the mouth. Arrangements are all. The system calculates the suitable position and the scaling factor of the 3D face model. It can also move each vertex for the precise matching, if necessary. Then it generates a texture-mapped realistic face automatically.
The deformation of the wire frame model makes a change of facial expressions, including winks (open/close eyes) and speaks (change mouth shapes synchronized with a speech synthesizer.) The generated VSA face is also always rocking naturally; these features give intimate feeling that can not be obtained from static images.
The advantages of the texture-mapped 3D face compared with the animated figure are;
Although the texture mapping technique requires high computational cost, recent computer graphics technology makes it possible to generate a vivid realistic facial image and change its facial expression in real-time.
For rocking realistic facial image synthesis in real-time, we have developed so far three modules. All modules have the same interface, so that we can choose one as we need.
The prototype of the original parallel transputers board (called "ViSA")
can generate a texture-mapped facial image 10-15 frames/sec on four transputer
In the motion J-PEG method, we make a facial image database. It has more than 7,000 J-PEG compressed images with the combination of various face directions and mouth shapes. Each original image is 640 480 pixels with 24 bits depth. The system arranges suitable compressed images picked up from the image database according to the speech sentence and the movement of the user, then it decompress the image sequence in real-time continuously with the J-PEG decompression board. Using this technique, we can generate the rocking realistic face in real-time even on a personal computer.
We have also implemented a real-time facial image synthesis module on the OpenGL graphics library. A high-performance graphics board with a hardware texture mapper can generate a rocking realistic face with various facial expressions 30 frames/sec.
We connect our VSA with WWW as an Internet-based vast information database.
We have already implemented the VSA system connected with NCSA Mosaic[Dohi and Ishizuka, 1996]. Recently, we have developed a new VSA system connected with Netscape Navigator, instead of Mosaic. Both systems have the same interfaces and functions, although the implementation method are different.
In August 1996, Netscape Communications Corporation and IBM announced the development of a speech-enabled version of Netscape Navigator. It is based on Netscape Navigator 2.02, and runs only on IBM OS/2 Warp 4.
In our VSA system, it is independent of the version of Netscape Navigator, since we don't make any modification on Netscape. The latest VSA system connects with Netscape Navigator 3.01, and runs on a UNIX workstation. It can handle multi-language encodings including Japanese.
VSA makes use of Netscape Navigator as the part of the system, a hyper-media viewer. Because of this connection, VSA can handle information described in Hyper-Text Markup Language(HTML). In other words, VSA can make use of vast information stored in the Internet. As a result, all hyper-media data in the standardized HTML format can be used for VSA as well as WWW.
A user can walk around the Internet world using speech dialog, in addition to a usual mouse interface on Netscape. Both the speech dialog interface through VSA and the mouse interface have the same priority. All mouse controls are any time available, independent of the use of voice control.
Figure 3: VSA system connected with WWW/Netscape
Figure 3 shows an example of the screen image of our VSA system connected
with Netscape Navigator. There are three windows. The right one is a Netscape
window. The upper left window shows VSA with a rocking realistic face. In
the lower left window, it displays the anchor list of the right Netscape
window. It is used for the voice-control of Netscape. Each entry of this
list is a triplet of the index number, the anchor string, and the URL.
< Index_number, Anchor_string, URL>
The anchor list is automatically extracted whenever a new page is opened.
Figure 4: Configuration of VSA-Netscape Interface
(Six processes collaborate through TCP/IP and shared memories
on different workstations.)
Figure 4 illustrates the configuration of our VSA-Netscape interface.
There are three parts; the Interface manager, the VSA server, and the Hypertext viewer. Each function is modularized and implemented as an independent process. It is easy to add and replace new function processes. In the Netscape version's basic implementation, each part consists of two processes. These six processes can collaborate through TCP/IP and shared memories on different workstations.
The Interface manager converts a speech sentence into an associated URL (Uniform Resource Locator), and controls Netscape and the VSA server. The VSA server controls both the operations of speech synthesis and realistic facial image synthesis with simple loose lip-synchronization.
In an earlier Mosaic version, we modified the source code(60K lines) of Mosaic software to extract the anchor list. However in the Netscape version, its source code is not available. Instead, we use a proxy technique to obtain necessary data including the anchor list.
The interface manager allows multiple TCP/IP connections. There are various types of sensors, and can exchange sensor data with the Interface manager through TCP/IP; a moving camera tracking a user for eye contacts, a ultra sonic telemeter, an infrared sensor, a thermometer etc. Using this TCP/IP connections, it will also embody VSA-VSA communications.
Our current VSA system can answer to user's questions; i.e. date, time, room temperature, arrival of personal e-mails, and weather forecast etc. And also, it can recognize four types of speech commands for the Internet navigation.
The following examples illustrates the voice-control operation of Netscape.
User: Please contact with "the University of Tokyo" server.
VSA: Yes. Just a moment please.
(connect with the server, then present an anchor list)
User: Number 0, please.
(anchor_index 0: Japanese_version_)
(open "Japanese" page, and present new anchor list)
User: Please show me c????s . . .
(the score value is under a threshold; it failed speech recognition)
User: Please show me "campus map"
(open "campus map" page, and present new anchor list)
Example 1. A simple dialog example.
In this simple dialog example, a user speaks to VSA, and then VSA replies
with voice as well as changing pages in the Netscape window, if necessary.
The speech recognizer returns both a sentence recognized and its score for decoding of the complete utterance. The score value is between 0 to 999 (best). If the score value is under a threshold, the VSA system discards the returned sentence and prompts the user to speak the sentence again.
In a speech dialog system, the delay time of responses is important factor
for making a user comfortable.
When you want to know a weather forecast, you have only to talk with VSA, "Please show me a weather forecast". Then, a relevant page is opened in the Netscape window. However, it may not be necessarily comfortable for you because we are sometimes kept waiting long time to get information because of a traffic jam of the network. A user's request invokes downloading information from the Internet, not displaying information.
Figure 5: Sub agent
(sub agents autonomously gather information in advance and VSA responds
promptly to the user's question with voice and images, if necessary.)
In the VSA system, some sub agents autonomously gather information into
the local database, and VSA responds promptly to the user's question with
voice (and images, if necessary.) They are, for example, a weather forecast,
stock prices, and relevant news topics which include specific keywords.
The sub agent accesses periodically specific WWW servers, then picks necessary
information out from the WWW page, and updates the local database entry
Many research groups have tried to develop a "general purpose" WWW page analyzing system. However the current state of WWW pages are too chaotic and too confused. They are a jumble of wheat and tares. Therefore we choose a simple strategy. A user advises the system authoritative WWW pages in advance. The WWW page is desirable that the contents is updated frequently, but it keeps the page layout.
A user may get html files beforehand using prefetch and caching techniques. In this case, however, the user need to search and pick up necessary information from the page.
In the VSA system with sub agents, it tells the user by voice (and image) only what he/she wants to know. In case of weather forecast, the weather page in general includes information all over the country. When a user simply asks the weather forecast, the VSA system judges the user wants to know the weather forecast of the local district. Then VSA replies it promptly with voice since the local database has the latest information of local weather forecast. And, when asked about the weather forecast in another place, VSA accesses the WWW server as usual, and opened its weather WWW page.
There are some papers about the effect of the human face on a multimodal
interface [Walker et al., 1994][Takeuchi and Naito, 1995][Koda and Maes,
1996]. According to the observations about our guests' interests, we confirm
a rocking realistic face attracts strongly the user's attention to the display.
It is the interface style close to daily face-to-face communication with
speech dialog. For computer experts, however, it may be an annoying interface
because the performance of any current speech dialog systems is insufficient
yet for free talk. They can type a keyboard and click a mouse without any
difficulties, and are well aware that the speech input can't be always recognized
correctly. Then, our current target users are persons unfamiliar with computer
operations, physically handicapped persons, and all users in situations
that the mouse interface is not suitable; e.g., a speaker standing in front
of a wall-type room-wide screen on a stage and a person walking on a street.
In multimodal interface systems, some modalities should be complementary each other. It is desirable for users to be able to choose the most suitable interface modalities depend on the time, the place, and the occasion. We don't intend to fully replace the mouse interface with the speech dialog interface. If all anchor strings have unique names each other, the speech dialog functions has an advantage, because the user need not to move a mouse cursor onto another anchor. However, the speech operation can't necessarily point out any location on a click-able map exactly.
We have developed a prototype of the Visual Software Agent system connected
with WWW and Netscape navigator. VSA offers a new multimodal interface with
a rocking realistic facial image and speech dialog function. It gives users
friendliness feeling, and it reduces the mental barrier for speech communication
lying between a human and a computer.
VSA delivers a user's speech request to WWW and Netscape. Thus, a user can walk around the Internet by talking with the visual anthropomorphic agent. It may broaden the usage style of WWW information and functions. The interaction between the agent and a user becomes more smooth and natural, and the agent can navigate the user in the vast Internet information space.
[Berners-Lee et al., 1994] T. Berners-Lee, R. Cailliau, A. Luotonen, H.F. Nielsen, and A. Secret.
"The World-Wide Web". Commun. ACM, 37(8):76-82, 1994.
[Dohi and Ishizuka, 1993] H. Dohi and M. Ishizuka. "Realtime Synthesis of a Realistic Anthropomorphous Agent toward Advanced Human-Computer Interaction". In G. Salvendy and M.J. Smith, editors, Human-Computer Interaction: Software and Hardware Interfaces, pages 152-157. Elsevier, 1993.
[Dohi and Ishizuka, 1996] H. Dohi and M. Ishizuka. "A Visual Software Agent: An Internet-Based Interface Agent with Rocking Realistic Face and Speech Dialog Function". Working Notes of AAAI-96 Workshop on Internet-Based Information Systems, pages 35-40, Aug. 1996.
[Koda and Maes, 1996] T. Koda and P. Maes. "Agents with Faces: The Effects of Personification of Agents". HCI'96, Aug. 1996.
[Lashkari et al., 1994] Y. Lashkari, M. Metral, and P. Maes. "Collaborative Interface Agents". Proc. 12th National Conference on Artificial Intelligence (AAAI-94), 1:444-449, Jul. 1994.
[Laurel, 1990]B. Laurel. "Interface Agents: Metaphors with Character". In B. Laurel, editor, The Art of Human-Computer Interface Design, pages 355-365. Addison-Wesley Publishing Company, Inc., 1990.
[Maes, 1994]P. Maes. "Agents that Reduce Work and Information Overload". Commun. ACM, 37(7):31-40, July 1994.
[Nagao and Takeuchi, 1994]K. Nagao and A. Takeuchi. "Speech Dialogue with Facial Displays: Multimodal Human-Computer Conversation". 32nd Annual Meeting of the Association for Computational Linguistics (ACL-94), pages 102-109, 1994.
[Oren et al., 1990]T. Oren, G. Salomon, K. Kreitman, and A. Don. "Guides: Characterizing the Interface". In B. Laurel, editor, The Art of Human-Computer Interface Design, pages 367-381. Addison-Wesley Publishing Company, Inc., 1990.
[Takebayashi, 1994]Y. Takebayashi. "Spontaneous Speech Dialogue System TOSBURG II -Towards the User-Centered Multimodal Interface-". IEICE Trans. Information and Systems, J77-D-II(8):1417-1428, Aug. 1994.
[Takeuchi and Nagao, 1993]A. Takeuchi and K. Nagao. "Communicative Facial Displays as a New Conversational Modality". ACM/IFIP INTERCHI'93 (INTERACT'93 and CHI'93), pages 187-193, 1993.
[Takeuchi and Naito, 1995] A. Takeuchi and T. Naito. "Situated Facial Displays: Towards Social Interaction". Proc. CHI'95 Human Factors in Computing Systems, pages 450-454, 1995.
[Walker et al., 1994]J.H. Walker, L. Sproull, and R. Subramani. "Using a Human Face in an Interface". Proc. CHI'94 Human Factors in Computing Systems, pages 85-91, 1994.
[Waters and Levergood, 1993] K. Waters and T.M. Levergood. "DECface: An Automatic Lip-Synchronization Algorithm for Synthetic Faces". Technical Report CRL 93/4, Cambridge Research Center, Digital Equipment Corporation, Sep. 1993.