It has been well established that visual information dominates auditory spatial information in the perception of object location . Howard and Templeton  defined this capturing of the auditory signal by the visual signal as the ventriloquism effect. Visual stimuli defined as appropriate for the auditory stimuli (e.g. the detail of a hand puppet and speech stimuli) were more effective in capturing the auditory stimuli than those that were defined as inappropriate . However, it has also been shown that neutral or non-meaningful stimuli, for example spots of light and tones, will also interact to cause a ventriloquism effect similar to that seen from more complex and meaningful stimuli [4,5]. Other studies have coarsely defined the spatial and temporal parameters that influence the illusion, and found that large spatial separations are less effective in creating the illusion than small spatial separations [3,6] and that the illusion was virtually abolished at temporal disparities of 300 ms . However, these studies used complex auditory (speech) and visual (hand puppets) stimuli, and did not define these spatial and temporal dependencies in detail.
The influence of auditory stimuli on visual perception has also been described, particularly with reference to moving visual stimuli [7,8]. These interactions between visual and auditory information suggest that they should be observable at the neuronal level. There are several regions of the mammalian brain in which neurons respond to both auditory and visual stimuli, for example the superior colliculus [9,10] the parietal lobe , the superior temporal sulcus  and the frontal lobes [12,13]. Although the parietal lobe is critical for the spatial perception of auditory and visual stimuli, patients with unilateral parietal lobe damage still experience capture effects . Further, the capture effects were reversed, with auditory stimuli capturing the spatial location of visual stimuli, in a patient with bilateral parietal lobe lesions . These data indicate that the parietal lobes are not necessary for the capture effects to occur, and suggest that other cortical areas are also involved.
Recent imaging studies in human subjects using speech stimuli presented during visualization of either a person speaking or a static face have indicated that unimodal areas can show increased activity when the auditory and visual stimuli are congruent [16,17]. It is, therefore, currently unclear, and indeed controversial, how unimodal and multimodal cortical areas participate in the integration of auditory and visual information that leads to the perception of objects and events. In order to fully investigate the neural mechanisms of the integration of multi-modal stimuli, more basic stimuli (i.e. non-language dependent) that can be applied to studies in both humans and experimental animals must be employed. The ventriloquism illusion is a simple paradigm in which the perception of the spatial location of auditory and visual stimuli can be altered, and will likely provide a powerful tool in addressing this question. In this study, the temporal and spatial disparities that give rise to the ventriloquism effect were defined in human subjects using auditory and visual stimuli that can ultimately be used in future studies in both humans and experimental animals.
MATERIALS AND METHODS
Nine subjects with normal hearing and normal or corrected to normal vision participated in these experiments with informed consent (four males and five females, aged 25–37, mean 29.6). All procedures conformed with the Declaration of Helsinki and were approved by the UC Davis Human Use Committee. Experiments were performed in a darkened double-walled sound booth (I.A.C.) with inner dimensions of 2.4 × 3.0 × 2.0 m and lined with echo attenuating foam (Sonex). Tucker-Davis Technology (TDT) hardware and software controlled the visual and auditory stimulus presentation and data collection. Auditory stimuli were 200 ms (∼4 ms linear on/off ramp) tone or noise stimuli presented at 65 dB SPL. Stimuli were presented from speakers placed 146 cm from the center of the interaural axis of the subject, resulting in a travel time from the speaker to the subject of ∼5 ms. Visual stimuli consisted of a 0.125° diameter red LED positioned at the center of the speaker located directly in front of the subject. A similar green LED placed 12° above this location served as the fixation point.
Subjects were seated in a chair with their heads lightly restrained by a headband attached to the chair to orient them toward the center speaker. Experiment 1 required the subject to fixate the fixation LED and move a switch to the center position to initiate each trial. Following a random interval of 500–1500 ms, 200 ms duration auditory and visual stimuli were presented from the center location (Fig. 1a). The difference in the onset of the visual and auditory stimuli (Δt) was varied between −250 ms (visual stimulus onset before auditory stimulus onset) and +250 ms in 50 ms increments. Subjects were instructed to move the switch to the right if they perceived the two stimuli to begin and end at the same time (same), and to move the switch to the left if they perceived that the two stimuli did not begin and/or end at the same time (different) and were further instructed to ignore whether they perceived the stimuli to originate from the same location. Same responses were given a score of 1.0 and different responses were given a score of 0. Each Δt was presented on 20 randomly interleaved trials in each session.
Experiment 2 required the subjects to decide whether the two stimuli were presented at the same location regardless of the temporal disparity. The stimuli were the same as for Experiment 1 except that the auditory stimulus could be presented from ± 12° in 4° increments along the horizontal meridian. Δt values were restricted to 0, 50, 100, 150 and 250 ms in order to provide a range of Δt values but keep the trial number in each session low to prevent subject fatigue. Subjects were required to respond the same if they perceived that the sound and light occurred at the same spatial location regardless of the relative timing of the two stimuli. The visual stimulus was also presented before, throughout, and after the auditory stimulus (Fig. 1b) in order to control for non-temporally based auditory-visual interactions. Subjects performed 20 randomly interleaved trials for each auditory location, control and Δt stimulus across five different sessions (four trials/stimulus/session). The data were scored in the same manner as with Experiment 1.
Nine subjects were tested using broadband noise, and three subjects were also tested on 1 kHz tone on Experiment 1 (Fig. 2). There was no significant difference in the results between noise and 1 kHz tone stimuli (ANOVA;p > 0.05). Across subjects, there was a consistent and significant asymmetry in the responses, with the mean score for negative Δt values being consistently higher than for positive Δt values (Fig. 2), and an apparent shift between the two sides of the curve by −50 ms. Tukey analysis confirmed this impression, as there was no significant difference between Δt values of 0 ms and −50 ms, 50 ms and −100 ms, 100 ms and −150 ms, and 150 ms and −200 ms. These differences were no longer apparent at Δt values of ≥ 200 ms, where subjects routinely perceived the two stimuli as occurring at different times.
Experiment 2 was designed to determine if there was an interaction between the spatial and temporal disparity between the visual and auditory stimuli on the ventriloquism effect. Subjects were tested on 1 kHz tone stimuli since this stimulus is more difficult to localize than noise and would provide several measures where the localization performance was below, near, and above threshold in the control condition . A three-way factorial analysis of variance showed significant differences in the responses for speaker location (df: 6,6958; F = 849.2;p < 0.001), Δt (df: 5,6958; F = 29.0;p < 0.001), and the interaction of the two (df: 30, 6958; F = 5.6;p < 0.001).
Post hoc tests showed that subjects were able to correctly distinguish the differences in the locations of the auditory and visual stimuli on 99.3%, 86%, 32%, and 10% of the trials for spatial disparities of 12, 8, 4 and 0°, respectively. The temporal disparities of 0 or 50 ms were significantly more likely to be perceived as originating from the same location than were Δt values of 150 or 250 ms. For Δt values of 100 ms subjects were more likely to perceive the stimuli as occurring at the same place than on control trials, but not significantly more than trials with temporal disparities of 150 ms or 250 ms.
The interactions between the spatial and temporal disparities are shown for all subjects in Fig. 3. Subjects nearly always perceived the two stimuli at different locations at all Δt values at the two extreme locations (± 12°;Fig. 3a). The effects of the temporal disparity were less consistent at an 8° separation. For the right speaker location (+8), the subject's ability to distinguish the difference in the location was only affected at Δt values of 100 ms. However, at the left speaker location (−8) subjects were significantly more likely to perceive the sound as originating at the same location as the light for Δt values of 0 and 50 ms. At 4° spatial disparity, Δt of 50 ms was just as effective at capturing the auditory signal as Δt of 0 ms (p > 0.05). Subjects were significantly more likely to perceive the sound as coming from the same location as the light at Δt values of 0 and 50 than any of the other time differences (p < 0.05). This effect was gone at the Δt values > 50 ms, which were not different from the control condition.
Subjects correctly perceived the auditory and visual stimuli as originating from the same location 100% of the time at Δt of 0 ms (open diamonds in Fig. 3a). However, there was a gradual decrease in this perception with increasing Δt, even though both stimuli were presented from the same spatial location. The performance across subjects was significantly degraded at Δt of 250 ms and under the control condition with respect to 0 ms temporal disparity (p < 0.01).
These results suggest that the effect of the temporal disparity was greatest for auditory disparities that were near or below threshold for auditory spatial localization. To test this explicitly, the difference between the percent correct on the control trials and the trials with different temporal disparities was divided by the percentage correct on the control trials and plotted as a function of the percentage correct on the control trials. If the effect is more pronounced when the auditory stimuli are difficult to localize (small spatial disparities), this performance index should be greater compared to when the auditory stimuli are easy to localize (large spatial disparities). Further, the performance index should be greater at short temporal disparities than at long temporal disparities. The results of this analysis are shown in Fig. 3b. Across subjects, there was more capture of the visual stimulus when the auditory stimulus was difficult to localize (low percentage correct on control trials) and at shorter temporal disparities.
This finding was further supported by dividing the thresholds, calculated as the spatial location at 50% correct responses , for stimuli at each spatial disparity by the threshold measured in the control condition. Ratios of 1.0 indicate no difference and ratios > 1.0 indicate that the visual stimulus led to decreased spatial acuity of the auditory stimulus. The results of this analysis revealed that this ratio was > 1.0 for all temporal disparities (0 ms: 1.53 ± 0.83; 50 ms: 1.45 ± 0.68; 100 ms: 1.40 ± 0.59; 150 ms: 1.15 ± 0.45; 250 ms: 1.13 ± 0.61) but that the thresholds were significantly larger only for temporal disparities of 0, 50 and 100 ms (paired t-test;p < 0.001) and not for temporal disparities of 150 or 200 ms (p > 0.05).
The results of this study define the temporal and spatial dependency of the ventriloquism effect using basic auditory and visual stimuli, extending previous observations [4–6]. We found that the temporal disparity was less effective in disrupting the illusion if the auditory stimulus was presented before the visual stimulus, that the illusion was more easily created at spatial disparities where the ability to localize the auditory stimuli was poor, and that the visual stimulus could functionally decrease the auditory spatial acuity at temporal disparities of 0, 50 and 100 ms but not at 150 and 250 ms.
An interesting aspect of the ventriloquism illusion is that the two sensory modalities are able to interact across a range of both space and time, indicating that these two stimulus parameters are important in the neuronal processing of extrapersonal space. In the primate, the first spike latencies are longer in the primary visual cortex  than in the primary auditory cortex [20,21]. However, first spike latencies of bimodal or multimodal neurons in the superior temporal sulcus and the parietal lobes are similar for both auditory and visual stimuli [11–13]. There was no difference in the perception that the auditory and visual stimuli occurred at the same time for Δt values of 0 and −50 ms (visual stimuli leading). However, there was an asymmetry in this effect, with auditory stimuli presented before visual stimuli being less effective in creating the illusion. This finding is consistent with naturally occurring temporal disparities, since light energy has a faster travel time than sound energy. The temporal disparity between visual and auditory stimuli is in the order of 50 ms at a distance of 15 m, which commonly occurs in the environment but does not lead to the perception of this temporal disparity.
The spatial disparity between the visual and auditory stimuli similarly influenced how these stimuli were perceived. Across Δt values the effect was greatest at spatial disparities in which the auditory stimulus was poorly localized on control trials. Thus, there is a direct relationship between the spatial acuity of the auditory system and the ability of the visual stimulus to capture the spatial location of the auditory stimulus. This indicates to us that the spatial information processed in unimodal cortical areas remains intact, and the integration of this spatial information is inappropriately processed under the illusion conditions in regions where neurons respond to both auditory and visual stimuli [9–13,16]. Given the interconnections between cortical areas, top-down or parallel influences, for example attention , could also affect the spatial processing in unimodal cortical areas, which may explain changes in activity of these areas during audio-visual tasks . The ventriloquism paradigm should prove very useful, both in human imaging studies as well as neurophysiological studies in behaving animals at the level of the single neuron, in elucidating the contributions of different cortical and sub-cortical areas in the binding of multiple stimulus attributes of real world stimuli.
These experiments demonstrate that visual capture is asymmetrically affected by the temporal disparity between visual and auditory stimuli, with visual stimuli presented before auditory stimuli more effective in creating the illusion. The spatial dependency of the ventriloquism effect was consistent with the spatial acuity of the auditory system. Small temporal disparities were effective in eliciting the illusion at large spatial disparities, and large temporal disparities could also disrupt spatial localization in the absence of spatial disparities. These findings suggest that the integration of the neuronal spatial representations of auditory and visual stimuli has both temporal and spatial dependencies that lead to the perception of real world objects in extrapersonal space. These temporal and spatial dependencies should provide powerful tools in elucidating the neural correlates of the perception of multi-modal stimuli.
The authors would like to thank L.A. Krubitzer, M.L. Phan, T.K. Su and T.M. Woods for helpful comments on previous versions of this manuscript. Funding provided by NIH grant DC02371, the Klingenstein Foundation, and the Sloan Foundation.
1. Welch RB, Warren DH. Psychol Bull 1980 88, 638–667.
2. Howard IP, Templeton WB. Human Spatial Orientation.
New York: Wiley; 1966.
3. Thurlow WR, Jack CE. Percept Mot Skills 1973 36, 1171–1184.
4. Bermant RI, Welch RB. Percept Mot Skills 1976 43, 487–493.
5. Radeau M. Perception
1985 14, 571–577.
6. Jack CE, Thurlow WR. Percept Mot Skills 1973 37, 967–979.
7. Sekuler R, Sekuler AB, Lau R. Nature 1997 385, 308.308.
8. Connah D, Meyer G, Wuerger SJ. Physiology 1999 54P, 520.520.
9. Meredith MA, Stein BE. J Neurophysiol 1986 56, 640–662.
10. Meredith MA, Nemitz JW, Stein BE. J Neurosci 1987 7, 3215–3229.
11. Mazzoni P, Bracewell RM, Barash S. et al
. J Neurophysiol 1996 75, 1233–1241.
12. Benevento LA, Fallon J, Davis BJ. et al
. Exp Neurol 1977 57, 849–872.
13. Bruce C, Desimone R, Gross CG. J Neurophysiol 1981 46, 369–384.
14. Soroker N, Calamaro N, Myslobodsky MS. J Clin Exp Neuropsychol 1995 17, 243–255.
15. Phan ML, Schendel KL, Recanzone GH. et al
. J Cogn Neurosci 2000 12, 583–600.
16. Calvert GA, Bullmore ET, Brammer MJ. et al
. Science 1997 276, 593–596.
17. Calvert GA, Brammer MJ, Bullmore ET. et al
. Neuroreport 1999 10, 2619–2623.
18. Recanzone GH, Makhamra SD, Guard DC. J Acoust Soc Am 1998 103, 1085–1097.
19. Maunsell JH, Gibson JR. J Neurophysiol 1992 68, 1333–1344.
20. Pfingst BE, O;Connor TA. J Neurophysiol 1981 45, 16–35.
21. Recanzone GH, Guard DC, Phan ML. J Neurophysiol 2000 83, 2315–2331.
22. Driver J, Spence C. Trend Cogn Sci 1998 2, 254–262.