B. Kapralos1,3, M. Jenkin1,3, E. Milios2,3
and J. Tsotsos1,3
1Department of Computer Science, York University, North York, CanadaM3J 1P3
2Department of Computer Science, Dalhousie University, Halifax Nova Scotia, B3H 1W5
3Centre for Vision Research, York University, North York, Canada M3J 1P3
{billk, jenkin, tsotsos}@cs.yorku.ca,eem@cs.dal.ca
Video teleconferencing has found a wide range of applications; from facilitating business meetings to aiding in remote medical diagnoses.Various commercial teleconferencing systems exist, including basic static systems for use by two participants (one at each end of the connection) [16][37][57].There are also systems intended for multiple speakers (i.e. as in a conference setting) but these systems typically focus on a single user and provide limited, if any, automatic speaker tracking technologies [38][48][56].
Existing
systems suffer from a number of limitations.Essentially,
they provide a limited number of static or manually tracked views. As a
consequence, in a multiple speaker setting, a speaker must either move
into the camera's view or the camera must be manually commanded to track
the speaker.Furthermore, in addition
to video, teleconferencing systems must be able to capture and transfer
audio (e.g. speaker's voice). As a result, in a multiple speaker setting,
the teleconferencing system must be able to localize a speaker in the audio
domain as well.However, with the
multiple speaker systems currently available, audio is not focused on the
speaker [38][48][56].Although sound
localization systems are available, most rely on extensive audio arrays
[6][7][14][44], which require both very expensive and specialized equipment,
are computationally expensive to operate and are rather non-robust, as
they cannot be easily moved to different locations.Furthermore,
there is little research conducted to investigate the combination of audio
and video cues for the detection and tracking of humans.Several
systems do exist; however, they are limited due to certain restrictions
they place.For example, in [17],
the position of the person must be roughly in the same plane as the two
microphones comprising the audio localization system in order for a person's
speech to be correctly localized.
This research investigates the development of a teleconferencing system integrating both audio and visual cues.The goal of this work is to investigate the development of an affordable, limited maintenance and portable teleconferencing system capable of locating and tracking speakers in a multiple speaker setting in both the audio and video domains.
Figure
1 below illustrates the Eyes 'n Ears hardware set-up.Although
not shown, a personal computer (Power Mac G3) is used to control the system.
![]() |
Figure
1: Eyes 'N Ears Sensor
|
The following sections describe the audio and video components in greater detail.
Typical camera lenses capture only a narrow field of view.To increase the visual field of the sensor, Eyes 'n Ears utilizes Cyclovision's ParaCamera optical system [1][3][36].As shown in figure 2 below, the ParaCamera allows us to capture the entire hemisphere from a single viewpoint thereby providing multiple dynamic views.Once the hemispherical view has been obtained, it may be un-warped producing a panoramic view (figure 3). From this panoramic, perspective views of any size corresponding to different portions of the scene may be extracted easily (figures 4a, 4b).
The ability of the ParaCamera to capture an image of the entire hemnisphere from a single viewpoint has made it very attractive for various applications.ParaCameras are used successfully in many applications, including surveilallanvce [4], autonomous robot navigation [61], virtual reality, telepresence [59] and pipe inspection [2].
![]() |
![]()
Figure
3: Panoramic View
|
|
![]() Figure
4a:Perspective View |
![]() Figure
4a: Perspective View |
|
|
|
|
|
The
video system is primarily responsible for locating the position of all
participants of the teleconference.In
particular, the video system will locate all human faces present within
the ParaCamera's view and determine their coordinates in the "real world".The
area of human face detection and tracking is rather large and well investigated.A
wide variety of systems exist, including systems which locate human faces
by detecting the blinking one's eyes [], or lips while they move during
speech [11], tracking active contours with snakes [32], eigenvector techniques
[21][40], or use correlation methods to detect certain features such as
a pair of eyes or nostrils [49][52].However,
most of the methods currently available are far too computationally expensive
and require many assumptions in order to work properly. In addition, the
low resolution of the ParaCamera makes it unsuitable for accurate face
recognition [20], using many of the methods listed above or other methods
used with traditional cameras.For
example, the ParaCamera cannot detect the blinking of one's eyes.
A
good economical detection/tracking system must be able to locate humans
quickly and reliably in the presence of noise and other objects in the
environment.In addition, it must
run fast and efficiently (e.g. run in real-time), and operate using inexpensive
camera equipment [5].
The colour of an object may be used as an identifying feature, which is local to the object and largely independent of the view and resolution.As a result, the use of color information may be used to detect objects from differing viewpoints. [54].Furthermore, there are various fast and simple color based human detection and tracking systems available (see [8][22][30][33][41][53][55][60])
Due to the considerations listed above, and considering the video system is only one part of the Eyes 'N Ears system (e.g. there is also an audio system requiring computational processing), face detection in the video domain is performed using color information with a method similar to Jones et al [30].
Jones et. al construct a statistical color model consisting of close to one billion pixels from images obtained via random crawls on the Internet.The pixels were manually classified as skin or non-skin and an RGB value histogram for each class was generated thereby providing both skin and non-skin models [30].Given skin and non-skin histograms, the probability that a pixel in the Hue, Saturation space is skin color, may be determined using Bayes rule:
(1)
where,
,
,
P(skin)Probability
of a pixel being skin
TsTotal
count contained in the skin histogram
TnTotal
count contained in the non-skin histogram
s[HS]The
pixel count contained in bin HS of the skin histogram
n[HS]The
pixel count contained in bin HS of the non-skin histogram
Once the probability is known, the pixel may be classified as skin by comparing its value to a pre-defined threshold (D).
The RGB values used to construct the histograms, as well as the skin and non-skin histograms are freely available [30].Many researchers have used their model and data to successfully segment human skin present in an image [22] however, this model performed rather poorly with our ParaCamera setup.Since Jones' method, as described above is rather simple to understand, easy to implement and does not require extensive computational processing, new skin and non-skin histograms were obtained by manually classifying pixels from images obtained with the ParaCamera as either skin or non-skin.However, rather than constructing the histograms using the RGB values, the RGB values of each pixel were converted the Hue, Saturation and Value (HSV) values [15].Histograms were then constructed using Hue and Saturation values only, thereby ignoring Value.Since most changes in lighting conditions correspond to changes in Value, by ignoring Value, the negative effects introduced by changes in lighting conditions are drastically reduced.In addition, various others have investigated the detection of human skin color in the HSV space and have found that human skin, regardless of race or color, forms a tight cluster in HSV color space []. Finally, constructing the histograms using only Hue and Saturation greatly minimizes the storage required for each histogram.Figure 5 below illustrates the Hue and Saturation histograms of both the skin (red/yellow plot) and non-skin (blue plot).The skin histogram clearly indicates the tight cluster formed by human skin color in HSV space whereas the histogram of the non-skin pixels illustrates the scattering of the non-skin pixel colors throughout a much larger region.
Due to the tight cluster formed by the skin pixels in HSV space, the skin pixel classifier described above performs well and is capable of detecting human skin rather accurately.
Figure 5
Normalized Histograms for Skin and
Non-Skin Pixels.
Figure 6 below is an image
obtained with the ParaCamera and figure 7 was obtained after classifying
each of the pixels as either skin or non-skin (skin color pixels have been
colored red).As figure 7 illustrates,
the human skin present in figure 6 has been successfully detected as skin.
![]() |
![]() |
|
Figure
6:Image Obtained with the ParaCamera
|
Figure
7:Skin Segmentation Applied to
image of figure 6
|
Regions of skin in the ParaCamera image, which are spatially close, are then grouped together to form a cluster.Assuming the participants are not very close to each other, each cluster of skin regions corresponds to a person.Finally, the image coordinates of each skin region are then converted to world coordinates using a method similar to [20] and provided to the audio system as a potential direction to a sound source.
The video system provides the location estimate of all potential teleconference participants. With this initial position estimate, using beam-forming [10][28][29] the audio system detects all active speakers and focuses on their speech thereby greatly reducing any other sounds or noises present in the environment.Furthermore, by detecting sound within a small region containing a human face, the audio system confirms the presence of a person there and also "fine-tunes" the initial position provided by the video system thereby providing more accurate position estimation.Once a speaker is localized in both domains, they may be tracked in both the audio and video domain.
Figure 8 illustrates the array configuration used for this work.Four omni-directional microphones are mounted in a (small) static pyramidal shape about the base of the ParaCamera providing an economical and portable acoustic array capable of being focused to any direction in 3-space [18][19].
Figure 8: Microphone
Array and Coordinate System
Microphone 1 is chosen as the reference point for the array as well as the origin for the coordinate system used in this work.The position of the other three microphones, the skin regions as well as any sound sources is defined relative to it.The following sections describe the techniques used to focus the arrays signal capturing ability to some particular region in 3-space.
Focusing the Array (Delay and Sum Beam-Forming)
Beam-forming allows the
array to focus (listen) to some particular direction.Any
signal(s) propagating in this direction are allowed to 'pass' through whereas
signals propagating in other directions are attenuated [29].As
described below in greater detail, focusing the array is accomplished entirely
in the software domain. Thus, the array or any of its sensors do not have
to be physically moved, but rather remain stationary over time.
Consider a sound source located at some position in space.Furthermore, consider an array of m microphones each at a unique location xm and each in the path of the propagating waves emitted by this sound source.The time taken for the propagating wave to reach each microphone will differ and is proportional to the distance between the microphone and the source.These time differences cause the signal received by each microphone to be "out-of-phase" with respect to the array origin.Knowing the location of the source and each of the array sensors, as well the speed of sound, the time taken for the propagating wave to reach each sensor may be calculated.Beam-forming takes advantage of these time differences between the time of arrival of a sound to each sensor [45].
Appropriately delaying the signal received by each microphone ensures the signals are in phase with respect to the array origin.This causes a reinforcing of the propagating sound signal while attenuating noise or signals propagating from other directions leading to a maximized beam-formed output signal.
In many applications, the signal of interest may fall within a small frequency region making.By restricting the signals being sampled by each microphone to within this region the noise and unwanted signals received by the microphones may be decreased.This is the approach taken by the Filter Delay and Sum Beam-former.As shown in figure 9 below, the signal received by each of the microphones is filtered prior to calculating the delays.
Figure
9:Filter Delay and Sum Beam-forming
The majority of human speech falls within the 200
- 4000Hz [42] frequency region.Since,
for this application speech is the signal of interest, the filter delay
and sum beam-former was used as opposed to the delay and sum beam-former.The
signal received by each of the microphones was filtered to allow only signals
within the 200 - 4000Hz frequency range to pass.
Farvs.
Near Field
Since
the position of each potential speaker is known, prior to beam-forming
to a particular location, the error assuming a far field source is calculated.When
the error is below some pre-defined threshold a far field source is assumed
otherwise the source is assumed to be in the near field.
Calculating
the Far Field Assumption Error
Consider
the angle em
between r0 (the vector from the sound source to x1)
and
(the
vector between the sound source and the mth microphone).As
the distance between the source and the array reference increases, em
will decrease.This angle may be
used as an indication to the error involved with assuming a far field source
[29].Since the coordinates of both
the source and each of the array microphones are known, em
can easily be determined.
Sound
sources producing values of em
which are greater than 1o, are considered to be in the far field.Otherwise,
the source will be considered to be in the near field.
With
a far field source, the direction of propagation between the sound source
and each of the microphones of the array is the same.As
a result, the time delay is equivalent to the length of the line obtained
by projecting the microphone onto a line from the sound source to the array
reference (x1). The
length of this line is equivalent to the difference in distance the sound
must travel to reach the array origin and the mth microphone.This
distance divided by the velocity of sound in air (vsound), gives
the desired delay.
Given the source's direction of propagation b0 and the position of each microphone relative to the array origin (xm), simple geometry may be used to directly determine the time difference (Dm).
![]()
wherebo is
the unit vector denoting the direction of propagation relative to the array's
origin, and the speed of sound is assumed to be constant at 345m/s.
When a source is in the near field, the direction of propagation of the source to each microphone varies.As shown below, the required delay is related to the distance between the source and the array reference and the source and the mth microphone.
![]()
where
represents
the distance between the sound source and the array reference and
represents
the distance between the source and each of the microphones (for m = 2,3,4).
As in the far-field case, delays, which do not correspond to a correct source location (e.g. the beam-former is not directly focused to the sound source), will result in a reduction of the energy level of the beam-formed signal.
Processing the Signal
The beam-formed signal is obtained by summing the appropriately delayed signals of each microphone.
![]()
The
signal of microphone 1 is not delayed since the signals of the remaining
microphones are delayed relative to it.
Signal energy Esignal is computed using the average absolute magnitude measure.
![]()
Signal variance vsignal is calculated as follows [19].The sample window is divided into sub-windows, and the variance of each sub-window is then computed.Finally, vignal is determined by taking the average sub-window variance.

The variance measure is a strong indication of the presence of a sound source as the variance of many sound sources, including human speech varies considerably relative to the variation of the background noise.Considering the variance measure has produced far more accurate results than using the signal magnitude alone.During system initialization, the variance of the background noise for a period of 100ms is measured and a threshold value is computed.
Appropriately delaying the signal received by each microphone ensures the signals are in phase with respect to the array origin.This causes a reinforcing of the propagating sound signal while attenuating noise or signals propagating from other directions leading to a maximized beam-formed output [29].It has been observed, however, that the magnitude of the signal received from each of the four microphones may vary considerably between consecutive measurements even without any change in the background noise or in the sound source.Rather than relying on the absolute magnitude of the beam-formed signal as a measure of the source location, the beam-formed signal sbeam is compared to the average signal of the microphones savg.
![]()
sdif, the difference between savg and sbeam, will be maximized when the beam-former is focused at the location corresponding to the sound source.Active speakers are detected by considering each cluster of skin regions separately.The beam-former is focused to each region of the cluster using the techniques described above to locate the region corresponding to the greatest value for sdif.Provided the value of sdif and the variance of sbeam corresponding to this region are above some pre-defined threshold values, this region is considered to be the head ofspeaker.
[1]
Baker, S and Shree, K. Nayar. (1999). "A Theory of Single Viewpoint Catadioptric
Image Formation".International
Journal of Computer Vision.1999.
[2]
Basu, A and D. Southwell. (1995). Omnidirectional Sensors for Pipe Inspection".
Proceedings of the IEEE International Conference on Systems, Man and Cybernetics.
4 pp. 3107-3112.
[3]
Baker, S and Shree, K. Nayar. (1998). "A Theory of Catadioptric Image Formation".Proceedings
of the 6th International Conference on Computer Vision.Bombay,
India.January 1998.pp.
35-42.
[4]Boult,
T. E., R. Michaels, X. Gao, P. Lewis, C. Power, W. Yin and A. Erkan. "Frame-Rate
Omnidirectional Surveillance & Tracking of Camouflaged and Occluded
Targets".
http://www.eecs.lehigh.edu/~tboult/TRACK/LOTS.html
[5]Bradski,
R. Gary.(1998).'Computer
Vision Face Tracking for Use in a Perceptual User Interface'.http://developer.intel.com.
[6]
Brandstein, M, J. Adcock, and H. Silverman, "A Practical Time-Delay Estimator
for Localizing Speech Sources with a Microphone Array," Computer, Speech,
and Language, 9:153-169, April 1995.
[7] Brandstein,
M,J. Adcock, and H. Silverman. (1995).
"A Closed-Form Method for Finding Source Locations from Microphone-Array
Time-Delay Estimates," In Proceedings of ICASSP95, pages 3019-3022.
IEEE, 1995.
[8]
Chai, Douglas and King N. Ngan. (1999). "Face Segmentation Using Skin-Color
Map in Videophone Applications".IEEE
Transactions on Circuits and Systems for Video Technology.Vol.
9 No. 4, June 1999.
[9] T.
Claes and D. Van Compernolle. (1996). "SNR-Normalisation for Robust Speech
Recognition". In Proc. ICASSP, Vol. I, pages 331-334, Atlanta,
Georgia, U.S.A., May 7-10 1996.
[10]
Compernolle, Dirk Van and Stefaan, Van Gerven. (1995). "Beam-forming with
Microphone Arrays". Final proceedings of the COST229 Application of DSP
to Telecommunications. Pp. 107-131.
[11]
Crowley, James, L and Frabcois Berard. (1997). Multi-modal Tracking of
Faces for Video Communication".IEEE
Conference on Computer Vision and Pattern recognition.Puerto
Rico 1997.
[12]
Cutler, Ross and Larry Davis. (2000). "Look Who's Talking: Speaker Detection
using Video and Audio Correlation".IEEE
International Conference on Multimedia and Expo 2000.New
York, New York USA.
[13]
Datum, Michael, Francesco Palmieri and Andrew Moiseff. (1996). "An Artificial
Neural network for Sound Localization Using Binaural Cues". Journal of
the Acoustical Society of America. Vol. 100, No. 1. July 1996.
[14]
Flanagan, J. L., J. D. Johnston, R. Zahn and G. W. Elko. (1985). " Computer
Steered Microphone Arrays for Sound Transduction in Large Rooms".Journal
of the Acoustical Society of America. Vol. 78, No. 5. November 1985.
[15]Foley,
James D., Andries Van Dam, Steven K. Feiner and John, F. Hughes. (1996).
'Computer Graphics Principles and Practice'.Addison-Wesley
Publishing Company.USA.
[16]Freed,
Les. (2000)."Microsoft NetMeeting
3.0".PC Magazine.May
25 2000.
[17]
Goodridge, Steven George.(1997).Multimedia
Sensor Fusion for Intelligent Camera Control and Human-Computer Interaction.Doctor
of Philosophy thesis in Electrical Engineering.Raleigh
NC.
[18]
Guentchev, Kamen, Yankiv. (1997). "Learning Based Three Dimensional Sound
Localization Using a Compact Non-Coplanar Array of Microphones". Master
of Science ThesisDepartment of Computer
Science, Michigan State University.
[19]
Guentchev, K. Y. and John J, Wong. (1998).Learning-Based
Three Dimensional Sound Localization Using a Compact Non-Coplanar Array
of microphones.American Association
for Artificial Intelligence.
[20]
Gutchess, D, Anil K. Jain and Sei-Wang Chen.Automatic
Surveillance Using Omnidirectional and Active Cameras.
[21]
Huang, Jie, Noboru, Ohnishi and Noboru, Sugie. (1998). "Spatial Localization
of Sound Source: Azimuth and Elevation Estimation". IEEE Instrumentation
and Measurement Technology Conference. St. Paul, Minnesota, USA.May
18-21, 1998.
[22]Herpers,
R. G. Verghese, K. Derpanis, D. Topalovic, J.K. Tsostos. (1999). 'Detection
and Tracking of Faces in Real Environments'.Proceedings
of the International Workshop on Recognition, Analysis and
Tracking
of Faces and Gestures in Real-Time Systems.Korfu
Greece.September 26-29 1999.
[23]
Huang, Jie, Noboru, Ohnishi and Noboru, Sugie. (1998). " A Biomimetic System
for localization and Separation of Multiple Sound Sources".IEEE
Transactions on Instrumentation and measurement. Vol. 44, No. 3, June 1995.
[24]
Huang Jie, Soboru Ohnishi and Noboru Sugie. (1996). "Building Ears for
Robot". International Symposium on Artificial Life and Robotics. B-Con
Plaza, Beppu, Oita, Japan. February 18-20, 1996.
[35]
Huang Jie, Tadawute Supaongprapa, Ikutaka Terakura, Fuming Wang, Noboru
Ohnishi and Noboru Sugie. (1999). "A Model-based Sound Localization System
and its Application to Robot Navigation".Robotics
and Autonomous Systems. Vol. 27. Pp. 199-209.
[26]
Irie, Robert, Eiichi. (1993). "Robust Sound Localization: An Application
of an Auditory Perception System for a Humanoid Robot".Master
of Science Thesis. Department of Electrical Engineering and Computer Science,
Massachusetts Institute of Technology.
[27]
Ishiguro, Hiroshi, Kim C. Ng, Richard Capella, Mohan M. Trivedi. (1999).
2nd Int. Conference on 3D Digital Imaging & Modeling.Ottawa,
Canada, Oct. 1999.
[28]
Jastrzembski, Mike and Jan Tai. (1995). "directional Microphone Array".Final
Course Project for Digital Communications and Signal Processing Systems
Design - Course 18-551.Department
of Electrical and Computer Engineering, Carnegie Mellon University.
[29]
Johnson, Don, H and Don E. Dudgeon. (1993). "Array Signal Processing: Concepts
and Techniques".Prentice Hall, Englewood
Cliffs, New Jersey, USA.
[30]Jones,
J. Michael and James M. Rehg. (1998). Statistical Color Models with Applications
to Skin Detection.Cambridge Research
Laboratory Technical Report Series.CRL
98/11.
[31]
Kapur, J, P. (1997)."Face Detection
in Color Images.EE499 Design Project.University
of Washington Department of Computer Engineering.
[32]
Kass, M, A. Witkin and D. Terzopoulos. (1987). "Snakes: Active Contour
Models".International Journal of
Computer Vision.1 (2), pp. 133-144.
[33]
McKenna,S. Gong and Y. Raja.(1998).
"Modeling Facial Colour and Identity with Gaussian
Mixtures". Pattern Recognition. Vol. 31, No. 12, pp. 1883-1892.
[34]
Mori, Renato. ed.. (1998). "Spoken Dialogues with Computers". Academic
Press Limited. London, United kingdom.
[35]
Mumolo Enzo, Massimiliano Nolich and Gianni Vercelli. (2000). "Algorithms
and Architectures for Acoustic Localization Based on Microphone Array in
Service Robot".Proceedings of the
IEEE International Conference on Robotics & Automation.San
Francisco, California, USA. April 2000.
[36]
Nayar, S. K. (1997)."Catadioptric
Omnidirectional Camera".Proceedings
IEEE Conference on Computer Vision and Pattern Recognition.Pp.
482-488.
[37]
Panasonic KXC-AP150 Video Communication Terminal with Detachable, Hand
Held Color Camera Unit. http://www.prodcat.panasonic.com/shop/product.asp'sku=KXC-AP150&CategoryID=303
[38]Panasonic
KXC-M7800 Vision Pro Series 7800 Video Teleconferencing System.
http://www.prodcat.panasonic.com/shop/product.asp'sku=KXC-M7800&CategoryID=303
[39]
Parham Aarabi. "Multi-Sense Artificial Awareness".Master
of Science Thesis. Department of Electrical and Computer Engineering, University
of Toronto, Tornoto, Canada. June 1999.
[40]
Pentland, Alex, Baback Moghaddam and Thad Starner. (1994)."View
Based and Modular Eigenspaces for Face Recognition".M.I.T
Media Laboratory, Perceptual Computing Section, Technical Report.No.
245.
[41]
Pingali, G, Garnze Tunali and Ingrid Carlbom. (1999). "Audio Visual Tracking
for Natural Interactivity".ACM Multimedia
'99. October 1999.Orlando Florida,
USA.
[42]
Rabiner, L. R and M. R. Sambur. (1975). "An Algorithm for Determining the
Endpoints of Isolated Utterances".The
Bell System Technical Journal. Vol. 54 No. 2, February 1975.
[43]
Rabiner, L. R and Jang, Biing-Hwang. (1993). Fundamentals of Speech Recognition.
Prentice Hall, Englewood Cliffs, New Jersey,
USA.
[44]
Rabinkin, D.(1996).A
DSP Implementation of Source Location Using Microphone Arrays.In
131st meeting of the Acoustical Society of America.Indiana
USA.May 15 1996.
[45]
Rabinikin, D. (1994)."Digital Hardware
and Control for a Beam-Forming Microphone Array".Master
of Science Thesis.Rutgers, The State
University of New Jersey, Department of Electrical Engineering.New
Brunswick, New Jersey, USA.January
1994.
[46]
Reid L. Greg. Active Binaural Sound Localization: Techniques,
Experiments and Comparisons. Master of Science Thesis.
York University, Department of Computer Science. April 28, 1999.
[47]
Renomeron James Richard. (1997). "Spatially Selective Sound capture for
Teleconferencing Systems".Master
of Science Thesis, Department of Electrical and Computer Engineering, Rutgers,
The State University, New Brunswick, New Jersey, USA. October 1997.
[48]
RSI Systems Incorporated.Eris Visual
CommunicationsSystems. http://www.squarenet.com/rsi/compbase.htm
[49]
Scassellati, Brian.(1998)."Eye
Finding via Face Detection for a Foveated, Active Vision System".American
Association for Artificial Intelligence 1998.
[50]
Shree, K. Nayar. (1997). Omnidirectional Video Camera.Proceedings
of the DARPA Image Understanding Workshop.New
Orleans, USA.May 1997.
[51]
Stiefelhagen, rainer, Jie Yang and Alex Waibel. (1999). "Modeling Focus
of Attention for meeting Indexing. ACM Multimedia '99.
[52]
Stiefelhagen Rainer and Jie Young. (1997). Proceedings.
of the International Conference on Acoustics, Speech and Signal Processing:
ICASSP'97,Munich, Germany, April
1997.
[53]
Storring, M, Hans J. Anderson and Erik Granum. (1999)."Skin
Color Detection Under Changing Lighting Conditions".Seventh
Symposium on Intelligent Robotics Systems.Columbia
Portugal.July 20-23 1999. Pp. 187-195.
[54]Swain,
J. Michael and Dana H. Ballard.(1991).'Color
Indexing'.International Journal
of Computer Vision.Volume 7, pp.
11-32.
[55]
Terrillon, J. C, M. David and S. Akamatsu. (1998). "Automatic Detection
of Human faces in Natural Scene Images by Use of a Skin Color Model and
of Invariant Moments".Proceedings
of the Third International Conference on Automatic Face and Gesture Recognition.Nara,
Japan.April 1998. Pp. 112-117.
[56]TTINewgen
Aethra VTC228-234 Le Pleiadi.http://www.ttinewgen.com/products/aethra_vtc.htm
[57]US
Robotics Company.Teleconferencing
Systems: ConferenceLink CS 1000 and CS1000X and ConferenceLink CS1050.
[58]
West, James R.(1998).Five
Channel Panning Laws:An Analytical
and Experimental Comparison.Master
of Science in Music Engineering Technology Thesis. Faculty of Music. Coral
Gables, Florida.
[59]
Yagi, Yasushi. (1999). "Omnidirectional Sensing and its Applications".IEICE
Trans. Inf. & Syst.Volume E82-D
(3).March 1999.
[60]
Yang, J and Alex Waibel. (1996). "A Real Time Face Tracker".Proceedings
of the WACV '96.Sarasota, Florida
USA.
[61]
Zheng, J. Y and S. Tsuji. (1992). "Panoramic Representation for Route Recognition
by a Mobile Robot".International
Journal of Computer Vision. 9 (1). Pgs 55-76.
[62]
Zotkin, Dimitry, Ramani, Duraiswami, Vasanth Philomin and Larry, S. Davis.
(2000). "Smart VideoConferencing". Proceedings on the International conference
on Multimedia and Expo.New York
City, New York. August 2000.