Selective Tuning and Saliency

 

Given that saliency, saliency maps, and gaze decisions based on a saliency map, are the most common manifestations of visual attention in computer vision systems, it is important to explain what role saliency plays within the Selective Tuning model.



Saliency Maps and their Role in ST


It would not be entirely incorrect to think - especially from the older papers - that ST uses a hierarchical version of a data-driven saliency map, that the top layer of the processing network is in fact the same as the saliency map as used by others, using a winner-take-all algorithm to make next fixation decisions. But this conclusion would be a mis-interpretation of the fact that the demonstrations provided are of a generic nature - without visual representations. The early demonstrations of ST (Tsotsos et al. 1995) were intended to show how the attentional mechanism performs and were never intended to make any comment about the representations over which the algorithm operates. In fact, the whole point was to show that the algorithm performs correctly regardless of representation. To be sure, a kind of saliency was implied, and top-down task tuning was present there (and in Tsotsos 1990) before it appeared in other models.


Importantly, what happens after the top-level winner-take-all decision is made shows how ST diverges, and does so significantly, from saliency map models (see details on how ST is distinct). The saliency map model chooses a point, or perhaps some arbitrary region about that maximum conspicuity point. ST uses a recurrent localization process to find the stimulus that led to that maximum response. In doing so, ST also suppresses the surround of that stimulus to allow for better subsequent analysis. The winner-take-all process is not only defined over an arbitrary representation, but also proofs are provided for its properties: convergence under all inputs, time to convergence, and how to set convergence time parameters.


As shown in more recent papers (eg., Tsotsos et al. 2005), but first described in Tsotsos, 1990 (Figure 8, p440), ST can operate on multiple networks. It can do so on the strength of the generic algorithms defined earlier. Saliency for ST is a local characteristic, distributed because each kind of neural representation has its own ‘meaning’ for saliency. One cannot define a global default algorithm for saliency and expect it to suffice for all visual computations.


This is not to say that the saliency map has no role in ST or in attention. It has a very specific role, one that fits its central use in computer vision systems, namely fixation change decisions. The ST and Overt Attention page shows how ST, which was originally intended to show how covert attention might function, is extended to accommodate fixation control. The basic idea was present in Tsotsos et al. 1995 (p. 535-538); it was fleshed out by Zaharescu (2004). It plays exactly the same role one sees in Itti et al. 1998 and its derivatives, namely computation of image conspicuity and decision on maximum conspicuity as the point of next fixation. The differences are three: this is embedded into the overall ST model and is not a stand-alone, whereas the saliency map models all focus only on overt fixation, ST provides for both overt and covert attentional fixations and decides which is best for each attentional shift; the embedding includes a fixation history maintenance process that goes beyond inhibition of return because when the eyes move the scene changes and this must be accommodated. The standard saliency map models operate on a fixed image, separated from a real scene that has no boundaries.


Saliency as Information Maximization


Neil Bruce, in the course of his PhD thesis (Bruce 2008), proposed that saliency be considered in the context of information theory. The proposed definition is distinguished from previous reports on this front and is demonstrated to be a natural principled definition for salient visual content. Specifically, the proposal deemed Attention by Information Maximization (AIM) seeks to select visual content that is most informative in a formal sense in the context of a specific scene, and is put forth in a form that is amenable to considering more general definitions of context. Efficacy in predicting human gaze patterns is demonstrated and the proposal is revealed to outperform existing models in the prediction of fixation points. With regard to biological plausibility, an important consideration is the extent to which the model behavior agrees with the psychophysics and neurophysiology literature. To this end it is revealed that AIM is able to account for an unprecedented range of classic psychophysics results including some subtle and counterintuitive results and may be achieved via a neural implementation that is consistent with observations concerning surround modulation in the cortex. More general modeling considerations are also addressed including compatibility with descriptions of how attention as a whole is achieved and constraints on possible architectures for achieving attentional selection in light of recent psychophysics and neural imaging results. See the series of papers by Neil Bruce cited below for details.


Fixation data and code are also available. The code written in MATLAB is at www.cse.yorku.ca/~neil/infosaliency.tar Included are a variety of learned ICA bases. Note that the code given expects a relatively low resolution image as the receptive fields are small, for a high resolution larger image, you may wish to try some larger receptive fields. Also, if you have any questions about the code, feel free to ask. To use within matlab, you should be able to simply do something along the lines of: info = AIM('21.jpg',0.5); with the parameter being a rescaling factor. Note that all of these bases should result in better performance than that based on the *very* small 7x7 filters used in the original NIPS paper.


The eye tracking data may be found at www.cse.yorku.ca/~neil/eyetrackingdata.zip. This includes binary maps for each of the images which indicate which pixel locations were fixated in addition to the raw data.


Correspondence is best addressed to bruce@cs.umanitoba.ca



References


  1. Bruce, N.D.B., Saliency, Attention and Visual Search: An Information Theoretic Approach, PhD Thesis, Dept. of Computer Science and Engineering, York University, Canada, 2008.

  2. Bruce, N., Tsotsos, J.K., Spatiotemporal Saliency: Towards a Hierarchical Representation of Visual Saliency, 5th Int. Workshop on Attention in Cognitive Systems, Santorini Greece, May 12, 2008.

  3. Bruce, N., Loach, D., Tsotsos, J.K., Visual Correlates Of Fixation Selection: A Look At The Spatial Frequency Domain, IEEE Int. Conf. on Image Processing, San Antonio, Sept. 16-19, 2007.

  4. Bruce, N., Tsotsos, J.K., Saliency Based on Information Maximization, NIPS 2005, Vancouver, BC

  5. Bruce, N., Tsotsos., J.K., Attention based on Information Maximization, Workshop From Computational Cognitive Neuroscience to Computer Vision, Bielefeld, March 21, 2007.

  6. Bruce, N. Tsotsos, J.K., An Information Theoretic Model of Saliency and Visual Search, L. Paletta and E. Rome (Eds.): WAPCV 2007, LNAI 4840, pp. 171–183, 2007.

  7. Tsotsos, J.K. Analyzing Vision at the Complexity Level, Behavioral and Brain Sciences 13-3,  p423 - 445, 1990.

  8. Tsotsos, J.K., Culhane, S., Wai, W., Lai, Y., Davis, N., Nuflo, F., Modeling visual attention via selective tuning, Artificial Intelligence 78(1-2), p 507 - 547, 1995.

  9. Tsotsos, J.K., Liu, Y., Martinez-Trujillo, J., Pomplun, M., Simine, E., Zhou, K., Attending to Visual Motion, Computer Vision and Image Understanding, Vol 100, 1-2, p 3 - 40, Oct. 2005.

  10. Zaharescu, A., (2004). A Neurally-based Model of Active Visual Search, MSc Thesis, Dept. of Computer Science and Engineering, York University, Canada, July.