Computational Foundations for 
Attentive Processes


Although attention is a human ability we all intuitively think we understand, the computational foundations for attentive processes in the brain or in computer systems are not quite as obvious. Notions such as capacity limits pervade the attention literature but remain vague. Through mathematical proofs, it is possible to derive the necessity of attentive processes, and through algorithmic approximations and processing optimizations it is possible to discover realizations given either biological or computational resource constraints. Perhaps the most important conclusion is that the brain is not solving the generic perception problem, and by extension, the generic cognition problem. Rather, the generic problem is re-shaped through approximations so that it becomes solvable by the amount of processing power available in the brain. 

The human cognitive ability to attend has been widely researched in cognitive and perceptual psychology, neurophysiology and in computational systems. Regardless of discipline, the core issue has been identified to be Information Reduction. Humans, and many other animals as well, are faced with immense amounts of information and are limited in their ability to process all of it by the size of our brains. It is not simply that there is too much information; the problem is that each component of each stimulus can be matched to many different objects and scenes in our memory resulting in a combinatorial explosion of potential interpretations, as is caricatured in Figure 1.



Figure 1.  The classic “Dalmatian sniffing at leaves” picture (due to Richard Gregory) is sufficiently complex as to activate an enormous number of possible interpretations. Each small piece of it has similarities to varying degrees with many other possible objects and scenes. The combinatorial explosion of possibilities that results is what any system – brain or machine – must effectively deal with in order to successfully perceive and act on the world.


Capacity Limits and Computational Complexity
One of the most frustrating things about studying attention is that research is so often accompanied by vague discussions of capacity limits, bottlenecks, and resource limits. How can these notions be made more concrete?  The area of computer science known as Computational Complexity is concerned with the theoretical issues dealing with the cost of achieving solutions to problems in terms of time, memory and processing power as a function of problem size. It thus provides the necessary theoretical foundation on which to base an answer to the capacity question.

With respect to neurobiology, many have considered complexity constraints in the past but mostly with coarse, counting arguments (for example, Feldman & Ballard 1982 or Uhr 1987). All roads lead to the same conclusions:  the brain cannot fully process all stimuli in parallel in the observed response times. But this is like saying there is a capacity limit: This does not constrain a solution. By arguing in this manner we are no closer to knowing what exactly the brain is doing to solve this problem. This paper takes the position that a more formal analysis of vision at the appropriate level of abstraction will help to reveal quantitative constraints on visual architectures and processing. First, however, it is important to address the applicability of this analysis for the neurobiology of the brain.


Human Perception/Cognition and Computation
Can cognition, intelligence, perception or vision be expressed in computational form? Surely, the discovery of explanatory mechanisms that yield observed performance is central to all experimental studies. In a very real sense, such explanations are presented as algorithms, a step-by-step procedure for achieving some result, and algorithms are a central component of the language of computation.  But could the brain actually process signals in an algorithmic manner? This non-trivial issue is important because if it could be proved that human brain processes cannot be modeled computationally (and this is irrespective of current computer hardware), then modeling efforts are futile. The jury is still out on this point, but a perhaps overwhelming amount of support has accumulated beginning with a critical theoretical notion, decidability. A proof of decidability is sufficient to guarantee that a problem can be modeled computationally (Davis, 1958, 1965). Decidability should thus not be confused with tractability. A tractable problem is one where enough resources can be found and enough time can be allocated so that the problem can be solved reasonably. An intractable problem may be decidable; but for an undecidable problem, one cannot determine its tractability. Intractable problems are those that have exponential complexity in space and/or time, that is, the mathematical function that relates processing time/space to the size of the input is exponential in that input size. If the input is large, then its role as an exponent leads to the need for enormous amounts of processing resources. There are several classes of such problems with differing characteristics and NP-Complete is one of those classes. 

To show that vision is decidable, then it must first be formulated as a Decision Problem. This means that if it is the case that for some problem we wish to know of each element in a countably infinite set A, whether or not that element belongs to a certain set B which is a proper subset of A, then that problem can be formulated as a decision problem. Such a problem is decidable if there exists a Turing Machine which computes ‘yes’ or ‘no’ for each element of A in answer to the decision question. A Turing Machine is a hypothetical computing device consisting of a finite state control, a read-write head and a two-way infinite sequence of labelled tape squares.  A program then provides input to the machine, is executed by the finite state control, and computations specified by the program read and write symbols on the squares of the tape. 

If we look at the perceptual task definitions provided by Macmillan and Creelman (2005), we see that all psychophysical judgments are of one stimulus relative to another - the basic process is comparison. The most basic task is discrimination, the ability to tell two stimuli apart. This certainly looks very much like the decision problem defined above. Macmillan and Creelman define recognition as a discrimination task where a subject is required to say to which of two classes a given a stimulus belongs.  Strongly related to this, the visual match problem defined by Tsotsos (1989)  - does an image contain an instance of a given target - is  an instance of Yashuhara’s Comparing Turing Machine (1971) and thus decidable. 

This however is not a proof that all of human vision can be modeled computationally. If no sub-problem of vision could be found to be decidable, then it might be that perception as a whole is undecidable and thus cannot be computationally modeled. But, what if there are other undecidable vision sub-problems? Even if some other aspect of vision is determined to be undecidable, this does not mean that all of vision is also undecidable or that other aspects of perception cannot be modeled computationally. Hilbert’s 10th problem in mathematics and the Halting Problem for Turing Machines are two examples of famous undecidable problems. The former does not imply that mathematics is not possible while the latter does not mean that computers are impossible. It seems that most domains feature both decidable as well as undecidable sub-problems and these co-exist with no insurmountable difficulty. 

Once a problem is known to be decidable, it may in fact be an instance of one of the many known NP-Complete problems and proof methods exist for demonstrating this. The proofs need to show that for any possible algorithm or implementation, the time complexity function of a solution is exponential in the size of the input. The analysis is asymptotic; that is, the problem is intractable only as the input size becomes very large.  Small values of an exponent (i.e., small input sizes) are often quite realizable.

Some might think that since the brain’s neurons operate in parallel, that parallelization of computations makes the overall process tractable. NP-complete problems are strongly conjectured to not be parallelizable and showing a problem is NP-complete is very strong evidence that the problem is not parallelizable. It is important to note that  the problem may still be parallelizable on some of the possible inputs and it might be that all sufficiently small instances may be parallelizable. NP-completeness is definitely a sign that the problem may be hard, it does not rule out efficient algorithms for the search problems the brain must actually compute. We are faced with two choices as a result: can we find those algorithms or ask if the brain is perhaps not even solving the original problem. We approach these questions in the context of vision (since 40% or more of the cortex is devoted to vision, it is a reasonable starting point).


The Computational Complexity of Vision
What is the generic vision problem?  For our purposes we will use the following: Given a sequence of images, for each pixel determine whether it belongs to some particular object or other spatial construct, localize all objects in space, detect and localize all events in time, determine the identity of all the objects and events in the sequence, determine relationships among objects and relate all objects and events to the available world knowledge. This section will briefly summarize the steps of an argument that concludes that the generic vision problem as defined here is intractable. If a vision system needs to search through the set of all possible image locations (pixels) and compare them to each element of memory, then without any task guidance or knowledge of the characteristics of the subset it seeks, it cannot know which subset may be more likely than another. As a result, it is the powerset of all locations that gives the number of image subsets to examine, an exponential function. Thus, purely feed-forward, unconstrained visual processing seems to have an inherent exponential nature. 

Two abstract problems can be defined to make this more concrete. The important variables that play a role in the definition are: the number of image locations (P), the number of object/event prototypes in memory (N) and the number of measurements made at each image location (M). The first problem is Unbounded Visual Match. This was intended to model recognition where no task guidance to optimize search is permitted. It corresponds to recognition with all top-down connections in the visual processing hierarchy removed or disabled. In other words, this is pure data-directed vision. More precisely, it determines if there exists a subset of pixels in an image that matches a particular target. The second problem is Bounded Visual Match.  This is recognition with knowledge of a target and task in advance, such that the knowledge is used to optimize the process. The basic theorems, proved in (Tsotsos 1989, 1990a) are:

Theorem 1: Unbounded (Bottom-Up) Visual Matching (UVM) is NP-Complete, with time complexity an exponential function of P (the worst-case time complexity is O(N2PM)).

Theorem 2: Bounded (Task-Directed) Visual Match (BVM) has time complexity linear in P (the worst-case time complexity is O(NP2M)).

The results have important implications. They first tell us that the pure data-directed approach to vision (and in fact to perception in any sensory modality) is computationally intractable in the general case. They also tell us that bounded visual search takes time linearly dependent on the size of the input, something that has been observed in a huge number of experiments. Even small amounts of task guidance can turn an exponential problem into one with linear time complexity. 

However, perhaps biological vision systems are designed around average or best-case assumptions and not worst-case assumptions as are the basis of the above theorems; it might be likely that expected case analysis more correctly reflects biological systems. Parodi et al. (1998) addressed this criticism by developing an algorithm for generating random instances of polyhedral scenes and examining median case complexity of labeling the scenes. The key results echo the worst-case results: the unbounded median-case time complexity is exponential in the number of polyhedral junctions while the bounded median-case time complexity is linear in number of junctions. As a result, the objection that worst-case analysis is not appropriate is defused. The key conclusion remains: The application of task knowledge guides the processing and converts an exponential complexity problem into a linear one. This is one of the main forms of attentive processing.


Complexity Constrains The Architecture for Visual Processing 
How can one deal with an intractable problem in practice? NP-Completeness eliminates the possibility of developing a completely optimal and general algorithm. Garey & Johnson (1979) provide a number of guidelines. The relevant guideline for this analysis is: Use natural parameters to guide the search for approximation algorithms, algorithms that solve the problem but not optimally, providing a solution only to within some specified error tolerance. One way of doing this it to reduce the exponential effect of the largest valued parameters. The process was presented in (Tsotsos 1987, 1988, 1990a, 1990b, 1991).

The natural parameters of the computation for visual search are N, P and M with worst-case time complexity is O(N2PM). N is a large number but any reduction leads to linear improvements. P is also a large number but reduction in P can lead to exponential improvement, as does reduction in M, however M is not so large a number. We can conclude that the best improvement would come from reductions in P. What would such changes look like? Here are several that are straightforward yet sufficient (but not necessary):
	
1. Hierarchical organization takes search of model space from O(N) to O(log2N).
2. Search within a pyramidal representation of the image (a layered representation, each layer with decreasing spatial resolution and with bidirectional connections between adjacent layers) operates in a top-down fashion, beginning with a smaller more abstract image, and is then refined locally thus reducing P. 
3. Spatio-temporally localized receptive fields reduce number of possible receptive fields from 2P to O(P1.5) (this assumes contiguous receptive fields of all possible sizes centered at all locations in the image array). 
4. Spatial selectivity can further reduce the P1.5 term if one selects the receptive field that is to be processed. This is not a selection of location only, but rather a selection of a local region and its size at a particular location. 
5. Feature selectivity can further reduce the M term, that is, which subset of all possible features actually is represented in the image or is important for the task at hand.
6. Object selectivity can further reduce the N term, reflecting task-specific information.

After applying the first three constraints, O(P1.52Mlog2N) is the worst-case time complexity.  The next three reduce P, N and M all to 1 to bring the expression down to perhaps its lowest value. These last three are all manifestations of attentive processing.

But how do these actions affect the generic perception problem described above? Hierarchical organization and logical map separation do not affect the nature of the vision problem. However, the other mechanisms have the following effects:
• Pyramidal abstraction affects the problem through the loss of location information and signal combination.
• Spatio-temporally localized receptive fields force the system to look at features across a receptive field instead of finer grain combinations and thus arbitrary combinations of locations are disallowed.
• Attentional selection further limits what is processed in the location, feature and object domains. The knowledge that task-related attentive processes employ to accomplish the selection can be as generic as statistical regularities.
As a result of these, the generic problem as defined earlier has been altered. Unfortunately it is not easy to formally characterize this altered problem. But it is clear that certain aspects of the problem are lost with these approximations.


Extending to Cognition and Action
Much of human cognition is action-based and actions need sensory input for their guidance. The kind of visual search described above seems ubiquitous within intelligent behavior. Simple behavior may be conceptualized as :
• localize and recognize a stimulus (with or without a pre-specified target)
• link the stimulus to an applicable action
• decide among all possible actions
• generate actuator commands.
If sensory perception is an integral component of  intelligence, sensory search problems are integral sub-problems of intelligent behavior. 

The first step is exactly the UVM problem. The second step can be solved by table look-up, the table size being the number of possible stimuli times the number of possible behaviors and where the table entries code  strength of applicability, perhaps dependent on stimulus characteristics  (thus, each stimulus may be associated with more than one behavior). The third step can be done by choosing the largest applicability strength from among all the entries, and the fourth would involve a function call or table-look-up again.

The problems are formalized in Tsotsos (1995) with proofs for the following:

Theorem 3: Unbounded Stimulus-Behavior Search is NP-hard.

Theorem 4: Bounded Stimulus-Behavior Search has  time complexity linear in the number 
			of  image locations. 

Extensions to time-varying input, active camera systems, and to the sensor planning problem in visual search have all been presented and mirror the above results (Tsotsos 1992, Ye & Tsotsos 1996). The generic, strictly data-directed versions of these problems all exhibit inherent exponential time complexity and the addition of attentive, task-specific guidance is sufficient to reduce this complexity to linear or better.


Conclusions
We have shown that visual search, and by extension any other intelligent behavior that requires some sensory perception as an integral component, has exponential time complexity where the exponents are too large to enable it to be tractably solved in any realization. By re-shaping the problem – through optimizations and approximations – the problem can be solvable with very low complexity. Those optimizations and approximations are all different kinds of attention. 


References 

Davis, M.  (1958).  Computability and Unsolvability,  New York, McGraw-Hill.
Davis, M. (1965). The Undecidable,  New York, Hewlett Raven Press. 
Feldman, J., Ballard, D. (1982). Connectionist models and their properties, Cognitive Science 6, p. 205-254.
Garey, M., Johnson, D. (1979).  Computers and Intractability: A Guide to the Theory of NP-Completeness, San Francisco: Freeman. 
Macmillan, N.A., Creelman, C.D. (2005). Signal Detection Theory: A User's Guide, Routledge.
Parodi, P., Lancewicki, R., Vijh, A., Tsotsos, J.K. (1998). Empirically-Derived Estimates of the Complexity of Labelling Line Drawings of Polyhedral Scenes, Artificial Intelligence 105,  p. 47 - 75.
Tsotsos, J.K. (1987). A `Complexity Level' Analysis of Vision, Proceedings of Int. Conference on Computer Vision: Human and Machine Vision Workshop, London, England.
Tsotsos, J.K. (1988). A `complexity level' analysis of immediate vision, International Journal of Computer Vision 1-4, p. 303 - 320.
Tsotsos, J.K.  (1989). The Complexity of Perceptual Search Tasks, Proc. IJCAI, p. 1571 - 1577, Detroit.
Tsotsos, J.K. (1990a). A Complexity Level Analysis of Vision, Behavioral and Brain Sciences 13-3, p. 423 - 455.
Tsotsos, J.K. (1990b). A Little Complexity Analysis Goes a Long Way, Behavioral and Brain Sciences 13-3, p. 459 - 469.
Tsotsos, J.K. (1991). Is Complexity Analysis Appropriate for Analyzing Biological Systems?, Behavioral and Brain Sciences 14-4, p. 770-773.
Tsotsos, J.K. (1992). On the relative complexity of active vs passive visual search, International Journal of Computer Vision 7-2, p. 127 - 141.
Tsotsos, J.K. (1995). On Behaviorist Intelligence and the Scaling Problem,   Artificial Intelligence 75, p. 135 – 160.
Uhr, L.. (1987). Highly Parallel, Hierarchical, Recognition Cone perceptual Structures, in  Parallel Computer Vision, ed. by L. Uhr,  Academic press.
Yashuhara, A. (1971). Recursive function theory and logic, Academic Press, New York.
Ye, Y. Tsotsos, J.K. (1996).  3D Sensor Planning: Its Formulation and Complexity, International Symposium on Artificial Intelligence and Mathematics, January 3-5, Fort Lauderdale, Florida.