All that recent media coverage of NASA's Mars rover demonstrated the progress that has been made in building what are known as "autonomous robots". Because it takes 11 minutes to send a signal from Mars to Earth, it was not possible for some earth-based engineer to control the exact path of the rover. It had to be programmed to make decisions for itself. The earthbound controllers used a computer mouse to tell the rover where to head for: they examined the picture supplied by the camera on the mother craft and displayed on the computer screen, picked a likely target, and simply mouse-clicked on the appropriate spot. The rover then had to figure out for itself the exact route to follow. It used a laser vision system to scan ahead for obstacles, and on board computers tried to find a way to negotiate around anything it detected in its path. It was slow going, with a top rate of about half a meter a minute.
Future, more ambitious explorations of distant planets will surely require robots with far greater autonomy than the present generation. The greatest limitation is almost certainly our inability to equip a computer control system with anything like proper vision, as enjoyed by practically all the animal kingdom. After decades of effort to develop computer vision, the most obvious lesson we have learned is that it is a very hard challenge and we still have a long way to go. In fact, if by "computer" you mean a traditional digital computer, it might be totally impossible to endow it with anything like the capacity of sight possessed by humans and many other living creatures.
The problem is, seeing is not simply taking in visual information the way we take in, say, food. The scenes we see are in fact created by our minds.
Though it is fashionable to talk about the mind as a computer, a better description is that it is a pattern former/recognizer. Our brains do this so reliably and so systematically, that it makes perfect sense to talk about the patterns created by our minds as if they are really "out there in the world". We notice our ability to create images on those occasions when people report seeing something that we know is really just an accident, such as the face of Jesus in the pattern of rocks on a snowy mountainside. But seeing patterns in the environment is a basic ability we use all the time.
The problem facing the computer scientist trying to equip a computer with vision is how to simulate this pattern creation/recognition ability using digital technology. Giving a robot "eyes" is no problem: simply fit the device with one or more cameras. Using two cameras and a bit of mathematics, you can ensure that the device has stereoscopic vision, giving it a sense of depth, as with the Mars rover. But what then?
The camera provides the computer with a massive array of numbers (pixels), representing the light intensity (and the wavelength, if the system is supposed to have color vision) at each point in the field of view. Using some fairly sophisticated mathematics, you can program the computer to pick out in that array things like straight lines and nicely shaped curves. But how do you decide what they represent?
The current approach to the problem of pattern recognition is to provide the computer with a large stock of known patterns and program it to try to find the best match. The problem is what do you mean by "best match"? In what sense is one huge array of numbers "close" to another? The problem is squarely in the laps not of the computer scientists but the mathematicians.
Some of the most promising work in this area of late has been carried out by mathematicians at Brown University in Rhode Island. One of their main projects has been to find ways to help radiologists interpret brain scans to diagnose illness. The computer is provided with a database of reference scans, and then tries to find the one closest to that of the patient. Brown mathematicians have developed a technique called brain warping that examines the changes that have to be made to the patient's scan to make it identical to one of the reference scans. The fewer the number of changes, the better the match. Since the computerized scans consist of massive arrays of numbers, this is a difficult challenge. The most successful approach found so far uses mathematical objects called Lie groups, which were initially developed for use by physicists.
One of the leaders in developing the mathematics of pattern recognition is the Fields Medal winning mathematician David Mumford, who recently left his position at Harvard University to join the group at Brown. For Mumford, the challenge is to find mathematical ways to describe what it is exactly that we recognize when we identify a picture or a scene. For example, we can recognize the face of a friend we have not seen for many years, even though many pixel-scale details of the face have changed dramatically. (At the scale of a pixel-array, a wrinkle in the skin can be like the Grand Canyon.) What is more, we can recognize our friend from different angles, under different light conditions, and at different distances.
Mumford adopts a probabilistic approach, using techniques of statistical mechanics. The important thing about an image, he says, is not what you actually see but how it compares with what you expect to see. This leads to the probabilistic methods of Bayesian statistics -- the same kinds of reasoning that are used to determine guilt based on DNA evidence: What is the likelihood that the given pixel array is such-and-such an array in the database?
A fundamental question that Mumford has to address is: What exactly do we mean by an image (as opposed to a pixel array)? He approaches this problem in much the same way that Euclid approached geometry: Try to formulate simple axioms from which everything else follows. Euclid's axioms allow us to say precisely what is a triangle, what is a circle, et cetera. Mumford's axioms are intended to give us a way to say what we mean by an image. (Not a particular image; rather the general concept of an image.)
Scale invariance and decomposability are two of Mumford's axioms for images. Scale invariance says that the probability of a particular match being the "right" one should not change when you alter the scale. (Your friend's face is the same however far away it is.) Decomposability says that any image can be split up into a collection of smaller images, each of which is somehow simpler or less cluttered than the original one, but still an image. (Your friend's face consists of eyes, nose, mouth, et cetera.)
A third of Mumford's axioms is his "blue sky hypothesis": Any image will likely contain regions with no objects in them. That is not to say that parts of the field of view are devoid of objects. Rather, when we view a scene, we concentrate on certain parts and aspects of it -- our image is selective. (We see our friend's face, but not what surrounds it.) According to Mumford, the blue sky regions are as important a part of an image -- part of what makes it an image (i.e., something created by our minds when our eyes gaze out at the world) -- as are the parts we identify and name.
Only time will tell how far work such as Mumford's will get us in the quest to explore other planets. Though the mathematics is impressive, it may well be that building robots that can make their way safely around completely alien territory is simply not achievable using mathematics and digital technology.
- Keith Devlin