Information is not knowledge. Those four words contain one of the most vexing paradoxes of the 21st century: We are accumulating data at exponential rates, but our understanding of what is being stored is creeping along. Yet somewhere in the stupefying volume of unanalyzed facts may lurk the next epochal discovery. In fact, one emerging frontier of discovery is called data exploration, for which researchers are developing techniques combining mathematics, information science, and software design to probe and extract patterns from vast collections of information.

One of the strangest and most promising methods seeks to uncover concealed patterns in large chunks of data by using topology, the mathematical study of shapes. Although researchers have long plotted their data onto graphs to identify significant connections between different variables, and 3-D bar charts are a dismally familiar feature of PowerPoint presentations everywhere, some patterns do not reveal themselves in two or three dimensions and become apparent only in higher dimensions—which ordinary mortals find impossible to visualize. (Just try to imagine the tesseract, or four-dimensional hypercube, which connects 16 corners instead of an ordinary cube’s eight.)

TOWER OF INFORMATION




Exabyte: 1,000,000,000,000,000,000 bytes

5EB=All words ever spoken by humans

Petabyte: 1,000,000,000,000,000 bytes

200PB=All material ever printed

Terabyte: 1,000,000,000,000 bytes

1TB= 50,000 trees made into paper and printed

Gigabyte: 1,000,000,000 bytes

1GB=Pickup truck full of books

Megabyte: 1,000,0000 bytes

1MB= Floppy disc

Kilobyte: 1,000 bytes

1KB = low res photo

Topology provides mathematical ways of describing shapes in multiple dimensions. At Stanford University, a number of researchers, including Gunnar Carlsson, Vin de Silva, and Tigran Ishkanov, are investigating ways to use high-dimensional shape detection to determine whether a data set contains hidden patterns.

“Algebraic topology has long existed on its own as a part of pure mathematics,” says Carlsson, a mathematician. “But about 15 years ago, I thought to myself: What is it about topology that’s attractive to people and might be useful?” He was particularly excited by a classic analysis of diabetes research in the 1970s. The data spanned 145 patients with diabetes and included five measurements for each patient—four metabolic and one weight related. Another way to look at it, Carlsson says, “is that each patient was represented in five dimensions.”

The researchers used a traditional statistical technique called projection pursuit to plot the data and depict it in three dimensions. When they did so, it became clear that there were two very distinct types of the disease, as suspected. Each cluster of patient data formed a separate shape in the data plot, and the two did not overlap. In 1979 the National Diabetes Data Group formally recognized the distinction between type 1 and type 2 diabetes.

“What we want to do now is automate that kind of recognition and make it precise—that is, computational,” Carlsson says. “And we want to expand it to high dimensions. This could be important for treating other kinds of diseases or revealing other sorts of important connections in large data sets.”

To test their system, Carlsson and his fellow mathematicians collaborate with neuroscientists who implant electrodes in the visual cortex of macaque monkeys and then monitor how the monkeys respond to different visual stimuli. Those data record which neurons fire at which times. But they don’t answer large questions: How is the brain processing the imagery? Does it process different patterns in different ways? And could you design an artificial system to simulate optical processing in the nervous system?

That’s where Carlsson’s team comes in, applying his multidimensional analysis to patterns in the data array. “We’re beginning to see how to represent that topologically in, say, 100 dimensions. And it turns out that, depending on the collection of visual stimuli the monkey sees, the data form characteristic shapes, such as a torus, which is like a doughnut or a loop.” Eventually, Carlsson says, this sort of analysis may help scientists understand how the brain works. “Neuroscience by itself does not know how complex families of images are encoded. Topological analysis could provide those critical insights.”

larger image

The National Archives and Records Administration preserves all White House

records and about 2% of all other federal records. It expects to receive 347PB of electronic data by 2022. 

 

Such insights might allow scientists to approach problems that are maddeningly difficult, such as designing computerized systems to make real-world visual sense out of the input from a couple of video cameras. “Suppose you need to build a robot whose job is to clean up in a bar,” Carlsson says. “The robot needs to distinguish between glasses, which go in the dishwasher, and bottles, which go into a recycling bin. But you don’t want to have to use a database of all possible glass and bottle shapes seen from all possible angles. You’d like qualitative cues that tell you the difference between a bottle and a glass.”

Topological analysis may do the job. “We don’t know yet exactly how sensitive the technique is. But the thing it has going for it is that it’s a single methodology that you can apply in a lot of situations.”

DIALOGUE

LET 2,000 PETABYTES BLOOM

DREW BADEN, a physicist at the University of Maryland, is helping design systems to extract, process, and store data from one of the greatest information-deciphering challenges in human history. The task is unfolding at the Large Hadron Collider, a giant particle accelerator under construction near Geneva. The data will come from an instrument called the Compact Muon Solenoid detector, or CMS, one of five such detectors set around the LHC’s 16.8-mile underground ring that will monitor the outcome of subatomic particle collisions occurring about 40 million times a second. The collisions are expected to create new, never-before-known particles and other phenomena.

Crunching these numbers will be a formidable task: The collider will have 7 times the energy and 16 times the number of collisions per second as the world’s most powerful existing collider, the Tevatron at Fermilab in Illinois. The detector team plans to process only one-fourth of 1 percent of the events. Even then, it will add up to 2,000 petabytes (one petabyte is 1 million gigabytes) per year. After analyzing the data, the system will make a permanent record of the outcome from about one in every 1,000 collisions.

How much data is that? If the 29 million books and other printed materials in the Library of Congress were stored digitally, it would amount to about 15,000 gigabytes. So the detector will process about 130,000 Libraries of Congress every year and save about 130 of them. Nothing like that has ever been attempted.

What’s the biggest challenge in coping with the volume of data?

B: The 2,000 petabytes a year that we’ll examine flow on high-speed massively parallel networks. Different parts of the data from each event flow via different paths, and they must be reassembled correctly to be analyzed. This reassembly is crucial if we are to avoid turning CMS into a 12,500-ton, $1 billion random-number generator!

But only some of the data will reveal new physics. How do you find that tiny amount?

B: Since events from something new will be the very rarest, we will use statistical analysis to separate them from the huge numbers of events from known physics. So imagine trying to find a few particular words from a particular volume in a library 130 times larger than the Library of Congress. Our biggest challenge will be in applying techniques to catalog, index, and manage all of this data. We want to avoid the situation where we are drowning in data and thirsting for statistics.