Contrasting views of the Lagoon nebula. Top: Infrared observations from the Paranal Observatory in Chile cut through dust and gas to reveal a crisp view of baby stars within. Bottom: A similar view in visible light appears opaque.

ESO, VVV

For Kirk Borne, the information revolution began 11 years ago while he was working at NASA’s National Space Science Data Center in Greenbelt, Maryland. At a conference, another astronomer asked him if the center could archive a terabyte of data that had been collected from the MACHO sky survey, a project designed to study mysterious cosmic bodies that emit very little light or other radiation. Nowadays, plenty of desktop computers can store a terabyte on a hard drive. But when Borne ran the request up the flagpole, his boss almost choked. “That’s impossible!” he told Borne. “Don’t you realize that the entire data set NASA has collected over the past 45 years is one terabyte?”

“That’s when the lightbulb went off,” says Borne, who is now an associate professor of computational and data sciences at George Mason University. “That single experiment had produced as much data as the previous 15,000 experiments. I realized then that we needed to do something not only to make all that data available to scientists but also to enable scientific discovery from all that information.”

The tools of astronomy have changed drastically over just the past generation, and our picture of the universe has changed with them. Gone are the days of photographic plates that recorded the sky snapshot by painstaking snapshot. Today more than a dozen observatories on Earth and in space let researchers eyeball vast swaths of the universe in multiple wavelengths, from radio waves to gamma rays. And with the advent of digital detectors, computers have replaced darkrooms. These new capabilities provide a much more meaningful way to understand our place in the cosmos, but they have also unleashed a baffling torrent of data. Amazing discoveries might be in sight, yet hidden within all the information.




Since 2000, the $85 million Sloan Digital Sky Survey at the Apache Point Observatory in New Mexico has imaged more than one-third of the night sky, capturing information on more than 930,000 galaxies and 120,000 quasars. Computational analysis of Sloan’s prodigious data set has uncovered evidence of some of the earliest known astronomical objects, determined that most large galaxies harbor supermassive black holes, and even mapped out the three-dimensional structure of the local universe. “Before Sloan, individual researchers or small groups dominated astronomy,” says Robert Brunner, an astronomy professor at the University of Illinois at Urbana-Champaign. “You’d go to a telescope, get your data, and analyze it. Then Sloan came along, and suddenly there was this huge data set designed for one thing, but people were using it for all kinds of other interesting things. So you have this sea change in astronomy that allows people who aren’t affiliated with a project to ask entirely new questions.”

A new generation of sky surveys promises to catalog literally billions and billions of astronomical objects. Trouble is, there are not enough graduate students in the known universe to classify all of them. When the Large Synoptic Survey Telescope (LSST) in Cerro Pachón, Chile, aims its 3.2-
billion-pixel digital camera (the world’s largest) at the night sky in 2019, it will capture an area 49 times as large as the moon in each 15-second exposure, 2,000 times a night. Those snapshots will be stitched together over a decade to eventually form a motion picture of half the visible sky. The LSST, producing 30 terabytes of data nightly, will become the centerpiece of what some experts have dubbed the age of peta­scale astronomy—that’s 1015 bits (what Borne jokingly calls “a tonabytes”).

Mosaic view of the center of the Milky Way, composed from 1,200 images taken over the course of 200 hours by the Very Large Telescope in Cerro Paranal, Chile.

ESO/S.Guisard

The data deluge is already overwhelming astronomers, who in the past endured fierce competition to get just a little observing time at a major observatory. “For the first time in history, we cannot examine all our data,” says George Djorgovski, an astronomy professor and codirector of the Center for Advanced Computing Research at Caltech. “It’s not just data volume. It’s also the quality and complexity. A major sky survey might detect millions or even billions of objects, and for each object we might measure thousands of attributes in a thousand dimensions. You can get a data-mining package off the shelf, but if you want to deal with a billion data vectors in a thousand dimensions, you’re out of luck even if you own the world’s biggest supercomputer. The challenge is to develop a new scientific methodology for the 21st century.”

The backbone of that methodology is the data-crunching technique known as informatics. It has already transformed medicine, allowing biologists to sequence the DNA of thousands of organisms and look for genetic clues to health and disease. Astronomers hope informatics will do the same for them. The basic idea is to use computers to extract meaning from raw data too complex for the human brain to comprehend. Algorithms can scour terabytes of data in seconds, highlighting patterns and anomalies, visualizing key information, and even “learning” on the job.

In a sense, informatics merely enables astronomers to do what they have always done, just a lot more quickly and accurately. For example, data mining is useful for classifying and clustering information, two critical techniques in an astronomer’s tool kit. Is an object a star or a galaxy? If it is a galaxy, is it spiral or elliptical? If 
it is elliptical, is it round or flat? Not so many years ago, such questions were addressed by eyeballing photographic plates. Classification is not a big deal when you are working with hundreds of extrasolar planets or thousands of supernovas, but it becomes hugely complicated when you are trying to make sense of billions of objects.

Research scientist Matthew Graham of the Center for Advanced Computing Research at Caltech recalls trying to identify a few hundred quasars in 1996 for his doctoral thesis on large-scale structures in the distant universe. He did it the old-fashioned way—with pencil and paper and laborious trial and error. When the LSST is completed, it will be far simpler to assemble a data set of millions of quasars.