For Kirk Borne, the information revolution began 11 years ago while he was working at NASA’s National Space Science Data Center in Greenbelt, Maryland. At a conference, another astronomer asked him if the center could archive a terabyte of data that had been collected from the MACHO sky survey, a project designed to study mysterious cosmic bodies that emit very little light or other radiation. Nowadays, plenty of desktop computers can store a terabyte on a hard drive. But when Borne ran the request up the flagpole, his boss almost choked. “That’s impossible!” he told Borne. “Don’t you realize that the entire data set NASA has collected over the past 45 years is one terabyte?”
“That’s when the lightbulb went off,” says Borne, who is now an associate professor of computational and data sciences at George Mason University. “That single experiment had produced as much data as the previous 15,000 experiments. I realized then that we needed to do something not only to make all that data available to scientists but also to enable scientific discovery from all that information.”
The tools of astronomy have changed drastically over just the past generation, and our picture of the universe has changed with them. Gone are the days of photographic plates that recorded the sky snapshot by painstaking snapshot. Today more than a dozen observatories on Earth and in space let researchers eyeball vast swaths of the universe in multiple wavelengths, from radio waves to gamma rays. And with the advent of digital detectors, computers have replaced darkrooms. These new capabilities provide a much more meaningful way to understand our place in the cosmos, but they have also unleashed a baffling torrent of data. Amazing discoveries might be in sight, yet hidden within all the information.
Since 2000, the $85 million Sloan Digital Sky Survey at the Apache Point Observatory in New Mexico has imaged more than one-third of the night sky, capturing information on more than 930,000 galaxies and 120,000 quasars. Computational analysis of Sloan’s prodigious data set has uncovered evidence of some of the earliest known astronomical objects, determined that most large galaxies harbor supermassive black holes, and even mapped out the three-dimensional structure of the local universe. “Before Sloan, individual researchers or small groups dominated astronomy,” says Robert Brunner, an astronomy professor at the University of Illinois at Urbana-Champaign. “You’d go to a telescope, get your data, and analyze it. Then Sloan came along, and suddenly there was this huge data set designed for one thing, but people were using it for all kinds of other interesting things. So you have this sea change in astronomy that allows people who aren’t affiliated with a project to ask entirely new questions.”
A new generation of sky surveys promises to catalog literally billions and billions of astronomical objects. Trouble is, there are not enough graduate students in the known universe to classify all of them. When the Large Synoptic Survey Telescope (LSST) in Cerro Pachón, Chile, aims its 3.2- billion-pixel digital camera (the world’s largest) at the night sky in 2019, it will capture an area 49 times as large as the moon in each 15-second exposure, 2,000 times a night. Those snapshots will be stitched together over a decade to eventually form a motion picture of half the visible sky. The LSST, producing 30 terabytes of data nightly, will become the centerpiece of what some experts have dubbed the age of petascale astronomy—that’s 1015 bits (what Borne jokingly calls “a tonabytes”).
The data deluge is already overwhelming astronomers, who in the past endured fierce competition to get just a little observing time at a major observatory. “For the first time in history, we cannot examine all our data,” says George Djorgovski, an astronomy professor and codirector of the Center for Advanced Computing Research at Caltech. “It’s not just data volume. It’s also the quality and complexity. A major sky survey might detect millions or even billions of objects, and for each object we might measure thousands of attributes in a thousand dimensions. You can get a data-mining package off the shelf, but if you want to deal with a billion data vectors in a thousand dimensions, you’re out of luck even if you own the world’s biggest supercomputer. The challenge is to develop a new scientific methodology for the 21st century.”
The backbone of that methodology is the data-crunching technique known as informatics. It has already transformed medicine, allowing biologists to sequence the DNA of thousands of organisms and look for genetic clues to health and disease. Astronomers hope informatics will do the same for them. The basic idea is to use computers to extract meaning from raw data too complex for the human brain to comprehend. Algorithms can scour terabytes of data in seconds, highlighting patterns and anomalies, visualizing key information, and even “learning” on the job.
In a sense, informatics merely enables astronomers to do what they have always done, just a lot more quickly and accurately. For example, data mining is useful for classifying and clustering information, two critical techniques in an astronomer’s tool kit. Is an object a star or a galaxy? If it is a galaxy, is it spiral or elliptical? If it is elliptical, is it round or flat? Not so many years ago, such questions were addressed by eyeballing photographic plates. Classification is not a big deal when you are working with hundreds of extrasolar planets or thousands of supernovas, but it becomes hugely complicated when you are trying to make sense of billions of objects.
Research scientist Matthew Graham of the Center for Advanced Computing Research at Caltech recalls trying to identify a few hundred quasars in 1996 for his doctoral thesis on large-scale structures in the distant universe. He did it the old-fashioned way—with pencil and paper and laborious trial and error. When the LSST is completed, it will be far simpler to assemble a data set of millions of quasars.
Setting algorithms loose on larger samples not only makes it easier to recognize patterns but also speeds the identification of outliers. “These days, one in a million objects is a serendipitous discovery,” Graham says. “You just happened to have the telescope pointed at the right place at the right time.” This is often the case in the search for “high-redshift” quasars, extremely distant and luminous objects powered by supermassive black holes. Right now, finding them is largely a matter of luck. With computers powering through a billion objects, astronomers can search more methodically for such extreme quasars—or for any other type of unusual object. This approach is not only faster but more accurate. The ability to say with statistical certainty that something is out of the ordinary allows astronomers to focus on the exceptions that prove the rule.
On the flip side, informatics is a remarkable tool for collecting statistics on the norm and using the tools of probability to figure out what the universe is like as a whole. For instance, astronomers have traditionally estimated the distances to remote galaxies using a spectrometer, which divides light from an object into its constituent wavelengths. But for every spectrum produced by Sloan, there were about 100 objects without spectra, only images. So Brunner put astroinformatics to work: He developed an algorithm that allows astronomers to estimate an object’s distance just by analyzing imagery, giving them a much bigger data set for studying the 3-D structure of the universe. “This will be really important with LSST,” he says, “because we won’t be able to get spectra for 99 percent of the objects.”
The interdisciplinary marriage between computer science and astronomy has not been fully embraced by either family yet, but that is changing. Last May brought a watershed moment, the debut of the Virtual Astronomical Observatory. This international network, 10 years in the making, allows astronomers to use the Internet to assemble data from dozens of telescopes. Then, in June, Caltech hosted the first international conference on “astroinformatics.” Astronomers are used to working at the limits of human imagination, but even they have a hard time envisioning the kinds of insights they will be able to pull out of the bounteous new databases. “We’ve built the roads,” Djorgovski says. “Now we need some Ferraris to drive on them.”
ZOONIVERSE
Data to the People In 2007, Oxford doctoral candidate Kevin Schawinski, exhausted from classifying 50,000 galaxies in one week, decided to solicit help from the robust community of amateur astronomers, using a technique known as crowdsourcing. The resulting project, Galaxy Zoo, allowed volunteers to classify images from the Sloan Digital Sky Survey on their home computers.
Within 24 hours of its debut, the site was generating 70,000 classifications an hour. An upgraded Galaxy Zoo 2, launched two years later, collected 60 million classifications from tens of thousands of users in 14 months. On the back end, a statistical process called “cleaning clicks” searched for and eliminated the inevitable bogus and mistaken classifications.
The interface was so intuitive that even Galaxy Zoo participant Matthew Graham’s 6-year-old could grasp it. “She thought it was a game,” he says. But Galaxy Zoo is much more than a toy. It has produced two dozen scientific papers and identified several previously unknown objects, most notably Hanny’s Voorwerp (right), a peculiar intergalactic blob named after the Dutch schoolteacher who spotted it, and a class of hyperactive galaxies dubbed the Green Peas. “Nonexperts end up discovering weird things because they don’t know not to ask, ‘Hey, what’s that over there in the corner?’ ” says Lucy Fortson, an associate professor at the University of Minnesota and project manager for the Citizen Science Alliance.
Galaxy Zoo has since morphed into the larger Zooniverse, which oversees more than 380,000 volunteers engaged in a variety of astronomical projects. Moon Zoo is attempting to count every crater on the moon. Its volunteers have so far classified more than 1.7 million images from NASA’s Lunar Reconnaissance Orbiter. The Milky Way Project scours infrared data from the Spitzer Space Telescope for evidence of gas clouds: Participants use their computers to draw circles on cloud “bubbles” thought to result from shock waves stirred up by extremely bright young stars. Planet Hunters, meanwhile, puts citizen scientists to work analyzing readings from NASA’s Kepler space telescope, designed to find Earth-like planets orbiting other stars. Equally if not more important, scientists are using the classifications made by Zooniverse participants to develop more accurate machine-learning algorithms so that computers will be able to do this kind of work in the future.See for yourself: zooniverse.org
Telescopes Without Borders
To learn as much as possible about distant objects, astronomers observe them with telescopes that “see” in various wavelengths. Unfortunately, the resulting data sets are archived in many locations all over the world, which makes them difficult to access; most are also inherently incompatible, so merging them requires a lot of painstaking labor. About 10 years ago, a group of astronomers started talking about creating a unified, global virtual observatory. Like the Internet, the virtual observatory is more a framework than a physical thing—a research environment linking data from a wide array of telescopes and archives and providing the tools to study them.
In the United States, an experimental version (the National Virtual Observatory) launched in 2002, but the lack of good data-analyzing tools made it difficult to use. “There was no science involved, just plumbing,” says Caltech astronomer George Djorgovski, a member of the virtual observatory’s science advisory council. “People who wanted to do science, myself included, got impatient and went to work on their own projects. No results to show, nobody wants to use it. Nobody wants to use it, no results to show.” The prospects for virtual astronomy improved dramatically last May when NASA and the National Science Foundation kicked in funding of $27.5 million over five years to finally bring the Virtual Astronomical Observatory (VAO) online and continue to develop tools for sharing data with astronomers worldwide.
The vao will not produce breakthroughs on its own, but it will make them possible. Kirk Borne likens it to the http protocol used to surf the Internet: “The Internet changed the world. But http made it possible.” See for yourself: usvao.org
Smile: The Universe in 1 Trillion Dazzling Pixels
Early this year astronomers with the Sloan Digital Sky Survey released the largest color image of the universe ever made, a trillion-pixel set of paired portraits that covers one-third of the night sky. It includes roughly a quarter of a billion galaxies and about the same number of stars within our home galaxy, the Milky Way. The brownish image at far left—dubbed the “orange spider” by one team member—is one of the portraits, covering the Milky Way’s southern hemisphere. Each point in the image represents multiple galaxies.
A dive into the image’s densely packed imagery reveals astonishing detail. The orange box at far left calls out M33, the Triangulum Galaxy, which at 3 million light-years away is one of our closest galactic neighbors. Zooming in shows M33’s spiral form. A further zoom brings into view green, spidery NGC 604, one of the largest nebulas in M33 and home to more than 200 newly formed stars. “Astronomers can use the data we drew on to create this image as a kind of guidepost,” New York University astronomer Michael Blanton says. And so they are: In the first two weeks after the Sloan team made the map available online, researchers queried the data about 60,000 times.
SLOAN DIGITAL SKY SURVEY
Greatest Mapmaker in the Universe
The Sloan Digital Sky Survey (SDSS), launched in 2000, heralded the modern age of big-picture astronomy. For years, scientists who needed a global sense of what was out there relied on one dominant set of photographs—the Palomar Observatory Sky Survey—created in the 1950s. The Sloan Telescope (located at the Apache Point Observatory in New Mexico) retraced much of the Palomar Survey but replaced photographic plates with digital imagery that could be updated and analyzed electronically, anywhere. “Sloan was the single biggest player in converting people to embrace this approach,” says Caltech astronomer George Djorgovski. “Sky surveys became respectable not only because they brought in so much data but because the content of the data was so high that it enabled so many people to do science.”
Sloan scientists have made some spectacular discoveries. In 2000 the project’s researchers spotted the most distant quasar ever observed. But independent astronomers have authored the vast majority of the 2,000-plus scientific papers based on SDSS; they simply use Sloan public data as the basis of their research. In one dramatic example, astronomers at Cambridge University discovered the “Field of Streams,” a spray of stars stretching nearly one-quarter of the way across the sky. They seem to be the shreds of small galaxies that were cannibalized by the Milky Way.
Data mining and other tools of informatics have been particularly helpful in extracting useful information from basic brightness measurements. Such data were thought to be of secondary importance when Sloan began but actually enabled astronomers to identify 100 times as many objects as expected. University of Illinois astronomer Robert Brunner is still reveling in the Sloan’s expanded view of the universe: “Our techniques allow us to start inquiring into the relationship between dark matter and supermassive black holes and how they influence galaxy formation and evolution.” See for yourself: sdss.org
Large Synoptic Survey Telescope Movie Camera to the Stars
The Large Synoptic Survey Telescope [LSST], being built atop Cerro Pachón in Chile, is a $450 million megaproject that will truly cement the relationship between astronomy and informatics. It is designed to probe dark energy and dark matter, take a thorough inventory of the solar system, map the Milky Way in unprecedented detail, and generally watch for anything that changes or moves in the sky.
Armed with an 8.4-meter (27-foot) optical telescope and a 3,200-megapixel camera—the world’s largest—the LSST will record as much data in a couple of nights as the Sloan Survey did in eight years. “For the first time, we’re going to have more astronomical objects cataloged in a coherent survey than there are people on Earth,” says Simon Krughoff, a member of the LSST data management team. (For those keeping score at home, experts project 20 billion objects.)
The numbers are so big and daunting that the LSST is the first astronomical project ever to formally incorporate informatics into its design architecture. “I made the case that we needed a group focused on data mining, machine learning, and visualization research to involve not just astronomers but also computer scientists and statisticians,” says Kirk Borne, who chairs the informatics and statistics team. The LSST will image the entire visible sky so rigorously that it will produce, in effect, a 10-year-long feature film of the universe. This should lead to tremendous advances in time-domain astronomy: studying fast-changing phenomena as they occur—black holes being born, supernovas exploding—as well as locating potentially Earth-threatening asteroids and mapping the little-understood population of objects orbiting out beyond Neptune. See for yourself: lsst.org