As his friends flocked to social networks like Facebook and MySpace, Alessandro Acquisti, an associate professor of information technology at Carnegie Mellon University, worried about the downside of all this online sharing. “The personal information is not particularly sensitive, but what happens when you combine those pieces together?” he asks. “You can come up with something that is much more sensitive than the individual pieces.”

Acquisti tested his idea in a study, reported earlier this year in Proceedings of the National Academy of Sciences. He took seemingly innocuous pieces of personal data that many people put online (birthplace and date of birth, both frequently posted on social networking sites) and combined them with information from the Death Master File, a public database from the U.S. Social Security Administration. With a little clever analysis, he found he could determine, in as few as 1,000 tries, someone’s Social Security number 8.5 percent of the time. Data thieves could easily do the same thing: They could keep hitting the log-on page of a bank account until they got one right, then go on a spending spree. With an automated program, making thousands of attempts is no trouble at all.

The problem, Acquisti found, is that the way the Death Master File numbers are created is predictable. Typically the first three digits of a Social Security number, the “area number,” are based on the zip code of the person’s birthplace; the next two, the “group number,” are assigned in a predetermined order within a particular area-number group; and the final four, the “serial number,” are assigned consecutively within each group number. When Acquisti plotted the birth information and corresponding Social Security numbers on a graph, he found that the set of possible IDs that could be assigned to a person with a given date and place of birth fell within a restricted range, making it fairly simple to sift through all of the possibilities.




To check the accuracy of his guesses, Acquisti used a list of students who had posted their birth information on a social network and whose Social Security numbers were matched anon­ymously by the university they attended. His system worked—yet another reason why you should never use your Social Security number as a password for sensitive transactions.

Welcome to the unnerving world of data mining, the fine art (some might say black art) of extracting important or sensitive pieces from the growing cloud of information that surrounds almost all of us. Since data persist essentially forever online—just check out the Internet Archive Wayback Machine, the repository of almost everything that ever appeared on the Internet—some bit of seemingly harmless information that you post today could easily come back to haunt you years from now.

Fortunately, the main practitioners of data mining these days are not criminals. Interpreting the clustering of data is now a big business, a potent force in politics, and a powerful tool of government (although plenty of people may object to finding their data scrutinized by those folks, too). Data-driven targeting of potential voters played an enormous role in the election of Barack Obama; directed marketing has led to record growth for companies like 1-800-Flowers.

These activities are sure to increase as our data clouds expand. In the near future, face-recognition software will scrutinize online photos to identify “anonymous” individuals; software programs will secretly scan e-mail on government networks; implantable medical devices may even transmit your health data directly to your doctor. With all this information floating around, privacy advocates warn, it is inevitable that some of it, somehow, will wind up in the wrong hands—or at the very least in places where you did not intend it to go.

Before the rise of the Internet, we had safety in numbers: the daunting numbers of scattered, hard-to-access databases that contained our sensitive details. It took shoe leather to put all that information together. Governments and companies kept large personal and demographic databases, but there was no way to instantly shuttle the information from one place to another. That changed when scientists began linking computers to each other—primarily with the creation of the Arpanet, forerunner of the Internet—in the 1960s. As a result, information was no longer confined to individual computers. It could be transmitted to any computer connected to the network anywhere.