Devlin's Angle

February 2006

Cracking the Code

April is Mathematics Awareness Month for 2006. The theme this year is Mathematics and Internet Security. (See http://mathaware.org for details.) Accordingly, this month's column takes a look at a recent development in that field.
Chances are, you've never heard of Xiaoyun Wang. Unless you are a cryptographer, that is. In which case, the mere mention of her name is likely to have sent a shiver down your spine. For Wang is one of the world's most formidable code breakers. Over the past few years, she has exposed vulnerabilities in some of the most secure digital signature systems. Fortunately, she is one of the "good guys" in the never ending struggle to keep the Internet secure, approaching the problem as an academic research project (she is currently at Tsinghua University in Beijing), publishing her methods and results, and announcing her successes at international conferences on cryptography. But if she can crack supposedly secure password systems, so can someone else, perhaps someone with more sinister motives.

Anyone who uses a computer or a bank ATM is familiar with the need to enter a password. But what is to prevent a person with criminal intent from simply monitoring the electronic signal emanating from your computer or the ATM you are using and capturing both your login name and your password and then using them to gain illegal access to your account? The answer is that the message that actually gets sent is scrambled. The most common way of scrambling identification is by means of what is called a "hash function". Servers generally do not store your raw password, rather its scrambled version, or "hashed value". To determine whether a login attempt is legitimate, the receiving server compares the incoming hashed login information with the hashed version of your login details stored in its database.

To make this system work in practice, the hashing function, H, has to have two fairly obvious properties:

  1. For any input string x, it should be easy to compute H(x).
  2. Given any hash value y, it should be computationally infeasible to find an x such that H(x) = y.
("Computationally infeasible" means it would take the fastest computers more than (say) a human lifetime to carry out the procedure to completion.)

By requirement 2, even if a hacker gained access to the stored login information, he or she would not be able to obtain your password (though without additional controls they would of course be able to access your account on that machine, since it's the hashed version that the receiving server uses for authorization.)

In practice, the folks who design hash functions usually demand an additional uniformity feature that facilitates efficient storage of the hashed values of identification information and makes for a faster and easier database-lookup procedure to determine identity:

  1. All values produced by H have the same bit- length.
Because of this third condition, in theory there will be many different input strings that produce the same output; in the parlance of the hashing community, there will be "collisions", distinct input strings x and y such that H(x) = H(y). Because access to secure sites is determined (at the site) by examining the incoming hashed login data, one possible weakness of the system is that illegal access to an account does not require that the intruder obtains the account holder's login identity and password; it is sufficient to find some input data that generates the same hashed value, i.e. to find an input that collides with the legitimate data. In designing an algorithm for a hash function, it is therefore clearly important to make sure that this is extremely unlikely to occur. That gives a fourth requirement:
  1. It is a practical impossibility (i.e., it is computationally infeasible) to find a string y that collides with a given string x, i.e., for which H(x) = H(y).
Typically, hash functions work by combining (in some systematic way) the bits of the input string (eg. your login details) with other bits chosen at random, and performing some complex, iterative distillation process that reduces the resulting string down to one of a fixed length (predetermined for the system).

There are dozens of different hash functions in use. The two most widely used ones are MD5 ("Message Digest algorithm 5"), developed by Ronald Rivest at MIT in 1991 as one of a series of hash algorithms he designed, and SHA-1 ("Secure Hash Algorithm 1") developed by the National Security Agency in 1995. (Rivest is well known as one of the three inventors of the widely used RSA cryptographic system - described on the 2006 Mathematics Awareness Month website at http://mathaware.org.)

MD5 produces a hash value of 128 bits, and it would take on average 2^64 guesses to find a collision. SHA-1 generates a hash string of length 160 bits, and it would require an average of 2^80 guesses to find a collision. In theory, both methods would seem to offer a high degree of security - provided that the only feasible way to find a collision is by trial-and- error.

Enter Wang, who seems to exhibit a powerful combination of computer programming skill and a brilliant pattern recognition ability. At the Crypto 04 conference in Santa Barbara, California, Wang astonished the attendants with her announcement that she had found a way to find a collision for MD5 in just 2^37 inputs, a staggering reduction in problem size that made MD5 highly vulnerable.

Wang's approach was to input to the algorithm two strings that differ by just a few bits and look closely at what happens to them, step-by- step, as the algorithm operates on them. This led her to develop a "feel" for the kinds of strings that will result in a collision, allowing her to gradually narrow down the possibilities, resulting eventually in her developing a procedure to generate a collision. Others working in the field remark that her ability to intuit which of the many possible paths to follow, coupled with her tenacity, is remarkable. Commenting to the magazine New Scientist, which covered the story in its 17 December, 2005 issue, Charanjit Jutia, a cryptographer at IBM's Watson Research Center in Yorktown Heights, New York, described the challenge of cracking a hash function like SHA-1 as being "like a giant puzzle." Referring to Wang, he added, "Most people get tired and give up. She did not"

In 1997, Wang used her method to successfully find collisions for SHA-0 (SHA-1's predecessor) in 2^58 computations instead of the believed minimum of 2^80. Then, working with colleagues Dengguo Feng, Xuejia Lai, and Hongbo Yu, she developed and a variation of the same approach with MD5 that she used successfully to crack the system.

Following the announcement at Crypto 04, Wang and Yu teamed up with Yiqun Lisa Yin, now an independent security consultant based in Greenwich, Connecticut, and started work on the crown jewel of current hash functions, SHA-1. This proved a much harder nut to crack, but to the general dismay (and admiration) of the computer security community, at the annual RSA security conference in San Francisco in February last year, they were able to announce that they had developed an algorithm that could generate two SHA-1 colliding files in just 2^69 steps.

Unlike MD5, Wang and her colleagues have not (yet) cracked SHA-1, they have just produced a method that could crack it in far fewer steps than was previously believed possible. That number 2^69 is still sufficiently high to provide some degree of confidence in the system's security - for now. So too is the even lower number of 2^63 steps that Wang and other collaborators managed to achieve in the months following the February 2005 announcement. But many in the cryptographic community now believe that the writing is on the wall, and that, as a result of Wang's work, advances in computing speed and power will rapidly render useless all the hashing algorithms currently in use. Not today - the experts assure us that our ATM transactions are secure for now. But soon. Commenting on the development to New Scientist, Burt Kaliski, the head of RSA Laboratories in Bedford, Massachusetts, declared "This is a crisis for the research community." Mark Zimmerman, a cryptographer with ICSA Labs in Mechanicsburg, Pennsylvania, put it rather more colorfully. "It's not Armageddon," he said, "but it's a good kick in the pants."

I will just add that, what I think may be of particular interest to readers of MAA Online is that the entire story shows that, even in the Moore's Law driven world of massively large computations, the remarkable pattern-recognition capacity and "numerical intuitions" of the human brain, can still play a major - indeed devastating - role. Especially when combined with another valuable human trait: tenacity.


For further details on this story, see the article "Busted! A Crisis in Cryptography," by Celeste Biever, in the 17 December, 2005 issue of New Scientist, available online to magazine subscribers at http://www.newscientist.com/home.ns. See also the highly informative account by the leading security consultant Bruce Schneier online at http://www.schneier.com/blog/archives/2005/02/ sha1_broken.html.


Devlin's Angle is updated at the beginning of each month.
Mathematician Keith Devlin (email: [email protected]) is the Executive Director of the Center for the Study of Language and Information at Stanford University and The Math Guy on NPR's Weekend Edition. Devlin's newest book, THE MATH INSTINCT: Why You're a Mathematical Genius (along with Lobsters, Birds, Cats, and Dogs) was published recently by Thunder's Mouth Press.