My topic for one class was the use of DNA profiling in law enforcement and criminal court cases. I explained to the class how the accuracy of DNA profiling as a means to identify people (usually crime suspects, who as a result of the identification become criminal defendants) depends crucially on a key statistic: the probability that a DNA profile match between (say) a defendant and a DNA sample taken from a crime scene was a result of an accidental match, rather than because the crime-scene sample came from the defendant. I then displayed on the screen a probability figure that prosecutors typically give in court:
1/15,000,000,000,000,000or 1 in 15 quadrillion.
The class spontaneously erupted in laughter. Not just one or two students, but, as far as I could tell, all of them. For what this class of 30 or so experienced and numerically sophisticated scientists, technologists, engineers, and others realized at once was that such a figure is total nonsense. Nothing in life ever comes remotely close to such a degree of accuracy. In most professions where numerical precision is important, including laboratory science, 1 in 10,000 is often difficult to achieve.
I told the class that I was not trying to pull a fast one on them, and that I had read sworn testimonies to courts from scientists for the prosecution in serious criminal cases who insisted that such figures were both accurate and reliable. In the class discussion that followed, the general consensus seemed to be that anyone who makes such a statement in court is either numerically naive or else is deliberately trying to deceive a (possibly numerically unsophisticated) judge and jury.
Most regular readers will by now have a pretty good idea where a figure such as 1 in 15 quadrillion will have come from, and what it actually represents.
(Readers who saw last month's column will also have realized that I am returning to the same topic I discussed then, namely DNA profiling. To avoid repetition, I will assume from here on that anyone reading this column has (recently) read last month's offering, and I will assume familiarity with everything I discussed there.)
A DNA profile is essentially a read-out of an individual's genetic material taken on a number of distinct loci, loci chosen because scientists believe that they exhibit considerable variation among the population. The variation at a chosen locus will typically be such that the probability of two randomly chosen individuals having the same read-out at that locus is 1-in-ten or less.
If the read-outs at different loci are independent (in the usual sense of probability theory), then you can employ the product rule to compute the probability that two randomly selected individuals have profiles that match on a given set of loci, say 5 loci, 10 loci, or 13 loci, the number of loci that is currently used by the FBI in their CODIS database of 3,000,000 or so profiles from convicted individuals.
Taking the 1/10 figure I used above as an example for the probability of a random match at a single locus, using the product rule gives a figure of
1 in 10,000,000,000,000or 1 in 10 trillion for the probability of a match on 13 loci, a figure which, my conservative 1-in-10 example statistic notwithstanding, is still laughably well beyond the accuracy of human science and technology.
As often happens when the computation of probabilities is concerned, such unsophisticated use of the product rule rapidly takes theory well beyond the bounds of reality. I would not state such a ludicrous figure in a court of law, and nor, I am fairly sure, would any of my Stanford adult education class. And personally, as a mathematician, I find it a disgraceful state of affairs that the courts allow it. They may as well admit alchemy and astrology.
A figure such as 1 in ten trillion is so far off the reliability scale of science and technology that it says nothing about the likelihood that an innocent individual is in court accused of a crime. It answers one and one question only. Namely, the theoretical mathematical question, "If the probability of a single event is 1/10, and you run 13 independent trials, what is the probability that the event occurs in all 13 trials?"
So what figure should be presented in court? Actually, let me rehprase that. After my previous column appeared, I received a number of emails from lawyers, and some of them supported the view of the Supreme Court of California that the courts themselves should decide that matter - a state of affairs that I would be happy to concur with if they would seek the advice of professional statisticians in doing so, something that at present they seem reluctant to do. So, I'll rephrase my question as, what figure should be presented in court if the aim is to provide the judge and jury with the best numerical measure of the actual probability that the defendant's DNA profile match is accidental?
As I explained last month, that is a difficult question to answer, and the answer you get depends in significant part on how the defendant was first identified as a suspect. But my focus this month is not on the initial identification procedure, but on another matter that worries me, which I merely touched upon in passing last month. To whit:
Since naive application of the product rule leads to mumbo jumbo answers, just how do you calculate the probability of a random match on a specified set of loci?
Not by mathematics, or at least not by mathematics alone as is presently the case, that's for sure. (I believe that one of the duties of a professional mathematician is to stand up and say when our powerful toolbox is not appropriate, and this in my view is one of those moments.)
If mathematics were to be used to compute a meaningful and reliable random match probability, then the first thing that would need to be done is to look very, very, very closely at that assumption of independence across the loci.
As far as I can tell (and remember that this is way outside my domain of expertise), very little is known about this. This is particularly worrying because, given the way the product rule works (and now I'm back in my domain), in particular the speed with which it starts to produce absurdly large answers, in order to compute a reliable profile match probability starting with match probabilities at individual loci, you would need extremely accurate (I would guess unachievably accurate) numerical information on the degrees of dependence.
So what should be done? To me, the answer is obvious. Instead of using mathematics, determine the various random match probabilities empirically.
As far as I am aware, to date there has been only one attempt to do this, and the results obtained were both startling and worrying. A study of the Arizona CODIS database carried out in 2005 showed that approximately 1 in every 228 profiles in the database matched another profile in the database at nine or more loci, that approximately 1 in every 1,489 profiles matched at 10 loci, 1 in 16,374 profiles matched at 11 loci, and 1 in 32,747 matched at 12 loci.
How big a population does it take to produce so many matches that appear to contradict so dramatically the astronomical, theoretical figures given by the naive application of the product rule? The Arizona database contained at the time a mere 65,493 entries. Scary isn't it?
It is not much of a leap to estimate that the FBI's national CODIS database of 3,000,000 entries will contain not just one but several pairs that match on all 13 loci, contrary (and how!) to the prediction made by proponents of the currently much touted RMP that you can expect a single match only when you have on the order of 15 quadrillion profiles.
Of course, to produce reliable data that will serve the cause of justice, such a study would have to be done with care. For example, it would be important to eliminate (or at least flag) when the same individual has two or more profiles in the same database, listed under multiple identities - something that one can imagine happening in a database of convicted criminals. It would also be important to flag close relatives, who have a greatly increased likelihood of sharing large sections of DNA. When that is done, it may turn out that, even when excised of those ludicrously astronomical numbers, DNA profiling as currently practiced remains a highly accurate means of identification with low probability of leading to a false conviction. (I sure hope so.)
But, given the extremely high stakes, and the amount of court time currently taken up arguing about numbers that scientists can (and should) determine accurately, once and for all, it seems to me that such a study needs to be carried out as a matter of some urgency.
With 3 million entries, the FBI CODIS database would seem to be easily large enough to yield statistically reliable results. In fact, given that any competent computer science undergraduate could probably write a program to perform such a study in rather less than a single afternoon, with the results available after a few minutes (seconds?) of computing time, I am amazed that this has not been done already.