In his recent post, my colleague Greg Muscarella noted that predictive coding and advances in review technologies are the rage within the eDiscovery community. While recent case law has been the primary focus of many of our earlier blogs, I want to explore one overlooked area – the challenges in measuring and communicating the accuracy of your predictive coding machine’s coding decisions.
Predictive coding usually involves a series of steps leading to the production of a final set of responsive documents. Some of the steps include identifying a small set of training documents, reviewing that subset using humans to train the machine, and then evaluating the system. There are several variations of training algorithms used in predictive coding and vendors frequently tout their approach as being superior to other variations. Regardless of the methodology, one fact remains the same – there is a need to measure and communicate the accuracy of the outcome clearly and cost effectively.
An analogy from the automotive industry is instructive. Every car manufacturer may have different parts and design specifications for manufacturing cars, but there are a few well-defined standards for measuring a car’s quality. For example, JD Powers & Associates has standards for measuring and rating automobile safety. This standard allows customers to compare the safety of cars from multiple vendors without requiring the customer to learn how each manufacturer built the car. Similarly, in eDiscovery, software providers often devise their own unique variation of predictive coding algorithm and methodology for training their system and implementing a prediction workflow. However, to be successful, the industry must develop a common standard forf measuring the accuracy of predictions. For example, if workflow A claims to have produced 50K responsive documents and workflow B produces 60K responsive documents, there should be a common and objective standard for measuring which workflow performed best.
The metrics of measuring the effectiveness of predictive coding are well established. Arguably, the most important measurement in eDiscovery is recall, which assesses the risk of missing responsive documents – i.e., false negatives. To balance this, precision assesses the cost of reviewing false positives. A combined measure, the F-measure is often used to report a balance between the two. However, to make these measurements, one needs to be able to compare the machine predictions against a human review. Human review of the entire corpus is typically not a viable option due to time and cost constraints – Hence, we resort to the human review of a small document sample, which can be compared to the computer’s predictions. This set of documents is often called a control set.
There are many ways to generate a control set, but the most common method is to select a random sample. Using sampling to measure the quality of predictions is based on the same theory used in quality control testing. Sampling theory and statistical measures related to sampling have been studied extensively in a range of industries. For example, the product manufacturing industry utilizes acceptance sampling to determine whether or not to acceptor reject shipments of manufactured goods.. Rather than testing 100% of a batch of units, a random sample of the batch is tested, and a decision about the entire batch is reached from the sample test results. Acceptance sampling was originally developed during World War II to test bullets. The purpose was to eliminate blanks so fewer soldiers would get killed in battle. Although the importance of proper sampling is obvious, the process is often confusing, which can lead to critical errors and serious legal ramifications in the context of eDiscovery.
When only a small sample from a population is examined, there is a risk of measurement errors if the sample size is too small. To understand why this occurs, it’s important to understand sampling basics. In general, sampling from a very large population involves a likelihood estimate – the likelihood that the full population will have the same measurement found in the sample. For example, exit polling in elections typically includes a sampling approach that relies on the fact that there is a reasonable likelihood that the polling results can be used to predict how the full population will vote. This information can be used to predict which candidate will win an election. In general, this approach has worked. However, in some well-publicized cases, the methodology has failed spectacularly. When the approach fails, the failures have been studied intensely, so these errors can be avoided in the future. The rationale behind the results of some of these studies is articulated in more detail below.
The likelihood that sampling results will hold true for the full population is measured using a combination of “margin of error” and “confidence level.” As an example, an exit poll may indicate that the winner in an election leads by a certain percentage with a margin of error +/- 3%. The confidence level, when not explicitly stated is usually 95%. This likelihood is interpreted as having 95% confidence that the winner’s margin is +/- 3%. This also means that there is a 5% (one in twenty chance) that the margin is over +/- 3%. The number of samples selected controls the margin of error and confidence level. Assuming that samples are randomly selected, common statistical formulas and tables will tell you the number of samples needed to achieve a certain margin of error within a particular confidence level.
However, one commonly overlooked factor is that sample size calculations depend on the prevalence of the measured items within the population. When the yield (the ratio of responsive documents to the total number of documents) is low, the margin of error on the estimate for responsive documents is high is the sample size is too small. To understand this concept using a real-world example, if we wish to measure the median height of the U.S. population of 320 million, one can select a random sample of 2,399 individuals and determine the height of everyone in the sample and find the median height. If the median height from the sample is 6’ 3”, one can estimate that the median height of the U.S. population is also 6’ 3”. Note that a different random sample of 2,399 individuals may produce a median height of 6’ 1”, and yet another, a median height of 6’ 4”. Thus, one could state that the median height is 6’ 3” with a margin of error of two inches. If we then attempt to estimate the number of individuals in the U.S. population who are 7’ 1” tall, you may reach a very different margin of error.
For example, assume that we find out of 2,399 heights, only one is 7’ 1” tall. Based on this finding, one might conclude that 0.0417% of the U.S. population is of height 7’ 1”, and that there are precisely 133,389 individuals that are 7’ 1” tall. If the actual number of 7’ 1” people (as determined by an exhaustive census) is 5,200, the margin of error is a staggering 2400%. Imagine a very likely scenario of not identifying that one individual of height 7’ 1” when selecting a sample of 2,399 – one would incorrectly conclude that there isn’t a single 7’ 1” tall person within the U.S. population and the margin of error would be 100%! Alternatively, if the sample included two 7’ 1” people, one would estimate that the population includes 266,778 people of that height, which represents a 4800% error. Again, it is the fact that there is a very low “yield” or number of people within the population meeting the criteria (7’ 1”) that causes the margin of error to widen dramatically.
The math behind this approach is complicated, but the basic point is simple. If an initial measurement indicates a low yield exists within a population, then a deeper analysis is required to determine a proper sample size to avoid an unreasonable margin of error. In the context of eDiscovery, the importance of selecting a large enough sample size is critical to obtaining an acceptable margin of error. Since increasing the sample size may require more documents to be reviewed, it is important to consider methods to reduce the cost of reviewing these additional documents. One method is to rely on internal knowledge about a particular case so large portions of the population that are clearly not responsive can be identified and set aside from the other documents. For example, if 600,000 of the one million documents are set aside as not responsive based on culling strategies, the yield in the rest of the corpus jumps to approximately 5% (i.e., 20,000 responsive documents out of 400,000 documents), and the number of samples needed for a 10% margin of error is only 6,878. Other strategies include judgmental sampling, stratified sampling and sampling the non-responsive population. However, many of these may lack the statistical validity that random sampling provides and may not be appropriate in the context of communicating a standardized set of results.
As predictive coding moves from a nascent technology with a few early adopters to more mainstream acceptance, it is critical to establish sound measurement methods in order to ensure the performance levels represented are actually being achieved. Equally important is the fact that practitioners must understand whether or not the predictive coding systems used applies these statistical methodologies properly to ensure that inaccurate representations are not made to the court and opposing parties. Understanding established procedures for sampling and the various statistical theories at play is a key aspect of defensibility. I’ve touched on one aspect – the impact of low yield on measurement and its importance with respect to predictive coding and eDiscovery in general. Stay tuned for the next post in our predictive coding series, which will discuss the metrics used for measuring accuracy.