In part 2 of our predictive coding blog series, we highlighted some of the challenges in measuring and communicating the accuracy of computer predictions. But what exactly do we mean when we refer to accuracy? In this post, I will cover the various metrics used to assess the accuracy of predictive coding.
The most intuitive method for measuring the accuracy of predictions is to simply calculate the percentage of documents the software predicted correctly. If 80 out of 100 documents are correctly predicted, the accuracy should be 80%. This approach is one of the standard methods used in many other disciplines. For example, a test score in school is often calculated by taking the number of questions answered correctly, dividing that by the total number of questions on the test, then multiplying the resulting number by 100 to get a percentage value. Wouldn’t it make sense to apply the same method for measuring the accuracy of predictive coding? Surprisingly, the answer is actually, “no.”
This approach is problematic because in eDiscovery the goal is not to determine the number of all documents tagged correctly, but rather the number of responsive documents tagged correctly. Let’s assume there are 50,000 documents in a case and each document has been reviewed by a human and computer, resulting in the human-computer comparison chart shown below.
Based on this chart, we can see that out of 50,000 total documents, the software predicted 42,000 documents (sum of row #1 and #3) correctly and therefore its accuracy is 84% (42,000/50,000).
However, analyzing the chart closely reveals a very different picture. The results of human review shows that there are 8,000 total responsive documents (sum of row #1 and #2) but the software found only 2,000 of those (row #1). This means the computer only found 25% of the truly responsive documents. This is called Recall.
Also, of the 4,000 documents that the computer predicted as responsive (the sum of row #1 and #4), only 2,000 are actually responsive (row #1), meaning the computer is right only 50% of the time when it predicts a document to be responsive. This is called Precision.
So, why are Recall and Precision so low – only 25% and 50%, respectively – when computer predictions are correct for 84% of the documents? That’s because the software did very well predicting non-responsive documents. Based on the human review, there are 42,000 non-responsive documents (sum of row #3 and #4), of which the software correctly found 40,000, meaning the computer is right 95% (40,000/42,000) of the time when it predicts a document non-responsive. While the software is right only 50% of the time when predicting a document responsive, it is right 95% of the time when predicting a document non-responsive, meaning that overall predictions across all documents are right to 84%.
In eDiscovery, parties are required to take reasonable steps to find documents. The example above illustrates that the “percentage of correct predictions across all documents” metric may paint an inaccurate view of the number of responsive documents found or missed by the software. This is especially true when most of the documents in a case are non-responsive, which is the most common scenario in eDiscovery. Therefore, Recall and Precision, which accurately track the number of responsive documents found and missed, are better metrics for measuring accuracy of predictions, since they measure what the eDiscovery process is seeking to achieve.
However, measuring and tracking both metrics independently could be cumbersome in many situations, especially if the end goal is to achieve higher accuracy on both measures overall. A single metric called F-measure, which tracks both Precision and Recall and is designed to strike a balance (or harmonic mean) between the two, can be used instead. A higher F-measure typically indicates higher precision and recall, and a lower F-measure typically indicates lower precision and recall.
These three units – Precision, Recall and F-measure – are the most widely accepted standards for measuring the accuracy of computer predictions. As a result, users of predictive coding are looking to solutions that provide a way to measure the prediction accuracy in all three units. The most advanced solutions have built-in measurement workflows and tracking mechanisms.
There is no standard for Recall, Precision or F-measure percentage. It is up to the parties involved in eDiscovery to determine a “reasonable” percentage based on the time, cost and risk trade-offs. The higher percentage means higher accuracy – but it also means higher eDiscovery costs as the software will likely require more training. For high-risk matters, 80%, 90% or even higher Recall may be required, but for lower-risk matters, 70% or even 60% may be acceptable. It should be noted that academic studies analyzing the effectiveness of linear review show widely varying review quality. One study which compared the accuracy of manual review with technology assisted review shows that manual review achieved, on average, 59.3% recall compared with an average recall of 76.7% for technology assisted review such as predictive coding.