Query your average litigation attorney about the difference between predictive coding technology and other more traditional litigation tools and you are likely to receive a wide range of responses. The fact that “predictive coding” goes by many names, including “computer-assisted review” (CAR) and “technology-assisted review” (TAR) illustrates a fundamental problem: what is predictive coding and how is it different from other tools in the litigator’s technology toolbelt™?
Predictive coding is a type of machine-learning technology that enables a computer to “predict” how documents should be classified by relying on input (or “training”) from human reviewers. The technology is exciting for organizations attempting to manage skyrocketing eDiscovery costs because the ability to expedite the document review process and find key documents faster has the potential to save organizations thousands of hours of time. In a profession where the cost of reviewing a single gigabyte of data has been estimated to be around $18,000, narrowing days, weeks, or even months of tedious document review into more reasonable time frames means massive savings for thousands of organizations struggling to keep litigation expenditures in check.
Unfortunately, widespread adoption of predictive coding technology has been relatively slow due to confusion about how predictive coding differs from other types of CAR or TAR tools that have been available for years. Predictive coding, unlike other tools that automatically extract patterns and identify relationships between documents with minimal human intervention, requires a deeper level of human interaction. That interaction involves significant reliance on humans to train and fine-tune the system through an iterative, hands-on process. Some common TAR tools used in eDiscovery that do not include this same level of interaction are described below:
- Keyword search: Involves inputting a word or words into a computer which then retrieves documents within the collection containing the same words. Also known as Boolean searching, keyword search tools typically include enhanced capabilities to identify word combinations and derivatives of root words among other things.
- Concept search: Involves the use of linguistic and statistical algorithms to determine whether a document is responsive to a particular search query. This technology typically analyzes variables such as the proximity and frequency of words as they appear in relationship to a keyword search. The technology can retrieve more documents than keyword searches because conceptually related documents are identified, whether or not those documents contain the original keyword search terms.
- Discussion threading: Utilizes algorithms to dynamically link together related documents (most commonly e-mail messages) into chronological threads that reveal entire discussions. This simplifies the process of identifying participants to a conversation and understanding the substance of the conversation.
- Clustering: Involves the use of algorithms to automatically organize a large collection of documents into different topical categories based on similarities between documents. Reviewing documents organized categorically can help increase the speed and efficiency of document review.
- Find similar: Enables the automated retrieval of other documents related to a particular document of interest. Reviewing similar documents together accelerates the review process, provides full context for the document under review, and ensures greater coding consistency.
- Near-duplicate identification: Allows reviewers to easily identify, view, and code near-duplicate e-mails, attachments, and loose files. Some systems can highlight differences between near-duplicate documents to help simplify document review.
Unlike the TAR tools listed above, predictive coding technology relies on humans to review a small fraction of the overall document population, which ultimately results in a fraction of the review costs. The process entails feeding decisions about how to classify a small number of case documents called a training set into a computer system. The computer then relies on the human training decisions to generate a model that is used to predict how the remaining documents should be classified. The information generated by the model can be used to rank, analyze, and review the documents quickly and efficiently. Although documents can be coded with multiple designations that relate to various issues in the case during eDiscovery, many times predictive coding technology is simply used to segregate responsive and privileged documents from non-responsive documents in order to expedite and simplify the document review process.
Training the predictive coding system is an iterative process that requires attorneys and their legal teams to evaluate the accuracy of the computer’s document prediction scores at each stage. A prediction score is simply a percentage value assigned to each document that is used to rank all the documents by degree of responsiveness. If the accuracy of the computer-generated predictions is insufficient, additional training documents can be selected and reviewed to help improve the system’s performance. Multiple training sets are commonly reviewed and coded until the desired performance levels are achieved. Once the desired performance levels are achieved, informed decisions can be made about which documents to produce.
For example, if the legal team’s analysis of the computer’s predictions reveals that within a population of 1 million documents, only those with prediction scores in the 70 percent range and higher appear to be responsive, the team may elect to produce only those 300,000 documents to the requesting party. The financial consequences of this approach are significant because a majority of the documents can be excluded from expensive manual review by humans. The simple rule of thumb in eDiscovery is that the fewer documents requiring human review, the more money saved since document review is typically the most expensive facet of eDiscovery.
Hype and confusion surrounding the promise of predictive coding technology has led some to believe that the technology renders other TAR tools obsolete. To the contrary, predictive coding technology should be viewed as one of many different types of tools in the litigator’s technology toolbelt™ that often can and should be used together. Choosing which of these tools to use and how to use them depends on the case and requires balancing factors such as discovery deadlines, cost, and complexity. Many believe the choice about which tools should be used for a particular matter, however, should be left to producing party as long as the tools are used properly and in a manner that is “just” for both parties as mandated by Rule 1 of the Federal Rules of Civil Procedure.
The notion that parties should be able to choose which tools they use during discovery recently garnered support in the 7th Federal Circuit. In Kleen Products, LLC, et. al. v. Packaging Corporation of America, et. al., Judge Nolan was faced with exploring plaintiffs’ claim that the defendants’ should be required to supplement their use of keyword searching tools with more advanced tools in order to better comply with their duty to produce documents. Plaintiffs’ argument hinged largely on the assumption that using more advanced tools would result in a more thorough document production. In response to this argument, Judge Nolan referenced Sedona Best Practices Recommendations & Principles for Addressing Electronic Document Production during a hearing between the parties to suggest that carpenter (end user) is best equipped to select the appropriate tool during discovery. Sedona Principle 6 states that:
“[r]esponding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.”
Even though the parties in Kleen Products ultimately postponed further discussion about whether tools like predictive coding technology should be used when possible during discovery, the issue remains important because it is likely to resurface again and again as predictive coding momentum continues to grow. Some will argue that parties who fail to leverage modern technology tools like predictive coding are attempting to game the legal system to avoid thorough document productions. In some instances, that argument could be valid, but it should not be a foregone conclusion.
Although there will likely come a day where predictive coding technology is the status quo for managing large scale document review, that day has not yet arrived. Predictive coding technology is a type of machine learning technology that has been used in other disciplines for decades. However, predictive coding tools are still very new to the field of law. As a result, most predictive coding tools lack transparency because they provide little if any information about the underlying statistical methodologies they apply. The issue is important because the misapplication of statistics could have a dramatic effect on the thoroughness of document productions. Unfortunately, these nuanced issues are sometimes misunderstood or overlooked by predictive coding proponents –a problem that could ultimately result in unfairness to requesting parties and stall broader adoption of otherwise promising technology.
Further complicating matters is the fact that several solution providers have introduced new predictive coding tools in recent months to try and capture market share. In the long term, competition is good for consumers and the industry as a whole. In the short term, however, most of these tools are largely untested and vary in quality and ease of use, thereby adding more confusion to would-be consumers. The unfortunate end result is that many lawyers are shying away from using predictive coding technology until the pros and cons of various technology solutions and their providers are better understood. Market confusion is often one of the biggest stumbling blocks to faster adoption of technology that could save organizations millions and the current predictive coding landscape is a testament to this fact.
Eliminating much of the current confusion through education is the precise goal of Symantec’s Predictive Coding for Dummies book. The book addresses everything from predictive coding case law and defensible workflows, to key factors that should be considered when evaluating different predictive coding tools. The book strives to provide attorneys and legal staff accustomed to using traditional TAR tools like keyword searching with a baseline understanding of a new technological approach that many find confusing. We believe providing the industry with this basic level of understanding will help ensure that predictive coding technology and related best practices standards will evolve in a manner that is fair to both parties –ultimately, expediting rather than slowing broader adoption of this promising new technology. To learn more, download a free copy of Predictive Coding for Dummies and feel free to share your feedback and comments below.