Posts Tagged ‘e-discovery costs’

Predictive Coding 101 & the Litigator’s Toolbelt

Wednesday, December 5th, 2012

Query your average litigation attorney about the difference between predictive coding technology and other more traditional litigation tools and you are likely to receive a wide range of responses. The fact that “predictive coding” goes by many names, including “computer-assisted review” (CAR) and “technology-assisted review” (TAR) illustrates a fundamental problem: what is predictive coding and how is it different from other tools in the litigator’s technology toolbelt™?

 Predictive coding is a type of machine-learning technology that enables a computer to “predict” how documents should be classified by relying on input (or “training”) from human reviewers. The technology is exciting for organizations attempting to manage skyrocketing eDiscovery costs because the ability to expedite the document review process and find key documents faster has the potential to save organizations thousands of hours of time. In a profession where the cost of reviewing a single gigabyte of data has been estimated to be around $18,000, narrowing days, weeks, or even months of tedious document review into more reasonable time frames means massive savings for thousands of organizations struggling to keep litigation expenditures in check.

 Unfortunately, widespread adoption of predictive coding technology has been relatively slow due to confusion about how predictive coding differs from other types of CAR or TAR tools that have been available for years. Predictive coding, unlike other tools that automatically extract patterns and identify relationships between documents with minimal human intervention, requires a deeper level of human interaction. That interaction involves significant reliance on humans to train and fine-tune the system through an iterative, hands-on process. Some common TAR tools used in eDiscovery that do not include this same level of interaction are described below:

  •  Keyword search: Involves inputting a word or words into a computer which then retrieves documents within the collection containing the same words. Also known as Boolean searching, keyword search tools typically include enhanced capabilities to identify word combinations and derivatives of root words among other things.
  •  Concept search: Involves the use of linguistic and statistical algorithms to determine whether a document is responsive to a particular search query. This technology typically analyzes variables such as the proximity and frequency of words as they appear in relationship to a keyword search. The technology can retrieve more documents than keyword searches because conceptually related documents are identified, whether or not those documents contain the original keyword search terms.
  •  Discussion threading: Utilizes algorithms to dynamically link together related documents (most commonly e-mail messages) into chronological threads that reveal entire discussions. This simplifies the process of identifying participants to a conversation and understanding the substance of the conversation.
  •  Clustering: Involves the use of algorithms to automatically organize a large collection of documents into different topical categories based on similarities between documents. Reviewing documents organized categorically can help increase the speed and efficiency of document review. 
  •  Find similar: Enables the automated retrieval of other documents related to a particular document of interest. Reviewing similar documents together accelerates the review process, provides full context for the document under review, and ensures greater coding consistency.
  •  Near-duplicate identification: Allows reviewers to easily identify, view, and code near-duplicate e-mails, attachments, and loose files. Some systems can highlight differences between near-duplicate documents to help simplify document review.

Unlike the TAR tools listed above, predictive coding technology relies on humans to review a small fraction of the overall document population, which ultimately results in a fraction of the review costs. The process entails feeding decisions about how to classify a small number of case documents called a training set into a computer system. The computer then relies on the human training decisions to generate a model that is used to predict how the remaining documents should be classified. The information generated by the model can be used to rank, analyze, and review the documents quickly and efficiently. Although documents can be coded with multiple designations that relate to various issues in the case during eDiscovery, many times predictive coding technology is simply used to segregate responsive and privileged documents from non-responsive documents in order to expedite and simplify the document review process.

 Training the predictive coding system is an iterative process that requires attorneys and their legal teams to evaluate the accuracy of the computer’s document prediction scores at each stage. A prediction score is simply a percentage value assigned to each document that is used to rank all the documents by degree of responsiveness. If the accuracy of the computer-generated predictions is insufficient, additional training documents can be selected and reviewed to help improve the system’s performance. Multiple training sets are commonly reviewed and coded until the desired performance levels are achieved. Once the desired performance levels are achieved, informed decisions can be made about which documents to produce.

 For example, if the legal team’s analysis of the computer’s predictions reveals that within a population of 1 million documents, only those with prediction scores in the 70 percent range and higher appear to be responsive, the team may elect to produce only those 300,000 documents to the requesting party. The financial consequences of this approach are significant because a majority of the documents can be excluded from expensive manual review by humans. The simple rule of thumb in eDiscovery is that the fewer documents requiring human review, the more money saved since document review is typically the most expensive facet of eDiscovery.

 Hype and confusion surrounding the promise of predictive coding technology has led some to believe that the technology renders other TAR tools obsolete. To the contrary, predictive coding technology should be viewed as one of many different types of tools in the litigator’s technology toolbelt™ that often can and should be used together. Choosing which of these tools to use and how to use them depends on the case and requires balancing factors such as discovery deadlines, cost, and complexity. Many believe the choice about which tools should be used for a particular matter, however, should be left to producing party as long as the tools are used properly and in a manner that is “just” for both parties as mandated by Rule 1 of the Federal Rules of Civil Procedure

 The notion that parties should be able to choose which tools they use during discovery recently garnered support in the 7th Federal Circuit. In Kleen Products, LLC, et. al. v. Packaging Corporation of America, et. al., Judge Nolan was faced with exploring plaintiffs’ claim that the defendants’ should be required to supplement their use of keyword searching tools with more advanced tools in order to better comply with their duty to produce documents. Plaintiffs’ argument hinged largely on the assumption that using more advanced tools would result in a more thorough document production. In response to this argument, Judge Nolan referenced Sedona Best Practices Recommendations & Principles for Addressing Electronic Document Production during a hearing between the parties to suggest that carpenter (end user) is best equipped to select the appropriate tool during discovery. Sedona Principle 6 states that:

“[r]esponding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.”

Even though the parties in Kleen Products ultimately postponed further discussion about whether tools like predictive coding technology should be used when possible during discovery, the issue remains important because it is likely to resurface again and again as predictive coding momentum continues to grow. Some will argue that parties who fail to leverage modern technology tools like predictive coding are attempting to game the legal system to avoid thorough document productions.  In some instances, that argument could be valid, but it should not be a foregone conclusion.

Although there will likely come a day where predictive coding technology is the status quo for managing large scale document review, that day has not yet arrived. Predictive coding technology is a type of machine learning technology that has been used in other disciplines for decades. However, predictive coding tools are still very new to the field of law. As a result, most predictive coding tools lack transparency because they provide little if any information about the underlying statistical methodologies they apply. The issue is important because the misapplication of statistics could have a dramatic effect on the thoroughness of document productions. Unfortunately, these nuanced issues are sometimes misunderstood or overlooked by predictive coding proponents –a problem that could ultimately result in unfairness to requesting parties and stall broader adoption of otherwise promising technology. 

Further complicating matters is the fact that several solution providers have introduced new predictive coding tools in recent months to try and capture market share. In the long term, competition is good for consumers and the industry as a whole. In the short term, however, most of these tools are largely untested and vary in quality and ease of use, thereby adding more confusion to would-be consumers. The unfortunate end result is that many lawyers are shying away from using predictive coding technology until the pros and cons of various technology solutions and their providers are better understood.  Market confusion is often one of the biggest stumbling blocks to faster adoption of technology that could save organizations millions and the current predictive coding landscape is a testament to this fact.

Eliminating much of the current confusion through education is the precise goal of Symantec’s Predictive Coding for Dummies book. The book addresses everything from predictive coding case law and defensible workflows, to key factors that should be considered when evaluating different predictive coding tools. The book strives to provide attorneys and legal staff accustomed to using traditional TAR tools like keyword searching with a baseline understanding of a new technological approach that many find confusing. We believe providing the industry with this basic level of understanding will help ensure that predictive coding technology and related best practices standards will evolve in a manner that is fair to both parties –ultimately, expediting rather than slowing broader adoption of this promising new technology. To learn more, download a free copy of Predictive Coding for Dummies and feel free to share your feedback and comments below.

Q&A With Predictive Coding Guru, Maura R. Grossman, Esq.

Tuesday, November 13th, 2012

Can you tell us a little about your practice and your interest in predictive coding?

After a prior career as a clinical psychologist, I joined Wachtell Lipton as a litigator in 1999, and in 2007, when I was promoted to counsel, my practice shifted exclusively to advising lawyers and clients on legal, technical, and strategic issues involving electronic discovery and information management, both domestically and abroad.

I became interested in technology-assisted review (“TAR”) in the 2007/2008 time frame, when I sought to address the fact that Wachtell Lipton had few associates to devote to document review, and contract attorney review was costly, time-consuming, and generally of poor quality.  At about the same time, I crossed paths with Jason R. Baron and got involved in the TREC Legal Track.

What are a few of the biggest predictive coding myths?

There are so many, it’s hard to limit myself to only a few!  Here are my nominations for the top ten, in no particular order:

Myth #1:  TAR is the same thing as clustering, concept search, “find similar,” or any number of other early case assessment tools.
Myth #2:  Seed or training sets must always be random.
Myth #3:  Seed or training sets must always be selected and reviewed by senior partners.
Myth #4:  Thousands of documents must be reviewed as a prerequisite to employing TAR, therefore, it is not suitable for smaller matters.
Myth #5:  TAR is more susceptible to reviewer error than the “traditional approach.”
Myth #6:  One should cull with keywords prior to employing TAR.
Myth #7:  TAR does not work for short documents, spreadsheets, foreign language documents, or OCR’d documents.
Myth #8:  Tar finds “easy” documents at the expense of “hot” documents.
Myth #9:  If one adds new custodians to the collection, one must always retrain the system.
Myth #10:  Small changes to the seed or training set can cause large changes in the outcome, for example, documents that were previously tagged as highly relevant can become non-relevant. 

The bottom line is that your readers should challenge commonly held (and promoted) assumptions that lack empirical support.

Are all predictive coding tools the same?  If not, then what should legal departments look for when selecting a predictive coding tool?

Not at all, and neither are all manual reviews.  It is important to ask service providers the right questions to understand what you are getting.  For example, some TAR tools employ supervised or active machine learning, which require the construction of a “training set” of documents to teach the classifier to distinguish between responsive and non-responsive documents.  Supervised learning methods are generally more static, while active learning methods involve more interaction with the tool and more iteration.  Knowledge engineering approaches (a.k.a. “rule-based” methods) involve the construction of linguistic and other models that replicate the way that humans think about complex problems.  Both approaches can be effective when properly employed and validated.  At this time, only active machine learning and rule-based approaches have been shown to be effective for technology-assisted review.  Service providers should be prepared to tell their clients what is “under the hood.”

What is the number one mistake practitioners should avoid when using these tools?

Not employing proper validation protocols, which are essential to a defensible process.  There is widespread misunderstanding of statistics and what they can and cannot tell us.  For example, many service providers report that their tools achieve 99% accuracy.  Accuracy is the fraction of documents that are correctly coded by a search or review effort.  While accuracy is commonly advanced as evidence of an effective search or review effort, it can be misleading because it is heavily influenced by prevalence, or the number of responsive documents in the collection.  Consider, for example, a document collection containing one million documents, of which ten thousand (or 1%) are relevant.  A search or review effort that identified 100% of the documents as non-relevant, and therefore, found none of the relevant documents, would have 99% accuracy, belying the failure of that search or review effort to identify a single relevant document.

What do you see as the key issues that will confront practitioners who wish to use predictive coding in the near-term?

There are several issues that will be played out in the courts and in practice over the next few years.  They include:  (1) How does one know if the proposed TAR tool will work (or did work) as advertised?; (2) Must seed or training sets be disclosed, and why?; (3) Must documents coded as non-relevant be disclosed, and why?; (4) Should TAR be held to a higher standard of validation than manual review?; and (5) What cost and effort is justified for the purposes of validation?  How does one ensure that the cost of validation does not obliterate the savings achieved by using TAR?

What have you been up to lately?

In an effort to bring order to chaos by introducing a common framework and set of definitions for use by the bar, bench, and vendor community, Gordon V. Cormack and I recently prepared a glossary on technology-assisted review that is available for free download at:  http://cormack.uwaterloo.ca/targlossary.  We hope that your readers will send us their comments on our definitions and additional terms for inclusion in the next version of the glossary.

Maura R. Grossman, counsel at Wachtell, Lipton, Rosen & Katz, is a well-known e-discovery lawyer and recognized expert in technology-assisted review.  Her work was cited in the landmark 2012 case, Da Silva Moore v. Publicis Group (S.D.N.Y. 2012).

Where There’s Smoke There’s Fire: Powering eDiscovery with Data Loss Prevention

Monday, November 12th, 2012

New technologies are being repurposed for Early Case Assessment (ECA) in this ever-changing global economy chockfull of intellectual property theft and cybertheft. These increasingly hot issues are now compelling lawyers to become savvier about how the technologies they use to identify IP theft and related issues in eDiscovery. One of the more useful, but often overlooked tools in this regard is Data Loss Prevention (DLP) technology. Traditionally a data breach and security tool, DLP has emerged as yet another tool in the Litigator’s Tool Belt™ that can be applied in eDiscovery.

DLP technology utilizes Vector Machine Learning (VML) to detect intellectual property, such as product designs, source code and trademarked language that are deemed proprietary and confidential. This technology eliminates the need for developing laborious keyword-based policies or fingerprinting documents. While a corporation can certainly customize these policies, there are off the shelf materials that make the technology easy to deploy.

An exemplary use case that spotlights how DLP could have been deployed in the eDiscovery context is the case of E.I. Du Pont de Nemours v. Kolon Industries. In DuPont, a jury issued a $919 million verdict after finding that the defendant manufacturer stole critical elements of the formula for Kevlar, a closely guarded and highly profitable DuPont trade secret. Despite the measures that were taken to protect the trade secret, a former DuPont consultant successfully copied key information relating to Kevlar on to a CD that was later disseminated to the manufacturer’s executives. All of this came to light in the recently unsealed criminal indictments the U.S. Department of Justice obtained against the manufacturer and several of its executives.

Perhaps all of this could have been avoided had a DLP tool been deployed. A properly implemented DLP solution in the DuPont case might have detected the misappropriation that occurred and perhaps prompted an internal investigation. At the very least, DLP could possibly have mitigated the harmful effects of the trade secret theft. DLP technology could potentially have detected the departure/copying of proprietary information and any other suspicious behavior regarding sensitive IP.

As the DuPont case teaches, DLP can be utilized to detect IP theft and data breaches. In addition, it can act as an early case assessment (ECA) tool for lawyers in both civil and criminal actions. With data breaches, where there is smoke (breach) there is generally fire (litigation). A DLP incident report can be used as a basis for an investigation, and essentially reverse engineer the ECA process with hard evidence underlying the data breach. Thus, instead of beginning an investigation with a hunch or tangential lead, DLP gives hard facts to lawyers, and ultimately serves as a roadmap for effective legal hold implementation for the communications of custodians. Instead of discovering data breaches during the discovery process, DLP allows lawyers to start with this information, making the entire matter more efficient and targeted.

From an information governance point of view, DLP also has a relationship with the left proactive side of the Electronic Discovery Reference Model. The DLP technology can also be repurposed as Data Classification Services for automated document retention. The policy and technology combination of DCS/DLP speak to each other in harmony to accomplish appropriate document retention as well as breach prevention and notification. It follows that there would be similar identifiers for both policy consoles in DCS/DLP, and that these indicators enable the technology to make intelligent decisions.

Given this backdrop, it behooves both firm lawyers and corporate counsel to consider getting up to speed on the capabilities of DLP tools. The benefits DLP offers in eDiscovery are too important to be ignored.

5 questions with Ralph Losey about the New Electronic Discovery Best Practices (EDBP) Model for Attorneys

Tuesday, November 6th, 2012

The eDiscovery world is atwitter with two new developments – one is Judge Laster’s opinion in the EORHB case where he required both parties to use predictive coding. The other is the new EDBP model, created by Ralph Losey (and team) to “provide a model of best practices for use by law firms and corporate law departments.” Ralph was kind enough to answer a few questions for eDiscovery 2.0:

1. While perhaps not fair, I’ve already heard the EDBP referred to as the “new EDRM.” If busy folks could only read one paragraph on the distinction, could you set them straight?

“EDRM, the Electronic Discovery Reference Model, covers the whole gamut of an e-discovery project. The model provides a well-established, nine-step workflow that helps beginners understand e-discovery. EDBP, Electronic Discovery Best Practices, is focused solely on the activities of lawyers. The EDBP identifies a ten-step workflow of the rendition of legal services in e-discovery. Moreover, EDBP.com attempts to capture and record what lawyers specializing in the field now consider the best practices for each of these activities.”

“By the way, although I have a copyright on these diagrams, anyone may freely use the diagrams. We encourage that. We are also open to suggestions for best practices from any practicing lawyer. We anticipate that this will be a constantly evolving model and collection of best practices.”

2. Given the lawyer-centric focus, what void are you attempting to fill with the EDBP?

I was convinced by my friend Jason Baron of the need for standards in the world of e-discovery. It is too much of a wild west out there now, and we need guidance. But as a private lawyer I am also cognizant of the dangers of creating minimum standards for lawyers that could be used as a basis for malpractice suits. It is not an appropriate thing for any private group to do. It is a judicial matter that will arise out of case law and competition. So after a lot of thought we realized that minimum standards should only be articulated for the non-legal-practice part of e-discovery, in other words, standards should be created for vendors only and their non-legal activities. The focus for lawyers should be on establishing best practices, not minimum standards. I created this graphic using the analogy of a full tank of gas to visualize this point and explained it my blog Does Your CAR (“Computer Assisted Review”) Have a Full Tank of Gas?


“This continuum of competence applies not only to the legal service of Computer Assisted Review (CAR), aka Technology Assisted Review (TAR), but to all legal services. The goal of EDBP is to help lawyers avoid negligence by staying far away from minimum standards and focus instead of the ideals, the best practices.”


3. The EDBP has ten steps. While assuredly unfair, what step contains the most controversy/novelty compared to business as usual in the current e-Discovery world?

“None really. That’s the beauty of it. The EDBP just documents what attorneys already do. The only thing controversial about it, if you want to call it that, is that it established another frame of reference for e-discovery in addition to the EDRM. It does not replace EDRM. It supplements it. Most lawyers specializing in the field will get EDBP right away.”


“I suppose you could say giving Cooperation its very own key place in a lawyer’s work flow might be somewhat controversial, but there is no denying that the rules, and best practices, require lawyers to talk to each other and at least try to cooperate. Failing that, all the judges and experts I have heard suggest that you should initiate early motion practice and not wait until the end. There seems to be widespread consensus in the e-discovery community on the key role of cooperative dialogues with opposing counsel and the court, so I do not think it is really controversial, but may still be news to the larger legal community. In fact, all of these best practices may not be well-known to the average Joe Litigator, which just shows the strong need for an educational resource like EDBP.”

4. Why not use “information governance” instead of “litigation readiness” on the far left hand side of the EDBP?

 There is far more to getting a client ready for litigation than helping them with their information governance. Plus, remember, this is not a workflow for vendors or management or records managers. It is not a model for an entire e-discovery team. This is a workflow only for what lawyers do.”

5. Given your recent, polarizing article urging law firms to get out of the eDiscovery business, how does the EDBP model either help or hinder that exhortation?

 This article was part of my attempt to clarify the line between legal e-discovery services and non-legal e-discovery services. EDBP is a part of that effort because it is only concerned with the law. It does not include non-legal services. As a practicing lawyer my core competency is legal advice, not processing ESI and software. Many lawyers agree with me on this, so I don’t think my article was polarizing so much as it is exposing, kind of like the young kid who pointed out that the emperor had no clothes.

The professionals in law firm lit support departments will eventually calm down when they realize no jobs are lost in this kind of outsourcing, and it all stays in the country. The work just moves from law firms, that also do some e-discovery, to businesses, most of whom only do e-discovery. I predict that when this kind of outsourcing catches on, that it will be common for the vendor with the outsourcing contract to hire as many of the law firm’s lit-support professionals as possible.

My Emperor’s no-clothes expose applies to the vendor side of the equation too. Vendors, like law firms, should stick to their core competence and stay away from providing legal advice. UPL is a serious matter. In most states it is a crime. Many vendors may well be competent to provide legal services, but they do not have a license to do so, not to mention their lack of malpractice insurance.

I am trying to help the justice system by clarifying and illuminating the line between law and business. It has become way too blurred to the detriment of both. Much of this fault lies on the lawyer-side as many seem quite content to unethically delegate their legal duties to non-lawyers, rather than learn this new area of law. I am all for the team approach. I have been advocating it for years in e-DiscoveryTeam.com. But each member of the team should know their strengths and limitations and act accordingly. We all have different positions to play on the team. We cannot all be quarterbacks.”

6. [Bonus Question] “EDBP” doesn’t just roll off the tongue. Given your prolific creativity (I seem to recall hamsters on a trapeze at one point in time), did you spend any cycles on a more mellifluous name for the new model?

“There are not many four-letter dot-com domain names out there for purchase, and none for free, and I did not want to settle for dot-net like EDRM did. I am proud, and a tad poorer, to have purchased what I think is a very good four-letter domain name, EDBP.com. After a few years EDBP will flow off your tongue too, after all, if has an internal rhyme – ED BP. Just add a slight pause to the name, ED … BP, and it flows pretty well thank you.”

Thanks Ralph.  We look forward to seeing how this new model gains traction. Best of luck.

New Gartner Report Spotlights Significance of Email Archiving for Defensible Deletion

Thursday, November 1st, 2012

Gartner recently released a report that spotlights the importance of using email archiving as part of an organization’s defensible deletion strategy. The report – Best Practices for Using Email Archiving to Eliminate PST and Mailbox Quota Headaches (Alan Dayley, September 21, 2012) – specifically focuses on the information retention and eDiscovery challenges associated with email storage on Microsoft Exchange and how email archiving software can help address these issues. As Gartner makes clear in its report, an archiving solution can provide genuine opportunities to reduce the costs and risks of email hoarding.

The Problem: PST Files

The primary challenge that many organizations are experiencing with Microsoft Exchange email is the unchecked growth of messages stored in portable storage tablet (PST) files. Used to bypass storage quotas on Exchange, PST files are problematic because they increase the costs and risks of eDiscovery while circumventing information retention policies.

That the unrestrained growth of PST files could create problems downstream for organizations should come as no surprise. Various court decisions have addressed this issue, with the DuPont v. Kolon Industries litigation foremost among them. In the DuPont case, a $919 million verdict and 20 year product injunction largely stemmed from the defendant’s inability to prevent the destruction of thousands pages of email formerly stored in PST files. That spoliation resulted in a negative inference instruction to the jury and the ensuing verdict against the defendant.

The Solution: Eradicate PSTs with the Help of Archiving Software and Retention Policies

To address the PST problem, Gartner suggests following a three-step process to help manage and then eradicate PSTs from the organization. This includes educating end users regarding both the perils of PSTs and the ease of access to email through archiving software. It also involves disabling the creation of new PSTs, a process that should ultimately culminate with the elimination of existing PSTs.

In connection with this process, Gartner suggests deployment of archiving software with a “PST management tool” to facilitate the eradication process. With the assistance of the archiving tool, existing PSTs can be discovered and migrated into the archive’s central data repository. Once there, email retention policies can begin to expire stale, useless and even harmful messages that were formerly outside the company’s information retention framework.

With respect to the development of retention policies, organizations should consider engaging in a cooperative internal process involving IT, compliance, legal and business units. These key stakeholders must be engaged and collaborate if a workable policies are to be created. The actual retention periods should take into account the types of email generated and received by an organization, along with the enterprise’s business, industry and litigation profile.

To ensure successful implementation of such retention policies and also address the problem of PSTs, an organization should explore whether an on premise or cloud archiving solution is a better fit for its environment. While each method has its advantages, Gartner advises organizations to consider whether certain key features are included with a particular offering:

Email classification. The archiving tool should allow your organization to classify and tag the emails in accordance with your retention policy definitions, including user-selected, user/group, or key-word tagging.

User access to archived email. The tool must also give end users appropriate and user-friendly access to their archived email, thus eliminating concerns over their inability to manage their email storage with PSTs.

Legal and information discovery capabilities. The search, indexing, and e-discovery capabilities of the archiving tool should also match your needs or enable integration into corporate e-discovery systems.

While perhaps not a panacea for the storage and eDiscovery problems associated with email, on premise or cloud archiving software should provide various benefits to organizations. Indeed, such technologies have the potential to help organizations store, manage and discover their email efficiently, cost effectively and in a defensible manner. Where properly deployed and fully implemented, organizations should be able to reduce the nettlesome costs and risks connected with email.

Judicial Activism Taken to New Heights in Latest EORHB (Hooters) Predictive Coding Case

Monday, October 29th, 2012

Ralph Losey, an attorney for Jackson Lewis, reported last week that a Delaware judge took matters into his own hands by proactively requiring both parties to show cause as to why they should not use predictive coding technology to manage electronic discovery. Predictive coding advocates around the globe will eagerly trumpet Judge Laster’s move as another judicial stamp of approval for predictive coding much the same way proponents lauded Judge Peck’s order in in Da Silva Moore, et. al. v. Publicis Groupe, et. al.  In Da Silva Moore, Judge Peck stated that computer-assisted review is “acceptable in appropriate cases.” In stark contrast to Da Silva Moore, the parties in EORHB, Inc., et al v. HOA Holdings, LLC, not only never agreed to use predictive coding technology, there is no indication they ever initiated the discussion with one another let alone with Judge Laster. In addition to attempting to dictate the technology tool to be used, Judge Laster also directed the parties to use the same vendor. Apparently, Judge Laster not only has the looks of Agent 007, he shares James Bonds’ bold demeanor as well.

Although many proponents of predictive coding technology will see Judge Laster’s approach as an important step forward toward broader acceptance of predictive coding technology, the directive may sound alarm bells for others. The approach contradicts the apparent judicial philosophy applied in Kleen Products, LLC, et. al. v. Packaging Corporation of America, et. al. — a 7th Circuit case also addressing the use of predictive coding technology. During one of many hearings between the parties in Kleen, Judge Nan Nolan stated that “the defendant under Sedona 6 has the right to pick the [eDiscovery] method.”  Judge Nolan’s statement is a nod to Principle 6 of the Sedona Best Practices Recommendations & Principles for Addressing Electronic Document Production which states:

“[r]esponding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.”

Many attorneys shudder at the notion that the judiciary should choose (or at least strongly urge) the specific technology tools parties must use during discovery. The concern is based largely on the belief that many judges lack familiarity with the wide range of eDiscovery technology tools that exist today.  For example, keyword search, concept search, and email threading represent only a few of the many technology tools in the litigator’s tool belt that can be used in conjunction with predictive coding tools to accelerate document review and analysis.  The current challenge is that predictive coding technology is relatively new to the legal industry so the technology is much more complex than some of the older tools in the litigator’s tool belt.  Not surprisingly, this complexity combined with an onslaught of new entrants to the predictive coding market has generated a lot of confusion about how to use predictive coding tools properly.

Current market confusion is precisely what Judge Laster and the parties in EORHB must overcome in order to successfully advance the adoption of predictive coding tools within the legal community. Key to the success of this mission is the recognition that predictive coding pitfalls are not always easy to identify– let alone avoid. However, if these pitfalls are properly identified and navigated, then Judge Laster’s mission may be possible.

Identifying pitfalls is challenging because industry momentum has led many to erroneously assume that all predictive coding tools work the same way. The momentum has been driven by the potential for organizations to save millions in document review costs with predictive coding technology. As a result, vendors are racing to market at breakneck speed to offer their own brand of predictive coding technology. Those without their own solutions are rapidly forming partnerships with those who have offerings so they too can capitalize on the predictive coding financial bonanza that many believe is around the corner. This rush to market has left the legal and academic communities with little time to build consensus about the best way to properly vet a wide range of new technology offerings.  More specifically, the predictive coding craze has fostered an environment where there is often a lack of scrutiny related to individual predictive coding tools.

The harsh reality is that all predictive coding tools are not created equally.  For example, some providers erroneously call their solution “predictive coding technology” when the solution they offer is merely a type of clustering and/or concept searching technology that has been commonly used for over a decade. Even among predictive coding tools that are perceived as legitimate, pricing varies so widely that using some tools may not even be economically feasible considering the value of the case at hand. Some solution providers charge a premium to use their predictive coding tools and require additional expenditures in the form of consulting fees, while others tools are integrated within easy-to-use eDiscovery platforms at no additional cost.

If the court and parties decide that using predictive coding technology in EORHB makes economic sense, they must understand the importance of statistics and transparency to insure a fair playing field. The widespread belief that all predictive coding technologies surpass the accuracy of human review is a pervasive misperception that continues to drive confusion in the industry. The assumption is false not only because these tools must be used correctly to yield reliable results, but because the underlying statistical methodology applied by the tools must also be sound for the tools to work properly and exceed the accuracy of human review. (See Predictive Coding for Dummies for a more comprehensive explanation of predictive coding and statistics).

The underlying statistical methodology utilized by most tools today is almost always unclear which should automatically raise red flags for Judge Laster. In fact, this lack of transparency has led many to characterize most predictive coding tools as “black box” technologies – meaning that inadequate information about how the tools apply statistics makes it difficult to trust the results. There are differing schools of thought about the proper application of statistics in predictive coding that have largely been ignored to date.  Hopefully Judge Laster and the parties will use the present case as an opportunity to clarify some of this confusion so that the adoption of predictive coding technology within the legal community is accelerated in a way that involves sufficient scrutiny of the processes and tools used.

Judge Laster and the parties in EORHB are presented with a unique opportunity to address many important issues related to the use of predictive coding technology that are often misunderstood and overlooked. Hopefully the parties use predictive coding technology and engage in a dialogue that highlights the importance of selecting the right predictive coding tool, using that tool correctly, and the proper application of statistics.  If the court and the parties shed light on these three areas, Judge Laster’s predictive coding mission may be possible.

Many Practitioners “Dazed and Confused” over Electronic Discovery Definitions

Wednesday, October 24th, 2012

The song “Dazed and Confused,” by legendary rock band Led Zeppelin, has a great stanza:

 Been Dazed and Confused for so long it’s not true.

Wanted a woman, never bargained for you.

Lots of people talk and few of them know, soul of a woman was created below.

As I recently surveyed the definitions for “eDiscovery,” it occurred to me that lots of folks talk as if they know the definition, but few likely appreciate many of the subtle nuances. And, if you forced them to, many wouldn’t be able to write a concise eDiscovery definition.

The first, obvious place to look for an eDiscovery north star is the EDRM, which was originally responsible for creating the lingua franca for the entire industry.

EDRM (Electronic Discovery definition)

  • “Discovery documents produced in electronic formats rather than hardcopy. The production may be contained on hard drives, tapes, CDs, DVDs, external hard drives, etc. Once received, these documents are converted to .tif format. It is during the conversion process that metadata can be extracted.
  • A process that includes electronic documents and email into a collection of ‘discoverable’ documents for litigation. Usually involves both software and a process that searches and indexes files on hard drives or other electronic media. Extracts metadata automatically for use as an index. May include conversion of electronic documents to an image format as if the document had been printed out and then scanned.
  • The discovery of electronic documents and data including e-mail, Web pages, word processing files, computer databases, and virtually anything that is stored on a computer. Technically, documents and data are ‘electronic’ if they exist in a medium that can only be read through the use of computers. Such media include cache memory, magnetic disks (such as computer hard drives or floppy disks), optical disks (such as DVDs or CDs), and magnetic tapes. 
  • The process of finding, identifying, locating, retrieving, and reviewing potentially relevant data in designated computer systems.”

Gartner, the large IT analyst firm, proffers a different version.

Gartner (E-discovery definition)

“E-discovery is the identification, preservation, collection, preparation, review and production of electronically stored information associated with legal and government proceedings. The e-discovery market is not unified or simple — significant differences exist among vendors and service providers regarding technologies, specialized markets, overall functionality and service offerings. Content and records management, information access and search, and e-mail archiving and retention technologies provide key foundations to the e-discovery function. More and more enterprises are looking to insource at least part of the e-discovery function, especially records management, identification, preservation and collection of electronic files. E-discovery technology can be provided as a stand-alone application, embedded in other applications or services, or accessed as a hosted offering.”

The Sedona Conference, which is the leading think tank on all things eDiscovery, has the following definition:

Sedona (Electronic Discovery/Discovery definition)

“Electronic Discovery (“E-Discovery”): The process of identifying, preserving, collecting, preparing, reviewing, and producing electronically stored information (“ESI”) in the context of the legal process. See Discovery.”

“Discovery: Discovery is the process of identifying, locating, securing, and producing information and materials for the purpose of obtaining evidence for utilization in the legal process. The term is also used to describe the process of reviewing all materials that may be potentially relevant to the issues at hand and/or that may need to be disclosed to other parties, and of evaluating evidence to prove or disprove facts, theories, or allegations. There are several ways to conduct discovery, the most common of which are interrogatories, requests for production of documents, and depositions.”

Looking at these in concert, a few things come into focus, aside from the vexingly diverse naming conventions.  First, the EDRM definition focuses (as some might expect) on the tactics and practice of eDiscovery. This is a useful starting place, but they’ve missed out on other elements, like the overall market dynamics, which are discussed (again not surprisingly) by Gartner. Gartner likewise addresses how eDiscovery is accomplished, referencing the need for software and the escalating trend of taking eDiscovery tools in house. Sedona (coming from a legal theory perspective) relies heavily on the legal definition of “discovery,” properly referencing its context in the legal process, a fact sometimes lost by practitioners that have expanded eDiscovery into other non-legal avenues.

These definitions are fine in the abstract, but even collectively they nevertheless fail to take into account several key points. First, as eDiscovery is quickly subsumed into the larger information governance umbrella, it’s important to stress the historically reactive nature of eDiscovery. This reactive posture can be nicely contrasted with the upstream concepts of information management and governance, which significantly impact the downstream, reactive elements.

Next, it’s important to recognize the costs/risks inherent in the eDiscovery process. Whether it’s due to spoliation sanctions or simply the costs of eDiscovery (which easily costs $1.5M per matter), the potential impact to the organization can’t be ignored. Without a true grasp on the organizational costs/risks, entities can’t properly begin to deploy either reactive or proactive solutions since they won’t have enough data for comprehensive ROI calculations. Finally, eDiscovery as a term has started to experience scope creep. What used to be firmly tethered to the legal discovery process has recently expanded into use cases where the process is now deployed in a number of similar (but non-legal) scenarios such as internal investigations, governmental inquiries, FOIA requests, FCPA, etc.

These additional aspects are critical for developing a comprehensive understanding of eDiscovery. And, while a comprehensive definition isn’t the final end game to this complex challenge, it’s certainly a better starting place than being “dazed and confused” about the nuances of eDiscovery.  Eliminating unnecessary confusion early in the game is ultimately essential to promoting and not hindering long term initiatives.

Federal Directive Hits Two Birds (RIM and eDiscovery) with One Stone

Thursday, October 18th, 2012

The eagerly awaited Directive from The Office of Management and Budget (OMB) and The National Archives and Records Administration (NARA) was released at the end of August. In an attempt to go behind the scenes, we’ve asked the Project Management Office (PMO) and the Chief Records Officer for the NARA to respond to a few key questions. 

We know that the Presidential Mandate was the impetus for the agency self-assessments that were submitted to NARA. Now that NARA and the OMB have distilled those reports, what are the biggest challenges on a go forward basis for the government regarding record keeping, information governance and eDiscovery?

“In each of those areas, the biggest challenge that can be identified is the rapid emergence and deployment of technology. Technology has changed the way Federal agencies carry out their missions and create the records required to document that activity. It has also changed the dynamics in records management. In the past, agencies would maintain central file rooms where records were stored and managed. Now, with distributed computing networks, records are likely to be in a multitude of electronic formats, on a variety of servers, and exist as multiple copies. Records management practices need to move forward to solve that challenge. If done right, good records management (especially of electronic records) can also be of great help in providing a solid foundation for applying best practices in other areas, including in eDiscovery, FOIA, as well as in all aspects of information governance.”    

What is the biggest action item from the Directive for agencies to take away?

“The Directive creates a framework for records management in the 21st century that emphasizes the primacy of electronic information and directs agencies to being transforming their current process to identify and capture electronic records. One milestone is that by 2016, agencies must be managing their email in an electronically accessible format (with tools that make this possible, not printing out emails to paper). Agencies should begin planning for the transition, where appropriate, from paper-based records management process to those that preserve records in an electronic format.

The Directive also calls on agencies to designate a Senior Agency Official (SAO) for Records Management by November 15, 2012. The SAO is intended to raise the profile of records management in an agency to ensure that each agency commits the resources necessary to carry out the rest of the goals in the Directive. A meeting of SAOs is to be held at the National Archives with the Archivist of the United States convening the meeting by the end of this year. Details about that meeting will be distributed by NARA soon.”

Does the Directive holistically address information governance for the agencies, or is it likely that agencies will continue to deploy different technology even within their own departments?

“In general, as long as agencies are properly managing their records, it does not matter what technologies they are using. However, one of the drivers behind the issuance of the Memorandum and the Directive was identifying ways in which agencies can reduce costs while still meeting all of their records management requirements. The Directive specifies actions (see A3, A4, A5, and B2) in which NARA and agencies can work together to identify effective solutions that can be shared.”

Finally, although FOIA requests have increased and the backlog has decreased, how will litigation and FOIA intersecting in the next say 5 years?  We know from the retracted decision in NDLON that metadata still remains an issue for the government…are we getting to a point where records created electronically will be able to be produced electronically as a matter of course for FOIA litigation/requests?

“In general, an important feature of the Directive is that the Federal government’s record information – most of which is in electronic format – stays in electronic format. Therefore, all of the inherent benefits will remain as well – i.e., metadata being retained, easier and speedier searches to locate records, and efficiencies in compilation, reproduction, transmission, and reduction in the cost of producing the requested information. This all would be expected to have an impact in improving the ability of federal agencies to respond to FOIA requests by producing records in electronic formats.”

Fun Fact- Is NARA really saving every tweet produced?

“Actually, the Library of Congress is the agency that is preserving Twitter. NARA is interested in only preserving those tweets that a) were made or received in the course of government business and b) appraised to have permanent value. We talked about this on our Records Express blog.”

“We think President Barack Obama said it best when he made the following comment on November 28, 2011:

“The current federal records management system is based on an outdated approach involving paper and filing cabinets. Today’s action will move the process into the digital age so the American public can have access to clear and accurate information about the decisions and actions of the Federal Government.” Paul Wester, Chief Records Officer at the National Archives, has stated that this Directive is very exciting for the Federal Records Management community.  In our lifetime none of us has experienced the attention to the challenges that we encounter every day in managing our records management programs like we are now. These are very exciting times to be a records manager in the Federal government. Full implementation of the Directive by the end of this decade will take a lot of hard work, but the government will be better off for doing this and we will be better able to serve the public.”

Special thanks to NARA for the ongoing dialogue that is key to transparent government and the effective practice of eDiscovery, Freedom Of Information Act requests, records management and thought leadership in the government sector. Stay tuned as we continue to cover these crucial issues for the government as they wrestle with important information governance challenges. 


Defensible Deletion: The Cornerstone of Intelligent Information Governance

Tuesday, October 16th, 2012

The struggle to stay above the rising tide of information is a constant battle for organizations. Not only are the costs and logistics associated with data storage more troubling than ever, but so are the potential legal consequences. Indeed, the news headlines are constantly filled with horror stories of jury verdicts, court judgments and unreasonable settlements involving organizations that failed to effectively address their data stockpiles.

While there are no quick or easy solutions to these problems, an ever increasing method for effectively dealing with these issues is through an organizational strategy referred to as defensible deletion. A defensible deletion strategy could refer to many items. But at its core, defensible deletion is a comprehensive approach that companies implement to reduce the storage costs and legal risks associated with the retention of electronically stored information (ESI). Organizations that have done so have been successful in avoiding court sanctions while at the same time eliminating ESI that has little or no business value.

The first step to implementing a defensible deletion strategy is for organizations to ensure that they have a top-down plan for addressing data retention. This typically requires that their information governance principals – legal and IT – are cooperating with each other. These departments must also work jointly with records managers and business units to decide what data must be kept and for what length of time. All such stakeholders in information retention must be engaged and collaborate if the organization is to create a workable defensible deletion strategy.

Cooperation between legal and IT naturally leads the organization to establish records retention policies, which carry out the key players’ decisions on data preservation. Such policies should address the particular needs of an organization while balancing them against litigation requirements. Not only will that enable a company to reduce its costs by decreasing data proliferation, it will minimize a company’s litigation risks by allowing it to limit the amount of potentially relevant information available for current and follow-on litigation.

In like manner, legal should work with IT to develop a process for how the organization will address document preservation during litigation. This will likely involve the designation of officials who are responsible for issuing a timely and comprehensive litigation hold to custodians and data sources. This will ultimately help an organization avoid the mistakes that often plague document management during litigation.

The Role of Technology in Defensible Deletion

In the digital age, an essential aspect of a defensible deletion strategy is technology. Indeed, without innovations such as archiving software and automated legal hold acknowledgements, it will be difficult for an organization to achieve its defensible deletion objectives.

On the information management side of defensible deletion, archiving software can help enforce organization retention policies and thereby reduce data volume and related storage costs. This can be accomplished with classification tools, which intelligently analyze and tag data content as it is ingested into the archive. By so doing, organizations may retain information that is significant or that otherwise must be kept for business, legal or regulatory purposes – and nothing else.

An archiving solution can also reduce costs through efficient data storage. By expiring data in accordance with organization retention policies and by using single instance storage to eliminate ESI duplicates, archiving software frees up space on company servers for the retention of other materials and ultimately leads to decreased storage costs. Moreover, it also lessens litigation risks as it removes data available for future litigation.

On the eDiscovery side of defensible deletion, an eDiscovery platform with the latest in legal hold technology is often essential for enabling a workable litigation hold process. Effective platforms enable automated legal hold acknowledgements on various custodians across multiple cases. This allows organizations to confidently place data on hold through a single user action and eliminates concerns that ESI may slip through the proverbial cracks of manual hold practices.

Organizations are experiencing every day the costly mistakes of delaying implementation of a defensible deletion program. This trend can be reversed through a common sense defensible deletion strategy which, when powered by effective, enabling technologies, can help organizations decrease the costs and risks associated with the information explosion.

How to Keep “Big Data” From Turning into “Bad Data” Resulting in eDiscovery and information Governance Risks

Wednesday, October 10th, 2012

In a recent Inside Counsel article, I explored the tension between big data and the potentially competing notion of information governance by looking at the 5 Vs of Big Data…

“The Five Vs” of Big Data 

1.  Volume: Volume, not surprisingly, is the hallmark of the big data concept. Since data creation doubles every 18 months, we’ve rapidly moved from a gigabyte world to a universe where terabytes and exabytes rule the day.  In fact, according to a 2011 report from the McKinsey Global Institute, numerous U.S. companies now have more data stored than the U.S. Library of Congress, which has more than 285 terabytes of data (as of early this year). And to complicate matters, this trend is escalating exponentially with no reasonable expectation of abating. 

2. Velocity: According to the analysts firm Gartner, velocity can be thought of in terms of “streams of data, structured record creation, and availability for access and delivery.” In practical terms, this means organizations are having to constantly address a torrential flow of data into/out of their information management systems. Take Twitter, for example, where it’s possible to see more than 400 million tweets per day. As with the first V, data velocity isn’t slowing down anytime either.

3. Variety: Perhaps more vexing than both the volume and velocity issues, the Variety element of big data increases complexity exponentially as organizations must account for data sources/types that are moving in different vectors. Just to name a few variants, most organizations routinely must wrestle with structured data (databases), unstructured data (loose files/documents), email, video, static images, audio files, transactional data, social media, cloud content and more.

4. Value:  more novel big data concept, value hasn’t typically been part of the typical definition. Here, the critical inquiry is whether the retained information is valuable either individually or in combination with other data elements, which are capable of rendering patterns and insights. Given the rampant existence of spam, non-business data (like fantasy football emails) and duplicative content, it’s easy to see that just because data may have the other 3 Vs, it isn’t inherently valuable from a big data perspective.

5. Veracity: Particularly in an information governance era, it’s vital that the big data elements have the requisite level of veracity (or integrity). In other words, specific controls must be put in place to ensure that the integrity of the data is not impugned. Otherwise, any subsequent usage (particularly for a legal or regulatory proceeding, like e-discovery) may be unnecessarily compromised.”

“Many organizations sadly aren’t cognizant of the lurking tensions associated with the rapid acceleration of big data initiatives and other competing corporate concerns around important constructs like information governance. Latent information risk is a byproduct of keeping too much data and the resulting exposure due to e-discovery costs/sanctions, potential security breaches and regulatory investigations. As evidence of this potential information liability, it costs only $.20 a day to manage 1GB of storage. Yet, according to a recent Rand survey, it costs $18,000 to review that same gigabyte of storage for e-discovery purposes.”

For more on this topic, click here.