Archive for the ‘transparency’ Category

2012: Year of the Dragon – and Predictive Coding. Will the eDiscovery Landscape Be Forever Changed?

Monday, January 23rd, 2012

2012 is the Year of the Dragon – which is fitting, since no other Chinese Zodiac sign represents the promise, challenge, and evolution of predictive coding technology more than the Dragon.  The few who have embraced predictive coding technology exemplify symbolic traits of the Dragon that include being unafraid of challenges and willing to take risks.  In the legal profession, taking risks typically isn’t in a lawyer’s DNA, which might explain why predictive coding technology has seen lackluster adoption among lawyers despite the hype.  This blog explores the promise of predictive coding technology, why predictive coding has not been widely adopted in eDiscovery, and explains why 2012 is likely to be remembered as the year of predictive coding.

What is predictive coding?

Predictive coding refers to machine learning technology that can be used to automatically predict how documents should be classified based on limited human input.  In litigation, predictive coding technology can be used to rank and then “code” or “tag” electronic documents based on criteria such as “relevance” and “privilege” so organizations can reduce the amount of time and money spent on traditional page by page attorney document review during discovery.

Generally, the technology works by prioritizing the most important documents for review by ranking them.  In addition to helping attorneys find important documents faster, this prioritization and ranking of documents can even eliminate the need to review documents with the lowest rankings in certain situations. Additionally, since computers don’t get tired or day dream, many believe computers can even predict document relevance better than their human counterparts.

Why hasn’t predictive coding gone mainstream yet?

Given the promise of faster and less expensive document review, combined with higher accuracy rates, many are perplexed as to why predictive coding technology hasn’t been widely adopted in eDiscovery.  The answer really boils down to one simple concept – a lack of transparency.

Difficult to Use

First, early predictive coding tools attempt to apply a complicated new technological approach to a document review process that has traditionally been very simple.  Instead of relying on attorneys to read each and every document to determine relevance, the success of today’s predictive coding technology typically depends on review decisions input into a computer by one or more experienced senior attorneys.  The process commonly involves a complex series of steps that include sampling, testing, reviewing, and measuring results in order to fine tune an algorithm that will eventually be used to predict the relevancy of the remaining documents.

The problem with early predictive coding technologies is that the majority of these complex steps are done in a ‘black box’.  In other words, the methodology and results are not always clear, which increases the risk of human error and makes the integrity of the electronic discovery process difficult to defend.  For example, the methodology for selecting a statistically relevant sample is not always intuitive to the end user.  This fundamental problem could result in improper sampling techniques that could taint the accuracy of the entire process.  Similarly, the process must often be repeated several times in order to improve accuracy rates.  Even if accuracy is improved, it may be difficult or impossible to explain how accuracy thresholds were determined or to explain why coding decisions were applied to some documents and not others.

Accuracy Concerns

Early predictive coding tools also tend to lack transparency in the way the technology evaluates the language contained in each document.  Instead of evaluating both the text and metadata fields within a document, some technologies actually ignore document metadata.  This omission means a privileged email sent by a client to her attorney, Larry Lawyer, might be overlooked by the computer if the name “Larry Lawyer” is only part of the “recipient” metadata field of the document and isn’t part of the document text.  The obvious risk is that this situation could lead to privilege waiver if it is inadvertently produced to the opposing party.

Another practical concern is that some technologies do not allow reviewers to make a distinction between relevant and non-relevant language contained within individual documents.  For example, early predictive coding technologies are not intelligent enough to know that only the second paragraph on page 95 of a 100-page document contains relevant language.  The inability to discern what language  led to the determination that the document is relevant could skew results when the computer tries to identify other documents with the same characteristics.  This lack of precision increases the likelihood that the computer will retrieve an over-inclusive number of irrelevant documents.  This problem is generally referred to as ‘excessive recall,’ and it is important because this lack of precision increases the number of documents requiring manual review which directly impacts eDiscovery cost.

Waiver & Defensibility

Perhaps the biggest concern with early predictive coding technology is the risk of waiver and concerns about defensibility.  Notably, there have been no known judicial decisions that specifically address the defensibility of these new technology tools even though some in the judiciary, including U.S. Magistrate Judge Andrew Peck, have opined that this kind of technology should be used in certain cases.

The problem is that today’s predictive coding tools are difficult to use, complicated for the average attorney, and the way they work simply isn’t transparent.  All these limitations increase the risk of human error.  Introducing human error increases the risk of overlooking important documents or unwittingly producing privileged documents.  Similarly, it is difficult to defend a technological process that isn’t always clear in an era where many lawyers are still uncomfortable with keyword searches.  In short, using black box technology that is difficult to use and understand is perceived as risky, and many attorneys have taken a wait-and-see approach because they are unwilling to be the guinea pig.

Why is 2012 likely to be the year of predictive coding?

The word transparency may seem like a vague term, but it is the critical element missing from today’s predictive coding technology offerings.  2012 is likely to be the year of predictive coding because improvements in transparency will shine a light into the black box of predictive coding technology that hasn’t existed until now.  In simple terms, increasing transparency will simplify the user experience and improve accuracy which will reduce longstanding concerns about defensibility and privilege waiver.

Ease of Use

First, transparent predictive coding technology will help minimize the risk of human error by incorporating an intuitive user interface into a complicated solution.  New interfaces will include easy-to-use workflow management consoles to guide the reviewer through a step-by-step process for selecting, reviewing, and testing data samples in a way that minimizes guesswork and confusion.  By automating the sampling and testing process, the risk of human error can be minimized which decreases the risk of waiver or discovery sanctions that could result if documents are improperly coded.  Similarly, automated reporting capabilities make it easier for producing parties to evaluate and understand how key decisions were made throughout the process, thereby making it easier for them to defend the reasonableness of their approach.

Intuitive reports also help the producing party measure and evaluate confidence levels throughout the testing process until appropriate confidence levels are achieved.  Since confidence levels can actually be measured as a percentage, attorneys and judges are in a position to negotiate and debate the desired level of confidence for a production set rather than relying exclusively on the representations or decisions of a single party.  This added transparency allows the type of cooperation between parties called for in the Sedona Cooperation Proclamation and gives judges an objective tool for evaluating each party’s behavior.

Accuracy & Efficiency

2012 is also likely to be the year of transparent predictive coding technology because technical limitations that have impacted the accuracy and efficiency of earlier tools will be addressed.  For example, new technology will analyze both document text and metadata to avoid the risk that responsive or privileged documents are overlooked.  Similarly, smart tagging features will enable reviewers to highlight specific language in documents to determine a document’s relevance or non-relevance so that coding predictions will be more accurate and fewer non-relevant documents will be recalled for review.

Conclusion - Transparency Provides Defensibility

The bottom line is that predictive coding technology has not enjoyed widespread adoption in the eDiscovery process due to concerns about simplicity and accuracy that breed larger concerns about defensibility.  Defending the use of black box technology that is difficult to use and understand is a risk that many attorneys simply are not willing to take, and these concerns have deterred widespread adoption of early predictive coding technology tools.  In 2012, next generation transparent predictive coding technology will usher in a new era of computer-assisted document review that is easy to use, more accurate, and easier to defend. Given these exciting technological advancements, I predict that 2012 will not only be the year of the dragon, it will also be the year of predictive coding.

Key eDiscovery Considerations for Selecting a Cloud Service Provider

Tuesday, October 25th, 2011

The data explosion that has burdened organizations across the globe for the past decade has become increasingly expensive to manage.  Many experts point to storage as the most obvious culprit for higher information governance costs.  There are, however, other factors driving those costs.  For example, demands for electronically stored information in legal and regulatory proceedings have significantly increased expenses surrounding data management.  Those demands have forced organizations to meet the high expectations that courts and regulatory bodies have for how they address their information or face the consequences.

Those consequences include sanctions and regulatory fines for groups that fail to account for how they store, manage and discover their information.  The $919 million verdict rendered in the E.I. du Pont de Nemours v. Kolon Industries case is paradigmatic of this trend.  That verdict was inextricably intertwined with the court’s instruction to the jury that executives and employees for defendant Kolon Industries deleted key evidence after the company’s preservation duty was triggered.

Going to Cloud Services for Data Archiving and eDiscovery

These rising data costs – and the risks they pose – are driving organizations to explore new technologies and methods for managing their data.  The latest alternative to traditional on-premise solutions involves leveraging cloud-based services.

The hype surrounding the cloud has generally focused on the opportunity for cheap and unlimited storage.  While cost effective data storage is important, that factor alone should not be determinative for selecting a cloud service provider.  Organizations must have the actual – not theoretical – ability to retrieve their data and do so in real time.  Otherwise, they may not be able to satisfy legal or regulatory requests, let alone the day-to-day demands of their operations.

In an analogous context, courts have traditionally compelled paper document productions even though the requested materials may be buried in a messy warehouse.  In one such case from this year, a U.S. district court in New York ordered a company to turn over decades-old records that were commingled with other materials in poorly labeled, shrink-wrapped boxes.  The court reasoned that disorganized record-keeping should not excuse an organization from producing relevant information.  See Brooks v. Macy’s (S.D.N.Y. May 6, 2011).

The rationale from the Brooks case is equally applicable to cloud-based services.  Cloud-based data must be intelligently organized so that companies can retrieve data in a timely fashion for business and legal purposes.  Otherwise, the savings achieved through cheap storage will be negated by the resulting legal quagmire.

Paring Back Superfluous and Duplicative Information

To facilitate the data retrieval process, the right cloud service provider should have the capacity to implement and observe applicable company retention policies.  An effective retention policy will generally help a company retain information that must be kept for business, legal or regulatory purposes – and nothing else.  The service provider should enable automated retention rules to ensure that information is kept only for a designated time period.  This will allow data to be expired once it reaches the end of that period.  And by expiring that data, the company will limit the amount of potentially relevant information available for follow-on litigation.

The pool of information can also be decreased through single instance storage.  This deduplication technology eliminates redundant data by preserving only a master copy of each document placed into the cloud.  This will reduce the amount of data that needs to be identified, collected and reviewed as part of the electronic discovery process.  For while unlimited data storage may seem ideal now, reviewing unlimited amounts of data will quickly become a logistical and costly nightmare.

Tools to Facilitate Discovery

A cloud service provider should ideally have eDiscovery functionality.  At a minimum, the service provider should be able to deploy legal holds to prevent users or automated policies from overwriting and destroying data.  Advanced search capabilities should also be included within the cloud-based service to reduce the amount of data that must be analyzed and then reviewed.  Moreover, the provider should support compatible load formats for export to third party review software.

Another key discovery issue is whether the cloud service provider can establish a clear audit trail for transmissions of company data.  Since information could be modified in transit by the routine operation of a service provider’s computer systems, an audit trail is necessary to prove that company documents and their metadata were not affected or otherwise compromised during transmission.  Without this assurance, a company may not be able to demonstrate the authenticity of its data before a tribunal or comply with key regulations.

A cloud server provider that can quickly retrieve and efficiently discover data has the potential to help organizations address their legal and regulatory demands in a cost effective manner.  Such a provider may be just the solution for organizations that are looking to properly address their runaway information governance costs.

The Business Strategy Behind Clearwell’s Transparent Concept Search

Monday, January 31st, 2011

Last fall, when Transparent Concept Search was still in development, we showed an early version of it to a group of our customers. Their excitement was palpable, and they spent most of our session together comparing notes about the varied ways they will use it. But at the end of the discussion, one of them asked the question which was on everyone’s mind: “how much will you charge for it?”, or as someone else immediately said “I get charged $200/GB for plain vanilla concept search, so how much of a premium do you think you will get for this?”

Our answer surprised them: there’s no charge. Transparent Concept Search is included in Clearwell for free. Here’s why doing that makes sense:

There are two business strategies in the technology industry which are proven to work. One is to be the low-cost provider and compete on price. These companies, such as Chinese PC manufacturers, do not spend anything on R&D or marketing. Instead, they ruthlessly squeeze out cost savings and pass them on to their customers. The other proven strategy is to be the innovation-leader, whereby you continually delight customers by giving them more and more functionality at the existing price. Players following this strategy are never the cheapest, since they charge a little extra to fund new product development. For example, iPhone is by no means the cheapest smart phone, but its price did not go up when, with the iPhone 4, Apple added video, a forward-facing camera, better battery life, and a retina display.

It is worth noting that either strategy can work, and companies sometimes move between the two, although making that transition is incredibly hard. Staying in the PC industry, Dell started as the low cost provider, but has more recently tried to move up the value chain by investing more in the design of its products. The results, so far, have been mixed.

At Clearwell, our strategy is to be the innovation leader in e-discovery software. We tackle really hard technical problems, solve them in innovative ways, and then seek to delight our users by providing them with breakout, new capabilities at no incremental cost. Transparent Concept Search is a perfect example of this.

Rather than just integrate with concept analysis plug-ins, as pretty much every review platform does, we asked ourselves: if we were to create concept search from scratch specifically for e-discovery, what would we build? As part of that process, we tapped into the latest academic research in semantic analysis coming out of UCLA, University of Pittsburgh, and other universities, and discovered that it offers a solution to the biggest single problem users have with concept search: the heavy computational burden traditional approaches require. By using a variation of the semantic space model which is explained in that new research rather than, say, latent semantic indexing, we can deliver concept searching to much larger legal matters.

Beyond the core technology, we also wanted to change the user experience, by bringing the same level of visibility and control that our users enjoy in keyword search to this domain. Our goal is to enable users to balance both precision and recall in a way that was not previously possible. The result – Transparent Concept Search – is completely seamless within Clearwell in a way that simply cannot be matched by concept search plug-ins to a review platform, which are essentially two separate products from two separate vendors. In summary, it’s a vastly superior user experience – at no incremental cost.

This is the first of many things you will see from us this year. Our team could not be more excited about the new products and ideas that we have in the pipeline.

Concept Search in E-Discovery: From Concept to Reality

Sunday, January 30th, 2011

For years, concept search in electronic discovery has been like concept cars at auto shows: Cool. Slick. The thing that everyone is talking about.

But not ready to move to the assembly line and be put into production.

Like a concept car, concept search has been based on a lot of good ideas and shown a lot of promise. However, it has failed to move beyond a few edge use cases and reach mass adoption in the e-discovery market.  Why is this the case?

It’s not been because it’s an unproven idea or that the basic technology hasn’t been available. In fact, the core algorithm that underlies most existing concept search technologies has actually been around since 1988, when latent semantic analysis (LSA) was first patented by a team from Bell Labs. Over the last 20 years, dozens if not hundreds of companies have sprung up to apply concept search to the broad area of enterprise search and to e-discovery in particular.

To understand why concept search has never taken off, it’s always interesting to look for parallels, and the parallel du jour is social networking. Readers of David Kirkpatrick’s excellent book The Facebook Effect and (perhaps to a lesser, more fictionalized extent) viewers of the movie The Social Network understand that Facebook was far from the first social networking site (remember MySpace? You won’t admit it, but I know you do). But, despite being several years late to the party, Facebook somehow took the core of the social networking idea and presented it to users in a way that really allowed it to “cross the chasm” to the mainstream market.

In introducing Transparent Concept Search, Clearwell plans to help conceptual search cross that same chasm in e-discovery.  In talking to customers over the last couple of years, we have found that there are unmet customer needs with existing concept search products that, once addressed, will really allow its use in e-discovery to flourish – and not just in a way that makes concept search marginally more useful, but, a la Facebook, makes it orders of magnitude more useful.

What are these unmet customer needs?

Ease of use: Historically, concept search has been relatively easy to use in the strictest sense of the word – you type in some terms that represent your concept, and you get a set of search results back, along with some related terms and/or clusters of related documents. Simple, right? The issue is that in most cases that’s not what the user really wants to do. Because concept search is inherently “fuzzy”, users want to be able to refine their concept based on the feedback that they got from their initial search. Concept search, just like keyword search, is an iterative process, and prior-generation technologies have not allowed for that form of iteration. In contrast, Clearwell’s Transparent Concept Search allows concepts to be defined and refined in a way that is intuitive, visual, and (don’t take my word for it, but try it for yourself) fun.

Precision: Traditional concept search increased recall when compared to just keyword search, but it came at the cost of precision. The refinement process facilitated by Clearwell’s Transparent Search addresses this issue by allowing intelligent human input to guide the concept search process. You get the best of both the recall and precision worlds with vastly diminished time and effort.

Defensibility: Even more important than ease of use and precision is defensibility. Defensibility, for those new to the term, isn’t so much about whether the way the algorithms work is known and able to be understood. They are, and aren’t that complicated. Rather, defensibility is about reasonableness: was the concept search a reasonable way of determining which documents are responsive? Without the ability to define your concept in an interactive manner, we believe that the answer has historically been “no”, making concept search nice in theory but unusable in actual legal practice. Transparent Concept Search promises to change that. The end result is a more defensible search process that yields both greater recall and greater precision, enabling users to more quickly analyze case facts, rapidly identify key documents that may have been missed, eliminate irrelevant documents, and prioritize the most relevant documents for review. Clearwell also provides a reporting and auditing feature to document your search, allowing you to improve defensibility by “proving up” what was done.

Low cost: Finally, never underestimate the value of “free” in helping meet the ever-important unmet need of cost predictability and control. Historically, vendors have charged price premiums (often substantial) for concept search. Trying to charge a premium in e-discovery for something that doesn’t fully meet the customer use case and isn’t defensible, and it’s a recipe for low adoption. However, provide a highly useable, effective, and defensible capability as part of the core functionality of today’s leading e-discovery platform, and it starts to look very attractive indeed.

Hopefully you can tell that we’re incredible excited about the promise that this technology holds for the market, and this initial version is really just the beginning. Want to see it for yourself? Check out the video below, visit our web site or, if you are in New York this week, please visit us at LegalTech New York – we would love to see you.

Defensible E-Discovery a Hot Topic at the Masters Conference

Thursday, October 29th, 2009

Recently, I moderated a panel at the Masters Conference with John Loveland, Sonya Thornton, and Bruce Markowitz entitled: How Defensible is Your E-Discovery Process? (Click here to read a summary of the panel.) It was well attended, and I think that the draw (aside from the esteemed panel) was that this topic still remains very vexing for most practitioners.

Initially, we started at ground zero with the notion that defensibility is in most instances equated with the “reasonableness” standard, which is pervasive across many areas of the EDRM spectrum… from preservation to production.  Instances include:

  • Preservation — “[a]s soon as a potential claim is . . . identified, a party is under a duty to preserve evidence which it knows, or reasonably should know, is relevant to the future litigation.”
  • FRE 502 (b) – the disclosure does not operate as a waiver in a Federal or State proceeding if the (2) the holder of the privilege or protection took reasonable steps to prevent disclosure;
  • General Privilege Waiver — In SEC v. Badian, 2009 WL 222783 (S.D.N.Y. Jan. 26, 2009)(link), “there is no basis … to conclude that there were precautions [to prevent the disclosure], let alone whether they were reasonable.”
  • FRCP 37(e) — Absent exceptional circumstances, a court may not impose sanctions under these rules on a party for failing to provide electronically stored information lost as a result of the routine, good-faith operation of an electronic information system.

While the foregoing isn’t exhaustive it does highlight the persistent nature of the reasonableness standard as practitioners seek a defensibility sanctuary.  The good news is that the law doesn’t require perfection and there are also a number of ways to obtain reasonable defensibility:

  • Demonstrable acceptance by the opposition – here the notion is that collaboration with the opposition allows the parties to comfortably move ahead with their discovery process and even if it’s not objectively reasonable, the parties consent to the protocol will in most instances carry an imprimatur of reasonableness.
  • Auditing / process transparency.  Similar to the first bullet, auditing the process and giving the opposition visibility into the process steps will often make it hard for them to lodge successful downstream challenges.
  • Adherence to Local Rules (See 7th Circuit Pilot Program) or judicial order.  Another avenue than can provide some degree of safety is compliance with a discovery protocol mandated by local rules, although that compliance may ultimately be challenged.
  • Statistical confidence intervals / sampling – the use of statistics as a way to bolster process defensibility is starting to come to maturity and in the future I think that detailed precision, recall and other statistical indicates will play a large role in e-discovery defensibility.

None of these steps can be guaranteed to really get you off the hook from a rapid opposing party calling foul, but using them in a “belt and suspenders” fashion will certainly help buttress any discovery process.

For more illumination on the topic please see the following video of my interview with John Loveland, who’s waxing poetically about discovery defensibility.