Posts Tagged ‘transparency’

2012: Year of the Dragon – and Predictive Coding. Will the eDiscovery Landscape Be Forever Changed?

Monday, January 23rd, 2012

2012 is the Year of the Dragon – which is fitting, since no other Chinese Zodiac sign represents the promise, challenge, and evolution of predictive coding technology more than the Dragon.  The few who have embraced predictive coding technology exemplify symbolic traits of the Dragon that include being unafraid of challenges and willing to take risks.  In the legal profession, taking risks typically isn’t in a lawyer’s DNA, which might explain why predictive coding technology has seen lackluster adoption among lawyers despite the hype.  This blog explores the promise of predictive coding technology, why predictive coding has not been widely adopted in eDiscovery, and explains why 2012 is likely to be remembered as the year of predictive coding.

What is predictive coding?

Predictive coding refers to machine learning technology that can be used to automatically predict how documents should be classified based on limited human input.  In litigation, predictive coding technology can be used to rank and then “code” or “tag” electronic documents based on criteria such as “relevance” and “privilege” so organizations can reduce the amount of time and money spent on traditional page by page attorney document review during discovery.

Generally, the technology works by prioritizing the most important documents for review by ranking them.  In addition to helping attorneys find important documents faster, this prioritization and ranking of documents can even eliminate the need to review documents with the lowest rankings in certain situations. Additionally, since computers don’t get tired or day dream, many believe computers can even predict document relevance better than their human counterparts.

Why hasn’t predictive coding gone mainstream yet?

Given the promise of faster and less expensive document review, combined with higher accuracy rates, many are perplexed as to why predictive coding technology hasn’t been widely adopted in eDiscovery.  The answer really boils down to one simple concept – a lack of transparency.

Difficult to Use

First, early predictive coding tools attempt to apply a complicated new technological approach to a document review process that has traditionally been very simple.  Instead of relying on attorneys to read each and every document to determine relevance, the success of today’s predictive coding technology typically depends on review decisions input into a computer by one or more experienced senior attorneys.  The process commonly involves a complex series of steps that include sampling, testing, reviewing, and measuring results in order to fine tune an algorithm that will eventually be used to predict the relevancy of the remaining documents.

The problem with early predictive coding technologies is that the majority of these complex steps are done in a ‘black box’.  In other words, the methodology and results are not always clear, which increases the risk of human error and makes the integrity of the electronic discovery process difficult to defend.  For example, the methodology for selecting a statistically relevant sample is not always intuitive to the end user.  This fundamental problem could result in improper sampling techniques that could taint the accuracy of the entire process.  Similarly, the process must often be repeated several times in order to improve accuracy rates.  Even if accuracy is improved, it may be difficult or impossible to explain how accuracy thresholds were determined or to explain why coding decisions were applied to some documents and not others.

Accuracy Concerns

Early predictive coding tools also tend to lack transparency in the way the technology evaluates the language contained in each document.  Instead of evaluating both the text and metadata fields within a document, some technologies actually ignore document metadata.  This omission means a privileged email sent by a client to her attorney, Larry Lawyer, might be overlooked by the computer if the name “Larry Lawyer” is only part of the “recipient” metadata field of the document and isn’t part of the document text.  The obvious risk is that this situation could lead to privilege waiver if it is inadvertently produced to the opposing party.

Another practical concern is that some technologies do not allow reviewers to make a distinction between relevant and non-relevant language contained within individual documents.  For example, early predictive coding technologies are not intelligent enough to know that only the second paragraph on page 95 of a 100-page document contains relevant language.  The inability to discern what language  led to the determination that the document is relevant could skew results when the computer tries to identify other documents with the same characteristics.  This lack of precision increases the likelihood that the computer will retrieve an over-inclusive number of irrelevant documents.  This problem is generally referred to as ‘excessive recall,’ and it is important because this lack of precision increases the number of documents requiring manual review which directly impacts eDiscovery cost.

Waiver & Defensibility

Perhaps the biggest concern with early predictive coding technology is the risk of waiver and concerns about defensibility.  Notably, there have been no known judicial decisions that specifically address the defensibility of these new technology tools even though some in the judiciary, including U.S. Magistrate Judge Andrew Peck, have opined that this kind of technology should be used in certain cases.

The problem is that today’s predictive coding tools are difficult to use, complicated for the average attorney, and the way they work simply isn’t transparent.  All these limitations increase the risk of human error.  Introducing human error increases the risk of overlooking important documents or unwittingly producing privileged documents.  Similarly, it is difficult to defend a technological process that isn’t always clear in an era where many lawyers are still uncomfortable with keyword searches.  In short, using black box technology that is difficult to use and understand is perceived as risky, and many attorneys have taken a wait-and-see approach because they are unwilling to be the guinea pig.

Why is 2012 likely to be the year of predictive coding?

The word transparency may seem like a vague term, but it is the critical element missing from today’s predictive coding technology offerings.  2012 is likely to be the year of predictive coding because improvements in transparency will shine a light into the black box of predictive coding technology that hasn’t existed until now.  In simple terms, increasing transparency will simplify the user experience and improve accuracy which will reduce longstanding concerns about defensibility and privilege waiver.

Ease of Use

First, transparent predictive coding technology will help minimize the risk of human error by incorporating an intuitive user interface into a complicated solution.  New interfaces will include easy-to-use workflow management consoles to guide the reviewer through a step-by-step process for selecting, reviewing, and testing data samples in a way that minimizes guesswork and confusion.  By automating the sampling and testing process, the risk of human error can be minimized which decreases the risk of waiver or discovery sanctions that could result if documents are improperly coded.  Similarly, automated reporting capabilities make it easier for producing parties to evaluate and understand how key decisions were made throughout the process, thereby making it easier for them to defend the reasonableness of their approach.

Intuitive reports also help the producing party measure and evaluate confidence levels throughout the testing process until appropriate confidence levels are achieved.  Since confidence levels can actually be measured as a percentage, attorneys and judges are in a position to negotiate and debate the desired level of confidence for a production set rather than relying exclusively on the representations or decisions of a single party.  This added transparency allows the type of cooperation between parties called for in the Sedona Cooperation Proclamation and gives judges an objective tool for evaluating each party’s behavior.

Accuracy & Efficiency

2012 is also likely to be the year of transparent predictive coding technology because technical limitations that have impacted the accuracy and efficiency of earlier tools will be addressed.  For example, new technology will analyze both document text and metadata to avoid the risk that responsive or privileged documents are overlooked.  Similarly, smart tagging features will enable reviewers to highlight specific language in documents to determine a document’s relevance or non-relevance so that coding predictions will be more accurate and fewer non-relevant documents will be recalled for review.

Conclusion - Transparency Provides Defensibility

The bottom line is that predictive coding technology has not enjoyed widespread adoption in the eDiscovery process due to concerns about simplicity and accuracy that breed larger concerns about defensibility.  Defending the use of black box technology that is difficult to use and understand is a risk that many attorneys simply are not willing to take, and these concerns have deterred widespread adoption of early predictive coding technology tools.  In 2012, next generation transparent predictive coding technology will usher in a new era of computer-assisted document review that is easy to use, more accurate, and easier to defend. Given these exciting technological advancements, I predict that 2012 will not only be the year of the dragon, it will also be the year of predictive coding.

Apple, Code Name K48 and E-Discovery

Wednesday, June 22nd, 2011

According to a complaint filed by the U.S. government, the FBI secretly recorded an employee at one of Apple’s suppliers passing confidential information about the soon to be released Apple iPad in an October, 2009 telephone conversation.  The recording, along with other evidence, led to the arrest of the employee and others on charges on of wire fraud and conspiracy to commit securities fraud on December 16, 2010 as part of a major insider-trading investigation.  In the conversation, a director for Flextronics named Walter Shimoon is heard saying:

“they [Apple] have a code name for something new … It’s … It’s totally … It’s a new category altogether… It doesn’t have a camera, what I figured out. So I speculated that it’s probably a reader. … Something like that. Um, let me tell you, it’s a very secretive program … It’s called K, K48. That’s the internal name. So, you can get, at Apple you can get fired for saying K48.”

Four months later, the first Apple iPad, code named K48, was unveiled to the public.    To read more about the case background, read the press release issued by the U.S. Attorneys’ Office on December 16, 2010.

The case is interesting from an eDiscovery standpoint because it highlights challenges related to finding critical evidence as part of an investigation or lawsuit when people are intentionally using code words to hide information.  Finding or overlooking important documents that have been disguised can make or break your case, so determining whether or not key players are using code words is an important part of a thorough investigation.  Equally important to the investigation is segregating relevant and irrelevant documents quickly before key evidence is lost or destroyed without being required to conduct a painstaking page by page review of each document.

How Does Technology Help?

The good news is that even though technology innovation has resulted in massive data growth requiring the review and analysis of more documentary evidence during lawsuits and investigations, advances in eDiscovery technology have also made sifting through this information faster and easier.  In other words, technology can help solve the data growth problem technology created.

One of the newest advances is the use of “transparent concept search” technology to find important electronic files in lieu of basic “keyword” or “traditional” concept searching technology.  In many situations investigators or lawyers simply aren’t aware code words are being used to hide activity, so critical evidence is often overlooked.  For example, in the present case assume the investigator is unaware that “K48” is the internal code name used for the first iPad.  A simple keyword search for the term “iPad” may not retrieve critical documents about the “iPad” because the code name K48 is being used to disguise the product name.  If this is the only search methodology used, information could easily be overlooked during the investigation due to the limitations of simple keyword search technology.

On the other hand, running the same search using a traditional concept searching tool is likely to retrieve documents containing the word “iPad” as well as other conceptually related documents.  The problem is that the user has no ability to control the breadth of the search using traditional concept searching technology.  That means even though a traditional concept search for the term “iPad” is likely to include documents containing the term “K48” and “iPad,” it is also likely to retrieve a large number of irrelevant documents containing terms like “iPod, iTouch and iTunes that may appear to be conceptually related to the search term “iPad.”  The problem may seem trivial initially, but when investigators are required to read hundreds or thousands of irrelevant documents about the iPod, iTouch or iTunes in an effort to find relevant documents about the iPad, the time and cost of the investigation can skyrocket.

Next Generation Transparent Concept Search Technology

To solve this problem, next generation transparent concept search technology takes traditional concept searching a step further by empowering investigators to reap the advantages of traditional concept searching while actually reducing instead of increasing e-discovery expenses.  The secret is that transparent concept searching technology significantly reduces the time and expense resulting from over-inclusive document retrieval by allowing users to eliminate documents containing concepts that are not relevant to the intended search.  This is accomplished by providing a transparent view of concepts related to a search so that users can actually visualize and select (or deselect) the range of concepts to be included in a search before the search is executed.

For example, using transparent concept search technology to search for the term “iPad” would reveal conceptually related terms like “K48” just like traditional concept searching.  However, a transparent concept search would also provide a list of all concepts related to the keyword “iPad” prior to the search such as “K48, iPod, iTouch, Shimoon, iTunes, etc.  Prior to executing the search, the user could de-select irrelevant concepts and limit the search to “iPad”, “Shimoon”, “internal” and “K48” to make sure only the most relevant documents are retrieved. (See Figure 1).  In addition to decreasing the cost associated with segregating relevant and irrelevant documents, the transparent approach to concept searching results in strategic advantages for investigators and legal teams because the most relevant evidence is found quickly so cases can be assessed faster, with more accuracy, and before evidence disappears.

Figure 1: Transparent concept search reveals all concepts related to the keyword “iPad” so users can not only identify key documents they may have otherwise overlooked, but they can also select which concepts (“internal” “K48” “Shimoon”) to include in the search so only the most relevant documents are retrieved.

Conclusion

Not knowing what to search for as part of eDiscovery or investigations is often the biggest organizational challenge that basic keyword and traditional concept search technology has not been able to solve.  Next generation transparent concept search technology overcomes the inherent limitations of basic keyword and traditional concept searching technology by empowering users to uncover, assess, and review evidence faster and with more accuracy, thereby giving litigators or investigators new strategic advantages on every case.

The Business Strategy Behind Clearwell’s Transparent Concept Search

Monday, January 31st, 2011

Last fall, when Transparent Concept Search was still in development, we showed an early version of it to a group of our customers. Their excitement was palpable, and they spent most of our session together comparing notes about the varied ways they will use it. But at the end of the discussion, one of them asked the question which was on everyone’s mind: “how much will you charge for it?”, or as someone else immediately said “I get charged $200/GB for plain vanilla concept search, so how much of a premium do you think you will get for this?”

Our answer surprised them: there’s no charge. Transparent Concept Search is included in Clearwell for free. Here’s why doing that makes sense:

There are two business strategies in the technology industry which are proven to work. One is to be the low-cost provider and compete on price. These companies, such as Chinese PC manufacturers, do not spend anything on R&D or marketing. Instead, they ruthlessly squeeze out cost savings and pass them on to their customers. The other proven strategy is to be the innovation-leader, whereby you continually delight customers by giving them more and more functionality at the existing price. Players following this strategy are never the cheapest, since they charge a little extra to fund new product development. For example, iPhone is by no means the cheapest smart phone, but its price did not go up when, with the iPhone 4, Apple added video, a forward-facing camera, better battery life, and a retina display.

It is worth noting that either strategy can work, and companies sometimes move between the two, although making that transition is incredibly hard. Staying in the PC industry, Dell started as the low cost provider, but has more recently tried to move up the value chain by investing more in the design of its products. The results, so far, have been mixed.

At Clearwell, our strategy is to be the innovation leader in e-discovery software. We tackle really hard technical problems, solve them in innovative ways, and then seek to delight our users by providing them with breakout, new capabilities at no incremental cost. Transparent Concept Search is a perfect example of this.

Rather than just integrate with concept analysis plug-ins, as pretty much every review platform does, we asked ourselves: if we were to create concept search from scratch specifically for e-discovery, what would we build? As part of that process, we tapped into the latest academic research in semantic analysis coming out of UCLA, University of Pittsburgh, and other universities, and discovered that it offers a solution to the biggest single problem users have with concept search: the heavy computational burden traditional approaches require. By using a variation of the semantic space model which is explained in that new research rather than, say, latent semantic indexing, we can deliver concept searching to much larger legal matters.

Beyond the core technology, we also wanted to change the user experience, by bringing the same level of visibility and control that our users enjoy in keyword search to this domain. Our goal is to enable users to balance both precision and recall in a way that was not previously possible. The result – Transparent Concept Search – is completely seamless within Clearwell in a way that simply cannot be matched by concept search plug-ins to a review platform, which are essentially two separate products from two separate vendors. In summary, it’s a vastly superior user experience – at no incremental cost.

This is the first of many things you will see from us this year. Our team could not be more excited about the new products and ideas that we have in the pipeline.

Defensible E-Discovery a Hot Topic at the Masters Conference

Thursday, October 29th, 2009

Recently, I moderated a panel at the Masters Conference with John Loveland, Sonya Thornton, and Bruce Markowitz entitled: How Defensible is Your E-Discovery Process? (Click here to read a summary of the panel.) It was well attended, and I think that the draw (aside from the esteemed panel) was that this topic still remains very vexing for most practitioners.

Initially, we started at ground zero with the notion that defensibility is in most instances equated with the “reasonableness” standard, which is pervasive across many areas of the EDRM spectrum… from preservation to production.  Instances include:

  • Preservation — “[a]s soon as a potential claim is . . . identified, a party is under a duty to preserve evidence which it knows, or reasonably should know, is relevant to the future litigation.”
  • FRE 502 (b) – the disclosure does not operate as a waiver in a Federal or State proceeding if the (2) the holder of the privilege or protection took reasonable steps to prevent disclosure;
  • General Privilege Waiver — In SEC v. Badian, 2009 WL 222783 (S.D.N.Y. Jan. 26, 2009)(link), “there is no basis … to conclude that there were precautions [to prevent the disclosure], let alone whether they were reasonable.”
  • FRCP 37(e) — Absent exceptional circumstances, a court may not impose sanctions under these rules on a party for failing to provide electronically stored information lost as a result of the routine, good-faith operation of an electronic information system.

While the foregoing isn’t exhaustive it does highlight the persistent nature of the reasonableness standard as practitioners seek a defensibility sanctuary.  The good news is that the law doesn’t require perfection and there are also a number of ways to obtain reasonable defensibility:

  • Demonstrable acceptance by the opposition – here the notion is that collaboration with the opposition allows the parties to comfortably move ahead with their discovery process and even if it’s not objectively reasonable, the parties consent to the protocol will in most instances carry an imprimatur of reasonableness.
  • Auditing / process transparency.  Similar to the first bullet, auditing the process and giving the opposition visibility into the process steps will often make it hard for them to lodge successful downstream challenges.
  • Adherence to Local Rules (See 7th Circuit Pilot Program) or judicial order.  Another avenue than can provide some degree of safety is compliance with a discovery protocol mandated by local rules, although that compliance may ultimately be challenged.
  • Statistical confidence intervals / sampling – the use of statistics as a way to bolster process defensibility is starting to come to maturity and in the future I think that detailed precision, recall and other statistical indicates will play a large role in e-discovery defensibility.

None of these steps can be guaranteed to really get you off the hook from a rapid opposing party calling foul, but using them in a “belt and suspenders” fashion will certainly help buttress any discovery process.

For more illumination on the topic please see the following video of my interview with John Loveland, who’s waxing poetically about discovery defensibility.

A Gross Inability to Craft Electronic Discovery Searches

Thursday, April 9th, 2009

The bashing of our judicial system seems to have reached a fevered pitch.  Groups like the American College of Trial Lawyers (“ACTL”) have proclaimed in a recent report that while the “civil justice system is not broken, it is in serious need of repair.”  The blame game seems to have judges and attorneys alike pointing fingers.  The Fellows of the ACTL (perhaps not surprisingly) seems to pin some of the blame on the judiciary:

“Judges should have a more active role at the beginning of a case in designing the scope of discovery and the direction and timing of the case all the way to trial. Where abuses occur, judges are perceived not to enforce the rules effectively.”

Groups like the Sedona Conference chalk up many of the ills to the failure to cooperate, so much so that they’ve orchestrated a cooperation proclamation – which has picked up enough support by the bench to have garnered several cites in the case law (see e.g., Mancia).

The bench for its part seems to put some of the onus on litigators and their reticence to get with the times.  William A. Gross. Constr. Assocs., Inc. v. Am. Mfrs. Mut. Ins. Co., 2009 WL 724954 (S.D.N.Y. Mar. 19, 2009) is the latest example of such a proclamation.  In this construction defect case, Judge Peck (a Sedona devotee) issues what he hopes will be a “wake-up” call to the bar about the need for “careful thought, quality control, testing, and cooperation with opposing counsel in designing search terms or ‘keywords’ to be used to produce emails or other electronically stored information (‘ESI’).”  In Gross, the court had to mediate an e-discovery dispute where the requesting party propounded a blatantly over-inclusive search request crafted by the requesting parties.  Unfortunately, the responding entity was a non-party and they simply dig their heads in the sand.  In order to facilitate a resolution this left the Court in the “uncomfortable position” of having to craft a “keyword search methodology for the parties, without adequate information from the parties (and Hill).”

Judge Peck’s exasperation with these antics was palpable.  Summing up the problem by citing Judge Grimm and Victor Stanley he stated: “This case is just the latest example of lawyers designing keyword searches in the dark, by the seat of the pants, without adequate (indeed, here, apparently without any) discussion with those who wrote the emails.”  He further noted: “[w]hile this message has appeared in several cases from outside this Circuit, it appears that the message has not reached many members of our Bar.”

After noting both Sedona and Judge Facciola (of O’Keefe and Equity Analytics fame) Peck’s opinion reached a crescendo:

“Electronic discovery requires cooperation between opposing counsel and transparency in all aspects of preservation and production of ESI. Moreover, where counsel are using keyword searches for retrieval of ESI, they at a minimum must carefully craft the appropriate keywords, with input from the ESI’s custodians as to the words and abbreviations they use, and the proposed methodology must be quality control tested to assure accuracy in retrieval and elimination of ‘false positives.’ It is time that the Bar-even those lawyers who did not come of age in the computer era-understand this.”

While it’s easy to see who Peck blames in this brouhaha, it takes (at least) two to tango.  Meaning that litigants on both sides of the “v” must move beyond the typical “seat of the pants” electronic discovery wrangling.  And, judges need to be savvy enough to spot the issues to help/force the parties into such an enlightened/cooperative state.  Nothing short will get the job done.

Concept Search Versus Keyword Search in Electronic Discovery

Wednesday, November 12th, 2008

In my last post, I started a discussion on the myths surrounding concept search.  The first myth I dispelled was the “concept search is concept search” myth.  The myth is that there is an agreed upon definition of concept search.  In actuality, when people in electronic discovery use the term concept search, they don’t always mean the same thing.  Frequently they are not actually talking about concept search technology at all and are actually talking about concept or content categorization technology, which is very different.  The second myth that needs dispelling is that concept search is better than keyword search.

The thinking behind this myth goes something like this:

Keyword search has a lot of problems.  It is prone to being over-inclusive, i.e., finding some non-relevant documents, and under-inclusive, i.e., not finding some relevant documents.  Concept search technologies are new and interesting and using these technologies you can find documents that keyword search can’t find.  Therefore, concept search must be better than keyword search.

Let’s examine this thinking.  The first two statements are accurate.  Keyword search is not perfect and can produce over- and under-inclusive results.  And concept search and content categorization technologies can both help identify documents that keyword search technologies might not find.  However, the conclusion that concept search is better than keyword search is not valid and doesn’t follow from these two statements.  Why?

In order to answer this question, we first need to go back to the difference between concept search and content categorization. Because these are different technologies, we really need to separately compare concept search versus keyword search and content categorization versus keyword search.  Let’s start with content categorization and keyword search.

The issue with this comparison is that keyword search and content categorization do different things.  Keyword search can be used in many ways in e-discovery.  The two most common are: (1) analysis or case assessment: finding the hot documents and understanding the matter by determining who knew what, when, how and why, etc., and (2) culling: removing non-responsive documents and/or identifying potentially privileged documents in order to reduce a large, starting set of documents to a smaller set before review.

Content categorization, on the other hand, has historically been used within the review phase of e-discovery.  Categorization can help reviewers to better understand the documents they are reviewing and thus potentially increase the speed of review.  Practitioners with whom I have worked also find that categorization can be useful during analysis by helping to understand a matter and identify potentially important keywords.

However, content categorization has not been used as part of culling.  First, culling needs to be transparent.  You need to be able to get agreement with or at least explain to the opposing side and the court exactly how you have culled the data set.  If you cull based on categories of documents that have been generated by a proprietary, black-box algorithm, it’s going to be difficult to gain agreement on or explain your culling methodology.  This is why the typical method of culling is still to use keyword search and either agree on the set of search terms with the opposing side or to use e-discovery search best practices to perform keyword searches on your own.

Second, content categorization has its own issues when it comes to being over- and under-inclusive.  There is no guarantee that your group of documents that have been categorized as being related to, for example, a company’s hiring policies include all of the documents in your matter related to hiring policies or that they do not include some documents that may not really be related to hiring policies.  Content categorization, like keyword search and virtually every information retrieval technology, is not perfect.

So what about concept search technology?  Surely, concept search technology is better than old, boring keyword search.  Well, actually it’s not that clear-cut.  The problem with concept search technology is that while it might find more relevant documents than plain keyword search, it will also likely find more false positives.  Imagine searching for documents containing “terminate” in an employment matter and your concept search technology automatically searching for “fire”, “dismiss”, etc. as well.  You’ll find more documents related to the termination of employees, but you’ll also find a lot more non-relevant documents concerning house fires, the fire department, etc.

So concept search can help address the under-inclusive problem with keyword search, (though it won’t solve it) and can be helpful during analysis.  But it can often increase the over-inclusive problem.  In addition, today’s concept search technologies share the transparency problem with concept categorization.  These technologies have largely been designed as “black boxes”, which as I have discussed in the past, makes sense for Enterprise search but not for e-discovery search, and, as a result, could also be potentially difficult to explain and defend.   For these reasons, concept search technology isn’t used very much in e-discovery today.  In order for its use to become widespread, it will need to become more transparent.  But that’s a topic for another day.

The bottom line here is that despite all the hype, concept search and content categorization technologies do not solve all the challenges of e-discovery search.  Both of these technologies can be very useful and the technology behind them is always improving.  However, as most of the experienced practitioners I work with already know, these technologies are generally better thought of as supplements to keyword search, not replacements.  The important question is not whether to use one technology over the other but which technology is best suited to your objectives and how best to use all the available technologies to achieve the desired goal.

What’s Different About E-Discovery Search?

Monday, May 5th, 2008

raiders-warehouse.jpgIn his latest article, Craig Ball argues that lawyers “need to learn more about the science of search.” Craig says that at least part of the reason for this is that searching in e-discovery is challenging and different from the searching to which lawyers are accustomed.

“Lawyers believe themselves adept at keyword search in e-discovery because they’ve mastered keyword search in online legal research. The correlation is superficial at best. Unlike the crazy quilt of ESI, the language of reported cases is precise, consistent and structured. Misspellings are rare. Legal research is Disneyland. E-discovery is Baghdad.”

I had a conversation on a similar litigation discovery topic with Ron Friedman last month after my last post where he made a similar argument about lawyers needing to learn e-discovery search tools.1

I think Craig and Ron make excellent points. E-Discovery using litigation support software search is different and it’s important for lawyers, investigators, litigation support professionals and other practitioners to understand how. The natural questions that arise from their arguments are: what is different about e-discovery search? How is it different from other familiar searches, such web search and legal research search? The answers are important because it can help guide e-discovery experts on how to train lawyers and even guide attorneys during litigation discovery review. It is also important for developing e-discovery best practices and e-discovery search software.

I think the first step in answering these questions is to agree on the definition of e-discovery search, or better said the types of e-discovery search since there are several. To address this appropriately would take a least another full litigation discovery post or a paper. As a result, I will leave the detailed discussion of these matters to another time, but for this discussion I will focus on searches used to identify potentially relevant documents for purposes of matter assessment (i.e., understanding the nature of the case: who did what, where, when and why) and for document production to the opposing party.

I have observed five major characteristics of e-discovery search that as a whole differentiate it from other searches. I would be interested to hear additional views on what is different about e-discovery search, so please comment on this post.

Recall
First, the cost of missing a relevant document, or low recall, can be very high in e-discovery. Missing a document that you should have produced could result in sanctions and adversely impact the case outcome. Missing key documents could also affect your legal strategy causing you to make sub-optimal decisions. Missing relevant documents can be costly in other searches as well. For example, in legal research, not identifying case law that is critical to your case could also have a detrimental impact on your legal strategy. However, low recall is on average costlier and more likely in e-discovery. In contrast to e-discovery and legal searchers, web search users are typically not very concerned with missing relevant documents. For the most part, they are interested in the most relevant documents, not all of the relevant documents. This is why Google rarely actually provides all the results for a search (you can try this yourself by paging to the end).

Precision
Second, the cost of returning false positives, otherwise known as low precision, in e-discovery searches is high. The results of e-discovery searches including false positives are typically produced and reviewed by humans at costs as high as several dollars per document. On the other hand, false positives have a minimal cost in web search because users either won’t see them if they are ranked low or will ignore them after minimal review. False positives can be costly during legal research in certain scenarios, such as when the stakes and nature of case are such that many search results need to be exhaustively reviewed, but typically the costs are lower.

Varied Language
Third, documents searched using litigation support software during e-discovery often include personal emails and files and frequently use varied language including jargon, slang, abbreviations, technical terminology, misspellings, and machine-created junk. This is Craig’s “Baghdad” point. In contrast, as Craig points out, documents searched during legal research, such as opinions, motions, etc. are typically well-structured documents with no misspellings, relatively consistent language etc. Even web sites are generally “cleaner” than typical e-discovery documents.

Complexity
Fourth, users are often looking for different information when performing searches during discovery. E-Discovery searches are often aimed at comprehensively understanding “who did what, when, where and why” in a matter where the people involved may be trying to hide this information and where there may be no single “starting point”. As a result, e-discovery searchers often adopt strategies that involve large numbers of queries, and will follow the evidence and iteratively refine their searches for combinations of topics, people, places, etc. Legal searches can also be fairly complex, but as with other differences this is one of degree. These searches typically don’t involve hundreds of queries and terms, are often more narrowly defined and have a “starting point”. Web searches tend to be even simpler. Most are one or two words.

Transparency
Finally, e-discovery search is part of a legal process. The searches themselves are subject to negotiation with and review by opposing counsel and the court. This process can also take place over long time frames. As such, there is a great need for transparency in the development and execution of e-discovery searches. It is also important for e-discovery searchers to develop a defensible audit trail to prove what searches were run and what results were produced when. This is not the case in web or legal research.

These differences have a number of implications for e-discovery search best practices, training, software and more. I will discuss these in more detail in future posts. However, I think these differences make clear why Craig and Ron are right to suggest that people who are new to e-discovery can benefit from specialized training and tools. Similarly for those of us who are deeply involved in e-discovery, I believe these differences point to the fact that there is still a lot of work to be done in developing best practices and software to make it easier for lawyers and other users to perform e-discovery searches effectively.

1 Ron also wrote another interesting post on this topic which can be found at PrismLegal.com.