Posts Tagged ‘sampling’

The “Artful” E-Discovery Dodger

Monday, October 13th, 2008

E-Discovery search has become a hot topic of late (in blogs and in the news), and I think it’s pretty clear that the unwashed (attorney) masses still don’t really grok the importance of using a defensible search protocol.  Neither do they seem to understand the enhanced scrutiny that’s being applied by the judiciary.

Kipperman v. Onex Corp., 2008 WL 4372005 (N.D. Ga. Sept. 19, 2008) is another in what will assuredly be a long string of cases that demonstrate how easy it is for litigators to get wrapped around the axel of e-discovery search.  In Kipperman, the defendant (Onex) presented several motions to the court, including attempts to obtain relief from the need to produce email identified after searching several backup tapes.

During a previous hearing the court ordered Onex to search all the mailboxes on two tapes, as well as on an additional tape selected by Plaintiff. The court determined that despite Onex’s objections and representations, the backup tapes were “producing meaningful discoverable information.”  The court was nevertheless sympathetic to Onex’s burden and therefore weighed in with some guidance:

“The court did suggest, … , that Plaintiff be more artful with its search terms and that Plaintiff utilize a list of the people, provided by Defendants, to review whether all mailboxes needed to be searched.”

The court also gave Onex the chance to narrow the search terms.  Unfortunately, they didn’t seize the opportunity to provide a narrower list or a refinement of their search terms.  “As such, they agreed to search and restore all the mailboxes with the search terms provided by Plaintiff.”

Not surprisingly, Onex then sought relief from having to review and produce all of the results from the search because the “broad search terms resulted in thousands and thousands of irrelevant hits.”  For example, the search terms included the word “republic” which used to elicit emails regarding Republic Builders Products, one of the companies involved in this matter.

“Defendants claim that the search captured thousands of irrelevant pages due to one occurrence of the word ‘republic’ often related to Onex business interests having nothing to do with Magnatrax in the ‘Republic of France,’ ‘Republic of Ireland,’ and ‘Czech Republic’.”

Again the court reaffirmed their sympathy with Onex’s burden and yet denied the requested relief, in large part because Onex was warned about not being more “artful”:

“[T]he court is not unsympathetic to the massive amount of discovery involved in this matter, the considerable burden of working with it, and the overproduction that often comes with e-mail production. Therefore, the court gave Defendants numerous tools by which to reduce the burden of e-mail discovery, including an opportunity to limit Plaintiff’s search terms and an opportunity to provide a list by which the number of peoples and the number of boxes being searched could be reduced. Defendants did not take advantage of these opportunities. Defendants must now lie in the bed that they have made. Thus, Defendants’ objections on the basis of relevancy and volume are DENIED.” (emphasis added).

Needless to say, Kipperman is probably not all that atypical.  Attorneys everywhere have historically used blunt e-discovery search instruments and haven’t often run afoul of the judiciary.  Now, post Victor Stanley, et al, the playing field has changed dramatically.  It’s important to leverage best practices (from Sedona and others), craft a defensible search strategy, sample the results and “show your work.”  Missteps along the way, especially ones that the court has tried to help the parties avoid won’t be met with much tolerance

Judge Grimm, Victor Stanley, And The Problem Of “Black-Box” E-Discovery Search

Friday, August 22nd, 2008

Judge Paul Grimm’s recent opinion in Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md. May 29, 2008) provides valuable guidance on one of the most important issues in e-discovery: how to conduct keyword searches in a defensible manner given that keyword searches are prone to produce over- and under-inclusive results.  The ruling suggests one of two approaches: either producing parties should adopt a “collaborative” approach to conducting keyword searches, whereby each party agrees on a search methodology; or, they should use a “best practices” approach, such as the one suggested by Sedona, where the producing party tests, samples, and iteratively refines searches so that they can demonstrate they have taken reasonable measures to reduce over- and under-inclusive results.

While the guidance is clear, following the guidance in practice is very difficult.  The primary reason for this is that the search technology being used in e-discovery today is not up to the task.  Specifically, today’s search technology suffers from three problems:

  1. The over- and under-inclusive tradeoff. Many technologies have been developed to address the tendency of keyword searches to miss relevant documents and produce under-inclusive results.  Wildcard and stemming technology has been developed in order to address the issue of finding common word variations in specified keywords.  Concept search has been designed to find documents containing words with similar meanings to the keywords in a search.  And fuzzy search technologies have been put in place to find misspellings of words. However, all of these suffer from the same problem: they produce too many non-relevant or “false positive” documents thus driving up the cost of review. For example, if someone runs the wildcard search “divers*”, then he or she not only gets the desired documents containing “diverse” and “diversity”, but also gets a large number of false positive documents containing “diversion”, “diversification”, and so on.  In the case of concept and fuzzy search, the problem is so great that these technologies to date have rarely been used in e-discovery.
  2. Too expensive to test, sample and refine searches. Today’s search technologies are largely designed to run one search at a time, not the dozens of searches that are typical in e-discovery. As a result, anyone trying to follow the best practices of testing, sampling, and refining each search will find themselves missing deadlines and running over budget because it takes so long. This also makes collaboration with the opposing party close to impossible, since there’s little time to iterate on – and agree upon - a set of keyword searches.
  3. Manual documentation. It’s not enough for producing parties to use best practices, they have to document them so that they can “show their work” to the court. Currently, documenting the search refinement process is mostly manual, with the result that it is either done inadequately or not at all.

The reason why the search technology used for e-discovery has these problems is surprisingly simple: it’s because the technology was not designed for e-discovery in the first place. Rather, it was built for enterprise search, and was only later repurposed towards e-discovery.

The “Black Box” Of Enterprise Search

The core issue is that enterprise search technology has been designed to be a “black box”. Users enter a single search query into one end, and get results at the other, with no visibility into what happens in between. Going back to our previous example, when a user searches for “divers*” intending to find documents related to “diversity” or “diverse”, enterprise search engines give the user no visibility into the crucial step of query expansion and how it expands the search query into relevant and non-relevant terms like “diversion” and “diversification”. As a result, the user has no ability to minimize the false positives.

In the same vein, when a user enters multiple queries into a “black box” enterprise search engine, all of the queries run as a single search, and the user has no visibility into which results are associated with which query. For example, a user that searches for “hiring OR interview” will get the results for the combination of the queries “hiring” and “interview”. He or she won’t know that only 5 of documents contained “hiring” while 100 documents contained “interview.”  This limitation makes analyzing, sampling and refining searches costly and time consuming.

That’s not say that enterprise search products like Autonomy or Endeca are flawed. Far from it.  Their “black box” design works exceedingly well for the simple and quick queries that people want to run across the enterprise for general business purposes. If a sales manager is looking for a single proposal for her meeting the following day, then she doesn’t care how the search was performed or if it’s over-inclusive.  She’s only interested in the first page of relevant results, and for that use case enterprise search engines do a great job.

But e-discovery is a whole different world.  In e-discovery, users typically must review every single document in the search results, not just the most relevant ones.  As a result, over-inclusive searches can dramatically increase the costs of downstream production and review.  And under-inclusive searches raise the issue of defensibility.  Finally, e-discovery users have to run a lot of search queries and understand which documents are associated with each of those queries.

So, going back to the original problem, if current search technologies cannot help lawyers and litigation support professionals follow Judge Grimm’s guidance and address the “well-known limitations” of keyword search, what can? That will be the subject of my next post.

Five E-Discovery Questions with Craig Ball

Tuesday, August 12th, 2008

cball1.gifIn the spirit of the popular New York Times magazine feature, with this post we inaugurate what we hope to be a long-running series of interviews with e-discovery luminaries to get their take on emerging ideas and trends (and hopefully have some fun as well).

Today’s questionee is e-discovery and forensics expert (and popular Law Technology News columnist) Craig Ball.  Craig’s combination of wit and insight speaks for itself, so let’s just get right to the questions.

1) The cases that are on everyone’s mind are O’Keefe/Lundin and Victor Stanley. What’s the practical impact of these rulings to the e-discovery practitioner?

Certainly these decisions have captured my enthusiastic attention.  Lawyers now have to devote greater care and thought to electronic search, and wake to the empirical evidence establishing the shocking shortfalls of keyword search in unstructured ESI collections.  The days of “let’s try these search terms and see what happens” are numbered.  Queries that will be run across mushrooming collections must pass muster in terms of noisiness, ambiguity, potential for misspelling, affinity to stemming, synonyms, slang, acronyms, IM-speak and other criteria unfamiliar to a profession that prides itself on precise expression.  Lawyers need to embrace concepts of “precision,” “recall” and “sampling” with the same fervor we once brought to the Statute of Frauds and the Rule Against Perpetuities.

Currently, lawyers on both the north and south sides of the docket are the unjust beneficiaries of slipshod search.  Requesting parties benefit from the economic leverage attendant to costly-yet-unavailing fishing expeditions while counsel for producing parties mint obscene pyramidal profits reviewing mountains of electrochaff.  Despite all the vitriol, rarely does either side’s counsel set out to exploit flawed searches.  It’s mostly blissful ignorance at work, coupled with little incentive to fix what’s broken.  Accordingly, Judges like Facciola and Grimm are picking up the baton and running with it.  It’ll be a long, tough race—and not every jurist will head for the tape—but I applaud those who’ve left the blocks!

Search demands nuance, discipline and scientific method.  Prepare to routinely test queries against sample collections, as soon that practice will be as commonplace as DNA testing in paternity cases.

2) What can e-discovery technology providers do to help?

At the risk of appearing ungracious, I can’t help but note that vendors eat at the same gluttonous table as lawyers, and vendor marketing is often so much snake oil.  Until the EDD vendor community takes a longer view of the market, stops building businesses for acquisition and starts building them to last, I don’t think they can be of much help.  The industry should stop pretending their processes and software are “proprietary” and touting their secret sauces.  Instead, how about delivering consistent, predictable service and pricing delivered by experienced, reliable and unflinchingly honest, genuinely knowledgeable personnel who welcome the chance to help lawyers understand this stuff.  If employees stayed around more than six months, that would be nice, too.

3) You recently participated in a new track at LegalTech West called FutureTech.  For those who missed it or the follow-up podcasts, what’s an emerging e-discovery trend that you think might take people by surprise?

Several come to mind.  Mediated meet-and-confer, for example.  The cost of a failed EDD effort can dwarf the amount in controversy, so it makes sense to turn to neutral, technically adept intermediaries to help resolve nettlesome questions, of scope, search, forms of production and cost sharing.  Folks just behave better when company comes.  I also foresee divergence between discovery and the other traditional phases of litigation.  We may see entirely different teams handle discovery in a zealous but non-confrontational manner, leaving the scorched earth stuff to others.

Another development that will sneak up on most lawyers is the growing marginalization of text.  As natural interfaces emerge—where you will talk or gesture to your computers—and as communication gets more real time and visual, words will manifest conduct less frequently.  Take YouTube.  I don’t get it—to me, it’s silly and boring—but it’s rich and exciting to my kids…and text is tertiary.

Something else that will change is where we look for evidence.  If you were pursuing discovery against a teenager, where would you go to locate their most revealing ESI?   Social networking (virtualized storage)?   Cell phones and laptops (portable devices)?   Gaming devices (alternate platforms)?  In ten years, don’t imagine they won’t favor and extend the tools they grew up with.

Data is the ultimate portable commodity, so it’s odd we don’t take our computing environments with us. We will. If desktop machines survive, they will be little more than screens with network connectivity temporarily hosting the virtual identities we carry in our pockets or store online. Local hard drives will be an increasingly irrelevant place to search for files as EDD turns to personal storage devices and online storage.

Other trends lawyers may not foresee: People will retain much more data as there will be little incentive and less time to make it go away. “Cheaper to keep her” will be how most of us deal with data.  Location data will be routinely tracked by many devices with GPS functionality on and about our person, so this will become a new and useful evidence stream.  Virtual machines will be used as forms of production.  Local storage will give way to cloud storage.  Hey, I could do this one all day!

4) You have an extensive background in both e-discovery and computer forensics. Do you see a convergence, or will they remain largely separate worlds from a process and technology perspective?

I see convergence already.  “Forensically sound” practices are creeping into EDD harvest and traditionally rigid approaches to disk forensics are being challenged by the practical realities of immense volume and mission-critical operations.   We see the growth of “live” forensics, hash values displacing Bates numbers and operating systems allowing more and more deleted information to be easily resurrected.

The tools and techniques of each discipline are also converging.  But there will remain a distinction between the two flowing from the unique ability of a skilled forensics examiner to distill the bits and bytes into a compelling tale of human strength or frailty.  It’s painfully easy to misread the significance of digital footprints.  There’s a component of science and art to computer forensics that will insure its distinction and growth.

We face convergent challenges, too.  In both forensics and EDD, the lure of lucre pulls in people who really ought to be doing something less harmful.  Lives, liberty, fortunes, and careers hinge on some computer forensic examinations; yet, some schools and tool sellers promote the notion that you can learn what you need to know over a long weekend.  Just as many copy shops decided they were e-discovery experts one dark night, a lot of poorly trained, incurious and careless forensic examiners are popping up all over.  I’m frankly appalled by some of what I see out there.   Where I hope we ultimately converge is a high standard of professionalism and proven expertise.

5) Finally, the question on the mind of every loyal “Ball in Your Court” reader: Which court is it — basketball, tennis, or volleyball?

I’ve never been much for team sports, but if I have to choose, I opt for the one played on the beach by fit, bikini-clad women.  I may be a hopeless nerd, but I’m not stupid.