Judge Grimm, Victor Stanley, And The Problem Of “Black-Box” E-Discovery Search

by Will Uppington on August 22nd, 2008

Judge Paul Grimm’s recent opinion in Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md. May 29, 2008) provides valuable guidance on one of the most important issues in e-discovery: how to conduct keyword searches in a defensible manner given that keyword searches are prone to produce over- and under-inclusive results.  The ruling suggests one of two approaches: either producing parties should adopt a “collaborative” approach to conducting keyword searches, whereby each party agrees on a search methodology; or, they should use a “best practices” approach, such as the one suggested by Sedona, where the producing party tests, samples, and iteratively refines searches so that they can demonstrate they have taken reasonable measures to reduce over- and under-inclusive results.

While the guidance is clear, following the guidance in practice is very difficult.  The primary reason for this is that the search technology being used in e-discovery today is not up to the task.  Specifically, today’s search technology suffers from three problems:

  1. The over- and under-inclusive tradeoff. Many technologies have been developed to address the tendency of keyword searches to miss relevant documents and produce under-inclusive results.  Wildcard and stemming technology has been developed in order to address the issue of finding common word variations in specified keywords.  Concept search has been designed to find documents containing words with similar meanings to the keywords in a search.  And fuzzy search technologies have been put in place to find misspellings of words. However, all of these suffer from the same problem: they produce too many non-relevant or “false positive” documents thus driving up the cost of review. For example, if someone runs the wildcard search “divers*”, then he or she not only gets the desired documents containing “diverse” and “diversity”, but also gets a large number of false positive documents containing “diversion”, “diversification”, and so on.  In the case of concept and fuzzy search, the problem is so great that these technologies to date have rarely been used in e-discovery.
  2. Too expensive to test, sample and refine searches. Today’s search technologies are largely designed to run one search at a time, not the dozens of searches that are typical in e-discovery. As a result, anyone trying to follow the best practices of testing, sampling, and refining each search will find themselves missing deadlines and running over budget because it takes so long. This also makes collaboration with the opposing party close to impossible, since there’s little time to iterate on – and agree upon – a set of keyword searches.
  3. Manual documentation. It’s not enough for producing parties to use best practices, they have to document them so that they can “show their work” to the court. Currently, documenting the search refinement process is mostly manual, with the result that it is either done inadequately or not at all.

The reason why the search technology used for e-discovery has these problems is surprisingly simple: it’s because the technology was not designed for e-discovery in the first place. Rather, it was built for enterprise search, and was only later repurposed towards e-discovery.

The “Black Box” Of Enterprise Search

The core issue is that enterprise search technology has been designed to be a “black box”. Users enter a single search query into one end, and get results at the other, with no visibility into what happens in between. Going back to our previous example, when a user searches for “divers*” intending to find documents related to “diversity” or “diverse”, enterprise search engines give the user no visibility into the crucial step of query expansion and how it expands the search query into relevant and non-relevant terms like “diversion” and “diversification”. As a result, the user has no ability to minimize the false positives.

In the same vein, when a user enters multiple queries into a “black box” enterprise search engine, all of the queries run as a single search, and the user has no visibility into which results are associated with which query. For example, a user that searches for “hiring OR interview” will get the results for the combination of the queries “hiring” and “interview”. He or she won’t know that only 5 of documents contained “hiring” while 100 documents contained “interview.”  This limitation makes analyzing, sampling and refining searches costly and time consuming.

That’s not say that enterprise search products like Autonomy or Endeca are flawed. Far from it.  Their “black box” design works exceedingly well for the simple and quick queries that people want to run across the enterprise for general business purposes. If a sales manager is looking for a single proposal for her meeting the following day, then she doesn’t care how the search was performed or if it’s over-inclusive.  She’s only interested in the first page of relevant results, and for that use case enterprise search engines do a great job.

But e-discovery is a whole different world.  In e-discovery, users typically must review every single document in the search results, not just the most relevant ones.  As a result, over-inclusive searches can dramatically increase the costs of downstream production and review.  And under-inclusive searches raise the issue of defensibility.  Finally, e-discovery users have to run a lot of search queries and understand which documents are associated with each of those queries.

So, going back to the original problem, if current search technologies cannot help lawyers and litigation support professionals follow Judge Grimm’s guidance and address the “well-known limitations” of keyword search, what can? That will be the subject of my next post.

Read more about Legal discovery.

4 Responses to “Judge Grimm, Victor Stanley, And The Problem Of “Black-Box” E-Discovery Search”

  1. Daniel Tunkelang Says:

    Will, great post! But, as Endeca’s Chief Scientist, I’d like to correct a misunderstanding. Unlike most search engines, and actually in contrast to most of the approaches described in the information retrieval literature, we emphasize the transparency of query processing. We do so because our emphasis is on supporting exploratory search, where we expect that a user will progressively refine as he or she learns more about the available information and even about his or her information needs.

    As Nicholas Belkin and other library and information science researchers have demonstrated through user studies, transparency is key to supporting effective interaction. If you’d like to learn more about this research, you might start with this classic paper: http://home.cc.gatech.edu/nance/uploads/5/p205-koenemann.pdf

    In the spirit of transparency, I encourage you to check out my blog, The Noisy Channel: http://thenoisychannel.blogspot.com/

    While I can’t reveal all of Endeca’s secrets there, I do try to communicate the vision of transparent, interactive information retrieval that drives everything we do.

  2. Will Uppington Says:

    Daniel, thanks for your comment and for the additional information on Endeca’s approach to search. I knew that Endeca always provided some visibility into search results ever since I first used your product in 2004.  It’s good to see that you are emphasizing visibility in query expansion as well.  I agree with you that enterprise search can also benefit from moving away from a black box approach to a more transparent approach. As the research you point to demonstrates, transparency in search is valuable in helping to improve the relevancy of results by, for instance, suggesting additional keywords to the user.  However, critical differences remain.  For example, in e-discovery, transparency is also valuable in reducing false positives.  This is much less of an issue in enterprise search because false positives are simply not as costly as they are in e-discovery.  In addition, there is less of a need to provide visibility into multiple queries run as a single search.  Enterprise search users still typically want to run one query at a time and not the dozens of queries that e-discovery searchers must run.  Finally, there are still limits as to how much transparency enterprise users will want or use.  Many of these users are often unwilling to spend much additional time to reduce over- and under-inclusive results — it depends on how important their search is and what their alternative is for finding the information. When the alternative is to call someone who knows, their patience can be limited.  E-Discovery users, on the other hand, are much more interested in improving their results because the payback is so significant.  It’s not every day that you can save thousands of dollars by simply improving the results of a keyword search.


  3. e-discovery 2.0 » Blog Archive » Judge Grimm, Victor Stanley, And The Problem Of “Black-Box” E-Discovery Search | eDiscovery Preparation Says:

    [...] e-discovery 2.0 » Blog Archive » Judge Grimm, Victor Stanley, And The Problem Of “Black-Box” E…. [...]

  4. e-discovery 2.0 » Blog Archive » Judge Grimm, Victor Stanley, And The Problem Of “Black-Box” E-Discovery Search | Counsel Technology Inc Says:

    [...] e-discovery 2.0 » Blog Archive » Judge Grimm, Victor Stanley, And The Problem Of “Black-Box” E…. [...]

Leave a Reply