Archive for the ‘ediscovery’ Category

Why Transparent Search In E-Discovery Is The Answer To Victor Stanley

Tuesday, August 26th, 2008

In my last post, I discussed how the “black box” design of enterprise search engines makes it challenging to defensibly use keyword search in e-discovery and follow Judge Grimm’s guidance in Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md. May 29, 2008).  In Victor Stanley, Judge Grimm notes that because keyword search technology is prone to producing over- and under-inclusive results, attorneys using keyword search should adopt one of two approaches: either collaborate with the opposing party to agree on keyword search methodology, or utilize best practices that demonstrate they have taken reasonable measures to reduce over- and under-inclusiveness.  However, the black box search technologies that are used in e-discovery today make following this guidance difficult.  They can’t reduce under-inclusiveness without increasing over-inclusiveness.  And they make it expensive to utilize collaborative or best practices methodologies including testing, sampling, refining and documenting searches.  All of which begs an obvious question: what can be done to improve search for e-discovery?

In my opinion, the answer is simple: e-discovery search needs to become more transparent.  Instead of being forced to feed one search query at a time into a “black box” search engine and then getting results  with no idea how those results were generated, lawyers and litigation support professionals need technology that provides them with greater visibility into the search process. They need to understand how the results were obtained, so they can reduce both the over- and under-inclusiveness of keyword search, and easily follow Judge Grimm’s advice to improve the defensibility of their search methodology.

A transparent search solution should have four key elements:

  1. Transparent query expansionQuery expansion is the process by which search engines take the query that the user submitted and expand or convert it into a new and improved form.  Wildcard, stemming, concept and fuzzy searches all follow this query expansion process.  For example, the search “divers*,” would be expanded to search for all the words that start with “divers” in the data set, such as “diverse,” “diversity,” “diversion,” “diversification,” etc.  In transparent search, query expansion would be exposed to users, allowing them to include or exclude expanded keywords. To continue with the previous example, a user that is searching for documents related to diversity would then have the ability to exclude false positive expanded terms, such as “divers”, “diversion,” and “diversification” from the search.  Making query expansion transparent can significantly reduce the over-inclusiveness of keyword search.  It also makes it practical to use technologies, such as concept and fuzzy search, that have not been used to date because of their complexity and tendency to produce massively over-inclusive results.
  2. Multiple query support. When a search contains multiple keyword queries, such as “hiring” and “interview,” transparent search should provide visibility into the results for each individual query as well as the combination of all the queries. For example, with the search “hiring OR interview,” users should have separate visibility into the results for “hiring” and “interview” as well as “hiring OR interview.”  They should know that out of the 100 documents that match “hiring OR interview”, only 5 match interview and 95 match hiring.  This kind of visibility is critical if you want to either collaborate or follow search testing, sampling, and refinement best practices when there are a large number of queries.
  3. Rapid sampling. Transparent search should support the ability to rapidly sample the results from all of the individual queries, such as “hiring” and “interview”, contained within a search. It should also be easy to take a random sample of non-matching documents in order to assess whether one or more searches have identified as many of the relevant documents as possible.  As Judge Grimm states in Victor Stanley when assessing keyword searches used to find privileged documents, “The only prudent way to test the reliability of the keyword search is to perform some appropriate sampling of the documents determined to be privileged and those determined not to be in order to arrive at a comfort level that the categories are neither over-inclusive nor under-inclusive.”
  4. Automated documentation. Transparent search technology needs to document all aspects of the search process including (but not limited to) any keyword that has been excluded during transparent query expansion, the combined results of a search containing multiple individual queries, and the results for each of the individual queries within that search.  Automatically documenting the search methodology used and the results obtained is critical so that users can “show their work” if their search methodology is ever called into question.

Benefits of Transparent Search

By addressing the main technology challenges of keyword search, transparent search provides significant benefits to attorneys and litigation support professionals using search for e-discovery. First, parties that adopt transparent search can improve the defensibility of their e-discovery search practices. By enabling iterative testing, sampling and refinement, transparent search allows users to adopt the approaches recommended by Judge Grimm when it was previously impractical to do so.  At the end of the day, this means less risk.

Second, the use of transparent search can substantially reduce downstream production and review costs by removing false positives. For example, it is not uncommon for certain wildcard searches to generate results where 20-40% of the included documents are false positives that can be removed by transparent query expansion.  This can result in thousands of dollars of savings on a single search query.

Finally, transparent search can dramatically reduce the time and cost required to complete the search and culling stage of e-discovery. Currently, it can take hundreds of hours to run a significant number of searches one at a time, document the results of each search, and sample and refine each individual query. With transparent search, running multiple queries and documenting each of the individual results takes minutes. Sampling each of the individual queries takes seconds.

When it comes to e-discovery search, it’s important to recognize that there are no “silver bullets.”  Search will remain an imperfect science with the possibility of over- and under-inclusive results.  But equally, there is no doubt that search remains the best solution for reducing the vast quantities of electronic information that are a part of every e-discovery process down to a reasonable level for human review. While attorneys and litigation support professionals can’t completely remove the imperfections of keyword search, they can, with transparent search, take action to minimize the impact of these imperfections and defensibly meet the requirements of new case law.  In doing so, they will be able to turn their attention to where it should be: the substance of the case.

Judge Grimm, Victor Stanley, And The Problem Of “Black-Box” E-Discovery Search

Friday, August 22nd, 2008

Judge Paul Grimm’s recent opinion in Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md. May 29, 2008) provides valuable guidance on one of the most important issues in e-discovery: how to conduct keyword searches in a defensible manner given that keyword searches are prone to produce over- and under-inclusive results.  The ruling suggests one of two approaches: either producing parties should adopt a “collaborative” approach to conducting keyword searches, whereby each party agrees on a search methodology; or, they should use a “best practices” approach, such as the one suggested by Sedona, where the producing party tests, samples, and iteratively refines searches so that they can demonstrate they have taken reasonable measures to reduce over- and under-inclusive results.

While the guidance is clear, following the guidance in practice is very difficult.  The primary reason for this is that the search technology being used in e-discovery today is not up to the task.  Specifically, today’s search technology suffers from three problems:

  1. The over- and under-inclusive tradeoff. Many technologies have been developed to address the tendency of keyword searches to miss relevant documents and produce under-inclusive results.  Wildcard and stemming technology has been developed in order to address the issue of finding common word variations in specified keywords.  Concept search has been designed to find documents containing words with similar meanings to the keywords in a search.  And fuzzy search technologies have been put in place to find misspellings of words. However, all of these suffer from the same problem: they produce too many non-relevant or “false positive” documents thus driving up the cost of review. For example, if someone runs the wildcard search “divers*”, then he or she not only gets the desired documents containing “diverse” and “diversity”, but also gets a large number of false positive documents containing “diversion”, “diversification”, and so on.  In the case of concept and fuzzy search, the problem is so great that these technologies to date have rarely been used in e-discovery.
  2. Too expensive to test, sample and refine searches. Today’s search technologies are largely designed to run one search at a time, not the dozens of searches that are typical in e-discovery. As a result, anyone trying to follow the best practices of testing, sampling, and refining each search will find themselves missing deadlines and running over budget because it takes so long. This also makes collaboration with the opposing party close to impossible, since there’s little time to iterate on – and agree upon - a set of keyword searches.
  3. Manual documentation. It’s not enough for producing parties to use best practices, they have to document them so that they can “show their work” to the court. Currently, documenting the search refinement process is mostly manual, with the result that it is either done inadequately or not at all.

The reason why the search technology used for e-discovery has these problems is surprisingly simple: it’s because the technology was not designed for e-discovery in the first place. Rather, it was built for enterprise search, and was only later repurposed towards e-discovery.

The “Black Box” Of Enterprise Search

The core issue is that enterprise search technology has been designed to be a “black box”. Users enter a single search query into one end, and get results at the other, with no visibility into what happens in between. Going back to our previous example, when a user searches for “divers*” intending to find documents related to “diversity” or “diverse”, enterprise search engines give the user no visibility into the crucial step of query expansion and how it expands the search query into relevant and non-relevant terms like “diversion” and “diversification”. As a result, the user has no ability to minimize the false positives.

In the same vein, when a user enters multiple queries into a “black box” enterprise search engine, all of the queries run as a single search, and the user has no visibility into which results are associated with which query. For example, a user that searches for “hiring OR interview” will get the results for the combination of the queries “hiring” and “interview”. He or she won’t know that only 5 of documents contained “hiring” while 100 documents contained “interview.”  This limitation makes analyzing, sampling and refining searches costly and time consuming.

That’s not say that enterprise search products like Autonomy or Endeca are flawed. Far from it.  Their “black box” design works exceedingly well for the simple and quick queries that people want to run across the enterprise for general business purposes. If a sales manager is looking for a single proposal for her meeting the following day, then she doesn’t care how the search was performed or if it’s over-inclusive.  She’s only interested in the first page of relevant results, and for that use case enterprise search engines do a great job.

But e-discovery is a whole different world.  In e-discovery, users typically must review every single document in the search results, not just the most relevant ones.  As a result, over-inclusive searches can dramatically increase the costs of downstream production and review.  And under-inclusive searches raise the issue of defensibility.  Finally, e-discovery users have to run a lot of search queries and understand which documents are associated with each of those queries.

So, going back to the original problem, if current search technologies cannot help lawyers and litigation support professionals follow Judge Grimm’s guidance and address the “well-known limitations” of keyword search, what can? That will be the subject of my next post.

What Is FRCP Compliance?

Wednesday, August 20th, 2008

frcp.gifThere have been several recent press releases from enterprise software companies proclaiming FRCP “compliance,” which certainly sounds appealing.  But, the use of that term begs the question:  how does a search technology (or methodology) become FRCP “compliant” and is that goal even possible?

IBM launched the first salvo:

“The software will allow companies to move from scattered, point-solution approaches to a disciplined approach that controls electronic information, helps support Federal Rules of Civil Procedure (FRCP) compliance,…”

And, Autonomy quickly followed suit:

“The Autonomy pan-enterprise search platform automates the retrieval, processing, and management of all information throughout a global organization irrespective of languages, operating systems, and file types, avoiding non-FRCP compliant search techniques.”

I’m more than tolerant of both puffery and marketing-speak (though woe to those who forward such releases to Monica Bay), but this notion of “FRCP compliance” seems to take advantage of an already bombarded buying public, who have likely grown weary of FRCP articles, CLEs, and maybe even blogs posts.  Nevertheless, it seems useful to really tease out what the FRCP means and does not mean in relationship to e-discovery and enterprise search.

So, in an attempt to debunk this “compliance” myth, I thought I’d devote this blog post to demystifying some of the inaccurate notions about the FRCP.

Federal First

Initially, it’s important to note that the Rules only apply to litigation within the United States Federal court system.  State court litigation, international lawsuits, arbitrations and administrative actions (just to name a few) aren’t under the aegis of the Rules.  While it’s true that certain state courts (Minnesota for example) have selectively adopted the new discovery provisions, most have not.  So, the first step is to check your venue.  Then, assuming the Rules do apply because your organization is in Federal litigation, the impact, while still not crystal clear, does take on more definition.

Relevancy Filters

As a starting place, the discovery process (as part of litigation) is fundamentally limited by Rule 26 to information (electronic and otherwise) that is “relevant” to the case at hand (i.e., “relevant to the claim or defense of any party”).  This distinction is critical because for the most part it prevents the responding party from having to cast a company wide net for all data, a task envisioned by many content management systems.   Certainly, the ability of certain systems to access all user created data is valuable when searching for relevant data, but there are many ways to skin that cat.

No Express Retention or Preservation Duties

Legions of articles proclaim that the amended Rules create wholly new duties to retain information in general, as well as infusing new duties to preserve electronic data once litigation is anticipated.  Instead however, the new Rules expressly disavow creating truly new retention or preservation duties.  While it is undoubtedly a good practice to have a retention policy, given the welter of statutes and regulations that do create retention duties, the Rules do not mandate that a company create one ahead of litigation.

What is true, however, is that the new Rules have powerful implications for preservation once litigation is likely because of the requirements to understand, negotiate and produce relevant information early in the litigation process.  Under the new Rules, it is critical to be able to identify and retain potentially relevant data once litigation is filed (or is “reasonably likely”).  And yet, the burden of placing a legal “hold” on data, while often significant, certainly can be achieved without a formal document retention/deletion policy.  Again, the litigation “trigger” is key.

“Records” Aren’t the Focus

Continuing on this theme, but in a slightly different vector, there are differing opinions about the impact that the Rules have on “business records.”  This issue is nebulous since it is easy to confuse potentially relevant data corresponding to litigation with “business records,” which are often used in two different contexts.  Initially, there is the “business records” exception to the hearsay rule, which is quite specific and affects the admissibility of evidence in court.

The second, broader definition applies to organizations as they attempt to define a records management program to meet the numerous state, local and Federal mandates.  Commonly, as part of this complex initiative, companies will create records retention programs that specifically define official “records,” unofficial “records,” “non-records,” as well as specific retention periods for certain types of records.  Once the company’s records protocol is put into place there may be some downstream nexus with the Rules, but it won’t manifest itself until Federal court litigation arises, as described above.   The most common intersection occurs when a records retention policy prescribes a deletion event that contradicts the legal “hold” requirements for a record that is likely to be relevant to litigation.

In sum, the foregoing describes the role the FRCP plays in Federal court litigation.  It should be clear that the important, yet relatively narrow, use cases do not include any general compliance mandate in the absence of specific litigation.  I think it’s important to separate myth from reality when it comes to understanding how and when the revised Rules really do come into play.  Failure to do so can create an unpleasant scenario where your organization will either under- or over-prepare for these important litigation guidelines.

The Sleeping Giant Awakes? IBM Announces eDiscovery Manager

Thursday, August 14th, 2008

ibm2.jpgOn August 5, IBM announced eDiscovery Manager, which it says “enables organizations to better control the eDiscovery process by bringing key eDiscovery tasks in house. This helps clients more easily manage electronically stored information; provide earlier insights into collected evidence; and prioritize downstream evidence review, analysis and production.”

Taken at face value, this is potentially very significant. IBM is the world’s second-largest software company and its Lotus Notes/Domino email system is used by approximately one-third of corporate America. So I decided to dig a little deeper to understand exactly what IBM’s new product can do, and which customers it can best serve.

Product Capabilities

The first and most important thing to understand about eDiscovery Manager is that, before you can use it, you must first buy and install IBM’s unstructured data stack. This comes in two forms: you can either deploy IBM Content Manager and IBM Common Store; or, you can choose Filenet P8 and Filenet Email Manager. Either way, the deployment time is months and typically involves an army of consultants.

For data in IBM’s content management solutions, eDiscovery Manager enables users to search and export. There is no review functionality, no tagging, and no analysis. The limitations in functionality stem from eDiscovery Manager not really being a new product; rather it’s a rewrite of an old product (eMS or email search) with a new AJAX-based user interface.

Target Customers

The best customers for eDiscovery Manager are those enterprises which have large amounts of data in Filenet P8 / Email Manager or IBM Content Manager / CommonStore. For those enterprises, it will be a useful tool, which IT departments will use to identify and collect data, just as they use utilities like ExMerge for Microsoft Exchange and Robocopy for file shares. Most companies will then choose to process, review and analyze data from all these different repositories with an e-discovery solution.

To my mind, what’s more significant than the announcement of eDiscovery Manager is the fact that IBM is waking up to the opportunity in e-discovery. There’s no doubting the company’s reach and technical prowess, and it will be interesting to watch what future products (e.g., “IBM eDiscovery Review”?) are in the works.

Socha-Gelbmann Survey For 2008 Highlights Shifting Landscape In E-Discovery Software

Thursday, July 24th, 2008

Yesterday, George Socha and Tom Gelbmann published summary results for their 2008 EDD survey. George and Tom gathered self-reported data from 85 e-discovery service providers and 40 e-discovery software companies. To help vendors resist the temptation to “exaggerate” their accomplishments, they then cross-referenced the responses against independent surveys submitted by 29 law firms and 19 corporations, and applied a healthy dose of their own good judgment. The outcome, which they will publish in-full next month, is a great snapshot of the industry, and probably the most objective ranking of e-discovery vendors that you can find.

By comparing this year’s results to the 2007 survey, you get a sense for how much has changed in the e-discovery world over the past 12 months:

Top E-Discovery Software Companies

software.jpg

Note: arrows show change to rankings from last year’s Socha-Gelbmann Survey

Autonomy and Clearwell move up to the Top 5, overtaking Attenex and CT Summation which slip back to the second tier. There are also 3 new names ranked 6 through 10 (Epiq, iConect and Symantec) who displace Cataphora, Doculex, ISYS, and Oracle, none of whom even make it into the top 15. In other words, 70% of the rankings have changed since last year.

If a litigation support manager were to focus only on the Top 5 in making her e-discovery software decision, she would have a choice of some very different solutions. Autonomy positions itself as a high-end (expensive) platform for corporations, while Lexis offers a comprehensive toolset for law firms. Guidance and Clearwell are complementary in that both provide best-of-breed solutions for parts of the EDRM model: Guidance is the leader in collection and preservation, while Clearwell is the leader in processing, analysis and review. Finally, FTI takes a services-based approach which centers around RingTail, its hosted review application.

Looking lower down the list, there were some other interesting results, primarily around which companies were NOT ranked. Kazeon made it into the third tier (ranked 11-15) whereas StoredIQ, its main competitor, did not. Nor did Recommind break into the rankings, despite making a major push into e-discovery from knowledge management over the past year. But the most striking absentees are PSS Systems and Exterro, which have pioneered litigation hold management for Fortune 100 companies. I can only guess that they cover too much of niche market to warrant inclusion in an industry-wide report.

Top E-Discovery Service Providers

In contrast to the world of software, e-discovery services saw much less movement in this year’s rankings:

service-providers.jpg

Note: arrows show change to rankings from last year’s Socha-Gelbmann Survey

There was only one change to the top 5: Fios moved up, displacing Guidance which plummeted 10-20 places down to a 16-25 ranking. In addition, there were two new players in the top 10, Epiq and Huron, who edged out Electronic Evidence Discovery and Ernst & Young.

Conclusion

Changes to the software rankings reflect broader changes in the e-discovery market. As e-discovery has moved in-house, corporations have become a major driver of purchase decisions that were previously left to law firms. Many software companies, such as Attenex, have struggled to make this transition, while others, such as Clearwell, have capitalized on it. There has been no such change in the service provider world and, as a result, the rankings are relatively stable.

It will be interesting to see what happens next year. Every other software space is dominated by a small number of players, like Oracle for databases or VMWare for virtualization. If the same is true for e-discovery, then we can expect many fewer changes to the software rankings in future surveys as the leaders pull away from the pack.