Posts Tagged ‘TREC’

Q&A With Predictive Coding Guru, Maura R. Grossman, Esq.

Tuesday, November 13th, 2012

Can you tell us a little about your practice and your interest in predictive coding?

After a prior career as a clinical psychologist, I joined Wachtell Lipton as a litigator in 1999, and in 2007, when I was promoted to counsel, my practice shifted exclusively to advising lawyers and clients on legal, technical, and strategic issues involving electronic discovery and information management, both domestically and abroad.

I became interested in technology-assisted review (“TAR”) in the 2007/2008 time frame, when I sought to address the fact that Wachtell Lipton had few associates to devote to document review, and contract attorney review was costly, time-consuming, and generally of poor quality.  At about the same time, I crossed paths with Jason R. Baron and got involved in the TREC Legal Track.

What are a few of the biggest predictive coding myths?

There are so many, it’s hard to limit myself to only a few!  Here are my nominations for the top ten, in no particular order:

Myth #1:  TAR is the same thing as clustering, concept search, “find similar,” or any number of other early case assessment tools.
Myth #2:  Seed or training sets must always be random.
Myth #3:  Seed or training sets must always be selected and reviewed by senior partners.
Myth #4:  Thousands of documents must be reviewed as a prerequisite to employing TAR, therefore, it is not suitable for smaller matters.
Myth #5:  TAR is more susceptible to reviewer error than the “traditional approach.”
Myth #6:  One should cull with keywords prior to employing TAR.
Myth #7:  TAR does not work for short documents, spreadsheets, foreign language documents, or OCR’d documents.
Myth #8:  Tar finds “easy” documents at the expense of “hot” documents.
Myth #9:  If one adds new custodians to the collection, one must always retrain the system.
Myth #10:  Small changes to the seed or training set can cause large changes in the outcome, for example, documents that were previously tagged as highly relevant can become non-relevant. 

The bottom line is that your readers should challenge commonly held (and promoted) assumptions that lack empirical support.

Are all predictive coding tools the same?  If not, then what should legal departments look for when selecting a predictive coding tool?

Not at all, and neither are all manual reviews.  It is important to ask service providers the right questions to understand what you are getting.  For example, some TAR tools employ supervised or active machine learning, which require the construction of a “training set” of documents to teach the classifier to distinguish between responsive and non-responsive documents.  Supervised learning methods are generally more static, while active learning methods involve more interaction with the tool and more iteration.  Knowledge engineering approaches (a.k.a. “rule-based” methods) involve the construction of linguistic and other models that replicate the way that humans think about complex problems.  Both approaches can be effective when properly employed and validated.  At this time, only active machine learning and rule-based approaches have been shown to be effective for technology-assisted review.  Service providers should be prepared to tell their clients what is “under the hood.”

What is the number one mistake practitioners should avoid when using these tools?

Not employing proper validation protocols, which are essential to a defensible process.  There is widespread misunderstanding of statistics and what they can and cannot tell us.  For example, many service providers report that their tools achieve 99% accuracy.  Accuracy is the fraction of documents that are correctly coded by a search or review effort.  While accuracy is commonly advanced as evidence of an effective search or review effort, it can be misleading because it is heavily influenced by prevalence, or the number of responsive documents in the collection.  Consider, for example, a document collection containing one million documents, of which ten thousand (or 1%) are relevant.  A search or review effort that identified 100% of the documents as non-relevant, and therefore, found none of the relevant documents, would have 99% accuracy, belying the failure of that search or review effort to identify a single relevant document.

What do you see as the key issues that will confront practitioners who wish to use predictive coding in the near-term?

There are several issues that will be played out in the courts and in practice over the next few years.  They include:  (1) How does one know if the proposed TAR tool will work (or did work) as advertised?; (2) Must seed or training sets be disclosed, and why?; (3) Must documents coded as non-relevant be disclosed, and why?; (4) Should TAR be held to a higher standard of validation than manual review?; and (5) What cost and effort is justified for the purposes of validation?  How does one ensure that the cost of validation does not obliterate the savings achieved by using TAR?

What have you been up to lately?

In an effort to bring order to chaos by introducing a common framework and set of definitions for use by the bar, bench, and vendor community, Gordon V. Cormack and I recently prepared a glossary on technology-assisted review that is available for free download at:  http://cormack.uwaterloo.ca/targlossary.  We hope that your readers will send us their comments on our definitions and additional terms for inclusion in the next version of the glossary.

Maura R. Grossman, counsel at Wachtell, Lipton, Rosen & Katz, is a well-known e-discovery lawyer and recognized expert in technology-assisted review.  Her work was cited in the landmark 2012 case, Da Silva Moore v. Publicis Group (S.D.N.Y. 2012).

Patents and Innovation in Electronic Discovery

Monday, June 13th, 2011

In the world of technology we live in, a huge amount of benefit is created when people apply certain well-known techniques to solve problems and create value to the broader community. Such techniques are often the result of painstakingly long and laborious research, driven primarily by academic institutions with private industry either funding such research directly or by co-opting them in their own work. When the industry as a whole recognizes a certain methodology, it gains popular usage.

In information retrieval, searching and retrieving relevant content from unstructured text has been a vexing problem, and we’ve had decades of the brightest minds applying their collective intelligence and the rigors of peer review to validate and establish the most effective way to solve a retrieval problem. And, research forums such as TREC, SIGIR and other information retrieval conferences establish a venue for advancing the state of the art. So, when Recommind announced that they have been issued a patent on Predictive Coding, I took notice, especially since it touches a nerve with those who believe research should be openly shared.

The patent lists six claims that describe a workflow whereby humans review and code a document and the coding decisions applied to the document sample are projected or applied to the larger collection of documents. Anyone who has even the slightest exposure to information retrieval research will recognize this as a very common interactive relevance feedback mechanism. Relevance feedback as a way to perform information retrieval has been studied for well over forty years, with a paper as early as 1968 by Rocchio J.J., titled Relevance Feedback in Information Retrieval. It falls under a category of methods broadly known as machine learning.

Any supervised machine learning system involves creating a training sample and using that sample to project into a larger population. The fact that one could claim patentable ideas on something that is so widely known and used is puzzling.  Any workflow that employs machine learning would include the steps of creating an initial control set, coding that by human review, and applying the learned tags to a larger population.  In fact, the Wiki article Learning to rank describes precisely the workflow that is claimed in the patent and as part of our participation in the TREC Legal Track 2009, Clearwell submitted a paper with iterative sampling based evaluation and automatic expansion of initial query.  In that paper, we describe exactly the workflow postulated by the six claims of the patent.

In terms of other prior art that would potentially invalidate the patent, the list is long. Let’s start with Text Classification. Text Classification using Support Vector Machines (SVM) was first published by Thorsten Joachims in 1998, in the Proceedings of Sixteenth International Conference on Machine Learning, as well as his book Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms, published by The Springer International Series in Engineering and Computer Science.  Now a well-recognized Professor of Computer Science at Cornell University, that work is widely cited as a seminal work on the area of machine learning and text classification. Interestingly, this work was cited by the Patent Examiner as prior art, but the inventors missed listing it. Nevertheless, that work and further work by several academics such as Leopold and Kindermann has already established the use of Support Vector Machines as a useful technique for machine learning. To claim the novelty of its use in automatically coding documents is, in my opinion, a hollow claim.

Another technology mentioned in passing is Latent Semantic Indexing (LSI). This is proposed as a retrieval technique by Deerwester, S., Dumais, S.T., Furnas, G.W.,Landauer, T.K., Harshman R. in their paper, Indexing by Latent Semantic Analysis, in Journal of the ASIS, 41(6):391-407, 1990. The use of LSI for semantic analysis, concept searching and text classification is also very widespread, and once again, it seems ridiculous to claim that it is something novel or innovative.

Next, let’s examine the use of sampling to validate the initial control set. Use of sampling for validation of a control set of documents is in fact such a widely known technique that most e-discovery productions employ sampling. In fact, the Sedona Commentary on Achieving Quality and the EDRM Search Guide recommend use of sampling to validate automated searches. Furthermore, several E-discovery opinions such as Judge Grimm’s opinion in Victor Stanley [Victor Stanley, Inc. v. Creative Pipe, Inc. , 2008 WL 2221841 (D. Md., May 29, 2008)]  suggests that any technique that reduces the universe of documents produced must employ sampling to validate automated searches.

In short, we think the claims issued in the patent and the associated workflow are so commonly used that the workflow is neither novel nor non-obvious to a trained practitioner, and there is enough prior art on each of the individual technologies to warrant a re-examination and eventual invalidation of the patent. In any event, it is fairly easy for anyone to pick up existing prior art and devise a similar workflow that achieves the same or better outcome, and attempt to enforce the patent will likely be challenged.

But there is an even bigger issue at stake here beyond the status of Recommind’s patent: namely, shouldn’t the e-discovery vendor community continue to work, as it has for years, toward what is in the best interest of the legal community and, more broadly, the justice system? Recommind’s thinly veiled threats about requiring industry participants to license their technology are an affront to those who have invested years developing the technology and practicing the approach in real-world e-discovery cases. Spend a few minutes trolling (no pun intended) around on archive.org and you’ll see that early predictive coding companies like H5 were practicing machine learning and predictive workflows in e-discovery over two years before Recommind announced their first version of Axcelerate.

Wouldn’t a better outcome be for corporations and law firms to benefit from the innovation that comes from free competition in the marketplace, while still honoring the sort of novel, non-obvious innovation that warrants patent protection? Legitimate patents that actually encourage and protect investments by an organization are fine, but process patents that attempt to patent a workflow are bad for business. With such an approach, the full promise of automated document review (which, as any truly honest vendor should admit, still has much more room to grow and develop) can be fully realized in a way that both provides vendors with the fair and just economic rewards they deserve while helping the legal system become radically more efficient.

IBM’s Watson: Can It Be Used for E-Discovery?

Thursday, May 12th, 2011

As the buzz around Watson and its foray into human-like (actually super-human) performance subsides, it may be time to take stock of what all the fuss was about. After all, we’re all used to computers doing better than humans in many things and even take its superior store of knowledge for granted. And, on the surface, we get answers to questions on pretty much anything from a simple Google or Bing search. So, what really is the big deal and is it even relevant in the context of electronic discovery?

For those not clued in on this, Watson is a brainchild of a four-year effort from 20-25 researchers at IBM, to build a computing engine that would successfully compete at champions-level at the popular quiz show, Jeopardy. Although it blundered on a couple of answers, it competed very well, with a wide margin of victory. Several industry experts that learned of it and watched the show have lauded this as an accomplishment at the same scale or even better than the IBM Deep Blue beating Chess Grand Champion, Gary Kasparov, in 1997. So, let’s examine if this is indeed worthy of the accolades it has gotten.

Behind Watson is an impressive piece of hardware – a series of 90 IBM Power 750 nodes, adding to 16TB of memory and 2,880 Power7 processor cores delivering a staggering 80 teraflops of peak performance.  All the hardware is highly inter-connected with ability to work on problems in parallel, but still marching to a final result in three seconds or less – just fast enough to beat the human buzzer. Some highlights of the computing infrastructure from the hardware architect, Dr. James Fan, at IBM indicate that the three-second timeframe meant the entire corpus of 200 million pages was loaded into memory. Also, with several processors simultaneously working on pieces of the problem, they place very high I/O requirements. The hardware supports a multi-processing OS, with virtualization, in a workload optimized system. The software drives the hardware using thousands of dense threads, with each thread of execution processing a large chunk of work with minimal context switch. Also, given the large number of cores, each thread is optimally allocated to a core. Branded as DeepQA, the software executes a series of complex algorithms in order to solve a very specific problem: winning on Jeopardy.

First, the Jeopardy game provides categories of clues. Some categories help in understanding the clue, while others are simply misleading to a computer. Next, the clue is revealed and one needs to determine what the clue is really asking, since many clues do not ask for a factoid with a direct question, but rather is a composition of multiple sub-clues, each related to another in some linguistic, semantic, syntactic, temporal or other forms of connection. The decomposition of clues and figuring the relationships is a challenge even for humans. Finally, after one understands the clue, you then have to hone in on an answer with some level of confidence, within a three-second window, and must activate the answer buzzer ahead of the rest of the competitors. Besides individual clues, one has to also devise an overall game strategy for selecting the next category, selecting a clue within that category, how much to wager on the Double Jeopardy and the Final Jeopardy. Overall, the game is a complex amalgamation of knowledge, language analysis, gaming strategy and speed of recall of answers.

The software architecture of the DeepQA system is documented in a paper published in AI Magazine. The team built several components to address each area of the problem, with many independent algorithms in each component.  There are lots of complicated technical details, but the final outcome is a human-like response.

A question on that anyone who examines its inner workings has is whether the system is really natural language processing, or statistical language analysis, or machine learning or some sort of ad-hoc program, which doesn’t fit any traditional area of analytics. It does appear to be an combination of several techniques, which may mirror exactly how humans go about solving these clues. We seem to have a large collection of knowledge, initially unconnected but the category, the clue, the hypothesis all appear to generate word and concept associations and a fuzzy evaluation of confidence measures which all converge into a confidence with which a competitor answers a question. It is the replication of these processes by algorithms that makes it truly an astounding achievement.

Given the success of DeepQA’s performance, a natural question is whether it has any practical value for helping us solve day-to-day problems. More specifically, can it cope with the information overload and the challenges of e-discovery posed by that mass of information?  Use within e-discovery context has been explored by several authors, most notably, Robert C. Weber of IBM and Nick Brestoff in recent Law.com articles. Their analysis is based on the ability to explore vast volumes of knowledge. But really, what DeepQA tackled is something more significant – the inherent ambiguity in human spoken and written communication. Our natural instincts are to employ subtle nuances, indirect references, implicit assumptions, and incomplete sentences. We tend to leverage prior and surrounding context in most of our communications. It’s just the natural way of communications, since doing so is actually very effective. We assume establishing context is redundant, unproductive and unnecessary as it tends to make communication repetitive. By not employing a rigid structure in how we write, we are able to bring to bear concise exchanges that span a large volume of information.

If the last two decades is an indicator, the nature of communication is getting less formal, with emails, instant messages, tweets, and blog posts replacing well-crafted formal letters and memos. And, forcing individuals to communicate using rigid, unambiguous text in order for it to be processed by computers easily would mean a huge change in behavior in how people communicate. Any action that contemplates such a change in behavior across billions of people is simply not going to occur. What this means is that the burden for automated analysis using computing algorithms is even greater. This is what makes the discovery of relevant content in the context of e-discovery a very hard problem, one that is worthy of the sort of technological prowess employed by DeepQA team.

Given that our appetite for producing information is ever-increasing, while its discoverability is getting harder, taking the work of DeepQA and adapting it to solve e-discovery needs has the potential to make significant improvements in how we tackle the search, review and analytical aspects of e-discovery.  DeepQA took an easily articulated goal of answering at least 60% of the clues with 85% precision in order to reach champion levels. That was sufficient to win the game. Note that there was never an attempt to get 100% of all clues, with 100% confidence. In the realm of e-discovery, we would be looking at taking a very general production request such as the TREC 2009 Topic 201 “All documents or communications that describe, discuss, refer to, report on, or relate to the Company’s engagement in structured commodity transactions known as prepay transactions.” and use just such a simple articulation of the request to produce relevant documents. It is the core algorithms of machine learning, multiple scoring methods, managing relevance and confidence levels along with traditional information retrieval methods that form the ingredients of the new frontier of automated e-discovery. Beyond e-discovery, application of DeepQA algorithms for business analytics also has significant potential, where fact and evidence-based decision making using unstructured data is likely the norm. DeepQA’s very public Jeopardy challenge has shown that the ingredients needed for enabling such problem solving is well within the realm of possibility.

Reinventing Review in Electronic Discovery

Tuesday, December 28th, 2010

In a recent workshop that I attended, I had the privilege of sharing thoughts on the latest electronic discovery trends with other experts in the market. Especially interesting to me was discussing the provocatively titled paper, The Demise of Linear Review by Bennett Borden of Williams Mullen. The paper, citing several factual data from various studies, as well as drawing parallel to other similar anachronisms of the past, makes excellent arguments for rethinking how legal review is performed in e-discovery.

When linear review is mentioned, the first mental picture one conjures up is boredom. It has generally been associated with a mental state that is a result of repetitive and monotonous tasks, with very little variation. To get a sense for how bad this can affect performance, one only needs to draw upon several studies of boredom at the workplace, especially in jobs such as mechanical assembly of the 1920s and the telephone switchboard operators of the 1950s. In fact, the Pentagon sponsored study, Implications for the design of jobs with variable requirements, from Navy Personnel Research and Development Center, presents an excellent treatise on contributors for workplace fatigue, stress, monotony, and distorted perception of time. This is best illustrated in their paper:

Mechanical assembly, inspection and monitoring, and continuous manual control are the principal kinds of tasks most frequently studied by researchers investigating the relationship between performance and presumed boredom. On the most repetitive tasks, degradation of performance has typically been found within 30 minutes (Fox & Embry, 1975; Saito, Kishida, Endo, & Saito, 1972). The early studies of the British Industrial Fatigue Board (Wyatt & Fraser, 1929) concluded that the worker’s experience of boredom could be identified by a characteristic output curve on mechanical assembly jobs. The magnitude of boredom was inversely related to output and was usually marked by a sharp decrement in the middle of a work period.

How does this apply to linear review? Well, a linear review is most often performed using a review application or tool, simulating a person reading and classifying a pile of documents. The reviewer is asked to read the document and apply a review code, based on their judgment. While it appears easy, it can be one of the most stressful, boring, and thankless jobs for a well-educated, well-trained knowledge worker. Even with technology and software advances a reviewer is required to read documents in relatively constrained workflows. Just scrolling through pages and pages of a document, comprehending its meaning and intent in the context of the production request can make it stressful. To add to this, reviewers are often measured for their productivity based on the number of documents or pages they review per day or per hour. In cases where large number of reviewers are involved, there are very direct comparisons of rates of review. Finally, the review effort is judged for quality without consideration for the very elements that impact quality. Imagine a workplace task where every action taken by a knowledge worker is monitored and evaluated to the minutest detail.

Given this, it is no wonder that study after study has found a straight plough-through linear review produces less than desirable results. A useful way to measure effectiveness of a review exercise is to submit the same collection of documents to multiple reviewers and assess their level of agreement on their classification of the reviewed documents in specific categories. One such study, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, finds that the level of agreement among human reviewers was only in the 70% range, even when agreement is limited to positive determination. As noted in the study, previous TREC inter-assessor agreement notes as well as other studies on this subject by Barnett et al., 2009 also shows a similar and consistent result. Especially noteworthy from TREC is the fact that only 9 out of 40 topics studied had an agreement level higher than 70%, while remarkably, four topics had no agreement at all. Some of the disagreement is due to the fact that most documents fall on varying levels of responsiveness which cannot easily be judged on binary yes/no decision (i.e., the “where do you draw the relevance line” problem). However, a significant source on variability is simply attributed to the boredom and fatigue that comes with repetitiveness of the task.

A further observation on reviewer effectiveness is available from the TREC 2009 Overview Report, which studied the appeals and adjudication process of that year’s Interactive Task. This study offers an excellent opportunity to assess the effectiveness of initial review and subsequent appeals and adjudication process. As noted in the study, the Interactive Task involves an initial run submission from participating teams which are sampled and reviewed by human assessors. Upon receiving their initial assessments, participating teams are allowed to appeal those judgments. Given the teams’ incentive to improve upon the initial results, they are motivated to construct an appeal for as many documents as they can, with each appeal containing a justification for re-classification. As noted in the study, the success rates of appeals were very high, with 84% to 97% of initial assessments being reversed. Such reversals were across the board and directly proportional to the number of appeals, suggesting that even the assessments that were not appealed could be suspect. Another aspect that is evidenced is that the appeals process requires a convincing justification from the appealing team, in the form of a snippet of the document, document summary, or a portion of the document highlighted for adjudication. This in itself biases the review and makes it easier for the topic assessor to get a clearer sense for the document on their attempt at adjudicating the appeal. This fact is also borne out by the aforementioned Computer Classification vs. Manual Review study where the senior litigator with the knowledge of the matter had the ability to offer the best adjudications.

Given that linear review is flawed, what are the remedies? As noted in Bennett’s paper, intelligent use of newer technologies along with a review workflow that leverages them can offer gains that are demonstrated in other industries. Let’s examine a few of them.

Response Variation

Response variation is a strategy for coping with boredom by attempting to build variety into the task itself. In mechanical assembly lines, response variation is added through innovative floor and task layouts, such as Cellular Layout. On some tasks, response variation may involve only simple alternation behaviors, such as reversing the order in which subtasks are performed; on others, the variety may take more subtle forms reflected in an inconsistency of response times. In the context of linear review, it can help to organize your review batches so that your review teams alternate classifying documents for responsiveness, privilege and confidential etc. Another interesting approach would be to mix the review documents but suggest that each be reviewed for a specific target classification.

Free-Form Exploration

Combining aspects of early case assessments and linear review is one form of exploration that is known to offer both a satisfying experience and effective results. While performing linear review, the ability to suspend the document being reviewed and jump to other similar documents and topics gives the reviewer a cognitive stimulus that improves knowledge acquisition. Doing so offers an opportunity for the reviewer to learn facts of the case that would normally be difficult to obtain, and approach the knowledge levels of a senior litigator of the case. After all, we depend on the knowledge of the matter to be a guide for reviewers, so attempts to increase their knowledge of the case can only be helpful. Also, on a free-form exploration, a reviewer may stumble on an otherwise difficult to obtain case fact and the sheer joy of finding something valuable would be rewarding.

Expanding the Work Product

Besides simply judging the review disposition of a document, the generation of higher value output such as document summaries, critical snippets, and document meta-data that contribute to the assessment can both reduce the boredom of the current reviewer as well as contribute valuable insights to other reviewers. As noted earlier, being able to assist the review with such aids can be immensely helpful in your review process.

Review Technologies

Of course, fundamentally changing linear review with specific technologies that radically changes the review workflow is an approach worth considering. While offering such aids, it must be remembered that human judgment is still needed and the process must incorporate both increasing their knowledge as well as their ability to apply judgment. We will examine these technologies in an upcoming post.

2009 TREC Legal Track Sheds Light on Search Efficacy in Electronic Discovery

Tuesday, July 27th, 2010

In one of my previous posts, I had discussed the value and importance of TREC to the legal community. Clearwell Systems has been a TREC participant for the last two years, and believes in working with the rest of the participants in advancing the collective knowledge of electronic discovery-related information retrieval methodologies. TREC’s work has been conducted in the context of annual workshops, and is organized in the form of specific tracks. For legal professionals, the TREC Legal Track is the most relevant, and track organizers have just released the much-awaited overview of the 2009 workshop. I will summarize the key results from the study and its broader implications.

The overview paper is now available and covers the design of the two tasks within the track – the Interactive task and the Batch task. The Interactive Task is very relevant for the legal community, since it is designed specifically for analyzing the task of producing specific records in response to a “discovery request”. As noted in the paper, 15 teams participated, including 10 commercial teams, up from three teams in 2008. The 2009 study was also the first time an email collection (based on Enron emails released by FERC) was used.

The Interactive Task involves a “mock complaint” and seven different topics, with each topic described in the form of a general information request. Several teams participated by choosing one or more topics and submitting responsive documents for each.  These were then assessed using a mathematically sound sampling and estimation methodology, and effectiveness metrics were computed for each team.

The critical summary measure is F1, a combination of precision (estimate of false positives) and recall (estimate of false negatives). Overall, the highest F1 measure achieved across six of the seven topics was very good, as evidenced by values from 0.614 to 0.840. As an example, an F1 measure of 0.840 was achieved with a Recall of 0.778 and Precision of 0.912. This implies that the information request was satisfied with very few false positives (8.8%) and few false negatives (22.2%). Having a high precision implies that your reviewers will be reviewing fewer irrelevant documents, hence reducing your review workload and review costs.  A high recall ensures that very few documents were missed, so your case teams can be confident that all the facts of the case are examined.

It’s always important to look not only at the results, but the costs incurred when achieving said results.  We can break this into the costs that each team incurred, and the costs that assessment and topic authorities incurred. Unfortunately, the study did not track the amount of resources each team expended, so we will have to leave that as a possible improvement for a future study. To get a view of the second cost, a review of the tabulation of team interactions with topic authorities (Figure 3 of the overview paper) is helpful. In this study, the topic authority plays the role of a case expert. The numbers show that for some topics, a highly acceptable F-measure (over 0.75) was achieved even with interactions of 100 minutes, well below the 600 minutes allocated for each team. This would indicate that the teams were able to understand and construct meaningful searches with very reasonable amount of involvement of a case expert.

The other interesting conclusion is that there is value in selecting a corpus containing attachments. The study found that attachments increased the value of responsiveness by measuring the “document to attachment” ratios. For the responsive set, this ratio was a significantly higher value of 4.8 (i.e., responsive document families had, on average, one message and 3.8 attachments), while the entire population had this ratio at 2.2. This suggests that using the Enron corpus that contained attachments was a very good decision.

Of course, the most revealing, controversial finding is with respect to the Assessment and Adjudication phase of the project. As noted in the overview paper’s section 2.3.4, the rate of success of appeals was significant, ranging from 82% to 97%. In other words, the initial sample assessments were reversed in an astonishingly large number of cases. One could argue that the appealed documents were carefully selected, but that argument is weakened by the varying number of appeals by participating teams, and the success rate for even the teams with larger number of submissions. As noted in the paper, the teams that invested greater amounts of resources in the appeals phase benefited proportionately in the levels of improvement of their final precision and recall numbers. I know that constructing appeals can consume a lot of resources since, in addition to the normal information retrieval task, you are required to provide a convincing argument for reversing an initial judgment. This becomes very much a review exercise, not unlike the traditional manual review that the broader legal industry has been struggling with. For example, our own appeals budget was limited, forcing us to sample the appealed documents and select only a few. The outcome of this is that un-appealed documents are all assessed as relevant, which is unsubstantiated by the large number of appeals. In the final analysis, section 2.4.2 illustrates a salient indicator of success – teams that had a positive and useful interaction with the topic authority had the greatest success of initial assessments as well as success in appeals, and the ones that leveraged this for the greatest number of appeals had reported the greatest F-measure.

The 2009 study saw a significant increase in participation from commercial teams. My own personal observation is that unlike academic teams, commercial teams tend to evaluate their participation in TREC projects through the narrow prism of short-term return on investment. While there is value in contributing to the community, I am sure each team is asked to justify the benefits of participation to their management. Some would argue that the full benefit is not realized because of the restrictions placed on dissemination of results within the broader community, especially in the area of marketing the results. I am sure every commercial participant would want to promote their performance, and highlight how their technology and methodology was superior. Given that such direct comparisons are not permitted, the ability to market your results is severely curtailed. The potential for comparative analysis could be a powerful motivator for all participating teams to invest more in the exercise, with the final outcome that the community benefits.

As I noted in my previous post, the legal e-discovery profession needs an independent authority that can challenge vendor claims and provide objective validation of one of the most complex areas of e-discovery – search and information retrieval. TREC has stepped in and served that need very effectively. And, this has been deservedly noticed by the people that matter – Justices of cases involving electronic discovery, expressing their opinions regarding “reasonableness” with respect to cost-shifting, adverse inference, motion to dismiss and other judgments.

A study of such magnitude is bound to have certain flaws, and these are documented in Section 2.5. Leaving aside these shortcomings, the TREC Legal Track effort is immensely useful for both participants and consumers/users of legal technologies and services. The value offered to the community by such studies is well captured in the companion report, titled the Economic Impact Assessment of NIST’s TREC program. As the TREC coordinators are rolling out their new TREC 2010 Legal Track tasks, it is obvious that continued improvements in both the design and execution will make it even more attractive for all participants. Clearwell Systems is committed to the overall goals of TREC and intends to continue their involvement in the TREC 2010 Legal Track projects.

Learn More On Litigation Support Software & Electronic Discovery Litigation

Better Search for E-Discovery

Tuesday, March 11th, 2008

I spend a lot of time researching and developing new search functionality, and working with enterprises and law firms to use this functionality to improve their e-discovery outcomes. To this end, I have followed the excellent research performed as part of the TREC legal track. I also recently attended an informative Sedona Conference webinar on “Search and Information Retrieval”, which contained a section on Information Retrieval (IR) Lessons for E-Discovery presented by Ellen Voorhees of NIST.

As I described some of this research to a colleague of mine, he asked me “So, what’s the so what? What’s the most important step our customers can make to improve the way they search in e-discovery matters based on your work with customers and this research?” My answer was a little surprising even to me. While good cases can be made for looking at concept search and newer, more automated ways of performing content analysis, I believe the most important step that customers can take is simply for them to get their “experts” to start iteratively searching the data in a matter as early as possible in a matter. Let me explain.

When I look at Ellen’s presentation and the findings from the TREC legal track 2006 research overview three findings stand out to me:

  1. If you want to get more effective results as measured by “recall” (i.e., how many of the relevant documents did you find?) and “precision” (i.e., how many of the documents you found were relevant versus false positives), then the best way to achieve this is to write a better search query.
  2. One of the best ways to get better search queries is to commit human resources to improving them, by putting a “human-in-the-loop” while performing searches.
  3. The more expert the human, the better results you are going to get.1

In other words, what Ellen and the other researches have found that is that you get better results if the same person is running searches, evaluating the results, refining those queries, and trying again. The more expert the person, the better results you are going to get.

Now, you may be thinking that this sounds like common sense and I would completely agree with you. However, while this advice is clearly common sense to you and me, in my experience, it is not always followed in our industry. Instead, it’s all too common that at the beginning of a matter someone comes up with set of keyword queries, someone else runs these queries, some other people perform a detailed review of the results and then finally the “expert” or attorney at the end of this process reviews the most important documents and/or a summary of the documents written by someone else. At this point, some new queries may be developed based on the results of the review and the process starts over again.

What’s the problem with this approach? While in the end this approach can be effective, it can be exceedingly costly and time consuming. Instead, getting your “expert”, whether this is inside counsel, outside counsel, a subject matter expert, a litigation support professional, or a hired investigator, to interact with the data will allow you to find the most important information faster enabling you to make critical legal decisions faster and to dramatically reduce the cost and risk associated with e-discovery.

So why don’t more people follow the common sense advice of getting an expert in front of the data experimenting with queries, interacting with the data and developing better queries? In my view, the single biggest reason is that the technology used to perform searches for e-discovery has simply not been easy enough for legal experts to use. As a result, these experts have got used to developing queries without using technology, and not iteratively interacting with the data over a short period of time.

But that’s changing. In the past few years, several intuitive e-discovery solutions have come to market that enable non-technical lawyers to run their own queries. More and more law firms and enterprises are leveraging these solutions to move to “human-in-the-loop” searching. The results are striking: better early case assessment, much shorter turnaround times, lower costs, and more accurate results.

1 This is my simplified interpretation of the findings of the TREC legal track. What was found was that an expert manual searcher performed well relative to other non-expert manual run results. Baron, J., Lewis, D., and Oard, D. ”TREC-2006 Legal Track Overview.” The TREC research also contained other findings not covered in this post and I recommend reading the full document so that readers can draw their own conclusions.