Posts Tagged ‘sampling’

Breaking News: Pippins Court Affirms Need for Cooperation and Proportionality in eDiscovery

Tuesday, February 7th, 2012

The long awaited order regarding the preservation of thousands of computer hard drives in Pippins v. KPMG was finally issued last week. In a sharply worded decision, the Pippins court overruled KPMG’s objections to the magistrate’s preservation order and denied its motion for protective order. The firm must now preserve the hard drives of certain former and departing employees unless it can reach an agreement with the plaintiffs on a methodology for sampling data from a select number of those hard drives.

Though easy to get caught up in the opinion’s rhetoric (“[i]t smacks of chutzpah (no definition required) to argue that the Magistrate failed to balance the costs and benefits of preservation . . .”), the Pippins case confirms the importance of both cooperation and proportionality in eDiscovery. With respect to cooperation, the court emphasized that parties should take reasonable positions in discovery so as to reach mutually agreeable results. The order also stressed the importance of communicating with the court to clarify discovery obligations.  In that regard, the court faulted the parties and the magistrate for not seeking the court’s clarification with respect to its prior order staying discovery. The court explained that the discovery stay – which KPMG had understood to prevent any sampling of the hard drives – could have been partially lifted to allow for sampling. And this, in turn, could have obviated the costs and delays associated with the motion practice on this matter.

Regarding proportionality, the court confirmed the importance of this doctrine in determining the scope of preservation. Indeed, the court declared that proportionality is typically “determinative” of a motion for protective order. Nevertheless, the court could not engage in a proportionality analysis – weighing the benefits of preserving the hard drives against its burdens – as the defendant had not yet produced any evidence from the hard drives to evaluate the nature of the evidence. Only after the evidence from a sampling of hard drives had been produced and evaluated could such a determination be made.

The Pippins case demonstrates that courts have raised their expectations for how litigants will engage in eDiscovery. Staking out unreasonable positions in the name of zealous advocacy stands in stark contrast to the clear trend that discovery should comply with the cost cutting mandate of Federal Rule 1. Cooperation and proportionality are two of the principal touchstones for effectuating that mandate.

The Top Ten “What NOT to Do” List for LegalTech New York 2012

Thursday, January 26th, 2012

As we approach LegalTech New York next week, oft referred to as the Super Bowl of legal technology events, there are any number of helpful blogs and articles telling new attendees what to expect, where to go, what to say, what to do. Undoubtedly, there’s some utility to this approach, but since we’ll be in New York, I think it’s appropriate to take a more skeptical approach and proffer a list of what *NOT* to do at LTNY.

  1. DON’T get caught up in Buzzword Bingo. There are already dozens of sources attempting to prognosticate what the most popular buzzwords will be at this year’s show.  Leading candidates include “predictive coding,” “technology assisted review,” “information governance,” “big data” and even the pedestrian sounding “sampling.” And, while these terms will undoubtedly be on booths and broadcast repeatedly from the Hilton elevator, it doesn’t mean an attendee should merely parrot these without a deeper dive.  Here, the key is go behind the green curtain to see what vendors, panelists and tweet-ers actually mean by these buzzwords, since it’s often surprising to see how the devil really is in the details.
  2. DON’T get a coffee at the Hilton Starbucks. Yes, we all love our morning coffee, but there’s no need to wait in the Justin Bieber-esque line queue at the in-hotel Starbucks. There are approximately 49 locations in a ½ mile radius, including one right across the street. There’s also the vendor giving out free coffee on the second floor, so save yourself 30 minutes of needless line waiting.
  3. DON’T ride the Hilton elevator. For those staying or taking meetings at the Hilton, the elevator lines can be excessively long.  Once you finally get on, you’ll wish they’d been even longer as you then find yourself subjected to the brainwashing of vendor announcements while you make multiple stops on your way to your desired floor. Either take the stairs or, if that’s not possible, try to minimize the trips to keep your sanity. Or, plan B – bring your iPod.
  4. DON’T talk to booth models. It’s tempting to gravitate to the most attractive person at a given vendor’s booth, but they’re often hired professionals designed to get you in for the all-important “badge scan.” Instead, focus on  the person who looks like they’ve been in the same company-branded oxford for 48 hours, because they probably have. While perhaps less aesthetically pleasing, they’ll certainly know more about the product and that’s why you’re there after all, isn’t it?
  5. DON’T pass out your resume on the show floor. While certainly a great networking opportunity, LTNY isn’t the place to blatantly tout your professional wares, at least if you want to keep your nascent job search on the down low. And, if you want to have more private meetings, you’ll need to do better than “hiding out” at the Warwick across the street. For more clandestine purposes, think about the Bronx.
  6. DON’T take tchotchkes without hearing the spiel. There are certain tchotchke hounds out there who roam around LTNY collecting “gifts” for the kids back at home. While I won’t frown on this behavior per se, it’s only courteous to actually listen to the pitch (as a quid pro quo) before you ask for the swag. Anything less is uncivilized.
  7. DON’T get over-served at the B-Discovery Party. After a long day on the show floor you’re probably ready to let loose with some of the eDiscovery practitioners you haven’t seen in a year.  But, in this era of flip cams and instant tweeting, letting your hair down too much can be career limiting. If you haven’t done Jägermeister shots since college, LTNY probably isn’t a good time to resume that dubious practice.
  8. DON’T forget to take your badge off (please!). Yes, it’s cool to let everyone know you’re attending the premier legal technology event of the year, but once you leave the show floor random New Yorkers will heckle you for sporting your badge after hours – particularly the baristas at Starbucks. Plus, if you’ve broken any of the other admonitions above, at least you’ll be more anonymous.
  9. DON’T forget to bring a heavy coat, mittens and scarf. Last year there was the infamous ice storm that stranded folks for days (me included). Even if the weather isn’t that severe this year, anyone from warmer climates will need to bundle up, particularly because it’s easy to unintentionally get caught outside for extended amounts of time – waiting for a cab in the Hilton queue, eating at Symantec’s free food cart, walking to a meeting at a “nearby” hotel that’s “just a block or so away.” Keep in mind those cross town blocks are longer than they appear on a map.
  10. DON’T forget to learn something. Without hyperbole, LTNY has the world’s greatest collection of legal/technology minds in one place for 3 days.  Most folks, even the vaunted panelists, judges and industry luminaries are actually quite accessible. So, at a minimum, attend sessions, ask questions and interact with your peers. Try to ignore the bright lights and signs on the floor and make sure to take some useful information back to your firm, company or governmental agency. You’ll undoubtedly have fun (and maybe a Jagermeister shot, too) along the way.

2012: Year of the Dragon – and Predictive Coding. Will the eDiscovery Landscape Be Forever Changed?

Monday, January 23rd, 2012

2012 is the Year of the Dragon – which is fitting, since no other Chinese Zodiac sign represents the promise, challenge, and evolution of predictive coding technology more than the Dragon.  The few who have embraced predictive coding technology exemplify symbolic traits of the Dragon that include being unafraid of challenges and willing to take risks.  In the legal profession, taking risks typically isn’t in a lawyer’s DNA, which might explain why predictive coding technology has seen lackluster adoption among lawyers despite the hype.  This blog explores the promise of predictive coding technology, why predictive coding has not been widely adopted in eDiscovery, and explains why 2012 is likely to be remembered as the year of predictive coding.

What is predictive coding?

Predictive coding refers to machine learning technology that can be used to automatically predict how documents should be classified based on limited human input.  In litigation, predictive coding technology can be used to rank and then “code” or “tag” electronic documents based on criteria such as “relevance” and “privilege” so organizations can reduce the amount of time and money spent on traditional page by page attorney document review during discovery.

Generally, the technology works by prioritizing the most important documents for review by ranking them.  In addition to helping attorneys find important documents faster, this prioritization and ranking of documents can even eliminate the need to review documents with the lowest rankings in certain situations. Additionally, since computers don’t get tired or day dream, many believe computers can even predict document relevance better than their human counterparts.

Why hasn’t predictive coding gone mainstream yet?

Given the promise of faster and less expensive document review, combined with higher accuracy rates, many are perplexed as to why predictive coding technology hasn’t been widely adopted in eDiscovery.  The answer really boils down to one simple concept – a lack of transparency.

Difficult to Use

First, early predictive coding tools attempt to apply a complicated new technological approach to a document review process that has traditionally been very simple.  Instead of relying on attorneys to read each and every document to determine relevance, the success of today’s predictive coding technology typically depends on review decisions input into a computer by one or more experienced senior attorneys.  The process commonly involves a complex series of steps that include sampling, testing, reviewing, and measuring results in order to fine tune an algorithm that will eventually be used to predict the relevancy of the remaining documents.

The problem with early predictive coding technologies is that the majority of these complex steps are done in a ‘black box’.  In other words, the methodology and results are not always clear, which increases the risk of human error and makes the integrity of the electronic discovery process difficult to defend.  For example, the methodology for selecting a statistically relevant sample is not always intuitive to the end user.  This fundamental problem could result in improper sampling techniques that could taint the accuracy of the entire process.  Similarly, the process must often be repeated several times in order to improve accuracy rates.  Even if accuracy is improved, it may be difficult or impossible to explain how accuracy thresholds were determined or to explain why coding decisions were applied to some documents and not others.

Accuracy Concerns

Early predictive coding tools also tend to lack transparency in the way the technology evaluates the language contained in each document.  Instead of evaluating both the text and metadata fields within a document, some technologies actually ignore document metadata.  This omission means a privileged email sent by a client to her attorney, Larry Lawyer, might be overlooked by the computer if the name “Larry Lawyer” is only part of the “recipient” metadata field of the document and isn’t part of the document text.  The obvious risk is that this situation could lead to privilege waiver if it is inadvertently produced to the opposing party.

Another practical concern is that some technologies do not allow reviewers to make a distinction between relevant and non-relevant language contained within individual documents.  For example, early predictive coding technologies are not intelligent enough to know that only the second paragraph on page 95 of a 100-page document contains relevant language.  The inability to discern what language  led to the determination that the document is relevant could skew results when the computer tries to identify other documents with the same characteristics.  This lack of precision increases the likelihood that the computer will retrieve an over-inclusive number of irrelevant documents.  This problem is generally referred to as ‘excessive recall,’ and it is important because this lack of precision increases the number of documents requiring manual review which directly impacts eDiscovery cost.

Waiver & Defensibility

Perhaps the biggest concern with early predictive coding technology is the risk of waiver and concerns about defensibility.  Notably, there have been no known judicial decisions that specifically address the defensibility of these new technology tools even though some in the judiciary, including U.S. Magistrate Judge Andrew Peck, have opined that this kind of technology should be used in certain cases.

The problem is that today’s predictive coding tools are difficult to use, complicated for the average attorney, and the way they work simply isn’t transparent.  All these limitations increase the risk of human error.  Introducing human error increases the risk of overlooking important documents or unwittingly producing privileged documents.  Similarly, it is difficult to defend a technological process that isn’t always clear in an era where many lawyers are still uncomfortable with keyword searches.  In short, using black box technology that is difficult to use and understand is perceived as risky, and many attorneys have taken a wait-and-see approach because they are unwilling to be the guinea pig.

Why is 2012 likely to be the year of predictive coding?

The word transparency may seem like a vague term, but it is the critical element missing from today’s predictive coding technology offerings.  2012 is likely to be the year of predictive coding because improvements in transparency will shine a light into the black box of predictive coding technology that hasn’t existed until now.  In simple terms, increasing transparency will simplify the user experience and improve accuracy which will reduce longstanding concerns about defensibility and privilege waiver.

Ease of Use

First, transparent predictive coding technology will help minimize the risk of human error by incorporating an intuitive user interface into a complicated solution.  New interfaces will include easy-to-use workflow management consoles to guide the reviewer through a step-by-step process for selecting, reviewing, and testing data samples in a way that minimizes guesswork and confusion.  By automating the sampling and testing process, the risk of human error can be minimized which decreases the risk of waiver or discovery sanctions that could result if documents are improperly coded.  Similarly, automated reporting capabilities make it easier for producing parties to evaluate and understand how key decisions were made throughout the process, thereby making it easier for them to defend the reasonableness of their approach.

Intuitive reports also help the producing party measure and evaluate confidence levels throughout the testing process until appropriate confidence levels are achieved.  Since confidence levels can actually be measured as a percentage, attorneys and judges are in a position to negotiate and debate the desired level of confidence for a production set rather than relying exclusively on the representations or decisions of a single party.  This added transparency allows the type of cooperation between parties called for in the Sedona Cooperation Proclamation and gives judges an objective tool for evaluating each party’s behavior.

Accuracy & Efficiency

2012 is also likely to be the year of transparent predictive coding technology because technical limitations that have impacted the accuracy and efficiency of earlier tools will be addressed.  For example, new technology will analyze both document text and metadata to avoid the risk that responsive or privileged documents are overlooked.  Similarly, smart tagging features will enable reviewers to highlight specific language in documents to determine a document’s relevance or non-relevance so that coding predictions will be more accurate and fewer non-relevant documents will be recalled for review.

Conclusion - Transparency Provides Defensibility

The bottom line is that predictive coding technology has not enjoyed widespread adoption in the eDiscovery process due to concerns about simplicity and accuracy that breed larger concerns about defensibility.  Defending the use of black box technology that is difficult to use and understand is a risk that many attorneys simply are not willing to take, and these concerns have deterred widespread adoption of early predictive coding technology tools.  In 2012, next generation transparent predictive coding technology will usher in a new era of computer-assisted document review that is easy to use, more accurate, and easier to defend. Given these exciting technological advancements, I predict that 2012 will not only be the year of the dragon, it will also be the year of predictive coding.

How Do You Sample Electronically Stored Information (ESI) in E-Discovery?

Wednesday, February 9th, 2011

When confronted with an almost impossible data analysis problem, a tried and true technique to solve it has been the use of sampling. The mathematical analysis behind sampling is something that has been studied for quite a number of years. Also, sampling has also been put into practice for well over seventy years, in many fields from predicting results of elections and assessing quality of electric bulbs. Why not do the same for certifying your ESI productions, while also addressing defensibility and reasonableness?

Sampling as a way to assess quality is something the Electronic Discovery Reference Model (EDRM) Search Group authors covered in detail, with a strategy in a comprehensive EDRM Search Guide (see Section 9.5 and Appendix 2). And, while much of that work is still to hit the mainstream litigation scene as a general practice, I was pleasantly surprised to see it receive attention from a fellow blogger and litigator, Nick Brestoff, who highlighted this in a very thoughtfully crafted article in Law.com, titled A Strategy to Sample All the ESI You Need. I commend his article for helping the community understand the practical difficulties in getting a certifiable result that attorneys can stand behind. And, it is highly likely that the current practice is to certify your electronic discovery without a real measure of validity behind it.

That leads us to back to the mechanics of sampling, the math behind it, and its defensibility. As the EDRM Search Guide notes, meaningful sampling can only be done by the one who has the data, i.e., the producing party. While the Federal Rules of Civil Procedures (FRCP) Rule 26(a) lists required disclosures as well as signing and certification guidelines per Rule 26 (g), there is no agreed upon way to specify sampling parameters as well as the results of sampling.It is in this context, Nick Brestoff’s article is significant – it explores practical ways in which the producing party can shift the sampling mechanics to the requesting party. I do think, however,that there is a logistical problem with this–most litigators will balk at producing the largely irrelevant and non-responsive items to the other side.

Perhaps the real need is for the requesting party to specify in their Rule 26 (b) meet and confer, that the production be certified for completeness by also including a statement on sampling and its results. A simple request such as, “Sample the data for 98% confidence level and 2% error rate, and report the number of responsive documents” could be sufficient. The producing side can perform random sampling, per the sampling goals for the above request, selecting 13526 documents (based on the sampling table of EDRM Search Guide). This allows the attorneys representing the producing party to certify and sign off on an agreed-upon target.

In addition to the EDRM Search Guide, The Sedona Conference, Working Group Commentary, Achieving Quality in the E-Discovery Process is an indispensable resource for understanding the role of sampling. This paper discusses at length, several sampling methods, their applicability for various purposes, including certifying that the results meet a certain quality criteria. In addition, a number of electronic discovery cases have mentioned sampling as a way of overcoming the explosion of data volumes.A primary application of sampling is for evaluating proportionality claims, something that has moved from a simple assertion into an informed argument, with specificity on proving cost burden. Let’s examine a few.

Referring to the well-known Zubulake v. UBS Warburg, F.R.D. 280, the courts ordered the producing party in Makrakis v. Demelis, No. 09-706-C, 2010 WL 3004337 (July 13, 2010) to essentially sample just a small number of backup tapes, at the expense of the requesting party. This is also remarkable in the cost-shifting of processing and reviewing of the sample, however small, to the requesting party. Such measures, while reducing the costs of overall e-discovery, places a greater burden on sample selection to the requesting party, forcing them to apply the reasonableness evaluation.

In Barrera v. Boughton, 2010 WL 3926070 (D. Conn. Sept. 30, 2010), the court ruled that a phased approach to ESI discovery is appropriate and quotes an earlier case, S.E.C v. Collins & Aikman Corp, 256 F.R.D. 403, 418 (S.D.N.Y. 2009), that “[t]he concept of sampling to test both the cost and the yield is now part of the mainstream approach to electronic discovery.” The sampling recommendation in this instance was both a reduction of number of custodians from forty to three, as well as a significant reduction in the date range for the search. What was initially a $60,000 ESI search and discovery effort was reduced drastically to under $13,000.

Similarly, sampling is suggested in both M. Adams & Assoc., L.L.C. v. Fujitsu Ltd., No. 1:05-CV-64, 2010 WL 1901776, and Mt. Hawley Ins. Co. v. Felman Prod., Inc. as a way to perform a small set of search terms on a smaller number of custodians so as to get a sense for the larger electronic discovery costs.Clearone Communications v. Chiang offers another example of sampling by the use of Boolean logic to combine more common search terms thereby avoiding over-inclusiveness.

Per the Sedona commentary definitions, this type of sampling is referred to as “judgmental sampling” wherein the practitioner has a general sense of which of the several custodians and date range is most likely to offer the greatest yield. As judgmental sampling becomes more widely adopted as a way of controlling costs, electronic discovery sampling can embrace the benefits of statistical sampling as well. It is a natural next step, as even with narrow sampling criteria of judgmental sampling, the cost of review can be high. One area where statistical sampling has an advantage is that quantifiable measures of error and confidence intervals are possible, while judgmental sampling has no such formal measurement. Again, if the requesting party wishes to ensure a level of completeness and quality and if the producing party needs a basis for certifying their productions, statistical sampling can be a powerful aid.

As the Electronic Discovery World Zurns

Wednesday, July 29th, 2009

Judge Grimm’s Victor Stanley case was lauded by many as one of the most significant electronic discovery cases of 2008, mainly for its bold proclamation that e-discovery search is a much more complex and technical discipline than has been typically understood by litigators.

“[F]or lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread.”

Despite, legions of articles and blogs on the topic, at least certain portions of the bench haven’t taken heed.  In the case In re: Zurn Pex Plumbing Products Liability Litigation, 2009 U.S. Dist. LEXIS 47636 (June, 5, 2009) (hereinafter “Zurn“), U.S. District Judge Ann Montgomery receives points for understanding some basic e-discovery tenants around recall and precision, but then mysteriously goes where “angels fear to tread” by suggesting her own search terms.

Examining the case facts in more detail,…  Zurn is a class action products liability case where discovery was bifurcated (as is often the case – see Spieker v. Quest Cherokee) to first cover the class “certification” component.  Initially, the Magistrate partially closed the door on broader ESI discovery, stating that “while ESI may prove to be relevant to the first stage of discovery, we cannot meaningfully make that prediction now, and require the parties to engage in what could be vastly more expensive, and yet utterly futile, discovery.”  However, the Magistrate didn’t shut the door entirely, suggesting that “should the parties uncover voids in the information disclosed in hard copy form, they are . . . at liberty to press for further discovery including electronically stored information.”

Despite complying with Sedona’s Cooperation Proclamation (“The parties have worked amicably throughout the discovery process”) opposing counsel still got to loggerheads when plaintiff found “voids” in the initial paper productions via third party discovery.  The plaintiff brought a motion to compel ESI discovery and the defendant objected, stated two primary arguments: (1) the Magistrate earlier ruled out ESI discovery and (2) if they had to perform ESI discovery it would be unduly burdensome/expensive.

Judge Montgomery summary rejected the first argument, but was concerned about the burden surrounding the proposed ESI discovery.  Here, the calculations get a bit confusing, but plaintiff’s request would have resulted in 361 gigabytes of ESI from employee email sources, as well as shared “J” and “K” drives.  The defendant multiplied the gigabyte number by 75,000 pages per gigabyte, which would have required “approximately seventeen weeks and cost $ 1,150,000, exclusive of vendor collection and processing costs, to review and process the data.”  Assuming a rather modest $1,000 per gigabyte for processing and hosting costs, defendants could’ve added another $400,000 for the project.

Ultimately, the court was not persuaded by the supporting affidavits, nor the attorney’s representations about the resulting burden:

“It is unclear whether Zurn’s cost and time numbers are based on a review of 27 million pages of documents, the 3.6 million pages of documents limited to the J Drive and custodians’ emails, or a smaller sample of document pages likely to be flagged as a result of a search for certain relevant terms pro-posed by Plaintiffs. The affidavit of Ms. Freestone, an attorney and not an expert on document search and retrieval, is not compelling evidence that the search will be as burdensome as Zurn avers.”

The 361 gigabytes apparently resulted from “hits” corresponding to plaintiff’s 26 search terms.  The court correctly identified that those terms had precision issues (“many of Plaintiffs’ proposed search terms will likely produce a large number of ‘hits’ that have limited relevance in the case.”)

Unfortunately, in an effort to increase the search precision, the Judge did not take heed of Judge Grimm’s warning and surprisingly took matters into her own hands: “the Court will limit the search to the following fourteen terms based on the likelihood that they will  produce relevant documents without including a vast number of documents that are likely irrelevant to the litigation.”  Here is the Judge’s list of keywords:

(1) AADFW,
(2) Corrosion,
(3) Corrosive,
(4) Corrosive Water,
(5) Crack,
(6) De-zinc,
(7) Dezincification,
(8) DZR,
(9) Fail,
(10) IMR,
(11) Leak,
(12) MES,
(13) SCC,
(14) Stress corrosion cracking

Without looking at the underlying data, it’s clear from the outset that Judge Montgomery didn’t craft a good search strategy (as Judge Grimm might have predicted).  For example, terms 2, 3, 4 and 14 could’ve been captured by a single stemmed search using the term “corros*.” Without such a stemmed search approach, the terms would probably have been run singly in the proposed protocol, meaning that each one would’ve had tremendous duplication, thereby resulting in wasted attorney review time and processing costs.

Judge Montgomery did recognize the potential error of her ways and gave the parties an out:

“The parties may decide on a different set of fourteen terms if they choose to do so. Additionally, if the search, as ordered by the Court, proves to be overly burdensome or costly, Zurn may renew its objection by presenting the Court with specific information including evidence from computer experts on applying the search terms, the number of documents identified, and the cost and time burdens of vetting documents.”

This “specific evidence” language seems to track notions from Sedona’s search best practices protocol, which prescribes sampling and iterative search term refinement.  What is surprising is that knowing this she would nevertheless blindly proffer the 14 term search strategy.  Instead, she should’ve quoted Victor Stanley and required the parties to come up with a data driven approach that met requisite precision and recall metrics.

The Electronic Discovery Sheriff Is Back In Town

Thursday, January 29th, 2009

As Tiger Woods is to golf, the honorable Shira A. Scheindlin is to electronic discovery.  She has unquestionably been the most dominant/visible/outspoken jurist in the electronic discovery realm over the past decade, penning amongst others, the Zubulake opinion, which is commonly referred to as the gold standard in electronic discovery.

But, like Woods, who recently took a sabbatical to mend his surgically repaired knee, Judge Scheindlin has recently been eclipsed by several other notable electronic discovery jurists, namely Judge Grimm (of Victor Stanley and Mancia fame) and Judge Facciola (aka “the Italian Stallion“) both of whom made numerous “best of the year” electronic discovery case law lists.

With Securities and Exchange Commission v. Collins & Aikman Corp., 2009 WL 94311 (S.D.N.Y., Jan. 13, 2009) Judge Scheindlin serves notice that the sheriff is back in town.  She not only tackles a number of thorny electronic discovery topics, but ambitiously takes on the US government in the process.  It’s fairly lengthy opinion, well worth the read, so I’ll just excerpt out a few of the notable takeaways.

As a bit of background…  the Collins case centered around a securities fraud complaint brought by the SEC against the Collins & Aikman Corp. and its former CEO David A. Stockman.  The crux of the dispute surrounded questions concerning the government’s discovery obligations in civil discovery (versus in a purely SEC investigation per se).

There were four distinct but interrelated disputes, namely:

“(1) Whether identifying responsive documents that have been organized by the producing party invades the protection accorded to attorney work-product and how a government agency-acting in its investigative capacity-must respond to a request for the production of documents. (2) Whether a government agency may unilaterally restrict the scope of its search based on an assertion of an “undue burden” on limited public resources. (3) How much information the Government must disclose in order to allow an adversary-and the court-to assess an objection based on the deliberative process privilege. (4) Whether a government agency may unilaterally exclude its own e-mail from document production on the ground that most-but not all-will be privileged.”

Addressing the work product claims, the court found against the government, again reinforcing several recent opinions about electronic discovery search:

“The SEC contends that Stockman can search through the ten million pages and find substantially the same documents identified by the SEC without impinging on the thought processes of the SEC attorneys. Indeed-at significant expense and delay-Stockman could search the document databases using appropriate search terms, but the inaccuracy of such searches is by now relatively well known.  A page-by-page manual review of ten million pages of records is strikingly expensive in both monetary and human terms and constitutes “undue hardship” by any definition.” [Citing, George L. Paul and Jason R. Baron's article: Information Inflation: Can the Legal System Adapt?

After losing the first battle, the SEC argued that even if the compilations were not protected as work product, it could produce the "complete, unfiltered, and unorganized investigatory file" since this was how the documents were "maintained in the usual course of its business."  This second attempt was similarly unpersuasive as Judge Scheindlin held that the "usual course of business" exemption did not apply:

"[C]onducting an investigation-which is by its very nature not routine or repetitive-cannot fall within the scope of the “usual course of business.” While the SEC routinely collects and maintains regulatory submissions such 10-K reports, in its investigative capacity the agency conducts tailored probes of a company or an industry, requiring the gathering of records from diverse sources. Many if not most of the 1.7 million documents in the SEC production here were likely collected in the agency’s investigatory role. Thus it is no surprise that the complete collection is maintained as it was collected-in large disorderly databases. The documents can only be provided in a useful manner if the agency organizes or labels them to correspond to each demand.”

Next, Judge Scheindlin addressed the SEC’s decision to “unilaterally” limit its search to “centralized compilations” which ultimately “turned up nothing.”  She found that the SEC’s “blanket refusal to negotiate a workable search protocol” was “patently unreasonable” citing both Mancia and the Sedona Conference’s Cooperation Proclamation:

“Rule 26(f) requires the parties to hold a conference and prepare a discovery plan. … Had this been accomplished, the Court might not now be required to intervene in this particular dispute. I also draw the parties’ attention to the recently issued Sedona Conference Cooperation Proclamation, which urges parties to work in a cooperative rather than an adversarial manner to resolve discovery issues in order to stem the ‘rising monetary costs’ of discovery disputes.”

As the coup de gras, Judge Scheindlin addressed and rejected out of hand the SEC’s most untenable claim that it would not produce e-mail “generated or received by the Commission itself” because “nearly all responsive e-mails will be privileged, protected, or non-substantive.”

“Because e-mails are inherently searchable, the SEC’s blanket refusal to produce any in-coming or outgoing e-mails is unacceptable. Without even an attempt to negotiate search terms that would weed out privileged, protected, or irrelevant e-mails, the SEC cannot reasonably assert that a routine aspect of modern discovery-search and review of a party’s e-mail-is beyond its capability. Essentially, the SEC’s position is that the cost of such a search is simply too high, but it has made no effort to document the cost or the likelihood that it would produce relevant, nonprivileged material. The concept of sampling to test both the cost and the yield is now part of the mainstream approach to electronic discovery.”

At the end of the day, the Collins opinion seems to make statement the Judge Scheindlin is back with a vengeance and she’s serving notice that the government isn’t above the law:

“Like any ordinary litigant, the Government must abide by the Federal Rules of Civil Procedure.”

Besides knocking the government down a peg, Judge Scheindlin throws her judicial weight behind a number of important but nascent trends, including the Sedona Cooperation Proclamation, the related need to meet & confer, the use of sampling and the challenges of electronic discovery search. While none of these notions are groundbreaking, her substantial backing means increasing clarity for lawyers and litigation support practitioners everywhere.  And, that’s certainly welcome.

The “Artful” E-Discovery Dodger

Monday, October 13th, 2008

E-Discovery search has become a hot topic of late (in blogs and in the news), and I think it’s pretty clear that the unwashed (attorney) masses still don’t really grok the importance of using a defensible search protocol.  Neither do they seem to understand the enhanced scrutiny that’s being applied by the judiciary.

Kipperman v. Onex Corp., 2008 WL 4372005 (N.D. Ga. Sept. 19, 2008) is another in what will assuredly be a long string of cases that demonstrate how easy it is for litigators to get wrapped around the axel of e-discovery search.  In Kipperman, the defendant (Onex) presented several motions to the court, including attempts to obtain relief from the need to produce email identified after searching several backup tapes.

During a previous hearing the court ordered Onex to search all the mailboxes on two tapes, as well as on an additional tape selected by Plaintiff. The court determined that despite Onex’s objections and representations, the backup tapes were “producing meaningful discoverable information.”  The court was nevertheless sympathetic to Onex’s burden and therefore weighed in with some guidance:

“The court did suggest, … , that Plaintiff be more artful with its search terms and that Plaintiff utilize a list of the people, provided by Defendants, to review whether all mailboxes needed to be searched.”

The court also gave Onex the chance to narrow the search terms.  Unfortunately, they didn’t seize the opportunity to provide a narrower list or a refinement of their search terms.  “As such, they agreed to search and restore all the mailboxes with the search terms provided by Plaintiff.”

Not surprisingly, Onex then sought relief from having to review and produce all of the results from the search because the “broad search terms resulted in thousands and thousands of irrelevant hits.”  For example, the search terms included the word “republic” which used to elicit emails regarding Republic Builders Products, one of the companies involved in this matter.

“Defendants claim that the search captured thousands of irrelevant pages due to one occurrence of the word ‘republic’ often related to Onex business interests having nothing to do with Magnatrax in the ‘Republic of France,’ ‘Republic of Ireland,’ and ‘Czech Republic’.”

Again the court reaffirmed their sympathy with Onex’s burden and yet denied the requested relief, in large part because Onex was warned about not being more “artful”:

“[T]he court is not unsympathetic to the massive amount of discovery involved in this matter, the considerable burden of working with it, and the overproduction that often comes with e-mail production. Therefore, the court gave Defendants numerous tools by which to reduce the burden of e-mail discovery, including an opportunity to limit Plaintiff’s search terms and an opportunity to provide a list by which the number of peoples and the number of boxes being searched could be reduced. Defendants did not take advantage of these opportunities. Defendants must now lie in the bed that they have made. Thus, Defendants’ objections on the basis of relevancy and volume are DENIED.” (emphasis added).

Needless to say, Kipperman is probably not all that atypical.  Attorneys everywhere have historically used blunt e-discovery search instruments and haven’t often run afoul of the judiciary.  Now, post Victor Stanley, et al, the playing field has changed dramatically.  It’s important to leverage best practices (from Sedona and others), craft a defensible search strategy, sample the results and “show your work.”  Missteps along the way, especially ones that the court has tried to help the parties avoid won’t be met with much tolerance

Judge Grimm, Victor Stanley, And The Problem Of “Black-Box” E-Discovery Search

Friday, August 22nd, 2008

Judge Paul Grimm’s recent opinion in Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md. May 29, 2008) provides valuable guidance on one of the most important issues in e-discovery: how to conduct keyword searches in a defensible manner given that keyword searches are prone to produce over- and under-inclusive results.  The ruling suggests one of two approaches: either producing parties should adopt a “collaborative” approach to conducting keyword searches, whereby each party agrees on a search methodology; or, they should use a “best practices” approach, such as the one suggested by Sedona, where the producing party tests, samples, and iteratively refines searches so that they can demonstrate they have taken reasonable measures to reduce over- and under-inclusive results.

While the guidance is clear, following the guidance in practice is very difficult.  The primary reason for this is that the search technology being used in e-discovery today is not up to the task.  Specifically, today’s search technology suffers from three problems:

  1. The over- and under-inclusive tradeoff. Many technologies have been developed to address the tendency of keyword searches to miss relevant documents and produce under-inclusive results.  Wildcard and stemming technology has been developed in order to address the issue of finding common word variations in specified keywords.  Concept search has been designed to find documents containing words with similar meanings to the keywords in a search.  And fuzzy search technologies have been put in place to find misspellings of words. However, all of these suffer from the same problem: they produce too many non-relevant or “false positive” documents thus driving up the cost of review. For example, if someone runs the wildcard search “divers*”, then he or she not only gets the desired documents containing “diverse” and “diversity”, but also gets a large number of false positive documents containing “diversion”, “diversification”, and so on.  In the case of concept and fuzzy search, the problem is so great that these technologies to date have rarely been used in e-discovery.
  2. Too expensive to test, sample and refine searches. Today’s search technologies are largely designed to run one search at a time, not the dozens of searches that are typical in e-discovery. As a result, anyone trying to follow the best practices of testing, sampling, and refining each search will find themselves missing deadlines and running over budget because it takes so long. This also makes collaboration with the opposing party close to impossible, since there’s little time to iterate on – and agree upon – a set of keyword searches.
  3. Manual documentation. It’s not enough for producing parties to use best practices, they have to document them so that they can “show their work” to the court. Currently, documenting the search refinement process is mostly manual, with the result that it is either done inadequately or not at all.

The reason why the search technology used for e-discovery has these problems is surprisingly simple: it’s because the technology was not designed for e-discovery in the first place. Rather, it was built for enterprise search, and was only later repurposed towards e-discovery.

The “Black Box” Of Enterprise Search

The core issue is that enterprise search technology has been designed to be a “black box”. Users enter a single search query into one end, and get results at the other, with no visibility into what happens in between. Going back to our previous example, when a user searches for “divers*” intending to find documents related to “diversity” or “diverse”, enterprise search engines give the user no visibility into the crucial step of query expansion and how it expands the search query into relevant and non-relevant terms like “diversion” and “diversification”. As a result, the user has no ability to minimize the false positives.

In the same vein, when a user enters multiple queries into a “black box” enterprise search engine, all of the queries run as a single search, and the user has no visibility into which results are associated with which query. For example, a user that searches for “hiring OR interview” will get the results for the combination of the queries “hiring” and “interview”. He or she won’t know that only 5 of documents contained “hiring” while 100 documents contained “interview.”  This limitation makes analyzing, sampling and refining searches costly and time consuming.

That’s not say that enterprise search products like Autonomy or Endeca are flawed. Far from it.  Their “black box” design works exceedingly well for the simple and quick queries that people want to run across the enterprise for general business purposes. If a sales manager is looking for a single proposal for her meeting the following day, then she doesn’t care how the search was performed or if it’s over-inclusive.  She’s only interested in the first page of relevant results, and for that use case enterprise search engines do a great job.

But e-discovery is a whole different world.  In e-discovery, users typically must review every single document in the search results, not just the most relevant ones.  As a result, over-inclusive searches can dramatically increase the costs of downstream production and review.  And under-inclusive searches raise the issue of defensibility.  Finally, e-discovery users have to run a lot of search queries and understand which documents are associated with each of those queries.

So, going back to the original problem, if current search technologies cannot help lawyers and litigation support professionals follow Judge Grimm’s guidance and address the “well-known limitations” of keyword search, what can? That will be the subject of my next post.

Read more about Legal discovery.

Five E-Discovery Questions with Craig Ball

Tuesday, August 12th, 2008

cball1.gifIn the spirit of the popular New York Times magazine feature, with this post we inaugurate what we hope to be a long-running series of interviews with e-discovery luminaries to get their take on emerging ideas and trends (and hopefully have some fun as well).

Today’s questionee is e-discovery and forensics expert (and popular Law Technology News columnist) Craig Ball.  Craig’s combination of wit and insight speaks for itself, so let’s just get right to the questions.

1) The cases that are on everyone’s mind are O’Keefe/Lundin and Victor Stanley. What’s the practical impact of these rulings to the e-discovery practitioner?

Certainly these decisions have captured my enthusiastic attention.  Lawyers now have to devote greater care and thought to electronic search, and wake to the empirical evidence establishing the shocking shortfalls of keyword search in unstructured ESI collections.  The days of “let’s try these search terms and see what happens” are numbered.  Queries that will be run across mushrooming collections must pass muster in terms of noisiness, ambiguity, potential for misspelling, affinity to stemming, synonyms, slang, acronyms, IM-speak and other criteria unfamiliar to a profession that prides itself on precise expression.  Lawyers need to embrace concepts of “precision,” “recall” and “sampling” with the same fervor we once brought to the Statute of Frauds and the Rule Against Perpetuities.

Currently, lawyers on both the north and south sides of the docket are the unjust beneficiaries of slipshod search.  Requesting parties benefit from the economic leverage attendant to costly-yet-unavailing fishing expeditions while counsel for producing parties mint obscene pyramidal profits reviewing mountains of electrochaff.  Despite all the vitriol, rarely does either side’s counsel set out to exploit flawed searches.  It’s mostly blissful ignorance at work, coupled with little incentive to fix what’s broken.  Accordingly, Judges like Facciola and Grimm are picking up the baton and running with it.  It’ll be a long, tough race—and not every jurist will head for the tape—but I applaud those who’ve left the blocks!

Search demands nuance, discipline and scientific method.  Prepare to routinely test queries against sample collections, as soon that practice will be as commonplace as DNA testing in paternity cases.

2) What can e-discovery technology providers do to help?

At the risk of appearing ungracious, I can’t help but note that vendors eat at the same gluttonous table as lawyers, and vendor marketing is often so much snake oil.  Until the EDD vendor community takes a longer view of the market, stops building businesses for acquisition and starts building them to last, I don’t think they can be of much help.  The industry should stop pretending their processes and software are “proprietary” and touting their secret sauces.  Instead, how about delivering consistent, predictable service and pricing delivered by experienced, reliable and unflinchingly honest, genuinely knowledgeable personnel who welcome the chance to help lawyers understand this stuff.  If employees stayed around more than six months, that would be nice, too.

3) You recently participated in a new track at LegalTech West called FutureTech.  For those who missed it or the follow-up podcasts, what’s an emerging e-discovery trend that you think might take people by surprise?

Several come to mind.  Mediated meet-and-confer, for example.  The cost of a failed EDD effort can dwarf the amount in controversy, so it makes sense to turn to neutral, technically adept intermediaries to help resolve nettlesome questions, of scope, search, forms of production and cost sharing.  Folks just behave better when company comes.  I also foresee divergence between discovery and the other traditional phases of litigation.  We may see entirely different teams handle discovery in a zealous but non-confrontational manner, leaving the scorched earth stuff to others.

Another development that will sneak up on most lawyers is the growing marginalization of text.  As natural interfaces emerge—where you will talk or gesture to your computers—and as communication gets more real time and visual, words will manifest conduct less frequently.  Take YouTube.  I don’t get it—to me, it’s silly and boring—but it’s rich and exciting to my kids…and text is tertiary.

Something else that will change is where we look for evidence.  If you were pursuing discovery against a teenager, where would you go to locate their most revealing ESI?   Social networking (virtualized storage)?   Cell phones and laptops (portable devices)?   Gaming devices (alternate platforms)?  In ten years, don’t imagine they won’t favor and extend the tools they grew up with.

Data is the ultimate portable commodity, so it’s odd we don’t take our computing environments with us. We will. If desktop machines survive, they will be little more than screens with network connectivity temporarily hosting the virtual identities we carry in our pockets or store online. Local hard drives will be an increasingly irrelevant place to search for files as EDD turns to personal storage devices and online storage.

Other trends lawyers may not foresee: People will retain much more data as there will be little incentive and less time to make it go away. “Cheaper to keep her” will be how most of us deal with data.  Location data will be routinely tracked by many devices with GPS functionality on and about our person, so this will become a new and useful evidence stream.  Virtual machines will be used as forms of production.  Local storage will give way to cloud storage.  Hey, I could do this one all day!

4) You have an extensive background in both e-discovery and computer forensics. Do you see a convergence, or will they remain largely separate worlds from a process and technology perspective?

I see convergence already.  “Forensically sound” practices are creeping into EDD harvest and traditionally rigid approaches to disk forensics are being challenged by the practical realities of immense volume and mission-critical operations.   We see the growth of “live” forensics, hash values displacing Bates numbers and operating systems allowing more and more deleted information to be easily resurrected.

The tools and techniques of each discipline are also converging.  But there will remain a distinction between the two flowing from the unique ability of a skilled forensics examiner to distill the bits and bytes into a compelling tale of human strength or frailty.  It’s painfully easy to misread the significance of digital footprints.  There’s a component of science and art to computer forensics that will insure its distinction and growth.

We face convergent challenges, too.  In both forensics and EDD, the lure of lucre pulls in people who really ought to be doing something less harmful.  Lives, liberty, fortunes, and careers hinge on some computer forensic examinations; yet, some schools and tool sellers promote the notion that you can learn what you need to know over a long weekend.  Just as many copy shops decided they were e-discovery experts one dark night, a lot of poorly trained, incurious and careless forensic examiners are popping up all over.  I’m frankly appalled by some of what I see out there.   Where I hope we ultimately converge is a high standard of professionalism and proven expertise.

5) Finally, the question on the mind of every loyal “Ball in Your Court” reader: Which court is it — basketball, tennis, or volleyball?

I’ve never been much for team sports, but if I have to choose, I opt for the one played on the beach by fit, bikini-clad women.  I may be a hopeless nerd, but I’m not stupid.