Posts Tagged ‘IBM’

IBM’s Watson: Can It Be Used for E-Discovery?

Thursday, May 12th, 2011

As the buzz around Watson and its foray into human-like (actually super-human) performance subsides, it may be time to take stock of what all the fuss was about. After all, we’re all used to computers doing better than humans in many things and even take its superior store of knowledge for granted. And, on the surface, we get answers to questions on pretty much anything from a simple Google or Bing search. So, what really is the big deal and is it even relevant in the context of electronic discovery?

For those not clued in on this, Watson is a brainchild of a four-year effort from 20-25 researchers at IBM, to build a computing engine that would successfully compete at champions-level at the popular quiz show, Jeopardy. Although it blundered on a couple of answers, it competed very well, with a wide margin of victory. Several industry experts that learned of it and watched the show have lauded this as an accomplishment at the same scale or even better than the IBM Deep Blue beating Chess Grand Champion, Gary Kasparov, in 1997. So, let’s examine if this is indeed worthy of the accolades it has gotten.

Behind Watson is an impressive piece of hardware – a series of 90 IBM Power 750 nodes, adding to 16TB of memory and 2,880 Power7 processor cores delivering a staggering 80 teraflops of peak performance.  All the hardware is highly inter-connected with ability to work on problems in parallel, but still marching to a final result in three seconds or less – just fast enough to beat the human buzzer. Some highlights of the computing infrastructure from the hardware architect, Dr. James Fan, at IBM indicate that the three-second timeframe meant the entire corpus of 200 million pages was loaded into memory. Also, with several processors simultaneously working on pieces of the problem, they place very high I/O requirements. The hardware supports a multi-processing OS, with virtualization, in a workload optimized system. The software drives the hardware using thousands of dense threads, with each thread of execution processing a large chunk of work with minimal context switch. Also, given the large number of cores, each thread is optimally allocated to a core. Branded as DeepQA, the software executes a series of complex algorithms in order to solve a very specific problem: winning on Jeopardy.

First, the Jeopardy game provides categories of clues. Some categories help in understanding the clue, while others are simply misleading to a computer. Next, the clue is revealed and one needs to determine what the clue is really asking, since many clues do not ask for a factoid with a direct question, but rather is a composition of multiple sub-clues, each related to another in some linguistic, semantic, syntactic, temporal or other forms of connection. The decomposition of clues and figuring the relationships is a challenge even for humans. Finally, after one understands the clue, you then have to hone in on an answer with some level of confidence, within a three-second window, and must activate the answer buzzer ahead of the rest of the competitors. Besides individual clues, one has to also devise an overall game strategy for selecting the next category, selecting a clue within that category, how much to wager on the Double Jeopardy and the Final Jeopardy. Overall, the game is a complex amalgamation of knowledge, language analysis, gaming strategy and speed of recall of answers.

The software architecture of the DeepQA system is documented in a paper published in AI Magazine. The team built several components to address each area of the problem, with many independent algorithms in each component.  There are lots of complicated technical details, but the final outcome is a human-like response.

A question on that anyone who examines its inner workings has is whether the system is really natural language processing, or statistical language analysis, or machine learning or some sort of ad-hoc program, which doesn’t fit any traditional area of analytics. It does appear to be an combination of several techniques, which may mirror exactly how humans go about solving these clues. We seem to have a large collection of knowledge, initially unconnected but the category, the clue, the hypothesis all appear to generate word and concept associations and a fuzzy evaluation of confidence measures which all converge into a confidence with which a competitor answers a question. It is the replication of these processes by algorithms that makes it truly an astounding achievement.

Given the success of DeepQA’s performance, a natural question is whether it has any practical value for helping us solve day-to-day problems. More specifically, can it cope with the information overload and the challenges of e-discovery posed by that mass of information?  Use within e-discovery context has been explored by several authors, most notably, Robert C. Weber of IBM and Nick Brestoff in recent Law.com articles. Their analysis is based on the ability to explore vast volumes of knowledge. But really, what DeepQA tackled is something more significant – the inherent ambiguity in human spoken and written communication. Our natural instincts are to employ subtle nuances, indirect references, implicit assumptions, and incomplete sentences. We tend to leverage prior and surrounding context in most of our communications. It’s just the natural way of communications, since doing so is actually very effective. We assume establishing context is redundant, unproductive and unnecessary as it tends to make communication repetitive. By not employing a rigid structure in how we write, we are able to bring to bear concise exchanges that span a large volume of information.

If the last two decades is an indicator, the nature of communication is getting less formal, with emails, instant messages, tweets, and blog posts replacing well-crafted formal letters and memos. And, forcing individuals to communicate using rigid, unambiguous text in order for it to be processed by computers easily would mean a huge change in behavior in how people communicate. Any action that contemplates such a change in behavior across billions of people is simply not going to occur. What this means is that the burden for automated analysis using computing algorithms is even greater. This is what makes the discovery of relevant content in the context of e-discovery a very hard problem, one that is worthy of the sort of technological prowess employed by DeepQA team.

Given that our appetite for producing information is ever-increasing, while its discoverability is getting harder, taking the work of DeepQA and adapting it to solve e-discovery needs has the potential to make significant improvements in how we tackle the search, review and analytical aspects of e-discovery.  DeepQA took an easily articulated goal of answering at least 60% of the clues with 85% precision in order to reach champion levels. That was sufficient to win the game. Note that there was never an attempt to get 100% of all clues, with 100% confidence. In the realm of e-discovery, we would be looking at taking a very general production request such as the TREC 2009 Topic 201 “All documents or communications that describe, discuss, refer to, report on, or relate to the Company’s engagement in structured commodity transactions known as prepay transactions.” and use just such a simple articulation of the request to produce relevant documents. It is the core algorithms of machine learning, multiple scoring methods, managing relevance and confidence levels along with traditional information retrieval methods that form the ingredients of the new frontier of automated e-discovery. Beyond e-discovery, application of DeepQA algorithms for business analytics also has significant potential, where fact and evidence-based decision making using unstructured data is likely the norm. DeepQA’s very public Jeopardy challenge has shown that the ingredients needed for enabling such problem solving is well within the realm of possibility.

Cutting Through The Confusion: A Buyer’s Guide To Electronic Discovery Software

Sunday, April 19th, 2009

Over the past 4 years, I have had hundreds of conversations with corporate counsel and “legal IT”, meaning technical folks charged with supporting the legal team. More and more of them are looking to lower their costs by bringing e-discovery in-house. But as they work through that process, there’s one question that consistently comes up, even today – namely, “When [insert name of software company] says they “do” e-discovery, what exactly does that mean?”

There has been progress towards answering this question, thanks mainly to the analyst community. George Socha and Tom Gelbmann’s EDRM framework has been immensely helpful in breaking down electronic discovery into its component steps. Other analysts, like Debra Logan at Gartner, were quick to embrace the framework, prompting every software provider to follow suit. As a result, there is today a common language that everyone uses to describe the e-discovery process.

The Electronic Discovery Reference Model (EDRM) breaks down the e-discovery process into a series of steps. Companies looking to buy e-discovery software to lower costs typically map different software products to each of these steps, to make sure that they cover the entire process.
The Electronic Discovery Reference Model (EDRM) breaks down the e-discovery process into a series of steps. Companies looking to buy e-discovery software to lower costs typically map different software products to each of these steps, to make sure that they cover the entire process.

But having a universally-agreed framework is only half the answer. To eliminate customer confusion, there also needs to be agreement on how different software products fit into the framework. This is especially important since there is no single, end-to-end solution for e-discovery which covers all aspects of EDRM. So customers are forced to think about how different software solutions fit together. And that is where things begin to fall apart.

Many software vendors feel it is advantageous to claim that they do everything, even though they do not. Customers are rightly suspicious of those claims, and so press vendors to provide more detailed information – hence the question, “when you say you do e-discovery, what exactly does that mean?”

In light of that, how can litigation support teams, corporate counsel, or legal IT people figure out which e-discovery solution best meets their needs? From observing this decision-making process hundreds of times, I have found 3 simple steps are incredibly helpful.

Step 1: Read the analyst reports

Two reports in particular make for required reading. One is Gartner’s MarketScope Report, which is available for free at certain sites; the other is the 451Group’s recent e-discovery report, which is summarized in a publicly available presentation. The helpful thing about the 451 Group’s report is that it tells you which software companies do which parts of the EDRM process. You do have to buy the report to get the full picture (it’s well worth it!), but the publicly available presentation will give you a flavor for their analyis, and I have drawn from that presentation in the figure below:

Analyst firms like the 451 Group map software vendors to the EDRM framework according to what they actually do, which is often different from what software vendors claim they do.
Analyst firms like the 451 Group map software vendors to the EDRM framework according to what they actually do, which is often different from what software vendors claim they do.

The 451 Group’s analysis highlights several important points. First, it shows that there is no single end-to-end solution. Even the products of giants like EMC (SourceOne), HP (IAP), and IBM (CommonStore) only solve one piece of the puzzle, information management. Second, it shows that customers have choices at each stage of the EDRM process. For example, to solve the problem of identification, collection, and preservation of electronic information, customers can choose from solutions as diverse as Guidance EnCase (forensic collection), Index Engines (back-up tapes) and Mimosa NearPoint (email archive). Third, it provides an independent assessment of what vendors do, as opposed to what they may claim. For example, Kazeon claims analysis and review capabilities, whereas the report shows its product does identification, collection, and preservation; Recommind claims its Axcelerate eDiscovery and MindServer products do processing, whereas the report finds that they do not.

Step 2: Evaluate the products prior to purchase

Just as anyone would test-drive a car prior to purchase, it’s critical to test-drive e-discovery software. Any vendor should be willing to provide their software free of charge for an evaluation on-premise. The most effective evaluations are when the customer uses the product themselves, either on a live case or test data. This is far preferable to just sending the data to the vendor who then loads it into their system, as in that scenario there are too many opportunities for the vendor to hide their product’s shortcomings.

Step 3: Check references carefully

The trick with references is to insist on relevant references. It’s not good enough for the vendor to dredge up some random person who says nice things; or even a credible knowledgeable person who is using the product in a completely different way. For example, if a company is happy with Autonomy’s IDOL for enterprise search, that does not tell you much about what Autonomy might be like for e-discovery. What really counts are references from other customers who are using the product for the same application that you are.

All this can sound like a lot of work, but I have seen people go through the process in as little as a month, and be much happier for it. A little work up front can save a lot of time (and heart-ache!) later on.

What Is FRCP Compliance?

Wednesday, August 20th, 2008

frcp.gifThere have been several recent press releases from enterprise software companies proclaiming FRCP “compliance,” which certainly sounds appealing.  But, the use of that term begs the question:  how does a litigation support software search technology (or methodology) become FRCP “compliant” and is that goal even possible?

IBM launched the first salvo:

“The software will allow companies to move from scattered, point-solution approaches to a disciplined approach that controls electronic information, helps support Federal Rules of Civil Procedure (FRCP) compliance,…”

Learn more about ediscovery software.

And, Autonomy quickly followed suit:

“The Autonomy pan-enterprise search platform automates the retrieval, processing, and management of all information throughout a global organization irrespective of languages, operating systems, and file types, avoiding non-FRCP compliant search techniques.”

I’m more than tolerant of both puffery and marketing-speak (though woe to those who forward such releases to Monica Bay), but this notion of “FRCP compliance” seems to take advantage of an already bombarded buying public, who have likely grown weary of FRCP articles, CLEs, and maybe even blogs posts.  Nevertheless, it seems useful to really tease out what the FRCP means and does not mean in relationship to e-discovery and enterprise search.

So, in an attempt to debunk this “compliance” myth, I thought I’d devote this blog post to demystifying some of the inaccurate notions about the frcp electronic discovery.

Federal First

Initially, it’s important to note that the Rules only apply to litigation within the United States Federal court system.  State court litigation, international lawsuits, arbitrations and administrative actions (just to name a few) aren’t under the aegis of the Rules.  While it’s true that certain state courts (Minnesota for example) have selectively adopted the new discovery provisions, most have not.  So, the first step is to check your venue.  Then, assuming the Rules do apply because your organization is in Federal litigation, the impact, while still not crystal clear, does take on more definition.

Relevancy Filters

As a starting place, the discovery process (as part of litigation) is fundamentally limited by Rule 26 to information (electronic and otherwise) that is “relevant” to the case at hand (i.e., “relevant to the claim or defense of any party”).  This distinction is critical because for the most part it prevents the responding party from having to cast a company wide net for all data, a task envisioned by many content management systems.   Certainly, the ability of certain litigation support software systems to access all user created data is valuable when searching for relevant data, but there are many ways to skin that cat.

No Express Retention or Preservation Duties

Legions of articles proclaim that the amended Rules create wholly new duties to retain information in general, as well as infusing new duties to preserve electronic data once litigation is anticipated.  Instead however, the new Rules expressly disavow creating truly new retention or preservation duties.  While it is undoubtedly a good practice to have a retention policy, given the welter of statutes and regulations that do create retention duties, the Rules do not mandate that a company create one ahead of litigation. Read more about electronic data discovery.

What is true, however, is that the new Rules have powerful implications for preservation once litigation is likely because of the requirements to understand, negotiate and produce relevant information early in the litigation process.  Under the new Rules, it is critical to be able to identify and retain potentially relevant data once litigation is filed (or is “reasonably likely”).  And yet, the burden of placing a legal “hold” on data, while often significant, certainly can be achieved without a formal document retention/deletion policy.  Again, the litigation “trigger” is key.

“Records” Aren’t the Focus

Continuing on this theme, but in a slightly different vector, there are differing opinions about the impact that the Rules have on “business records.”  This issue is nebulous since during litigation discovery, it is easy to confuse potentially relevant data corresponding to litigation with “business records,” which are often used in two different contexts.  Initially, there is the “business records” exception to the hearsay rule, which is quite specific and affects the admissibility of evidence in court.

The second, broader definition applies to organizations as they attempt to define a records management program to meet the numerous state, local and Federal mandates.  Commonly, as part of this complex initiative, companies will create records retention programs that specifically define official “records,” unofficial “records,” “non-records,” as well as specific retention periods for certain types of records.  Once the company’s records protocol is put into place there may be some downstream nexus with the Rules, but it won’t manifest itself until Federal court litigation arises, as described above.   The most common intersection occurs when a records retention policy prescribes a deletion event that contradicts the legal “hold” requirements for a record that is likely to be relevant to litigation.

In sum, the foregoing describes the role the FRCP plays in Federal court litigation.  It should be clear that the important, yet relatively narrow, use cases do not include any general compliance mandate in the absence of specific litigation.  I think it’s important to separate myth from reality when it comes to understanding how and when the revised Rules really do come into play.  Failure to do so can create an unpleasant scenario where your organization will either under- or over-prepare for these important litigation guidelines.

The Sleeping Giant Awakes? IBM Announces eDiscovery Manager

Thursday, August 14th, 2008

ibm2.jpgOn August 5, IBM announced eDiscovery Manager, which it says “enables organizations to better control the eDiscovery process by bringing key eDiscovery tasks in house. This helps clients more easily manage electronically stored information; provide earlier insights into collected evidence; and prioritize downstream evidence review, analysis and production.”

Taken at face value, this is potentially very significant. IBM is the world’s second-largest software company and its Lotus Notes/Domino email system is used by approximately one-third of corporate America. So I decided to dig a little deeper to understand exactly what IBM’s new litigation discovery product can do, and which customers it can best serve.

Product Capabilities

The first and most important thing to understand about eDiscovery Manager is that, before you can use litigation support software, you must first buy and install IBM’s unstructured data stack. This comes in two forms: you can either deploy IBM Content Manager and IBM Common Store; or, you can choose Filenet P8 and Filenet Email Manager. Either way, the deployment time is months and typically involves an army of consultants.

For data in IBM’s content management solutions, eDiscovery Manager enables users to search and export. There is no review functionality, no tagging, and no analysis. The limitations in functionality stem from eDiscovery Manager not really being a new product; rather it’s a rewrite of an old product (eMS or email search) with a new AJAX-based user interface.

Target Customers

The best litigation support software customers for eDiscovery Manager are those enterprises which have large amounts of data in Filenet P8 / Email Manager or IBM Content Manager / CommonStore. For those enterprises, it will be a useful tool, which IT departments will use to identify and collect data, just as they use utilities like ExMerge for Microsoft Exchange and Robocopy for file shares. Most companies will then choose to process, review and analyze data from all these different repositories with an e-discovery solution.

To my mind, what’s more significant than the announcement of eDiscovery Manager is the fact that IBM is waking up to the opportunity in e-discovery. There’s no doubting the company’s reach and technical prowess, and it will be interesting to watch what future products (e.g., “IBM eDiscovery Review”?) are in the works.