As the buzz around Watson and its foray into human-like (actually super-human) performance subsides, it may be time to take stock of what all the fuss was about. After all, we’re all used to computers doing better than humans in many things and even take its superior store of knowledge for granted. And, on the surface, we get answers to questions on pretty much anything from a simple Google or Bing search. So, what really is the big deal and is it even relevant in the context of electronic discovery?
For those not clued in on this, Watson is a brainchild of a four-year effort from 20-25 researchers at IBM, to build a computing engine that would successfully compete at champions-level at the popular quiz show, Jeopardy. Although it blundered on a couple of answers, it competed very well, with a wide margin of victory. Several industry experts that learned of it and watched the show have lauded this as an accomplishment at the same scale or even better than the IBM Deep Blue beating Chess Grand Champion, Gary Kasparov, in 1997. So, let’s examine if this is indeed worthy of the accolades it has gotten.
Behind Watson is an impressive piece of hardware – a series of 90 IBM Power 750 nodes, adding to 16TB of memory and 2,880 Power7 processor cores delivering a staggering 80 teraflops of peak performance. All the hardware is highly inter-connected with ability to work on problems in parallel, but still marching to a final result in three seconds or less – just fast enough to beat the human buzzer. Some highlights of the computing infrastructure from the hardware architect, Dr. James Fan, at IBM indicate that the three-second timeframe meant the entire corpus of 200 million pages was loaded into memory. Also, with several processors simultaneously working on pieces of the problem, they place very high I/O requirements. The hardware supports a multi-processing OS, with virtualization, in a workload optimized system. The software drives the hardware using thousands of dense threads, with each thread of execution processing a large chunk of work with minimal context switch. Also, given the large number of cores, each thread is optimally allocated to a core. Branded as DeepQA, the software executes a series of complex algorithms in order to solve a very specific problem: winning on Jeopardy.
First, the Jeopardy game provides categories of clues. Some categories help in understanding the clue, while others are simply misleading to a computer. Next, the clue is revealed and one needs to determine what the clue is really asking, since many clues do not ask for a factoid with a direct question, but rather is a composition of multiple sub-clues, each related to another in some linguistic, semantic, syntactic, temporal or other forms of connection. The decomposition of clues and figuring the relationships is a challenge even for humans. Finally, after one understands the clue, you then have to hone in on an answer with some level of confidence, within a three-second window, and must activate the answer buzzer ahead of the rest of the competitors. Besides individual clues, one has to also devise an overall game strategy for selecting the next category, selecting a clue within that category, how much to wager on the Double Jeopardy and the Final Jeopardy. Overall, the game is a complex amalgamation of knowledge, language analysis, gaming strategy and speed of recall of answers.
The software architecture of the DeepQA system is documented in a paper published in AI Magazine. The team built several components to address each area of the problem, with many independent algorithms in each component. There are lots of complicated technical details, but the final outcome is a human-like response.
A question on that anyone who examines its inner workings has is whether the system is really natural language processing, or statistical language analysis, or machine learning or some sort of ad-hoc program, which doesn’t fit any traditional area of analytics. It does appear to be an combination of several techniques, which may mirror exactly how humans go about solving these clues. We seem to have a large collection of knowledge, initially unconnected but the category, the clue, the hypothesis all appear to generate word and concept associations and a fuzzy evaluation of confidence measures which all converge into a confidence with which a competitor answers a question. It is the replication of these processes by algorithms that makes it truly an astounding achievement.
Given the success of DeepQA’s performance, a natural question is whether it has any practical value for helping us solve day-to-day problems. More specifically, can it cope with the information overload and the challenges of e-discovery posed by that mass of information? Use within e-discovery context has been explored by several authors, most notably, Robert C. Weber of IBM and Nick Brestoff in recent Law.com articles. Their analysis is based on the ability to explore vast volumes of knowledge. But really, what DeepQA tackled is something more significant – the inherent ambiguity in human spoken and written communication. Our natural instincts are to employ subtle nuances, indirect references, implicit assumptions, and incomplete sentences. We tend to leverage prior and surrounding context in most of our communications. It’s just the natural way of communications, since doing so is actually very effective. We assume establishing context is redundant, unproductive and unnecessary as it tends to make communication repetitive. By not employing a rigid structure in how we write, we are able to bring to bear concise exchanges that span a large volume of information.
If the last two decades is an indicator, the nature of communication is getting less formal, with emails, instant messages, tweets, and blog posts replacing well-crafted formal letters and memos. And, forcing individuals to communicate using rigid, unambiguous text in order for it to be processed by computers easily would mean a huge change in behavior in how people communicate. Any action that contemplates such a change in behavior across billions of people is simply not going to occur. What this means is that the burden for automated analysis using computing algorithms is even greater. This is what makes the discovery of relevant content in the context of e-discovery a very hard problem, one that is worthy of the sort of technological prowess employed by DeepQA team.
Given that our appetite for producing information is ever-increasing, while its discoverability is getting harder, taking the work of DeepQA and adapting it to solve e-discovery needs has the potential to make significant improvements in how we tackle the search, review and analytical aspects of e-discovery. DeepQA took an easily articulated goal of answering at least 60% of the clues with 85% precision in order to reach champion levels. That was sufficient to win the game. Note that there was never an attempt to get 100% of all clues, with 100% confidence. In the realm of e-discovery, we would be looking at taking a very general production request such as the TREC 2009 Topic 201 “All documents or communications that describe, discuss, refer to, report on, or relate to the Company’s engagement in structured commodity transactions known as prepay transactions.” and use just such a simple articulation of the request to produce relevant documents. It is the core algorithms of machine learning, multiple scoring methods, managing relevance and confidence levels along with traditional information retrieval methods that form the ingredients of the new frontier of automated e-discovery. Beyond e-discovery, application of DeepQA algorithms for business analytics also has significant potential, where fact and evidence-based decision making using unstructured data is likely the norm. DeepQA’s very public Jeopardy challenge has shown that the ingredients needed for enabling such problem solving is well within the realm of possibility.