Why Should I Trust Predictive Coding?

Predictive coding, or categorization, is software that categorizes documents based upon the human review of a smaller sample of the documents.  It is a science that has arisen out of the need of the legal profession to cull irrelevant evidence from voluminous collections and find those needles in the haystack.  Now that people walk around with supercomputers in their pockets, we document more and more of our personal and work lives in email, chat, video, social media, and many other electronic forms, much of which is discoverable in litigation.  The large volume of evidence is obstructing the resolution of cases, as delayed and fees mount while attorneys work to gather the most important information for their clients.  As e-discovery vendors, we seek out, recommend and support the right tools to collect, process, cull, search, analyze, and review evidence.  When a client is faced with the challenge of reviewing hundreds of thousands of documents on a short timeline, a limited budget, and with only a couple of attorneys available to review, we want to help, and we often recommend predictive coding.

Although some clients are too polite to just come out and say it, they often wonder “why should I trust predictive coding?” The fact is that trust is not required.  Instead, we focus on showing you the validation of the results. You cannot trust any categorization technology to perfectly categorize documents without some method of checking how it has been trained, and trust is better placed your own sense of reason and your reputable vendor to help you validate the results.  At this point in time, even the most clever and well-funded computer scientists in the world have only just begun to invent computer code that can teach itself anything.  In 2012, 16,000 computers were linked together a quasi-neural network and unleashed on the internet to teach themselves something – anything, and this proto-brain spent its time figuring out how to identify cat videos.  We are quite a long way from teaching computers how to review documents with no guidance, and which has turned out to be a surprisingly difficult task.  The most powerful computer you can buy has less self-teaching ability than a garden slug.

Just because computers can’t teach themselves all of the nuances of your case without your input doesn’t mean that predictive coding – more generally known as “categorization” – is not an extremely useful tool.  Categorization is actually one of the most powerful and accurate tools available today for culling irrelevant evidence, and finding interesting documents.

Haystack ID has used categorization technology on many cases and saved clients countless hours of research.  Some categorization engines on the market are kCura’s Relativity Assisted Review, which can be added to any Relativity workspace where documents are being hosted for review, and OrcaTec, which can be run as a standalone review platform, or a Relativity add-on.  Although the technology varies, both systems analyzes the text of the client’s specific documents and calculates the similarity of documents based upon the words that are used together in documents.  Using the kCura method, once the initial analysis, or indexing, is complete, the system can calculate almost instantly the relative similarity of one document from another based upon the language used in the document, and the words that tend to be used together in documents in the client’s documents.  A reviewer, or a small group or reviewers who stay in close communication about the case’s subject matter reviews a random sample of the data, which trains the categorizing engine, after which a categorization round is run that categorizes the rest of the database as responsive or nonresponsive based upon their similarity to those documents actually reviewed.  This round is then validated by a new random sample of the categorized documents.  The system then keeps track of many metrics including the agreement of reviewers with how the system categorized the documents.  Rounds continue until agreement (also known as “overturn rate”) stabilizes at an acceptable level.  The OrcaTec method is somewhat different.   In their preferred process, a single expert attorney reviews batches of 100 documents, and the accuracy of each round, and specifically Recall and Precision are estimated.  As rounds continue, these measures of production quality quickly stabilize.

The size of the random sample is calculated based upon a formula rooted in statistical sampling science.  It is the same formulate that tells us that a sample size of, say, 1,000 people polled, is representative of a population of 1,000,000, within a certain margin of error and a certain confidence level percentage.  The beauty of sample set equations is that sample sizes do not increase in direct proportion of the project population, the sample sizes flatten out.  You can test this out with a sample size calculator, like this one here.  The number of documents to be sampled can be easily calculated by Relativity based upon what confidence level and what margin of error you, your client, or even the opposing side requires.  I should not attempt to describe the mathematical equations, because I do not have a math degree, but again, we don’t have to trust that it works, we validate that it works.  Validation combines information about the accuracy of the results as validated by a human authority along with information about whether the sample size is sufficient to base a reasonable prediction that any other sample at random would meet the same quality standards.

If the time comes that an attorney must defend the process before the courts, statistical sampling is rock-solid science, and certainly a reasonable alternative to reviewing millions of documents, only to find that half of them are not responsive at all to the document request.

Speak with us about whether predictive coding would be helpful on your case.

In the meantime, see how long it takes you to find the cat in the haystack.