Predictive Coding; Confusion among Seed Sets, Random-Statistical Sampling and Training

“By Jason Glass & Alex Wall Esq.

We read a blog the other day written by a senior level exec from a predictive coding service provider regarding the latest (mis)information circulating in the industry.  Specifically the focus of the blog was on predictive coding and statistically significant random sampling in order to create a valid seed set.   His argument is not all machine learning algorithms need a minimum number of documents in order to learn and that some systems can learn from a single exemplar document.

“Auto–Define Random Sampling” 

Let me ask you, as the managing Partner on a matter who ultimately needs to sign off on the discovery order, would you feel comfortable having an entire database system coded based off a single document?  We didn’t think so.  The blogger and the predictive coding company however has confused random sampling with training.  In that regard, there is nothing wrong with using whatever examples or training methods you want for predictive coding.  However, in any project, we recommend at least one quality control round where the system’s results are tested versus the attorney’s expectations.  If a predictive coding service provider doesn’t recommend that, than in my opinion they are being irresponsible and most likely will end up in court defending themselves with the attorneys throwing up their hands and wondering what the hell happened.

Thai Food and Statistic Sampling is for Suckers

The blogger counter’s statistic sampling in general by stating “why a client would tolerate the waste associated with blindly stumbling through the data in the name of scientific validity?” “Next time you are hungry for Thai food, why don’t you just tell Yelp! you want food. You wouldn’t want to bias the process would you?”  Touché.  Funny analogy and I’ll try and stick to the same.  If I am looking for Thai restaurant sure I will type it into Yelp!  What happens next?  You get a whole bunch of results, right?  From there I need to choose a location.  I will also want to read some diner reviews and ratings of some of the specific restaurants prior to making a final decision.  Ambiance is going to be important too.  I’m looking for a romantic spot for my wife and I, it can’t be casual.  We have three kids and these days we rarely get the chance to get out of the house.   Oh, and I love Chicken Massaman.  So I have to pick a Thai restaurant that has really good Chicken Massaman (shout out to Thai Thani in Swampscott MA.  Best Mai Thai’s and Massaman).. …Viola!!  Ladies and Gentlemen I think we just described statistical sampling and validating our results.

The point here is, some predictive coding engines are more transparent than others in how their systems work.   As a software agnostic company who has tested many on the market today we are conscious of that and with that in mind, make sure the platform you are considering is transparent in how the system works and most importantly can be validated by a defensible process.  Let’s also take a moment and understand what “defensible” means?  To us it means that our client understand the process thoroughly and would stand and defend the process as having been a reasonable and intelligent one, that counsel is prepared to defend.  We would never suggest that our attorneys blindly trust any technology without either validating a sample of the results, and we are looking out for our clients in insisting upon quality control.