Blind De-Duplication

Why HaystackID ?

One of the most important factors in choosing a litigation support vendor is stability. Case matters can be in litigation for many months, if not years, and client firms place a lot of trust in a vendor when they begin the processing for a new matter. Should anything occur which would necessitate switching from one vendor to another for processing, many issues can arise.

Moving from one vendor to another, or splitting a particular case between multiple vendors, can be problematic for many reasons, with the most glaring issues stemming from de-duplication inconsistencies. It is virtually impossible to use integrated de-duplication tools to accurately test for duplicative records when a given file might be a duplicate of a file that was processed with another vendor.

Background

HaystackID asked the client for a list of the MD5 hash, which are alphanumeric identifying keys that are unique to any given document, of the files in question. Armed with the MD5 hashes, HaystackID had just enough information about the files to determine if the records were duplicative without needing to be provided the files themselves.

Using a stock HaystackID utility, the project management team set about converting the MD5 hashes into source data for a tag list. The team avoided potential partial-family tagging by examining families for partial tagging and then clearing all the tags in that family, knowing that if some docs within the family did not dupe out, the original documents in question were not from the same source as the family, and thus the family should not be tagged.

Challenges

HaystackID has been asked to take over many projects in various stages of processing and has since devised a defensible methodology for providing accurate de-duplication of records for their clients, regardless of how many vendors a matter may have been distributed to. Presented with such a situation, the HaystackID project management team was tasked with finding documents which were duplicates of documents that HaystackID did not have in their system, had never seen before, and could not be provided with due to restrictive stipulations in regards to the distribution of the documents.

With this custom methodology and a common utility, HaystackID was able to examine a case involving several million documents for duplicates against tens of thousands of files that the project management team was unable to examine for themselves.

Solution

The client was able to determine which files existed in both data sets and which were only in one, allowing them to perform an accurate de-duplication of records spread across multiple vendors.

While other vendors would have demanded access to the entire data set to perform an accurate de-duplication or refused to take on the task, HaystackID, through innovative and custom scripting, enabled a common utility to eliminate lengthy and pointless processing times and provide a comprehensive master list of unique documents in the record set, saving the client precious time, needless aggravation, and unnecessary expense.

Request a Discussion or Demonstration