Getting rid of déjà vu

August 3, 2009

The bad news about electronically stored information is that there’s so much of it.  The good news is that it can easily be deduped.  And the really good news is that full-scale deduping can get rid of a lot more than you might have guessed.

In the August 2009 issue of Law Technology News, available at Anne Kershaw and Joe Howie report on a study they conducted in May by surveying 18 e-discovery vendors.  Confining the scope strictly to pure de-duping (as opposed to near-duplicate detection, e-mail threading, etc.), they found that deduping within a single custodian reduced the number of documents by an average of 21.4 percent; if performed across multiple custodians, the average reduction nearly doubled to 38.1 percent. 

Yet the vendors indicated that while they all offered cross-custodian deduping, only 52 percent of the projects got it; in the remainder, their clients opted for either single-custodian deduping (41 percent) or none at all (seven percent). 

Until a few years ago, for many e-discovery vendors, the machine burden of deduping across custodians was much greater than doing so within one custodian’s collection.  Some vendors charged nothing for deduping within custodian but charged extra if done across custodians, to compensate for the extra machine time and effort. 

Also, in the then-common linear review paradigm (each custodian’s data kept together and reviewed as

Without de-duping across all custodians, you need a huge number of reviewers

Without de-duping across all custodians, you need a huge number of reviewers

a unit) deduping within custodian only was supported by the prima facie plausible argument that “it’s a more accurate picture” of the data to know who had what, even if it did mean that the same document was going to show up multiple times in different custodians’ collections.  The mere fact of it being in Al’s collection as well as Barbara’s and Charlie’s was somehow considered sufficient differentiation to justify keeping all three. 

Deduping technology is now much better, so cross-custodian deduping no longer grinds the system to a near halt.  On top of which, as this article points out, if you need a report as to what other custodians also had a particular document, just about any vendor or hosting platform can generate this. 

Articles such as this one by Anne and Joe, and other consultants, should reassure lawyers that deduping across the entire database is not just all right, it’s practically incumbent upon them.  As these authors state, with the concurrence of several judges they consulted:   “Lawyers who fail to check for duplicates across multiple custodians, instead removing only duplicates from within the records of individual custodians, end up reviewing at least 20% more records on average. Whether or not their document review bills are ever audited, these lawyers are not meeting their ethical obligations to both clients and the justice system.”


Wrong for the wrong reasons

May 15, 2009

In a post dated April 27, 2009, the Technolawyer blog tells a document review horror story that should never have happened, but not for the reasons the players think.

 Here’s the URL:

A big West Coast law firm defending a medical devices case found itself overwhelmed in a large document review that mushroomed into something much larger than anticipated.  The firm assigned more reviewers, including an inexperienced younger associate named Marc.  Sadly, Marc failed to flag as privileged a document that clearly was.  Even worse, Marc was undersupervised because of ridiculous internal firm politics.   The cartoon below might be Marc arriving at work.  Get the picture?

richie_rich2The document Marc failed to flag privileged of course got produced.  (The documents in this part of the review appear to have been paper-source, because they are described as having been OCR’d, and some had marginal handwritten notes.)

“The document in question was a chart of notable events in the history of the litigation prepared by in-house counsel. In addition to its fundamentally privileged content, it contained the attorney’s marginalia — the sort of thing that most of us scrawl on a document when we are certain that it will never fall into the hands of, say, the plaintiff’s attorney.

“The document was so clearly privileged… that each of the eight other reviewers assigned to the case had recognized and tagged its duplicates as such. Marc, however, decided that the document should be produced.”  [STOP RIGHT HERE.  HOW DID NINE COPIES OF THE SAME DOCUMENT MAKE IT INTO THE REVIEW STREAM SEPARATELY?]  And so it made its way, unnoticed, into the batch of documents (which numbered in the tens of thousands) produced for opposing counsel….”

The blog quotes a firm partner explaining how the reviewer missed this: 

” ’An experienced reviewer would have recognized that the document was, without a doubt, privileged,’ the partner said. ‘But there was no name on it, and Marc didn’t know to look at the OCR coding[i], which would have told him that it was authored by an in-house attorney. Moreover, he didn’t realize that it was a duplicate of documents that had been tagged as privileged by other people. Maybe the OCR coding failed because of the marginalia; maybe he just didn’t have the experience to de-duplicate [INTERRUPTING AGAIN:  IT SHOULD NOT BE THE REVIEWER’S RESPONSIBILITY TO DE-DUPLICATE!] . Either way, he made a bad call.’ ”

According to the Technolawyer posting, the firm partner said the lessons to be learned from this are: 

  • supervise the reviewers,
  • immediately claw back privileged documents (and if necessary fight about it later), rather than pretend nothing went wrong, and
  • “not only be aware of duplicates, but remain mindful of the limitations of even the best eDiscovery tools. OCR is not a perfect technology.”

Here’s where I have a big problem — not with Technolawyer, but with the Big Law Firm.  Unless the variation in OCR quality was right off the Richter scale, there is no excuse for nine versions of the same document, even those with handwritten marginal notes, to have gone into review separately.  None. 

Any e-discovery consultant or vendor with even moderate sophistication knows about software that performs near-duplicate detection.  One of the best-known is Equivio. 

Near-duplicate detection software will catch different variations of what is essentially the same e-mail or electronic document, just different revisions.  It will catch the same document both in its Word format and in PDF format, clearly an instance where the hash value would be completely dissimilar.  And it is very commonly used to catch multiple copies of the same paper document that inevitably come out slightly different when OCR’d.  I’ve known litigation support vendors who have used it for this purpose for several years now, and their clients appreciate its benefits. 

_1716577_penguin2Near-dupe detection software can be calibrated to group documents together based on a percentage degree of similarity.  If you have a batch with wide variability in OCR quality, you’d set the percentage lower than if you’re confident the OCR quality is consistently high. 

I don’t know if near-duplicate detection was used in this case, or whether it was considered but a good reason existed not to use it.  From the way this story is told, it does not sound like it was used. 

So, Big Law Firm,  you shouldn’t be so quick to blame the smart-ass young associate.  This document shouldn’t have gotten to him in the first place.  It should have been bundled together with its other eight near-duplicates, and reviewed by someone with more seniority.   The cost of near-dupe detection is a lot less than the cost of reviewing the same document nine times.  Even without the error, your client should have fired you for that alone.   (This paragraph assumes near-duplicate detection was not used or considered.  If it was, never mind. )



[i] As written in the Technolawyer blog, which in turn is a direct quote.  I am not certain of the meaning of “OCR coding”.   In my lexicon, something is either OCR’d or it is coded.