Mind the Gap? Bridging the Divide between Legal Education and Practice-Ready Skills
November 17, 2015
Counsel’s Guide to Defensible Deletion of Corporate Data
December 1, 2015

The Data Dump: what to do when you’ve received too much data?, Part 3

<< Read Part 2: The Data Dump: what to do when you’ve received too much data?

As you have read in my prior posts, you can utilize workflows and technology to find a needle in a haystack, or to find out what needles can be found in your evidentiary haystack. There are matters where you are still learning about the context of you’re your case – once you learn that, you can use it as a guiding force to rapidly develop newly discovered areas of inquiry. Using the tools I outline below, you can leverage technology to make short work of your document dump while creating a robust evidentiary set.

Building on the methods outlined above, once you have taken a look at the keyword hit count and selected the documents that are relevant to your search, you can feed that search into a concept-clustering analytics tool. Concept clustering is more advanced than simple Boolean searching. It looks at words and concepts that are frequently seen together and reports on what they are. As an example, for a set of documents might contain documents about football goalposts, as well as documents containing discussions where people are in an argument and accuse the other person of moving the goalposts. Concept clustering tools work to separate the two sets of goalpost-related documents so that you only have to look at those that are related to the subject at hand. This is something a keyword search cannot do. This is known as unsupervised learning.

Once you have a document or group of documents that you’ve found which are responsive, you will have other options using analytics to speed your process. Your review platform should have a categorization (predictive coding) feature which allows you to find documents related to your issues. Teaching the system by creating these categories and identifying what is in them is known as supervised learning or Technology Assisted Review (“TAR”). This process uses much the same mathematical logic as concept clustering, but instead of letting the documents be automatically sorted by topic, you teach the system what you are looking for and then it responds with more of the same. Once you are ready to set the system to its task, it takes the documents that you have selected or created as a “seed set” and finds other documents that are similar to them by determining how alike the documents are. TAR can help create a key set of priority documents for you to review by categorizing documents you haven’t seen as similar to ones that you have.[1] TAR has more extensive uses in production, but it is still a valuable triage tool. If your TAR process reveals a new hot document, you can use that exemplar and feed it back in to your tool and start the process again using what you’ve found.

Once your TAR has finished marking documents, it is time for you to review the set of documents identified for you. There is no “easy button” and no substitute for looking at the documents when you have to make witness kits and select evidence for trial.

If, after all of this, you have still not yet found what you thought you should, the next step in the process is to run a word frequency hit count. This is different than a keyword hit count. This hit count will tell you the frequency of every word in the production. That will give you the ability to quickly pull out keywords to search for that may not have been apparent when you created the initial search terms. The more you know about your case the easier it will be to do this. You should already have some sense, through interrogatories, depositions, and interviews of what you should be looking for. What you may not know is codenames, project names, and employee shorthand. To the extent that you can determine these words by talking to opposing counsel or witnesses, you should. To the extent that you can’t, a word frequency hit count report is a good first step so in identifying those terms that commonly appear. This may also reveal terms that you hadn’t thought of and that your interviews and depositions hadn’t revealed.


The entire search process is designed to be iterative and repeatable, and each piece of the process has its own strengths. At this stage, you should have a significant set of relevant documents and a sense of your case. As you learn more about the case by deposition, interrogatory, or through your review, you can follow these steps again and again to reveal more relevant evidence. There is no requirement that you follow the entire workflow, either. If a deposition reveals a hot document that you hadn’t seen before, you can feed it into your analytics tool to determine what similar documents are in your set. If a whole new area of inquiry arises, you can create a new category, feed it a new seed set, and allow the algorithm to do its work. If that happens, you may need to revisit which documents you loaded on your review platform and which you did not – a new issue may need new documents. Remember what I said in my last post – you can only find what’s available to be found, so if you’ve chosen to only load part of a production, this may be the time to revisit.

The workflows outlined in the last three blog posts were designed to help you based on what you already know. If you know a lot and need to find a little, manually reviewing based on predefined criteria will likely be faster and easier than using robust tools. When you know some things about your case but need to expand what you know, clustering results around what you already know lets you expand your issues. When you find a new area that requires a deeper dive, leveraging technology solutions like concept clustering and TAR may be the fastest and easiest way to come to grips with the deluge. With these tools in your belt, you have maximal flexibility to quickly, efficiently, and cost effectively review a massive document production.

[1] If you are not sure that TAR is right for you, see “3 Reasons You’re Afraid of Technology Assisted Review…and Shouldn’t Be” which will give you a better overview of Predictive Coding and will help address some commonly held questions and concerns about it.

Other Articles in this Series:

The Data Dump: what to do when you’ve received too much data? Part 1
The Data Dump: what to do when you’ve received too much data? Part 2
Jonathan Swerdloff
Jonathan Swerdloff
Jonathan Swerdloff is Director of Global Client Services and eDiscovery at Scott+Scott Attorneys at Law LLP. Prior to this role, he was an expert Consultant at Driven, Inc. Learn more about Driven's Consulting Services