Predictive Coding Legaltech

Post on 14-Apr-2017

163 views 1 download

transcript

Predictive Coding 2.0 Making E-Discovery More Efficient and Cost Effective

John Tredennick Jeremy Pickens Jim Eidelman

How Many Do I Have to Check?

1.  You have a bag with 1 million M&Ms 2.  It contains mostly brown M&Ms 3.  You cannot see into the bag 4.  You have a scoop that will pull out 100

M&Ms at a time 5.  Your hope is that there are no red

M&Ms in the bag 6.  You pull out a scoop and they are all

brown

How many scoops do you need to review to be confident there are no red M&Ms?

Let’s Take a Poll

How many scoops?

1 3

5 10 20

2

100? 500? 1,000?

How Confident Do You Need to Be?

How many errors can you tolerate?

Does 95% work?

At a 95% confidence level and 5% percent margin of error: 384 M&Ms At a 99% confidence level and 1% margin of error: 459 M&Ms

§  Five out of a hundred? §  One out of a hundred? §  One percent = 10,000

How about 99%

At a 100% confidence level and 0% margin of error: 1,000,000 M&Ms

Predictive Coding

Does it Work?

What Have the Courts Said?

What Have the Courts Said?

“Until there is a judicial opinion approving (or even critiquing) the use of predictive coding, counsel will just have to rely on this article as a sign of judicial approval. In my opinion, computer-assisted coding should be used in those cases where it will help ‘secure the just, speedy, and inexpensive’ (Fed. R. Civ. P. 1) determination of cases in our e-discovery world.”

Magistrate Judge Andrew Peck

Predictive Coding 1.0

1.  Assemble your corpus. 2.  Assemble a seed set of

documents. 3.  Review the seed set. 4.  Apply machine learning and

automatically tag the remainder of the corpus.

Predictive Coding 1.0

§  Tremendous gains in review effectiveness

§  Substantial cost savings §  It works. Often quite well

….when the corpus is complete.

533 matters, nearly 36,000 uploads across the matters.

67.5 uploads per case

This is collection driven, not loading limits.

166.3 days loading case

67 uploads

166 days

In which upload and on which day do your responsive documents show up?

Terms that do not appear early begin appearing later.

Machine-Assisted Decision Making

Upload timeline of 6 TB case. When should machine-assisted decision making (e.g. early case assessment) begin?

Is it here?

Or here?

Example: Responsive Early, Junk Later

To: bob@company.com, alice@company.com

From: charles@company.com

Subject: Company Picnic

Bob, would you coordinate with Alice and make sure we have enough hamburger buns for the company picnic? Please try and find them at a reasonable price.

Responsive Junk

Example: Junk Early, Responsive Later

To: bob@company.com, alice@privatemail.com

From: charles@company.com

Subject: Get Together

Let’s get together at 7pm at the Sports Bar to discuss pricing of our components. The Broncos are playing and I really want to watch Tebow.

Junk Responsive

Problems With Predictive Coding 1.0

The corpus is almost never complete §  Continuous collection and rolling uploads §  When does “Early Case Assessment” begin?

Changing Issues §  Responsiveness is “bursty”

Shifting Concept Relationships §  Due both to increasing corpus and changing issues §  Exploration is extremely limited

Our Approach

Predictive Coding 2.0 necessitates the ability to deal with dynamic change and flux. We have developed a flexible analytics framework based on bipartite graphs It is aware of changes in corpus and in coding so as to enable smart review and adaptive related concept suggestion as information pours in.

Goal: Continuous Case Assessment

Our Approach

Avoid the lock-in that arises due to poor decision making that occurs early in the matter when corpus (collection) and coding information is incomplete.

What Is Underneath?

A full bipartite graph of the documents and features (e.g. words, phrases, dates) that comprise those documents

Documents Terms

Feedback: Immediate and Continuous

Continuous feedback aids better decision making and predictive coding. Adapts to both:

New arrival of coding information New arrival of documents and terms

Documents Terms

Predictive Coding 2.0

Feedback – and improvement – is iterative, continuous, amplified.

% of Docs Examined Manually

The more you review, the less you have to review

Term relationships change over time Using continuous improvement, decisions can be revised and refined as the matter proceeds.

Better Decisions As Understanding Improves

Time uncovers new relationships

Documents Terms

Looking at Concepts Over Time 20%   65%  lube   fuels  

piping   fob  battery   purityethane  

mounted   petrochemicals  redundant   fin  batteries   paraxylene  

compartments   cif  mixture   phy  airflow   fwd  ansi   swopt  

ventilation   brentpartials  chargers   brg  stainless   locswap  

rotor   benzene  bleed   diff  

accessory   spd  plenum   liquids  detector   opt  

Start with the key term “fuel”

At 20% these are the related terms

And at 65%

Related Terms Through Coding Filters

Documents Terms

Responsive

NonResponsive

TREC collection with many topics

identified

Putting Related Concepts to Work

The whole corpus

Topic 203 …whether the Company had met, or could, would, or might meet its financial forecasts, models, projections, or plans… Topic 205 …analyses, evaluations, projections, plans, and reports on the volume(s) or geographic location(s) of energy loads.

Term   Score  

modeling   1000  equation   864  

stochastic   706  variables   677  

parameters   518  probability   365  simulation   337  

assumption   325  returns   251  curves   211  

Model In the Whole Collection

Scope is the whole collection

Look at the keyword “model”

Term   Score  

flows   1000  assumptions   913  

gains   872  shares   864  liquidity   486  

fluctuations   374  analysts   285  

cents   254  whitewing   237  handles   166  

Model In Topic 203

Look at the keyword “model”

Scope: Topic 203

meeting financial forecasts

Term   Score  

bids   1000  congestion   611  

loads   455  constraints   354  

clearing   292  zonal   194  

signals   192  procure   190  dispatch   152  

csc   120  

Model In Topic 205

Look at the keyword “model”

Scope: Topic 205

analyzing energy

volumes

Whole Corpus   Topic 203   Topic  205  

modeling   flows   bids equation   assumptions   congestion

stochastic   gains   loads variables   shares   constraints

parameters   liquidity   clearing probability   fluctuations   zonal simulation   analysis   signal

assumption   cents   procure returns   whitewing   dispatch curves   handles   csc

Model In Comparison Now,

imagine this with batches and coding

changes over time!

Note: Our system can accept any combination of coding and metadata filters to dynamically assess your data

Summary

Incomplete Collections

Changing Coding Calls

Havoc for Machine Coding

Predictive Coding 2.0

Problem: The corpus is almost never complete Answer: Review Algorithms that are iterative and continuous

Problem: Changing Issues Answer: Review Algorithms that are adaptive and continuous

Problem: Shifting Concept Relationships Answer: Concept Relationships that are calculated dynamically, on-the-fly, and coding-aware.

Continuous Case Assessment

Analytics Consulting

§  Analytics consulting and predictive ranking for nearly 4 years §  How it started -- Before “Predictive Coding” became popular:

“Can’t you predict what documents are probably relevant based on your review so far?” – Judge, SDNY

§  Predictive Ranking: Iterative search techniques + algorithms §  Then off-the-shelf Predictive Coding 1.0 technologies §  Catalyst’s research is exciting! We apply the research to real-world

scenarios. Applying Bipartite Analytics…

Smart Review with the Bipartite Analytics Technology Advantages:

§  Accurate §  Dynamic §  Flexible §  “Just in Time” suggestions

Smart Review Scenarios 1. “What happened” – examples: FCPA investigation, conspiracy ECA 2. Typical large scale litigation with lots of ESI – e.g., class action lawsuit 3. Highly complex litigation with multiple issues – e.g. patent and unfair competition claims

Scenario 1 – What happened?

Goal: Rapidly determine facts and resolve matter if possible Applying the Technology Small number of knowledgeable attorneys drill into documents using the fusion of advanced search features and flexible predictive coding.

Scenario 1 – What happened? Goal: Rapidly determine facts and resolve matter if possible Applying the Technology Small number of knowledgeable attorneys drill into documents using the fusion of advanced search features and flexible predictive coding.

§  Faster location of valuable “veins” of information due to search filters

§  Rapid learning and application of that learning through flexible, “just in time” predictive coding 2.0.

§  “Choose your own adventure”

Scenario 2 – Large Scale Litigation

Goal: Minimize cost because of learning across large document set, increase quality with focused review, and maximize protection of privilege and trade secrets Applying the Technology:

§  Prioritized review based on rapid, continuous learning §  Large scale defensible culling §  More accurate ranking of “potentially privileged” documents

Scenario 3– Highly Complex Litigation

Goal: Review and produce with multiple and changing issues Applying the Technology §  Rapid learning across multiple topics §  Leverage ability to adjust for change in topics §  Review quality improves because of focus §  Explore otherwise hidden subjects with Concept Explorer §  Leverage learning across narrow, focused lines of inquiry (e.g.,

emails between two people in a narrow time window) §  Protect privileged documents

Predictive Coding 2.0 Making E-Discovery More Efficient and Cost Effective

John Tredennick Jeremy Pickens Jim Eidelman