EMC-2 Sample Data Set

EMC-2 Sample Data Set A synthetic, narrative-driven alternative to Enron data	Version	v1.1
Updated	2024-09-12
Author	John Benson	License

News

2024-09-12: I’ve added timelines generated by both Claude Sonnet and GPT 4o1 and while both are impressive, GPT 4o1 is incredibly impressive. It’s very resistant to a user forcing their own reasoning models but the output is good enough I’m not sure it will matter.

Introduction

Find the data on Github

Hey there, and welcome to the EMC-2 Sample Dataset, a little project I’ve been cooking up to scratch an itch that’s been bugging me (and probably you) for well over a decade.

I’ve been in the litigation and eDiscovery trenches since passing the bar, and in all that time, I’ve never come across a public dataset that really hits the spot for testing, demoing, or just playing around with eDiscovery and digital forensics tools. Sure, we’ve all used the Enron set, but let’s be real - it’s showing its age, and we’re all a bit tired of it. (Pour one out for Andy Zipper.)

So, I decided to create something new. Something that tells a story. Something that’s manageable, yet complex enough to be interesting. And yeah, something that’s actually kind of fun to work with. Welcome to the Ellingson Mineral Company (EMC-2) dataset.

Why EMC-2?

Look, we all know the Enron dataset. It’s been the workhorse of eDiscovery for years. But here’s the thing - our tools have evolved, our needs have changed, and frankly, we deserve better source material. Here’s why EMC-2 might just be the refresh you’ve been waiting for:

It tells a story: Unlike the Enron emails, which can feel like reading someone else’s grocery lists out of context, EMC-2 has a narrative. You can actually follow what’s going on, which makes it great for testing tools that need to piece together a coherent timeline or spot key events.
It’s manageable: At around 200 documents, it’s big enough to be meaningful but small enough that you can actually review it yourself. Perfect for comparing your work to what your shiny new AI tool spits out.
It’s got range: Emails, sure, but also chat logs, voicemails, and more. Because modern investigations aren’t just about emails anymore.
It’s synthetic: No ethical quandaries about using real people’s private data. Analyze away without worry.
It’s got easter eggs: There might be a prize in it for you if you dig deep enough. Just saying.

Dataset Contents

Alright, let’s break down what you’re getting in this digital goodie bag:

Here’s the information organized into a two-column markdown table:

CUSTODIANS	DOCUMENT TYPES
Hal Benson (CEO)	Emails (with attachments, because who doesn’t love attachments?)
Dr. Sarah Chen	Internal memos and reports
Eugene Belford	Chat logs
Dade Murphy	Voicemail transcripts
Margo Wallace	Press releases
Kate Libby	News articles
Richard Gill	Technical specs
Jennifer Mack	Financial spreadsheets
jur1st	Code snippets
	System logs

Note: I’ve maintained the order of the custodians and document types as they were presented in the original text. The last document type (System logs) doesn’t have a corresponding custodian since there were more document types than custodians in the original list.

The whole shebang spans from about October 15 to October 25, 1995. Why 1995? Let’s just say I might have been inspired by a certain movie. But don’t worry, you don’t need to know the film to appreciate the dataset. It’s a self-contained story of corporate shenanigans, cyber threats, and the digital equivalent of “who dunnit?”

How to Use This Data

The envisioned use is to take the data, feed it through the data analysis tool of your choice, and then play with it. I’ve processed the full set without processing errors, although there are a couple of files which may appear corrupt or strange.

What Kind of Things are Inside the Data?

At the most basic level, this might be one of the most bizarre expressions of what I suppose is categorized as fan fiction of sorts but expressed through the most soul-crushing medium imaginable…someone else’s email. Think of it as a short story told in a non-traditional medium if nothing else, and you don’t need to know a darn thing about the movie to find it a quick, compelling, and dense narrative to pull facts and details from.

There are references and easter eggs that were intentional and unintentional. There’s also something of a golden ticket that’s stored inside here. If you find it, you let me know, and you may be immortalized on a wall of fame. First come, first served. In-person turn-ins might receive prizes and certainly will receive a high five.

Limitations and Known Issues

This is a small set of documents (at this point), so it’s probably not suited for much extrapolation of a predictive model at scale. If it is, someone’s welcome to correct me, but making something that is statistically defensible is not the intention.
The types of native files currently are limited to plain text, though that also remains the current state of the art for most legal-specific tools I’ve taken a look at so far.
Metadata for CSV and TXT documents presently have no time metadata, which impacts the set’s ability to sort chronologically within days.
While this got seeded with things from the movie, this is not and isn’t intended to be some kind of detailed stand-in. We get broad arcs, main characters, motivations, that kind of thing.

Technical Details

The dataset was implemented using a combination of human creativity and AI assistance:

Initial taxonomies of different elements of the universe that exists in the seminal 1995 film, Hackers, at first as kind of a gag. Then the results started coming out and things escalated to sciencey things quickly.
AI models were used to expand on these outlines, generating consistent and varied content.
Human review and editing ensured quality, coherence, and adherence to the storyline.
Technical details (e.g., code snippets, log files) were double-checked for accuracy along the way.

This hybrid approach allowed us to create a dataset that combines the creativity and nuance of human-generated content with the scale and consistency made possible by AI.

Dataset Contents

The Ellingson Mineral Company Synthetic Dataset comprises a rich tapestry of documents that tell the story of a corporate cyber attack and its aftermath. Here’s an overview of what you’ll find:

Timeline

The main events of the dataset span from approximately October 15 to October 25, 1995, with some documents providing background context from earlier dates.

Key Themes

Corporate espionage and insider threats
Cybersecurity and hacking
Financial fraud
Legal and ethical dilemmas
Media relations and crisis management

Sample Document: Email from The Plague

┌──────────────────────────────────────────────────────────────────────────┐
│Message 1/1                                                               │
│Date: Tue, 10 Oct 1995 15:20:10 -0400                                     │
│From: eugene.belford@offshore-secure.net                                  │
│To: margo.wallace@ellingson.com                                           │
│Subject: It's Time                                                        │
│                                                                          │
│M,                                                                        │
│                                                                          │
│I've submitted my resignation. Two weeks from now, we'll be sipping       │
│mai tais on a beach while Ellingson burns. Make sure everything is in     │
│place on your end.                                                        │
│                                                                          │
│Remember, if anyone gets suspicious, you know nothing. As far as          │
│you're concerned, I'm leaving for a better opportunity elsewhere.         │
│                                                                          │
│- E                                                                       │
│                                                                          │
│                                                                          │
│? Help      < MsgIndex P PrevMsg   - PrevPage D Delete   R Reply          │
│O OTHER CMDS> ViewAttch N NextMsg Spc NextPage U Undelete F Forward       │
└──────────────────────────────────────────────────────────────────────────┘

This email exemplifies the type of incriminating evidence hidden within the dataset, challenging users to uncover key information amidst a sea of routine corporate communications.

Technical Specifications

The Ellingson Mineral Company Synthetic Dataset is designed to be compatible with a wide range of eDiscovery and analysis tools. Here are the key technical details:

File Formats

Emails: mbox files split per custodian
Attachments: .txt, .csv
Standalone documents: .txt, .csv

Metadata

Each email includes standard metadata fields and is known to thread properly:

Date Created
Date Modified
Author
Recipients (for emails)
Subject (for emails)
File Size
File Type

Dataset Size

Total Documents: Approximately 200
Total Size: Approximately 200kb. Yeah, you read that correctly. Can we go back to using Pine for email now?

.
├── CITATION.MD
├── Custodians
│   ├── Belford, Eugene
│   │   └── ebelford.mbox
│   ├── Benson, Hal
│   │   └── hbenson.mbox
│   ├── Chen, Sarah
│   │   └── schen.mbox
│   ├── Gill, Richard
│   │   └── rgill.mbox
│   ├── Libby, Kate
│   │   └── klibby.mbox
│   ├── Mack, Jennifer
│   │   └── jmack.mbox
│   ├── Murphy, Dade
│   │   └── dmurphy.mbox
│   ├── Rivera, Alex
│   │   └── Rivera, Alex.mbox
│   ├── US-DOJ
│   │   ├── DOJ_EmailFile.mbox
│   │   └── edocs.zip
│   ├── Wallace, Margo
│   │   └── mwallace.mbox
│   └── jur1st
│       └── jur1st.mbox
├── LICENSE.md
├── README.md
└── edocs_unzipped_metadata_bonked
  ├── Da_Vinci_virus_analys.txt
  ├── HCaul_Cover_Letter.txt
  ├── HCaul_skytel_19951022.csv
  ├── HCaul_skytel_19951022.txt
  ├── budget.csv
  ├── da_vinci_vir_core.c.txt
  ├── da_vinci_virus_core.c.txt
  ├── diff.txt
  ├── gibson_security_patch.txt
  ├── hacker_chatlog.txt
  ├── irc_htp.txt
  ├── irc_htp2.txt
  ├── memo_10101995.txt
  ├── nytimes_article_ellingson_hack.txt
  ├── plague_voicemail_transcript.txt
  ├── talk_log.txt
  ├── talk_log2.txt
  ├── traffic_report.txt
  ├── ts_phone_log.txt
  ├── virus_rpt_cyber.txt
  └── vm_0827.txt

14 directories, 36 files

Changes

2024-09-06 - Post release changes in advance of the Stanford Hackathon

Added some additional short messaging data that should parse to RSMF with some creativity and research into what you’ve stumbled across.
Brought dates within the data set into alignment to ensure the timeline of events is cohesive.
Manually set the MAC times for all of the loose files to ensure a cohesive timeline.
Added additional context to certain threads. The timeline of events and certainly the document count overall feels good but the next step will be injecting targeted flags and events into this set.
Fixed a handful of parsing errors in the file attachments. Statistics now post-deduplication is around 204 to 205 depending on how aggressively suspicious systems are of C code living in a plain text file. If AV pops on the extraction of the file you’ll know because only zero bytes will be seen when the agent rolls around to picking it up. cough better logging here from certain platforms here would be swell.
This is an opinion change on my part…I had said that this wasn’t a good test set for benchmarking performance. After testing and debugging the set a few times this is a fantastic set if you need to see how quickly a processing system will churn through a tiny data set.

Contact Info

If you enjoy this or find it useful, let me know by dropping me a line over email (john at john-benson dot com) and letting me know what you do with it. Heck if you want to take the concept and run with it on your own that’s fine too. If nothing else this proves that there’s no darn reason that we need search or summarize Andy Fastow’s email or scratch your head Jeb(!)’s use of “FUP” as an abbreviation for follow-up. To paraphrase the last best hope we had before this at solving this problem…please clap? 👏