Large Ingestion Performance

Hi,

We have a relatively large forensic project and wanted to ask for advice about Autopsy case size and performance.

Our goal is to analyze the contents of a large file server (logical files) to identify instances of approx. 20 keywords and 1 regex. We’re using embedded file extraction, email parsing, and keyword searching.

The contents are ~3.5TB in >1M files. 1000’s of the files are MS Outlook PST files, some up to ~50GB, and many containing 10,000’s or 100,000’s of email messages.

Ideally we’d like to complete ingestion and keywords searching within a few weeks.

A couple questions about Autopsy’s performance/limits:

  1. Is there a practical limit to Autopsy case size in order for it to likely complete ingestion within a few weeks?

  2. If Autopsy is interrupted during ingestion (e.g. out of disk space, system reboot, etc.), is there a way to resume ingestion at the point where it stopped rather than starting over?

  3. Is there a way to get a list of the file names that have been ingested so far while it’s running?

  4. Any other ideas or advise about a project like this?

Thanks!

I’m just posting to follow as I’d be interested in this as well.

I don’t know about Autopsy limits, but if you would like to try IPED open source - GNU GPL v3 - forensic tool (GitHub - sepinf-inc/IPED: IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.) here are the answers regarding it:

  1. 2^31 ~ 2 billion files in a case, although the main UI table tab can list up to ~135mi files at once (probably you won’t hit that limit when doing keyword searches)
  2. You can use the --continue cmd line option
  3. Just open the UI, apply the actual files filter and export file properties
  4. Check Performance Tips · sepinf-inc/IPED Wiki · GitHub

Hope it could help handling your huge case.

The only tip I could suggest, based off of previous experience, is to ingest with one or two modules at a time. Don’t run them all at once and restart autopsy in-between. But this also depends on your system.

EDIT:
After thinking through this again, I wanted to add a bit more data:

I regularly ingest large data sets, both images and logical files. I am usually in single user mode. It seems to be a good practice, in the case of large data sets, to add the data source without running any ingest modules or with only running a few.

I recently ingested a large amount of MBOX (Thunderbird) files. I made the mistake of running too many modules and also started viewing the case while it was mid ingesting. This caused Autopsy to freeze up. I had to restart the ingest.

So, in summary, a large dataset like that could take over 24 hours, depending on the system, the drive speed(s), amount of RAM, processing power, etc. But if analyzed in a “gradient approach” it makes it a bit easier.

I hope I didn’t write too much and that this helps.

For one of my cases, I ingested a 1 TB image and if I remember correctly it took roughly 24 hours. I was using a multi user setup and was also not using every analyzer for ingest. While it ingest was running, I had a secondary host that also had autopsy on it to view the case with during ingest for preliminary analysis.