Google Takeout MBOX

I’m trying to run the Email Parser Ingest Module against a Google Takeout MBOX file. I’ve unzipped the takeout and added the directory in which the MBOX is located as a data source in Autopsy. The ingest module detects one file and completes almost immediately; no email messages are parsed out. Running the Keyword search with email checked does detect email addresses, FWIW.

I’ve tried two different MBOX Takeout files, two different OSes (Windows and Linux), and two version of Autopsy (4.14.0 and 4.15.0).

Anyone process a Google Takeout MBOX files successfully?

Hello,

Yes, as part of the dataset contained within the Autopsy Training course, there is a Google Takeout mailbox file, and running the Email Parser Ingest module against that worked flawlessly.

Thanks for the quick reply.

I created a sample MBOX file with Thunderbird and parsed it successfully. The Google Takeouts, however, continue to produce zero parse results. The only differences I can see is the file sizes (Takeouts are big at >7GB) and the From lines in the files (Thunderbird created a “From -” with a dash while the Takeouts do not have a dash).

My current test, still underway, is to a) import the Takeout into Thunderbird then b) export from Thunderbird and c) re-attempt the ingest into Autopsy. Update: test failed.

Latest info: The 7GB Takeout file has 107M rows. I started carving the file into pieces and processing them. 100K, 1M, and 10M rows ingest and parse out some of the emails but not all (percentages vary). When trying half the file (53M rows), no parsing output at all. I did the same approach to my second Takeout with the same results.

In short, I’m seeing flaky behavior out of the Email Parser based on the size of the MBOX file being ingested. Please advise.

Any luck on this? I’m experiencing the same issues with mbox files that are even smaller than 7GB.

The issue is with any mbox file that is over 2gb in size. The issue is in one of the underlying files that Autopsy uses to parse mbox file. The work around until a fix can be put in place in the code is to break up the mbox file into small chunks. You can do this with a hex editor using the following instructions.

  1. Open the file in the hex editor.
  2. Go to some offset in the file, I suggest offset 1048576000 as a starting point as that is 1/2 of a 2gb file.
  3. Search from this point for the following text "From " with out the double quotes, the double quotes are there so you will include the space. The from text should be proceeded by 1 or more groups of the following hex values x0d X0a so you will know you have a valid ending point.
  4. At the byte offset before the F or From select from your starting offset to that offset and copy the contents to a new file and save that file.
  5. Do this for the remainder of the file until all chunks of the file are under 2gb. You can then process them in Autopsy without running into the issue.

I have tested this and it does work and does not loose any records. If you have any issues with doing this let me know,

Kind regards.

Mark

1 Like

Hi all. New to the group.

I’m having similar issues getting Autopsy to ingest a Google SW and it’s mbox files.

It won’t parse out the emails inside the mbox and I’ve checked that the size is under 2GB. Opening up in thunderbird shows that there’s Google hangout chats and definitely lots of emails. I’ve tried a number of different modules without any luck. I happen to get some more Google SW mboxes and have similar issues where the emails results are thousands less in Autopsy than opening the mbox in Thunderbird.

Any tips or help greatly appreciated!

The 2GB limit has been fixed so that should not be an issue anymore. Would you be able to share any of the mboxes that you have identified with issues with support, if you can that would be great so they could help debug the issue.

Can’t share the MBOX I’m having issues with but I think what’s unique to it compared to other Google MBOX is that there’s google chat logs that might be tripping up Autopsy.

Some of the time formats say 1970s for the time stamp.