Performance testing and tuning for Autopsy 4.18

@Eugene_Livis , do you know what this means?

For this round of testing I disabled periodic searching:image

But what I look at the logs for that ingest I see the following:

2021-04-06 21:52:33.4 org.sleuthkit.autopsy.keywordsearch.IngestSearchRunner startJob
INFO: Adding job 0
2021-04-06 21:52:33.401 org.sleuthkit.autopsy.keywordsearch.IngestSearchRunner startJob
INFO: Resetting periodic search time out to default value
2021-04-06 21:53:28.683 org.sleuthkit.autopsy.keywordsearch.Ingester indexText

I checked the logs from an earlier test where I disabled periodic search and see the same log. This would explain why changing the periodic search option never made an improvement on the indexing time.

Do you know why would the periodic search time be changed back to the default and how I can stop that from happening?

@honor_the_data I have looked at the code - the log message is a bit misleading. If user has specified KWS update frequency then that value will be used. If not, then the default is used:

I think the reason why periodic search option never made an improvement on indexing time is because for small data sources like the one you’re using the search is probably very quick, so you may not see an impact of the periodic search unless you use larger data source. However, if you are not going to examine the data until ingest is complete then you should definitely disable the periodic search.

Ok. Now that I’m using a much larger image I will try again and change the periodic search back to the default 5 minutes to see what impact it makes.

Using the larger image from a Magnet CTF where the e01 chunks add up to 27.4 GB but the the original volume size was 61,440 MB I ran two tests.

One test with no periodic searching and the other was using the default 5 minute periodic searching. All other variables (memory, ingest threads, etc.) were kept the same.

No periodic searching

  • Duration- 1:34:14
  • The “Resetting periodic search time out to default value” log was present but there were no logs indicating that periodic searches were actually run.

5 min periodic searching
-Duration- 1:35:41

  • The “Resetting periodic search time out to default value” log was present and also logs stating “Starting periodic searches” every 5 minutes.

So… turning on 5 minute periodic searches lengthened the time by 1 m 27 seconds (1.5%).
Caveat: my keyword list only had two terms.

@honor_the_data Yep, that’s roughly what I would expect for still a rather small image and only 2 (exact match?) search terms. I expect those searches to run fairly quickly (seconds, not minutes).

Thanks @Eugene_Livis, this is very interesting. One question I have, when tuning memory settings, is the Solr JVM part of the Maximum JVM memory or a separate JVM? I have 17GB of RAM in my my machine, (I realise this is probably going to have to increase with a 3tb dataset to perform a KWS, or possibly move to a Solr Cluster)… Originally I had my Max JVM Memory set to 8GB and my Max Solr JVM Memory at the default of 512k (running Autopsy 4.18) I was toying with the option of increasing the Solr memory to see if I can improve the speed of the KWS?

One question I have, when tuning memory settings, is the Solr JVM part of the Maximum JVM memory or a separate JVM?

@mckw99 Those are two unrelated settings. The Maximum JVM memory is the (max) Java heap size that is allocated to the Autopsy process. In single user mode, Autopsy starts a separate Solr process to do indexing. The Solr JVM setting is the (max) amount of Java heap that will be allocated to the Solr process.

Originally I had my Max JVM Memory set to 8GB

That is a reasonable default and should work, though when possible I usually set this to 10-14GB.

my Max Solr JVM Memory at the default of 512k (running Autopsy 4.18)

This is definitely not enough for large (TBs) data sources. If you’re ingesting TBs, I’d use at least 4GB, better 8GB. In fact, I have analyzed the thread dump from your other thread ( Slow ingest dump - version 4.18 - Keyword search ) and I am quite confident that the issue is caused by this setting. I think Solr server has run out of memory and basically stopped working (at least it is no longer able to index new documents), which is drastically slowing down ingest, because we re-try indexing several times. If I’m correct, you will see a lot of “Unable to send document batch to Solr. All re-try attempts failed!” errors if you open the Autopsy debug logs:

image

I would also expect to see some error message notifications in the bottom right corner of the Autopsy UI. Let me know please if you don’t see any error notifications there.

Thanks @Eugene_Livis! I didnt see any of these errors in the log files, although unfortunately I had to delete the case, to restart it with greater memory settings before I read your post. Note to self: Save the log files in future! I’ve restarted the ingest with memory settings of JVM:10GB and Solr JVM of 8GB (although I am concerned about the over all memory in the machine left for the OS with these settings (16GB total)). I will let you know how it goes and thanks again for your reply, much appreciated.

@mckw99 How did it go?

@Eugene_Livis - thanks for asking :slight_smile: I tweaked the memory settings without upping the ‘physical’ memory in the host VM running Autopsy (16GB RAM, Windows Server 2019 (only running Autopsy). Firstly I upped it to use JVM:10GB and Solr JVM of 8GB and all 8 cores - this killed Autopsy - hung at around 20% and I had to terminate the Autopsy/Java processes (and reboot the machine for good measure). Suspected the memory settings had killed the OS - but no expert on these things! The current ingest I am running is based on JVM:10GB and Solr JVM of 4GB. It started 2021-04-14 09:03 and is 54% through at 2021-04-20 10:00 (3.3Tb) - 6 days so far. I am seeing periodic slow downs, but have set the ingest not to have any periodic searches, I am not sure if these errors in the logs suggest it is trying to run a search? Where is the Tika log is located for further investigation?

2021-04-14 15:27:35.48 org.sleuthkit.autopsy.keywordsearch.IngestSearchRunner startJob
INFO: Resetting periodic search time out to default value
2021-04-14 15:28:24.804 org.sleuthkit.autopsy.textextractors.TikaTextExtractor getReader
WARNING: Error with file [id=269594] XXXXXXXXXX.doc, see Tika log for details…

I am seeing a few errors relating to files not existing, but I would expect that as it is a live file system:
WARNING: Error reading from file with objId = 4351355
org.sleuthkit.datamodel.TskCoreException: Error reading local file, it does not exist at local path: S:\XXXXXXXXX.docx

I would be interested in your thoughts. At the moment I was planning on letting it run (hopefully complete within another 6 days) and will run the KWS after the ingest has completed. If it goes to plan that will be a 3.3tb data set that has taken 12 days (guess/estimated at the moment) to complete with the memory settings 16GB, JVM:10GB and Solr JVM of 4GB. Do you think this is a reasonable speed or slow in your experience? The ingest snapshot suggests it is doing about 10 files per second. I have another data set of 6tb to search, I am not sure I am willing to wait 24 days for it to complete - I may have to look at a clustered Solr system for the larger dataset…

I am seeing periodic slow downs, but have set the ingest not to have any periodic searches, I am not sure if these errors in the logs suggest it is trying to run a search?

@mckw99 Short answer - the periodic searches are disabled. That “Resetting periodic search time out to default value” log message is a bit misleading, we should change it. I have looked at the code and what it means is that we are setting the “user specified default value”. Unless you must run KWS before the ingest completes, you should definitely disable the periodic searches. They start to take more and more time as the size of index grows.

Where is the Tika log is located for further investigation?

I wouldn’t worry about Tika errors. Tika is a text extraction tool that we use to extract text out of files. Sometimes it is unable to do so. But to answer your question, the Tika log (and other logs, including Solr) are located in “C:\Users\USER_NAME\AppData\Roaming\autopsy\var\log” directory

If it goes to plan that will be a 3.3tb data set that has taken 12 days (guess/estimated at the moment) to complete with the memory settings 16GB, JVM:10GB and Solr JVM of 4GB. Do you think this is a reasonable speed or slow in your experience?

Ugh, there really isn’t such a thing as “expected ingest speed”. Everything very very heavily depends on what kind of data is being ingested, and on what kind of system. I honestly haven’t tried to ingest 3+ TBs into a single user case to give you an educated answer as to whether the performance you’re seeing is slow or reasonable. Overall, with only 16GB of RAM and only 4GB allocated to Solr, I think that’s the biggest bottleneck.

I have another data set of 6tb to search, I am not sure I am willing to wait 24 days for it to complete - I may have to look at a clustered Solr system for the larger dataset…

I would not wait 24 days either, nor would I expect that you can linearly extrapolate from 12 days to 24 days. At least with a single Solr server, the indexing speed definitely slows down greatly as the size of index increases. So I would definitely recommend creating an Autopsy multi-user cluster. Even if you have a single Solr node, you may get performance gain simply because Solr will be on a dedicated machine (as opposed to fighting for CPU, RAM, and disk access with Autopsy on your local machine) and running with much greater hardware resources than 4GB of RAM. If you find hardware resources to have several Solr servers - that will make a huge difference! You can also run ingest on several machines in parallel (if you have multiple data sources) which will obviously also increase the ingest speed. For a single user case 6TB is a significant chunk of data though, so if you are finding that on your system the ingest is seriously slowing down at some point, then you may want to consider splitting the input data sources into several Autopsy cases (e.g. 3TB in case1 and another 3TBs into case2). You’ll have to examine the cases separately which is obviously inconvenient, but the ingest will complete much faster.

I am seeing a few errors relating to files not existing, but I would expect that as it is a live file system

@mckw99 I just want to check if you are running Autopsy on the same drive that you are trying to analyze? That can definitely lead to some problems:

Hi @Eugene_Livis, no, dataset I am analysing is a mapped drive to a network share (so network card is always going to be a bottleneck). Autopsy is installed on the system drive c: and the logfiles/case is on another separate drive Not ideal but what I’ve got to work with. Unfortunately this afternoon I managed to accidentally log off the machine instead of lock it :woman_facepalming:. 7 days of ingest and I accidentally kill the process. I reopened the case and restarted the ingest - will it restart from where it was terminated or (as I suspect) restart the ingest from beginning? Again thanks for your advice, much appreciated.

Unfortunately this afternoon I managed to accidentally log off the machine instead of lock it :woman_facepalming:. 7 days of ingest and I accidentally kill the process.

@mckw99 ugh, sorry…

and the logfiles/case is on another separate drive

This will slow things down a bit further because the case database is located in the case directory.

I reopened the case and restarted the ingest - will it restart from where it was terminated or (as I suspect) restart the ingest from beginning?

No, unfortunately the ingest will be restarted from the beginning. So I would definitely create a new case because at it stands right now you are adding to the existing index and case database, which already contain 7 days worth of data. And you will also see duplicates for all of the previously processed data.

Ok, so it took 6.5 days to complete - but it got there! Actually quicker than I thought it would. Issue I am having now is my Keyword search bar seems to have disappeared? same no matter what case I look at? I am sure it used to be where the ? is…

Am I missing something incredibly simple?

Well that’s alarming :scream: :scream:! Please let us know if you figure out how to restore the keyword search bar.

Issue I am having now is my Keyword search bar seems to have disappeared?

@mckw99 Wow, That’s really odd, I have never seen anything like that. I assume you have restarted Autopsy and that didn’t help? The first thing I would do is try to delete your Autopsy user profile by deleting (or let’s try renaming first) the “C:\Users\elivis\AppData\Roaming\ autopsy” directory. Rename the “autopsy” directory to “autopsy_old”. This way Autopsy will start completely “clean” as basically a brand new install. All your cases and processed data will NOT be affected, but you will have to reconfigure all of the Autopsy settings though. If that doesn’t fix it, then you should try deleting the “autopsy” folder again, followed by uninstalling and reinstalling Autopsy.

@Eugene_Livis Tried all the usual suspects - rebooted, deleted autopsy profile, uninstalled autopsy, re-installed. Ended up having to delete my user profile on the machine and that sorted it :roll_eyes: As far as the 3Tb search goes - it ran much faster this time, may be because autopsy was installed on the same drive as the case files…I’ll start my 6tb search next week and see how it goes… Thanks for the help.

Ended up having to delete my user profile on the machine and that sorted it

@mckw99 That’s weird. But i’m glad that it’s fixed.

As far as the 3Tb search goes - it ran much faster this time, may be because autopsy was installed on the same drive as the case files

That will definitely greatly improve performance. It will be even better if you have SSD drives.

So I thought I would chance my arm and go with the 6tb dataset (after finally getting the 3tb one to go so well). Unfortunately its slowed right down to 5 files a second on the progress snapshot. I think this one might just be too much for the hardware and single solution I am running but I have attached a link to the thread dump fyi. Dropbox - Ingest thread dump 04.txt - Simplify your life