The TL;DR- Does anyone have tuning advice for significantly improving indexing and keyword search performance?
Long version- For a while I’ve wanted to switch over primarily to Autopsy as my primary forensic suite for searching and reviewing images, because I think it has the most sensible and useful UI of any of the commercial products I have access to (EnCase, Axiom, X-Ways).
I’ve done some fairly extensive testing on the new Autopsy 4.18, where I added the e01 image from last years Autopsy training, device1_laptop.e01 (≈3.5 GB), and would change the JVM Memory, Solr JVM memory, and periodic update frequency settings in order to improve processing times. All these tests were run on the same system, The server was not doing any other significant workload at the time and made sure to restart Autopsy whenever changing the memory settings.
Here are the conditions of my testing:
-The data source for each test was the device1_laptop.e01 (≈3.5 GB) image from last year’s free COVID-19 Autopsy training
-All tests were run on the same system
-The system was not doing any other significant workload during any of the tests
-The configurations changedto see how they affected performance were: JVM Memory, Solr JVM memory, and periodic update frequency settings
-Autopsy was restarted whenever memory settings were changed
Start time and end time for the ingest jobs were determined using the timestamps associated with the “org.sleuthkit.autopsy.ingest.IngestManager startIngestJob” and “org.sleuthkit.autopsy.ingest.IngestManager FinishIngestJob” records in the autopsy.log.0 log file.
My major takeaways:
-Autopsy4.18 completed an ingest in ≈half the time as it took 4.17 to do the same thing. This is excellent!
-Changing the memory settings and the periodic update settings did not make a impact.
-Autopsy 4.18 took nearly 3 times longer to do the same task as Axiom 4.11. That is not a big deal for a 3.5GB image but it might be a deal breaker for 1TB image.
Does anyone have tuning advice for significantly improving indexing and keyword search performance?
Sorry, I dont have any advice (Noob to this) but since I am currently searching a 3.3Tb data set and have a 6Tb data set to go I’d be interested in any answers to your question!
This is pretty fast and infact so fast that it indexed 500GB e01 image with carving and other options under 3 hours .
It has a facility to keep temp data on a different drive , image on separate drive and case data on a different drive.
Temp data I keep on 512 gb M2 nvme , Image I keep on 2TB raid 0 of 1TB x 2 Sata SSD and case I keep on RAID 5 drive.
It’s interface is slightly different as for case creation you have use cmd line but for review it has a ok ok Java GUI it’s not streamline as autopsy but will do the job done.
I am using it as a second open source forensic tool for data and artifact validation primary is still autopsy but I use it exclusively if I have to do index search.
While working on integrating Solr8 into Autopsy 4.18 I have run some profiling tests. Short answer - if you are going to work with large images (TBs) and KWS performance is important, your best best is to use a network Solr server. We call this “Multi-User” (MU) mode, as opposed to “Single user” mode which is the default Autopsy mode. The instructions on how to install a MU Solr server are located here:
The instructions are extensive but the process is honestly very easy.
Some notes:
I find that a single Solr server works well up to 1TB, then the performance starts to slow down. The performance doesn’t “drop off the cliff” but it keeps slowing down as you add more data.
A single MU Solr server will probably not perform any better than a SU Autopsy case. However, in MU mode you can add additional Solr servers and create a Solr cluster. See " Adding More Solr Nodes" in the above documentation. That is where performance gains come from, especially for large input data sources. Apache Solr documentation calls this “SolrCloud” mode and each Solr server is called a “shard”. The more Solr servers/shards you have, the better performance you will have for large data sets. On our test and production clusters, we are using 4-6 Solr servers to handle data sets of up to 10TB. That seems to be the upper limit. After that, you are much better off breaking your Autopsy case into multiple cases, thus creating a separate Solr index for each case.
In my experience, a 3-node SolrCloud indexes data roughly twice as fast as single Solr node. A 6-node SolrCloud indexes data almost twice as fast as 3-node SolrCloud. After that I did not see much performance gain. This is all very rough figures that are heavily dependent on network throughput, machine resources, disk access speeds, and the type of data that is being indexed.
Exact match searches are MUCH faster than substring or regex searches.
Regex searches tend to use a lot of RAM on the Solr server.
I find that indexing/searching of unallocated space really slows everything down because it is mostly binary or garbal data.
If you are not going to look at the search results until ingest is over then you should disable the periodic searches. They will start taking longer as your input data grows.
Hope this helps. I’ll be glad to answer any other questions.
Thanks for the very detailed response @Eugene_Livis .
A multi-node MU Solr deployment is definitely an option but at this point I was still doing an apples to apples comparison between Autopsy and Axiom to see if I can justify the effort for my team to start adopting Autopsy. I’m hoping that I can get Autopsy indexing to be more competitive with Axiom indexing, when indexing the same image, both tools are running 100% locally, and both tools are reading from and writing to the same disks. These tests were not run at the same time, BTW, in order to avoid them fighting over resources.
I also ran tests with the periodic searching disabled, or changed to 20 minutes, but this did not result in any improvement.
I’m hoping that I can get Autopsy indexing to be more competitive with Axiom indexing, when indexing the same image
@honor_the_data I obviously don’t know the details of your tests and you probably have thought of this already, nor have I used Axiom to know anything about it’s performance, but I suppose it’s important to make sure that both tools are doing the same work. Some questions that immediately jump to mind are:
What Autopsy ingest modules are you enabling during your test? If you are profiling just KWS indexing, then the only ingest modules you should enable are “File Type ID” and “Keyword Search” with all built-in searches (URL, IP, email, etc) disabled.
You should probably use 6 ingest threads, or at least 4. You can configure that in Tools->Options->Ingest->Settings tab.
I don’t know if Axiom is indexing the unallocated space by default or not. If not, then you should disable processing of unallocated space in Autopsy as well, as that definitely takes significant time:
Having done extensive testing/profiling with Solr, and I found that there are a lot of tuning parameters. I would recommend using a much larger image for your profiling/testing. There are periodic actions that the indexing server does, for example how often Solr merges index data under the cover. If one test concludes before this computationally intensive periodic event occurs and the other test completes after this event, it will lead to misleading results (because you will have these periodic events when processing real data sets so you have to account for them). I find that larger images (50-100GB?) provide a much better gauge of performance.
The indexing performance is very non-linear vs the size of input data. In my personal experience, the speed at which the first 5GB are indexed has very little implication regarding the speed at which data is being indexed once you have already indexed a 1TB of data. So again, I would recommend much larger test images.
Again, I do not know anything abut Axiom’s indexing performance, but these are some of the “gothcha’s” that I have discovered in my testing.
What Autopsy ingest modules are you enabling during your test? The only enabled ingest module for Autopsy is keyword search. When comparing with Axiom both tools were running a keyword search using the same two keywords and they were including unallocated space.
You should probably use 6 ingest threads, or at least 4.- Autopsy is recommending a maximum of 4 for my machine. I’ll try 6 to see what happens.
I don’t know if Axiom is indexing the unallocated space by default or not.- This is not default in Axiom but I made sure to turn it on to match the ingest configuration of Autopsy.
I would recommend using a much larger image for your profiling/testing.- I’ll can give this a shot. The reason I’ve been testing with such a small image is that in the past I’ve tried indexing images of the size ranges you mention and it would run for very long amounts of time, like over 24 hours. Autopsy 4.18 is much faster though so I can check if testing with a larger image is viable.
@honor_the_data I will also point out that if you have a team of examiners, then setting up a Multi-User “cluster” is definitely the best way to go in terms of performance, optimal use of computer resources, and user experience. You will only have to set up one cluster for the entire lab, not one per examiner. All ingest processing will then be done on the cluster and can be done on multiple machines in parallel 24/7. All of the examiners will be able to connect to the cluster and open/examine all cases that have been processed. Cases will not be local to the examiner’s machine. Multiple examiners can open and modify the same case at the same time. And you will be able to distribute Solr indexing/searching across multiple Solr servers, greatly speeding up performance.
We have clients that have labs with multiple examiners. A cluster is definitely the best way to go in that scenario. There are very detailed instructions on how to configure an Autopsy cluster:
That would be my dream scenario and it is in fact why I’m doing this testing. Unfortunately my dilemma is that if I cannot demonstrate that Autopsy can do similar equivalent ingest operations in times that are competitive with the other tools we have available to us, e.g. Axiom, the team is not going to use Autopsy no matter how convenient the collaboration is.
On that note, I increased the ingest threads from 4 to 6. The total inget time did drop but not by a huge amount. It went from 0:36:50 to 0:33:14. Axiom was 0:14:44 for the same image, keywords, unallocated, etc .
I will look for a larger image in the 50-100Gb range like you suggested and then run similar tests.
I found a larger image from a Magnet CTF. The e01 chunks add up to 27.4 GB but the the original volume size was 61,440 MB. This should be a better test.
For this round of testing I disabled periodic searching:
But what I look at the logs for that ingest I see the following:
2021-04-06 21:52:33.4 org.sleuthkit.autopsy.keywordsearch.IngestSearchRunner startJob
INFO: Adding job 0
2021-04-06 21:52:33.401 org.sleuthkit.autopsy.keywordsearch.IngestSearchRunner startJob INFO: Resetting periodic search time out to default value
2021-04-06 21:53:28.683 org.sleuthkit.autopsy.keywordsearch.Ingester indexText
I checked the logs from an earlier test where I disabled periodic search and see the same log. This would explain why changing the periodic search option never made an improvement on the indexing time.
Do you know why would the periodic search time be changed back to the default and how I can stop that from happening?
@honor_the_data I have looked at the code - the log message is a bit misleading. If user has specified KWS update frequency then that value will be used. If not, then the default is used:
I think the reason why periodic search option never made an improvement on indexing time is because for small data sources like the one you’re using the search is probably very quick, so you may not see an impact of the periodic search unless you use larger data source. However, if you are not going to examine the data until ingest is complete then you should definitely disable the periodic search.
Using the larger image from a Magnet CTF where the e01 chunks add up to 27.4 GB but the the original volume size was 61,440 MB I ran two tests.
One test with no periodic searching and the other was using the default 5 minute periodic searching. All other variables (memory, ingest threads, etc.) were kept the same.
No periodic searching
Duration- 1:34:14
The “Resetting periodic search time out to default value” log was present but there were no logs indicating that periodic searches were actually run.
5 min periodic searching
-Duration- 1:35:41
The “Resetting periodic search time out to default value” log was present and also logs stating “Starting periodic searches” every 5 minutes.
So… turning on 5 minute periodic searches lengthened the time by 1 m 27 seconds (1.5%).
Caveat: my keyword list only had two terms.
@honor_the_data Yep, that’s roughly what I would expect for still a rather small image and only 2 (exact match?) search terms. I expect those searches to run fairly quickly (seconds, not minutes).
Thanks @Eugene_Livis, this is very interesting. One question I have, when tuning memory settings, is the Solr JVM part of the Maximum JVM memory or a separate JVM? I have 17GB of RAM in my my machine, (I realise this is probably going to have to increase with a 3tb dataset to perform a KWS, or possibly move to a Solr Cluster)… Originally I had my Max JVM Memory set to 8GB and my Max Solr JVM Memory at the default of 512k (running Autopsy 4.18) I was toying with the option of increasing the Solr memory to see if I can improve the speed of the KWS?
One question I have, when tuning memory settings, is the Solr JVM part of the Maximum JVM memory or a separate JVM?
@mckw99 Those are two unrelated settings. The Maximum JVM memory is the (max) Java heap size that is allocated to the Autopsy process. In single user mode, Autopsy starts a separate Solr process to do indexing. The Solr JVM setting is the (max) amount of Java heap that will be allocated to the Solr process.
Originally I had my Max JVM Memory set to 8GB
That is a reasonable default and should work, though when possible I usually set this to 10-14GB.
my Max Solr JVM Memory at the default of 512k (running Autopsy 4.18)
This is definitely not enough for large (TBs) data sources. If you’re ingesting TBs, I’d use at least 4GB, better 8GB. In fact, I have analyzed the thread dump from your other thread ( Slow ingest dump - version 4.18 - Keyword search ) and I am quite confident that the issue is caused by this setting. I think Solr server has run out of memory and basically stopped working (at least it is no longer able to index new documents), which is drastically slowing down ingest, because we re-try indexing several times. If I’m correct, you will see a lot of “Unable to send document batch to Solr. All re-try attempts failed!” errors if you open the Autopsy debug logs:
I would also expect to see some error message notifications in the bottom right corner of the Autopsy UI. Let me know please if you don’t see any error notifications there.
Thanks @Eugene_Livis! I didnt see any of these errors in the log files, although unfortunately I had to delete the case, to restart it with greater memory settings before I read your post. Note to self: Save the log files in future! I’ve restarted the ingest with memory settings of JVM:10GB and Solr JVM of 8GB (although I am concerned about the over all memory in the machine left for the OS with these settings (16GB total)). I will let you know how it goes and thanks again for your reply, much appreciated.