Performance testing and tuning for Autopsy 4.18

@honor_the_data , @mckw99 @Athomas Sorry for delayed response.

While working on integrating Solr8 into Autopsy 4.18 I have run some profiling tests. Short answer - if you are going to work with large images (TBs) and KWS performance is important, your best best is to use a network Solr server. We call this “Multi-User” (MU) mode, as opposed to “Single user” mode which is the default Autopsy mode. The instructions on how to install a MU Solr server are located here:

https://sleuthkit.org/autopsy/docs/user-docs/4.18.0/install_solr_page.html

The instructions are extensive but the process is honestly very easy.

Some notes:

  • I find that a single Solr server works well up to 1TB, then the performance starts to slow down. The performance doesn’t “drop off the cliff” but it keeps slowing down as you add more data.
  • A single MU Solr server will probably not perform any better than a SU Autopsy case. However, in MU mode you can add additional Solr servers and create a Solr cluster. See " Adding More Solr Nodes" in the above documentation. That is where performance gains come from, especially for large input data sources. Apache Solr documentation calls this “SolrCloud” mode and each Solr server is called a “shard”. The more Solr servers/shards you have, the better performance you will have for large data sets. On our test and production clusters, we are using 4-6 Solr servers to handle data sets of up to 10TB. That seems to be the upper limit. After that, you are much better off breaking your Autopsy case into multiple cases, thus creating a separate Solr index for each case.
  • In my experience, a 3-node SolrCloud indexes data roughly twice as fast as single Solr node. A 6-node SolrCloud indexes data almost twice as fast as 3-node SolrCloud. After that I did not see much performance gain. This is all very rough figures that are heavily dependent on network throughput, machine resources, disk access speeds, and the type of data that is being indexed.
  • Exact match searches are MUCH faster than substring or regex searches.
  • Regex searches tend to use a lot of RAM on the Solr server.
  • I find that indexing/searching of unallocated space really slows everything down because it is mostly binary or garbal data.
  • If you are not going to look at the search results until ingest is over then you should disable the periodic searches. They will start taking longer as your input data grows.

Hope this helps. I’ll be glad to answer any other questions.

1 Like