How to get keyword search result with less ingest time

Appreciate if someone can help me to understand how to make the inject and other settings in order to get better keyword search result in less ingest time.

I have several EWF (e01) images which have to examine using keyword searches. I have to make sure searches are conducted on deleted files, and any container files such as *.pst , *.zip etc. too.

Following snapshots shows the status of one image ingesting(as of 02nd Aug) . after many hours i’m not sure to what extent the keyword search is completed.

Following snapshots taken 05 Aug.

Appreciate if someone can give an advice about:

  1. how is the status of ingest process?
  2. how to optimize the ingest process?
  3. approximately how long it will take to complete the keyword search?
  4. any other thoughts that make this exercise successful?

I’m using an old laptop Intel i5-2430M 2.4GHz 4CPUs, 8GB RAM, OS is windows 7.
Autopsy version 4.11.0

There is nothing in the screen shots to indicate that the keyword search module is engaged in indexing text in one of the three file ingest threads in any of the snapshots. However, it looks like the recent activity and email parser modules are somehow “stuck.” If I am reading the little numbers in the Elapsed Time columns correctly, recent activity analysis has been going on for ~175 hours and the email parser has been analyzing archive.pst for something like ~128 hours. You might want to check the logs or send them to me here at Basis Technology, if that is an option. I am the Autopsy Team Lead.

Note that indexing text for keyword search appears in the ingest snapshot, but actual searching of text for keywords does not. Searching happens in a different thread pool. It will have its own progress bar in the lower right hand corner, but that’s it. Perhaps the progress bar for keyword search is scrolled out of sight in the screenshot?

If you are certain that keyword search is also proceeding slowly, I would note that keyword search is a RAM intensive process. As you already know, I think, 8GB of RAM is not a lot. I certainly would not try to do two ingest jobs simultaneously, as you are doing here.

One thing you could do is define a file filter, to limit the files that are processed in an ingest job. You can do so through Tools=>Options=>Ingest tab=> File Filters tab and then select the filter when you kick off an ingest job. However, there are some limitations on the filters that can be defined. Filtering on extension is easy, filtering on deleted files is unfortunately not currently supported.

How long any ingest job will take, and keyword search in particular, is highly dependent on the input and the resources of the computer. I don’t have a way to predict how long your jobs will take.

1 Like

Thank you so much for valuable inputs!!

File filtering is a good option which i just started as you referred.

I’m wondering if the ingest is filtered by file extensions, will it consider/filter the recoverable deleted files too or will it disregard them?

Could you pls. tell me how to end the ingest process safely, so that the ingest process completed so far is saved? Or Is it saved automatically ?
Then I can re-start ingest from better machine or else I can start reviewing the files with keyword hits found so far.
Also I would like to know what ingest modules are essential as to recover deleted files (I’m interested in any user generated documents, emails etc)

Recovery of deleted files is done automatically by the SleuthKit when Autopsy runs it.

You can, however, choose to ignore orphan files in FAT file systems, where orphan files are defined in this context as deleted files that still have file metadata in the file system, but that cannot be accessed from the root directory. Note that FAT file system orphan files are not ignored by default, and recovering them can take some time. The setting is not a module setting, it is a data source processor setting on the “Select Data Source” page of the “Add Data Source” wizard, if you select “Disk Image or VM File” on the “Select Type of Data Source to Add” page of the wizard. You might want to think about this if you are examining FAT file systems.

When you apply an extension based file filter to an ingest job, all files with the specified extension(s) will be passed through the filter and included in the analysis, whether deleted or not. Note that the conditions (name, folder name, modified time) in a file filter rule are AND’ed together to determine which files satisfy the rule, and the rules are OR’ed together to determine which files pass through the filter.

If you cancel an ingest job, whatever has been written into the case database and the text index up to that point is what you get - the partial results are not deleted.

To cancel an ingest job, use the progress bars in the lower right hand corner of the main application window. There may be multiple tasks to cancel, including a separate keyword search task. There will likely be a slight to modest delay while the current tasks finish what they were doing. This is because you cannot force kill threads of execution in Java and Autopsy is primarily a Java application; you can only notify tasks running in threads that they should terminate as soon as it is safe to do so. It is conceivable that a “stuck” analysis task may not respond to a cancellation request. If this happens, your only recourse is to kill the process that owns the thread, i.e., shut down Autopsy. If this happens it is unfortunate, but not the end of the world if you are just trying to cancel an ingest job. It is also possible in rare circumstances for the Autopsy process itself to become become hung and for you to be unable to shut it down without resorting to the Task Manager. Note that it is easy to do this on Windows 10; in older versions of the Task Manager you will need to kill off the Solr server process spawned by the Autopsy process separately. Unfortunately, if it comes to this, the Solr server process will be identified only as Java™ Platform SE Binary, and some discernment is required. Of course, in such an extremity, it may be an option to simply reboot the machine.

1 Like

Thank you so much. I stopped ingest and then applied filters and re-ran the ingest.

I hope this time it’ll work perfectly.

I have few doubts with regards to keyword searches.

  1. After the ingest is completed 100%, can i add keywords? and do i have to rerun ingest may be only with keyword module? will this be okay in order the get complete keyword hit results?

  2. If the ingest operation is not compete 100%, but ran halfway and then stopped/cancelled, then if I add more/new keywords afterwords and run the Ingest with all relevant modules, will the keyword search consider the already completed ingest part too? I’m checking whether the keyword result is complete?

Highly appreciate if you can clarify… Thank you so much

You can do keyword searching without running an ingest job. In the upper right hand area of the main application window there are two drop downs that allow you to do searching with keyword lists or individual search terms. The search results will be presented in a separate window and any keyword hit results (artifacts) will be added to the case database. All text that has been added to the text index at the time of the search will be searched.

image.png

If you choose to run another ingest job with just the keyword search module enabled, likewise all of the text in the text index will be searched for whatever search terms are in the keyword lists you have selected to be used for the ingest job.

1 Like

Tried to keyword search using drop down option (Keyword Lists).
But I’m getting an error when it is saving.
Unable to close connection to case database
Caused by: An SQLException was provoked by the following failure:
java.util.ConcurrentModificationException

Extract of error log file is attached FYI.
Error log extract

Appreciate if you can give me a suggestion to solve this.

On the slowness topic, I think Richard was correct that the suspicious things are Recent Activity and Email.

The only other thing that should be looked at is that we recently (though not yet released) changed how we parse PST and MBOX files so that we do not keep all of them in memory. The code in 4.11 and 4.12 will load all messages into memory and then add them to the database. This is obviously not good on memory for big PST files.

How big is your archive.PST file? It could be that the module is taking up all of the memory and making everything else run really slow.

You could test this by disabling the Email module and see if everything runs faster.

You have lots of errors (lower right corner, red circle). What do they say?

The log file talks about database corruption. Are you out of disk space?

1 Like

It’s been a long time, but I noticed the attached log excerpt and took a look at it. It appears that something is going very wrong with attempts the case database. There are many, many stack traces such as this:

2019-08-17 13:38:46.819 org.sleuthkit.autopsy.datamodel.KeywordHits$KeywordResults update
WARNING: SQL Exception occurred:
org.sqlite.SQLiteException: [SQLITE_CORRUPT] The database disk image is malformed (database disk image is malformed)
org.sqlite.core.DB.newSQLException(DB.java:941)
org.sqlite.core.DB.newSQLException(DB.java:953)
org.sqlite.core.DB.throwex(DB.java:918)
org.sqlite.jdbc3.JDBC3ResultSet.next(JDBC3ResultSet.java:84)
ETC.

This errors are happening at the SQLite JDBC level and indicate that the database is somehow corrupt, if the error messages are reliable.