This documentation is for Dovecot v2.x, see wiki1 for v1.x documentation.

Squat Full Text Search Indexing

NOTE: The Squat code is quite slow for large mailboxes. There are also a few bugs that are unlikely to be fixed. In v2.1+ it's recommended to use fts-lucene instead. There's also a "New Squat" redesign/reimplementation going on, which will likely end up having a new name.

The main difference between Squat indexes and the others is that Squat provides support for substring searches, while pretty much all other FTS indexes support only matching from the beginning of words. By strictly reading the IMAP RFC it requires substring matching, so to optimize regular TEXT and BODY searches you must use Squat with Dovecot v2.0. The other indexes are used only for Dovecot's non-standard X-TEXT-FAST and X-BODY-FAST searches. However, almost all other commonly used IMAP servers no longer care about this requirement, so Dovecot v2.1 also no longer makes this distinction. In Dovecot v2.1 TEXT and BODY searches are done using any indexes available.

Some statistics with Core2 duo 6600, 2 GB memory (partial=4 full=4):

Internal workings

Squat works by building a trie of all 1..4 character combinations of all words in messages. For example given a word "world" it indexes "worl", "orld", "rld", "ld" and "d". Matching message UIDs are stored to each trie node, so Squat supports giving definitive answers to any searches with a length of 1..4 characters. For longer search words the word is looked up in 4 letter pieces and the results are merged. The resulting list is a list of messages where the word *may* exist. This means that those messages are still opened and the word is searched from them the slow way.

Squat also supports indexing more characters from the beginning of words. For example if 5 first characters are indexed, the word "character" would be cut to "chara". If a user then searches for "chara", Dovecot does:

  1. Search for messages having "char" (reply example: 1,5,10)
  2. Search for messages having "hara" (reply example: 3,5,10)
  3. Store the intersection of the above searches to "maybe UIDs" (example: 5,10)
  4. Search for full word "chara" and store it as "definite UIDs" (reply example: 5)
  5. Remove definite UIDs (5) from maybe UIDs (5,10 -> 10)

In the above example Dovecot would know that message UID 5 has the word "chara", but it still has to check if UID 10 has it as a substring (e.g. in word "achara") by reading the message contents. If the user's search words match mostly non-substrings (which is common), using long enough full word indexing can improve the search performance a lot, especially when the word matches a lot of messages.

The Squat name comes from Cyrus IMAP which implements slightly similar Squat indexes ("Search QUery Answer Tool"). Dovecot's implementation and file format however is completely different. The main visible difference is that Dovecot allows updating the index incrementally instead of requiring to re-read the entire mailbox to build it. Cyrus Squat also doesn't support 1..3 characters long searches.

Configuration

fts_squat setting can be used to change Squat options:

mail_plugins = $mail_plugins fts fts_squat

plugin {
  fts = squat
  fts_squat = partial=4 full=10
}

Using both Squat and non-Squat (v2.0-only)

It's possible to use both Squat and non-Squat indices at the same time, so that TEXT/BODY are looked up from Squat indexes and X-TEXT-FAST/X-BODY-FAST are looked up from the non-Squat index. This of course means that indices will have to be built and updated for both, so it might not be that good idea.

protocol imap {
..
  mail_plugins = fts fts_squat fts_solr
}
...
plugin {
  fts = squat solr
}

Plugins/FTS/Squat (last edited 2014-01-27 10:47:23 by TimoSirainen)