Mbox Mailbox Format
Usually UNIX systems are configured by default to deliver mails to /var/mail/username or /var/spool/mail/username mboxes. In IMAP world these files are called INBOX mailboxes. IMAP protocol supports multiple mailboxes however, so there needs to be a place for them as well. Typically they're stored in ~/mail/ or ~/Mail/ directories.
The mbox file contains all the messages of a single mailbox. Because of this, the mbox format is typically thought of as a slow format. However with Dovecot's indexing this isn't true. Only expunging messages from the beginning of a large mbox file is slow with Dovecot, most other operations should be fast. Also because all the mails are in a single file, searching is much faster than with maildir.
Modifications to mbox may require moving data around within the file, so interruptions (eg. power failures) can cause the mbox to break more or less badly. Although Dovecot tries to minimize the damage by moving the data in a way that data should never get lost (only duplicated), mboxes still aren't recommended to be used for important data.
Locking is a mess with mboxes. There are multiple different ways to lock a mbox, and software often uses incompatible locking. There are at least four different ways to lock a mbox:
dotlock: mailboxname.lock file created by almost all software when writing to mboxes. This grants the writer an exclusive lock over the mbox, so it's usually not used while reading the mbox so that other processes can also read it at the same time. So while using a dotlock typically prevents actual mailbox corruption, it doesn't protect against read errors if mailbox is modified while a process is reading.
flock: flock() system call is quite commonly used for both read and write locking. The read lock allows multiple processes to obtain a read lock for the mbox, so it works well for reading as well. The one downside to it is that it doesn't work if mailboxes are stored in NFS.
fcntl: Very similar to flock, also commonly used by software. In some systems this fcntl() system call is compatible with flock(), but in other systems it's not, so you shouldn't rely on it. fcntl works with NFS if you're using lockd daemon in both NFS server and client.
lockf: POSIX lockf() locking. Because it allows creating only exclusive locks, it's somewhat useless so Dovecot doesn't support it. With Linux lockf() is internally compatible with fcntl() locks, but again you shouldn't rely on this.
Another problem with dotlocks is that if the mailboxes exist in /var/mail/, the user may not have write access to the directory, so the dotlock file can't be created. There are a couple of ways to work around this:
Give a mail group write access to the directory and then make sure that all software requiring access to the directory runs with the group's privileges. This may mean making the binary itself setgid-mail, or using a separate dotlock helper program which is setgid-mail. With Dovecot this can be done by setting mail_extra_groups = mail.
Set sticky bit to the directory (chmod +t /var/mail). This makes it somewhat safe to use, because users can't delete each others mailboxes, but they can still create new files (the dotlock files). The downside to this is that users can create whatever files they wish in there, such as a mbox for newly created user who hadn't yet received mail.
If multiple lock methods are used, which is usually the case since dotlocks aren't typically used for read locking, the order in which the locking is done is important. Consider if two programs were running at the same time, both use dotlock and fcntl locking but in different order:
- Program A: fcntl locks the mbox
- Program B at the same time: dotlocks the mbox
- Program A continues: tries to dotlock the mbox, but since it's already dotlocked by B, it starts waiting
- Program B continues: tries to fcntl lock the, but since it's already fcntl locked by A, it starts waiting
Now both of them are waiting for each others locks. Finally after a couple of minutes they time out and fail the operation.
Locking used by different software
- Debian: Debian's policy specifies that all software should use "fcntl and then dotlock" locking.
procmail: procmail -v 2>&1|grep Lock
mutt: mutt -v|grep -i lock
Dovecot: mbox_read_locks and mbox_write_locks settings.
When listing mailboxes, Dovecot simply assumes that all files it sees are mboxes and all directories mean that they contain sub-mailboxes. There are two special cases however which aren't listed:
.subscriptions file contains IMAP's mailbox subscriptions.
.imap/ directory contains Dovecot's index files.
Because it's not possible to have a file which is also a directory, it's not possible to create a mailbox and child mailboxes under it.
Dovecot uses C-Client (ie. UW-IMAP, Pine) compatible headers in mbox messages to store metadata. These headers are:
- X-IMAPbase: Contains UIDVALIDITY, last used UID and list of used keywords
- X-IMAP: Same as X-IMAPbase but also specifies that the message is a "pseudo message"
- X-UID: Message's allocated UID
- Status: R (\Seen) and O (non-\Recent) flags
- X-Status: A (\Answered), F (\Flagged), T (\Draft) and D (\Deleted) flags
- X-Keywords: Message's keywords
- Content-Length: Length of the message body in bytes
Whenever any of these headers exist, Dovecot treats them as its own private metadata. It does sanity checks for them, so the headers may also be modified or removed completely. None of these headers are sent to IMAP/POP3 clients when they read the mail. Preferrably your LDA should strip all these headers before writing the mail to the mbox.
Only the first message contains the X-IMAP or X-IMAPbase header. The difference is that when all the messages are deleted from mbox file, a "pseudo message" is written to the mbox which contains X-IMAP header. This is the "DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA" message which you hate seeing when using non-C-client and non-Dovecot software. This is however important to prevent abuse, otherwise the first mail which is received could contain faked X-IMAPbase header which could cause trouble.
If message contains X-Keywords header, it contains a space-separated list of keywords for the mail. Since the same header can come from the mail's sender, only the keywords are listed in X-IMAP header are used.
The UID for a new message is calculated from "last used UID" in X-IMAP header + 1. This is done always, so fake X-UID headers don't really matter. This is also why the pseudo message is important. Otherwise the UIDs could easily grow over 231 which some clients start treating as negative numbers, which then cause all kinds of problems. Also when 232 is exceeded, Dovecot will also start having some problems.
Content-Length is used as long as another valid mail starts after that many bytes. Because the byte count must be exact, it's quite unlikely that abusing it can cause messages to be skipped (or rather appended to the previous message's body).
Status and X-Status headers are trusted completely, so it's pretty good idea to filter them in LDA if possible.
Dovecot's Speed Optimizations
Updating messages' flags and keywords can be a slow operation since you may have to insert a new header (Status, X-Status, X-Keywords) or at least insert data in the header's value. Some mbox MUAs do this simply by rewriting all of the mbox after the inserted data. If the mbox is large, this can be very slow. Dovecot optimizes this by always leaving some space characters after some of its internal headers. It can use this space to move only minimal amount of data necessary to get the necessary data inserted. Also if data is removed, it just grows these spaces areas.
mbox_lazy_writes setting works by adding and/or updating Dovecot's metadata headers only after closing the mailbox or when messages are expunged from the mailbox. C-Client works the same way. The upside of this is that it reduces writes because multiple flag updates to same message can be grouped, and sometimes the writes don't have to be done at all if the whole message is expunged. The downside is that other processes don't notice the changes immediately (but other Dovecot processes do notice because the changes are in index files).
mbox_dirty_syncs setting tries to avoid re-reading the mbox every time something changes. Whenever the mbox changes (ie. timestamp or size), it first checks if the mailbox's size changed. If it didn't, it most likely meant that only message flags were changed so it does a full mbox read to find it. If the mailbox shrinked, it means that mails were expunged and again Dovecot does a full sync. Usually however the only thing besides Dovecot that modifies the mbox is the LDA which appends new mails to the mbox. So if the mbox size was grown, Dovecot first checks if the last known message is still where it was last time. If it is, Dovecot reads only the newly added messages and goes into a "dirty mode". As long as Dovecot is in dirty mode, it can't be certain that mails are where it expects them to be, so whenever accessing some mail, it first verifies that it really is the correct mail by finding its X-UID header. If the X-UID header is different, it fallbacks to a full sync to find the mail's correct position. The dirty mode goes away after a full sync. If mbox_lazy_writes was enabled and the mail didn't yet have X-UID header, Dovecot uses MD5 sum of a couple of headers to compare the mails.
mbox_very_dirty_syncs does the same as mbox_dirty_syncs, but the dirty state is kept also when opening the mailbox. Normally opening the mailbox does a full sync if it had been changed outside Dovecot.
In mboxes a new mail always begins with a "From " line, commonly referred to as From_-line. To avoid confusion, lines beginning with "From " in message bodies are usually prefixed with '>' character while the message is being written to in mbox.
Dovecot doesn't currently do this escaping however. Instead it prevents this confusion by adding Content-Length headers so it knows later where the next message begins. Dovecot doesn't either remove the '>' characters before sending the data to clients. Both of these will probably be implemented later.
There are a few minor variants of this format:
mboxo is the name of original mbox format originated with Unix System V. Messages are stored in a single file, with each message beginning with a line containing "From SENDER DATE". If "From" occurs at the beginning of a line anywhere in the email, it is escaped with a greater-than sign (>From).
mboxrd was named for Raul Dhesi in June 1995, though several people came up with the same idea around the same time. An issue with the mboxo format was that if the text ">From" appeared in the body of an email (such as from a reply quote), it was not possible to distinguish this from the mailbox format's ">From". mboxrd fixes this by always quoting ">From" lines as well, so readers can just remove the first ">" character. This format is used by qmail.
mboxcl format was originated with Unix System V Release 4 mail tools. It adds a Content-Length field which indicates the number of bytes in the message. This is used to determine message boundaries. It still quotes "From" as the original mboxo format does (and not as mboxrd does it).
mboxcl2 is like mboxcl but does away with the "From" quoting.
MMDF (Multi-channel Memorandum Distribution Facility mailbox format) was originated with the MMDF daemon. The format surrounds each message with lines containing four control-A's. This eliminates the need to escape From: lines.
Dovecot currently uses mboxcl2 format internally, but it's planned to move to combination of mboxrd and mboxcl.
[http://www.qmail.org/man/man5/mbox.html Qmail mbox]