lib/tdb/docs/mutex.txt

   1 Tdb is a hashtable database with multiple concurrent writer and external
   2 record lock support. For speed reasons, wherever possible tdb uses a shared
   3 memory mapped area for data access. In its currently released form, it uses
   4 fcntl byte-range locks to coordinate access to the data itself.
   5
   6 The tdb data is organized as a hashtable. Hash collisions are dealt with by
   7 forming a linked list of records that share a hash value. The individual
   8 linked lists are protected across processes with 1-byte fcntl locks on the
   9 starting pointer of the linked list representing a hash value.
  10
  11 The external locking API of tdb allows one to lock individual records. Instead of
  12 really locking individual records, the tdb API locks a complete linked list
  13 with a fcntl lock.
  14
  15 The external locking API of tdb also allows one to lock the complete database, and
  16 ctdb uses this facility to freeze databases during a recovery. While the
  17 so-called allrecord lock is held, all linked lists and all individual records
  18 are frozen alltogether. Tdb achieves this by locking the complete file range
  19 with a single fcntl lock. Individual 1-byte locks for the linked lists
  20 conflict with this. Access to records is prevented by the one large fnctl byte
  21 range lock.
  22
  23 Fcntl locks have been chosen for tdb for two reasons: First they are portable
  24 across all current unixes. Secondly they provide auto-cleanup. If a process
  25 dies while holding a fcntl lock, the lock is given up as if it was explicitly
  26 unlocked. Thus fcntl locks provide a very robust locking scheme, if a process
  27 dies for any reason the database will not stay blocked until reboot. This
  28 robustness is very important for long-running services, a reboot is not an
  29 option for most users of tdb.
  30
  31 Unfortunately, during stress testing, fcntl locks have turned out to be a major
  32 problem for performance. The particular problem that was seen happens when
  33 ctdb on a busy server does a recovery. A recovery means that ctdb has to
  34 freeze all tdb databases for some time, usually a few seconds. This is done
  35 with the allrecord lock. During the recovery phase on a busy server many smbd
  36 processes try to access the tdb file with blocking fcntl calls. The specific
  37 test in question easily reproduces 7,000 processes piling up waiting for
  38 1-byte fcntl locks. When ctdb is done with the recovery, it gives up the
  39 allrecord lock, covering the whole file range. All 7,000 processes waiting for
  40 1-byte fcntl locks are woken up, trying to acquire their lock. The special
  41 implementation of fcntl locks in Linux (up to 2013-02-12 at least) protects
  42 all fcntl lock operations with a single system-wide spinlock. If 7,000 process
  43 waiting for the allrecord lock to become released this leads to a thundering
  44 herd condition, all CPUs are spinning on that single spinlock.
  45
  46 Functionally the kernel is fine, eventually the thundering herd slows down and
  47 every process correctly gets his share and locking range, but the performance
  48 of the system while the herd is active is worse than expected.
  49
  50 The thundering herd is only the worst case scenario for fcntl lock use. The
  51 single spinlock for fcntl operations is also a performance penalty for normal
  52 operations. In the cluster case, every read and write SMB request has to do
  53 two fcntl calls to provide correct SMB mandatory locks. The single spinlock
  54 is one source of serialization for the SMB read/write requests, limiting the
  55 parallelism that can be achieved in a multi-core system.
  56
  57 While trying to tune his servers, Ira Cooper, Samba Team member, found fcntl
  58 locks to be a problem on Solaris as well. Ira pointed out that there is a
  59 potential alternative locking mechanism that might be more scalable: Process
  60 shared robust mutexes, as defined by Posix 2008 for example via
  61
  62 http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setpshared.html
  63 http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setrobust.html
  64
  65 Pthread mutexes provide one of the core mechanisms in posix threads to protect
  66 in-process data structures from concurrent access by multiple threads. In the
  67 Linux implementation, a pthread_mutex_t is represented by a data structure in
  68 user space that requires no kernel calls in the uncontended case for locking
  69 and unlocking. Locking and unlocking in the uncontended case is implemented
  70 purely in user space with atomic CPU instructions and thus are very fast.
  71
  72 The setpshared functions indicate to the kernel that the mutex is about to be
  73 shared between processes in a common shared memory area.
  74
  75 The process shared posix mutexes have the potential to replace fcntl locking
  76 to coordinate mmap access for tdbs. However, they are missing the criticial
  77 auto-cleanup property that fcntl provides when a process dies. A process that
  78 dies hard while holding a shared mutex has no chance to clean up the protected
  79 data structures and unlock the shared mutex. Thus with a pure process shared
  80 mutex the mutex will remain locked forever until the data structures are
  81 re-initialized from scratch.
  82
  83 With the robust mutexes defined by Posix the process shared mutexes have been
  84 extended with a limited auto-cleanup property. If a mutex has been declared
  85 robust, when a process exits while holding that mutex, the next process trying
  86 to lock the mutex will get the special error message EOWNERDEAD. This informs
  87 the caller that the data structures the mutex protects are potentially corrupt
  88 and need to be cleaned up.
  89
  90 The error message EOWNERDEAD when trying to lock a mutex is an extension over
  91 the fcntl functionality. A process that does a blocking fcntl lock call is not
  92 informed about whether the lock was explicitly freed by a process still alive
  93 or due to an unplanned process exit. At the time of this writing (February
  94 2013), at least Linux and OpenSolaris also implement the robustness feature of
  95 process-shared mutexes.
  96
  97 Converting the tdb locking mechanism from fcntl to mutexes has to take care of
  98 both types of locks that are used on tdb files.
  99
 100 The easy part is to use mutexes to replace the 1-byte linked list locks
 101 covering the individual hashes. Those can be represented by a mutex each.
 102
 103 Covering the allrecord lock is more difficult. The allrecord lock uses a fcntl
 104 lock spanning all hash list locks simultaneously. This basic functionality is
 105 not easily possible with mutexes. A mutex carries 1 bit of information, a
 106 fcntl lock can carry an arbitrary amount of information.
 107
 108 In order to support the allrecord lock, we have an allrecord_lock variable
 109 protected by an allrecord_mutex. The coordination between the allrecord lock
 110 and the chainlocks works like this:
 111
 112 - Getting a chain lock works like this:
 113
 114   1. get chain mutex
 115   2. return success if allrecord_lock is F_UNLCK (not locked)
 116   3. return success if allrecord_lock is F_RDLCK (locked readonly)
 117      and we only need a read lock.
 118   4. release chain mutex
 119   5. wait for allrecord_mutex
 120   6. unlock allrecord_mutex
 121   7. goto 1.
 122
 123 - Getting the allrecord lock:
 124
 125   1. get the allrecord mutex
 126   2. return error if allrecord_lock is not F_UNLCK (it's locked)
 127   3. set allrecord_lock to the desired value.
 128   4. in a loop: lock(blocking) / unlock each chain mutex.
 129   5. return success.
 130
 131 - allrecord lock upgrade:
 132
 133   1. check we already have the allrecord lock with F_RDLCK.
 134   3. set allrecord_lock to F_WRLCK
 135   4. in a loop: lock(blocking) / unlock each chain mutex.
 136   5. return success.