doc/design/red-black

   1 Copyright (C) 2004  Internet Systems Consortium, Inc. ("ISC")
   2 Copyright (C) 1999-2001  Internet Software Consortium.
   3 See COPYRIGHT in the source root or http://isc.org/copyright.html for terms.
   4
   5 $Id: red-black,v 1.9 2004/03/05 05:04:46 marka Exp $
   6
   7                  Red-Black Tree Implementation Notes
   8
   9 OVERVIEW
  10
  11 BIND9's basic name storage mechanism is to use a modified form of
  12 balanced binary tree known as a red-black tree.  Red-black trees
  13 provide for relatively efficient storage, retrieval and removal of
  14 data while maintaining the lexical order of all stored keys, a
  15 necessary function for DNS security.
  16
  17 DESCRIPTION
  18
  19 A red-black tree is a balanced binary tree named for the coloring that
  20 is done in the tree, identifying each node as either red or black.
  21 There are two simple rules for maintaining the color of nodes:
  22   (1) A red node has only black children.
  23   (2) The path from the root to any leaf node always includes the
  24       same number of black nodes.
  25
  26 Whenever a key is added or removed, adjustments are made to adhere to
  27 those two rules.  These adjustments are relatively cheap to make but
  28 maintain the balance of the tree, thus making for efficient addition,
  29 lookup and deletion operations, all of which are O(log N).  The color
  30 of a node is not relevant to external users of the tree; it is needed
  31 only to maintain the balance of the tree.
  32
  33 For more information on basic red-black trees, see _Introduction to
  34 Algorithms_, Cormen, Leiserson, and Rivest, MIT Press / McGraw Hill,
  35 1990, ISBN 0-262-03141-8, chapter 14.
  36
  37 In BIND9, the red-black tree implementation uses DNS names as keys,
  38 and can store arbitrary data with each key value.  "name" and "key"
  39 are used interchangably in this document.
  40
  41 The basic red-black tree algorithm is further adapted for use in BIND9
  42 to incorporate the notion of hierarchy, creating a tree of red-black
  43 trees.  Where there is more than one name with a common suffix, all
  44 names with that suffix are stored in their own red-black tree, with a
  45 down pointer from the suffix locating the subtree.
  46
  47 For example, consider storing the following names:
  48    a       x.d.e.f     o.w.y.d.e.f
  49    b       z.d.e.f     p.w.y.d.e.f
  50    c       g.h         q.w.y.d.e.f
  51
  52 No matter which order the keys were added, this would result in a tree
  53 that can be visualized as:
  54
  55                                 b
  56                               /   \
  57                              a    d.e.f
  58                                    /|\
  59                                   c | g.h
  60                                     |
  61                                    w.y
  62                                    /|\
  63                                   x | z
  64                                     |
  65                                     p
  66                                    / \
  67                                   o   q
  68
  69 This tree shows that when there is no key for a particular label, and
  70 when there is only one known label for its immediate subordinate, then
  71 multiple labels can appear in a single node, such as at d.e.f and g.h.
  72 It also demonstrates that there can be more nodes in the tree of trees
  73 than there are actual keys (which degrades the O(log N) performance
  74 marginally); the nodes at d.e.f and w.y do not represent keys.
  75
  76 As an aside, remember that when ordering DNS names, labels are
  77 examined from the right, therefore w.y sorts after x and before z.
  78
  79 A split can occur not only on a regular label boundary, but also
  80 between any two bits in an EDNS bitstring label.  The common-suffix
  81 rules will be applied to keep as many bits together as possible.
  82
  83 In the current implementation of the tree of trees, a node is
  84 considered to "formally" exist only if it has data associated with
  85 it.  So if the above tree then had the key d.e.f added to it, the
  86 operation would succeed rather than getting an "already exists"
  87 error.
  88
  89 Along the same lines, if a key is added with a name which is a proper
  90 superdomain of the name stored in an existing node, the operation will
  91 succeed by splitting the existing node into one node that is the key
  92 and another node that is the remaining parts of the name.  Adding e.f
  93 to the above tree results in the top level red-black tree having a
  94 node named e.f where the current d.e.f is, and a down pointer from
  95 d.e.f to a "tree" of a single node named d.  The down pointer from d
  96 would be kept to the level which has x, w.y, and z.
  97
  98 A similar split of d.e.f would occur if the name k.e.f were added.
  99 The top level tree would have the node e.f with a down pointer to a
 100 level that had both d and k, and d would continue to have its down
 101 pointer to the x, w.y and z level.
 102
 103 It is guaranteed when splitting that external references to the node
 104 that is split will remain valid --- in the previous examples, anything
 105 that was pointing to the node that was d.e.f will still point to the
 106 node that is now just d.
 107
 108 When deleting keys, nodes can be rejoined.  If both of p.w.y.d.e.f and
 109 q.w.y.d.e.f were removed from the example tree, the node named w.y
 110 would become o.w.y.  Unlike splitting, it is _not_ guaranteed that
 111 external references remain consistent; sometimes they will, sometimes
 112 they won't.  Also, note that deletion is not perfectly symmetric with
 113 addition.  If you "undo" the last addition with a deletion of the same
 114 key then the tree of trees is not guaranteed to have exactly the same
 115 structure as it had prior to the addition.  Sometimes, but not always.
 116
 117 Rejoining does not happen if it would violate any of the rules that
 118 cause a split.  o would not be rejoined with w.y if w.y had data
 119 associated with the key; o would remain as a single node on its own
 120 level.  This emphasizes the rule that a node is considered to formally
 121 exist only if data is associated with it, because even if w.y.d.e.f
 122 had been explicitly added as a key but with no data, then o would
 123 still be merged with the w.y node when p and q were deleted.
 124
 125 Searching for a node generally returns one of three possible results:
 126 either the key is found, a superdomain (partial match) of the key is
 127 found, or no part of the key is found.  The first and last are rather
 128 obvious, and the second result basically means that a hierarchically
 129 enclosing name is found; e.g, searching for bb.rc.vix.com turned up
 130 rc.vix.com, but not the full name.
 131
 132 No locking is done within the RBT library.  @@@
 133
 134 CHAINS
 135
 136 @@@
 137
 138 When a partial match is made, level_matches is set while the chain
 139 points to the partial match node that was found.  Then the chain is
 140 adjusted to point to the DNSSEC predecessor node, which might not even
 141 be under the same top level domain as the name that was searched for.
 142 For example, consider a database that had only the names vix.com and
 143 isc.org.  A search for uu.net would leave the chain pointed to
 144 vix.com, the DNSSEC predecessor.  Though this might first appear to
 145 cause level_matches to be bogus because the chain has been unwound and
 146 sent down another path, note that the partial match node will always
 147 be in the chain of the predecessor, too --- and often the partial
 148 match node will be the predecessor itself.  In the vix.com/isc.org
 149 example, the search for uu.net finds a partial match at ".", which is
 150 of course also in the path to the vix.com predecessor.  A search for
 151 www.isc.org would find that isc.org is both the partial match and the
 152 predecessor.
 153
 154 EXTERNAL PROGRAMMATIC DETAILS
 155
 156 This section details the functions used to interact with the BIND9
 157 red-black tree library, or RBT for short.
 158
 159 A source file that will be using RBT will usually need to include
 160 <dns/rbt.h>.  This header file automatically includes <isc/result.h),
 161 <isc/mem.h>, <dns/types.h>, and <dns/name.h>.
 162
 163 The rbt.h file has more complete descriptions of each of the functions
 164 named here, including what is required for each argument, what each
 165 function ensures (and might not ensure) will occur, and the full range
 166 of possible results for each call.  Note well: if a function returns a
 167 dns_result_t rather than void, it definitely means there is something
 168 that can go possibly wrong in the function and it should be checked by
 169 the caller.
 170
 171 A new tree of trees must be initialized using:
 172
 173   dns_result_t dns_rbt_create(isc_mem_t *mctx, void (*deleter)(void *, void *),
 174                               void *deleter_arg, dns_rbt_t **rbtp);
 175
 176 The memory context, mctx, must be a non-null pointer that was
 177 initialized with isc_mem_create().  The deleter argument, if non-null,
 178 should point to a function that is responsible for cleaning up any
 179 memory associated with the data pointer of a node when the node is
 180 deleted.  It is passed the deleted node's data pointer as its first
 181 argument and deleter_arg as its second argument.
 182
 183 After initializing an RBT manager, to add keys to the tree, use:
 184
 185   dns_result_t dns_rbt_addname(dns_rbt_t *rbt, dns_name_t *name, void *data);
 186
 187 The name _must_ be an absolute name.  It is not required that the data
 188 pointer be non-null, but it is recommended that it point to something,
 189 even just invalid memory, because of the various searching and
 190 deletion issues described in the previous section.  The RBT code will
 191 not attempt to dereference the pointer.
 192
 193 To find a key in the tree, use:
 194
 195   dns_result_t dns_rbt_findname(dns_rbt_t *rbt, dns_name_t *name, void **data);
 196
 197 The data parameter must not be NULL, but *data must be NULL.  The
 198 result will be either DNS_R_SUCCESS, DNS_R_PARTIALMATCH or
 199 DNS_R_NOTFOUND.  In the first case, an exact match was found for the
 200 name and there was an associate data pointer, which is returned via
 201 the data parameter.  A partial match results when the name has not
 202 been found but a superdomain name, with data, does exist; then the
 203 data for that name is returned in the data parameter.  If no data is
 204 found for the name or a superdomain, *data will remain NULL.
 205
 206
 207 INTERNAL PROGRAMMATIC DETAILS
 208
 209 This section is mainly relevant to the RBT DB implementation.  It is
 210 highly recommended that programmers using the RBT library stick to the
 211 functions named in the previous section.
 212
 213 The dns_rbt_addname and dns_rbt_findname functions named in the
 214 previous section are wrappers around dns_rbt_addnode and
 215 dns_rbt_findnode.  The *node functions for the most part do not
 216 particularly care whether a node has an associated data pointer or
 217 not, whereas the *name functions do.  The one exception to this is
 218 that when a PARTIALMATCH is returned for a search, the indicated node
 219 is the deepest match that has data, rather than just the deepest
 220 match.  Even that behavior is selectable, however, using the boolean
 221 empty_data_ok argument to dns_rbt_findnode.
 222
 223 Each node in the tree of trees is represented by the following structure:
 224
 225   typedef struct dns_rbtnode {
 226           struct dns_rbtnode *left;
 227           struct dns_rbtnode *right;
 228           struct dns_rbtnode *down;
 229           /*
 230            * The following bitfields add up to a total bitwidth of 32.
 231            * The range of values necessary for each item is indicated,
 232            * but in the case of "attributes" the field is wider to accomodate
 233            * possible future expansion.  "offsetlen" could be one bit
 234            * narrower by always adjusting its value by 1 to find the real
 235            * offsetlen, but doing so does not gain anything (except perhaps
 236            * another bit for "attributes", which doesn't yet need any more).
 237            */
 238           unsigned int color:1;      /* range is 0..1 */
 239           unsigned int attributes:6; /* range is 0..2 */
 240           unsigned int namelen:8;    /* range is 1..255 */
 241           unsigned int offsetlen:8;  /* range is 1..128 */
 242           unsigned int padbytes:9;   /* range is 0..380 */
 243           /*
 244            * These values are used in the RBT DB implementation.  The
 245            * appropriate node lock must be held before accessing them.
 246            */
 247           void *data;
 248           unsigned int dirty:1;
 249           unsigned int locknum:DNS_RBT_LOCKLENGTH;
 250           unsigned int references:DNS_RBT_REFLENGTH;
 251   } dns_rbtnode_t;
 252
 253 @@@