Documentation/admin-guide/device-mapper/log-writes.rst

   1 =============
   2 dm-log-writes
   3 =============
   4
   5 This target takes 2 devices, one to pass all IO to normally, and one to log all
   6 of the write operations to.  This is intended for file system developers wishing
   7 to verify the integrity of metadata or data as the file system is written to.
   8 There is a log_write_entry written for every WRITE request and the target is
   9 able to take arbitrary data from userspace to insert into the log.  The data
  10 that is in the WRITE requests is copied into the log to make the replay happen
  11 exactly as it happened originally.
  12
  13 Log Ordering
  14 ============
  15
  16 We log things in order of completion once we are sure the write is no longer in
  17 cache.  This means that normal WRITE requests are not actually logged until the
  18 next REQ_PREFLUSH request.  This is to make it easier for userspace to replay
  19 the log in a way that correlates to what is on disk and not what is in cache,
  20 to make it easier to detect improper waiting/flushing.
  21
  22 This works by attaching all WRITE requests to a list once the write completes.
  23 Once we see a REQ_PREFLUSH request we splice this list onto the request and once
  24 the FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
  25 completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
  26 simulate the worst case scenario with regard to power failures.  Consider the
  27 following example (W means write, C means complete):
  28
  29         W1,W2,W3,C3,C2,Wflush,C1,Cflush
  30
  31 The log would show the following:
  32
  33         W3,W2,flush,W1....
  34
  35 Again this is to simulate what is actually on disk, this allows us to detect
  36 cases where a power failure at a particular point in time would create an
  37 inconsistent file system.
  38
  39 Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
  40 they complete as those requests will obviously bypass the device cache.
  41
  42 Any REQ_OP_DISCARD requests are treated like WRITE requests.  Otherwise we would
  43 have all the DISCARD requests, and then the WRITE requests and then the FLUSH
  44 request.  Consider the following example:
  45
  46         WRITE block 1, DISCARD block 1, FLUSH
  47
  48 If we logged DISCARD when it completed, the replay would look like this:
  49
  50         DISCARD 1, WRITE 1, FLUSH
  51
  52 which isn't quite what happened and wouldn't be caught during the log replay.
  53
  54 Target interface
  55 ================
  56
  57 i) Constructor
  58
  59    log-writes <dev_path> <log_dev_path>
  60
  61    ============= ==============================================
  62    dev_path      Device that all of the IO will go to normally.
  63    log_dev_path  Device where the log entries are written to.
  64    ============= ==============================================
  65
  66 ii) Status
  67
  68     <#logged entries> <highest allocated sector>
  69
  70     =========================== ========================
  71     #logged entries             Number of logged entries
  72     highest allocated sector    Highest allocated sector
  73     =========================== ========================
  74
  75 iii) Messages
  76
  77     mark <description>
  78
  79         You can use a dmsetup message to set an arbitrary mark in a log.
  80         For example say you want to fsck a file system after every
  81         write, but first you need to replay up to the mkfs to make sure
  82         we're fsck'ing something reasonable, you would do something like
  83         this::
  84
  85           mkfs.btrfs -f /dev/mapper/log
  86           dmsetup message log 0 mark mkfs
  87           <run test>
  88
  89         This would allow you to replay the log up to the mkfs mark and
  90         then replay from that point on doing the fsck check in the
  91         interval that you want.
  92
  93         Every log has a mark at the end labeled "dm-log-writes-end".
  94
  95 Userspace component
  96 ===================
  97
  98 There is a userspace tool that will replay the log for you in various ways.
  99 It can be found here: https://github.com/josefbacik/log-writes
 100
 101 Example usage
 102 =============
 103
 104 Say you want to test fsync on your file system.  You would do something like
 105 this::
 106
 107   TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
 108   dmsetup create log --table "$TABLE"
 109   mkfs.btrfs -f /dev/mapper/log
 110   dmsetup message log 0 mark mkfs
 111
 112   mount /dev/mapper/log /mnt/btrfs-test
 113   <some test that does fsync at the end>
 114   dmsetup message log 0 mark fsync
 115   md5sum /mnt/btrfs-test/foo
 116   umount /mnt/btrfs-test
 117
 118   dmsetup remove log
 119   replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
 120   mount /dev/sdb /mnt/btrfs-test
 121   md5sum /mnt/btrfs-test/foo
 122   <verify md5sum's are correct>
 123
 124   Another option is to do a complicated file system operation and verify the file
 125   system is consistent during the entire operation.  You could do this with:
 126
 127   TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
 128   dmsetup create log --table "$TABLE"
 129   mkfs.btrfs -f /dev/mapper/log
 130   dmsetup message log 0 mark mkfs
 131
 132   mount /dev/mapper/log /mnt/btrfs-test
 133   <fsstress to dirty the fs>
 134   btrfs filesystem balance /mnt/btrfs-test
 135   umount /mnt/btrfs-test
 136   dmsetup remove log
 137
 138   replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
 139   btrfsck /dev/sdb
 140   replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
 141         --fsck "btrfsck /dev/sdb" --check fua
 142
 143 And that will replay the log until it sees a FUA request, run the fsck command
 144 and if the fsck passes it will replay to the next FUA, until it is completed or
 145 the fsck command exists abnormally.