Documentation/filesystems/vfs.txt

   1
   2               Overview of the Linux Virtual File System
   3
   4         Original author: Richard Gooch <rgooch@atnf.csiro.au>
   5
   6                   Last updated on August 25, 2005
   7
   8   Copyright (C) 1999 Richard Gooch
   9   Copyright (C) 2005 Pekka Enberg
  10
  11   This file is released under the GPLv2.
  12
  13
  14 What is it?
  15 ===========
  16
  17 The Virtual File System (otherwise known as the Virtual Filesystem
  18 Switch) is the software layer in the kernel that provides the
  19 filesystem interface to userspace programs. It also provides an
  20 abstraction within the kernel which allows different filesystem
  21 implementations to coexist.
  22
  23
  24 A Quick Look At How It Works
  25 ============================
  26
  27 In this section I'll briefly describe how things work, before
  28 launching into the details. I'll start with describing what happens
  29 when user programs open and manipulate files, and then look from the
  30 other view which is how a filesystem is supported and subsequently
  31 mounted.
  32
  33
  34 Opening a File
  35 --------------
  36
  37 The VFS implements the open(2), stat(2), chmod(2) and similar system
  38 calls. The pathname argument is used by the VFS to search through the
  39 directory entry cache (dentry cache or "dcache"). This provides a very
  40 fast look-up mechanism to translate a pathname (filename) into a
  41 specific dentry.
  42
  43 An individual dentry usually has a pointer to an inode. Inodes are the
  44 things that live on disc drives, and can be regular files (you know:
  45 those things that you write data into), directories, FIFOs and other
  46 beasts. Dentries live in RAM and are never saved to disc: they exist
  47 only for performance. Inodes live on disc and are copied into memory
  48 when required. Later any changes are written back to disc. The inode
  49 that lives in RAM is a VFS inode, and it is this which the dentry
  50 points to. A single inode can be pointed to by multiple dentries
  51 (think about hardlinks).
  52
  53 The dcache is meant to be a view into your entire filespace. Unlike
  54 Linus, most of us losers can't fit enough dentries into RAM to cover
  55 all of our filespace, so the dcache has bits missing. In order to
  56 resolve your pathname into a dentry, the VFS may have to resort to
  57 creating dentries along the way, and then loading the inode. This is
  58 done by looking up the inode.
  59
  60 To look up an inode (usually read from disc) requires that the VFS
  61 calls the lookup() method of the parent directory inode. This method
  62 is installed by the specific filesystem implementation that the inode
  63 lives in. There will be more on this later.
  64
  65 Once the VFS has the required dentry (and hence the inode), we can do
  66 all those boring things like open(2) the file, or stat(2) it to peek
  67 at the inode data. The stat(2) operation is fairly simple: once the
  68 VFS has the dentry, it peeks at the inode data and passes some of it
  69 back to userspace.
  70
  71 Opening a file requires another operation: allocation of a file
  72 structure (this is the kernel-side implementation of file
  73 descriptors). The freshly allocated file structure is initialized with
  74 a pointer to the dentry and a set of file operation member functions.
  75 These are taken from the inode data. The open() file method is then
  76 called so the specific filesystem implementation can do it's work. You
  77 can see that this is another switch performed by the VFS.
  78
  79 The file structure is placed into the file descriptor table for the
  80 process.
  81
  82 Reading, writing and closing files (and other assorted VFS operations)
  83 is done by using the userspace file descriptor to grab the appropriate
  84 file structure, and then calling the required file structure method
  85 function to do whatever is required.
  86
  87 For as long as the file is open, it keeps the dentry "open" (in use),
  88 which in turn means that the VFS inode is still in use.
  89
  90 All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
  91 chmod(2) and so on) are called from a process context. You should
  92 assume that these calls are made without any kernel locks being
  93 held. This means that the processes may be executing the same piece of
  94 filesystem or driver code at the same time, on different
  95 processors. You should ensure that access to shared resources is
  96 protected by appropriate locks.
  97
  98
  99 Registering and Mounting a Filesystem
 100 -------------------------------------
 101
 102 If you want to support a new kind of filesystem in the kernel, all you
 103 need to do is call register_filesystem(). You pass a structure
 104 describing the filesystem implementation (struct file_system_type)
 105 which is then added to an internal table of supported filesystems. You
 106 can do:
 107
 108 % cat /proc/filesystems
 109
 110 to see what filesystems are currently available on your system.
 111
 112 When a request is made to mount a block device onto a directory in
 113 your filespace the VFS will call the appropriate method for the
 114 specific filesystem. The dentry for the mount point will then be
 115 updated to point to the root inode for the new filesystem.
 116
 117 It's now time to look at things in more detail.
 118
 119
 120 struct file_system_type
 121 =======================
 122
 123 This describes the filesystem. As of kernel 2.6.13, the following
 124 members are defined:
 125
 126 struct file_system_type {
 127         const char *name;
 128         int fs_flags;
 129         struct super_block *(*get_sb) (struct file_system_type *, int,
 130                                        const char *, void *);
 131         void (*kill_sb) (struct super_block *);
 132         struct module *owner;
 133         struct file_system_type * next;
 134         struct list_head fs_supers;
 135 };
 136
 137   name: the name of the filesystem type, such as "ext2", "iso9660",
 138         "msdos" and so on
 139
 140   fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
 141
 142   get_sb: the method to call when a new instance of this
 143         filesystem should be mounted
 144
 145   kill_sb: the method to call when an instance of this filesystem
 146         should be unmounted
 147
 148   owner: for internal VFS use: you should initialize this to THIS_MODULE in
 149         most cases.
 150
 151   next: for internal VFS use: you should initialize this to NULL
 152
 153 The get_sb() method has the following arguments:
 154
 155   struct super_block *sb: the superblock structure. This is partially
 156         initialized by the VFS and the rest must be initialized by the
 157         get_sb() method
 158
 159   int flags: mount flags
 160
 161   const char *dev_name: the device name we are mounting.
 162
 163   void *data: arbitrary mount options, usually comes as an ASCII
 164         string
 165
 166   int silent: whether or not to be silent on error
 167
 168 The get_sb() method must determine if the block device specified
 169 in the superblock contains a filesystem of the type the method
 170 supports. On success the method returns the superblock pointer, on
 171 failure it returns NULL.
 172
 173 The most interesting member of the superblock structure that the
 174 get_sb() method fills in is the "s_op" field. This is a pointer to
 175 a "struct super_operations" which describes the next level of the
 176 filesystem implementation.
 177
 178 Usually, a filesystem uses generic one of the generic get_sb()
 179 implementations and provides a fill_super() method instead. The
 180 generic methods are:
 181
 182   get_sb_bdev: mount a filesystem residing on a block device
 183
 184   get_sb_nodev: mount a filesystem that is not backed by a device
 185
 186   get_sb_single: mount a filesystem which shares the instance between
 187         all mounts
 188
 189 A fill_super() method implementation has the following arguments:
 190
 191   struct super_block *sb: the superblock structure. The method fill_super()
 192         must initialize this properly.
 193
 194   void *data: arbitrary mount options, usually comes as an ASCII
 195         string
 196
 197   int silent: whether or not to be silent on error
 198
 199
 200 struct super_operations
 201 =======================
 202
 203 This describes how the VFS can manipulate the superblock of your
 204 filesystem. As of kernel 2.6.13, the following members are defined:
 205
 206 struct super_operations {
 207         struct inode *(*alloc_inode)(struct super_block *sb);
 208         void (*destroy_inode)(struct inode *);
 209
 210         void (*read_inode) (struct inode *);
 211
 212         void (*dirty_inode) (struct inode *);
 213         int (*write_inode) (struct inode *, int);
 214         void (*put_inode) (struct inode *);
 215         void (*drop_inode) (struct inode *);
 216         void (*delete_inode) (struct inode *);
 217         void (*put_super) (struct super_block *);
 218         void (*write_super) (struct super_block *);
 219         int (*sync_fs)(struct super_block *sb, int wait);
 220         void (*write_super_lockfs) (struct super_block *);
 221         void (*unlockfs) (struct super_block *);
 222         int (*statfs) (struct super_block *, struct kstatfs *);
 223         int (*remount_fs) (struct super_block *, int *, char *);
 224         void (*clear_inode) (struct inode *);
 225         void (*umount_begin) (struct super_block *);
 226
 227         void (*sync_inodes) (struct super_block *sb,
 228                                 struct writeback_control *wbc);
 229         int (*show_options)(struct seq_file *, struct vfsmount *);
 230
 231         ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
 232         ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 233 };
 234
 235 All methods are called without any locks being held, unless otherwise
 236 noted. This means that most methods can block safely. All methods are
 237 only called from a process context (i.e. not from an interrupt handler
 238 or bottom half).
 239
 240   alloc_inode: this method is called by inode_alloc() to allocate memory
 241         for struct inode and initialize it.
 242
 243   destroy_inode: this method is called by destroy_inode() to release
 244         resources allocated for struct inode.
 245
 246   read_inode: this method is called to read a specific inode from the
 247         mounted filesystem.  The i_ino member in the struct inode is
 248         initialized by the VFS to indicate which inode to read. Other
 249         members are filled in by this method.
 250
 251         You can set this to NULL and use iget5_locked() instead of iget()
 252         to read inodes.  This is necessary for filesystems for which the
 253         inode number is not sufficient to identify an inode.
 254
 255   dirty_inode: this method is called by the VFS to mark an inode dirty.
 256
 257   write_inode: this method is called when the VFS needs to write an
 258         inode to disc.  The second parameter indicates whether the write
 259         should be synchronous or not, not all filesystems check this flag.
 260
 261   put_inode: called when the VFS inode is removed from the inode
 262         cache.
 263
 264   drop_inode: called when the last access to the inode is dropped,
 265         with the inode_lock spinlock held.
 266
 267         This method should be either NULL (normal UNIX filesystem
 268         semantics) or "generic_delete_inode" (for filesystems that do not
 269         want to cache inodes - causing "delete_inode" to always be
 270         called regardless of the value of i_nlink)
 271
 272         The "generic_delete_inode()" behavior is equivalent to the
 273         old practice of using "force_delete" in the put_inode() case,
 274         but does not have the races that the "force_delete()" approach
 275         had.
 276
 277   delete_inode: called when the VFS wants to delete an inode
 278
 279   put_super: called when the VFS wishes to free the superblock
 280         (i.e. unmount). This is called with the superblock lock held
 281
 282   write_super: called when the VFS superblock needs to be written to
 283         disc. This method is optional
 284
 285   sync_fs: called when VFS is writing out all dirty data associated with
 286         a superblock. The second parameter indicates whether the method
 287         should wait until the write out has been completed. Optional.
 288
 289   write_super_lockfs: called when VFS is locking a filesystem and forcing
 290         it into a consistent state.  This function is currently used by the
 291         Logical Volume Manager (LVM).
 292
 293   unlockfs: called when VFS is unlocking a filesystem and making it writable
 294         again.
 295
 296   statfs: called when the VFS needs to get filesystem statistics. This
 297         is called with the kernel lock held
 298
 299   remount_fs: called when the filesystem is remounted. This is called
 300         with the kernel lock held
 301
 302   clear_inode: called then the VFS clears the inode. Optional
 303
 304   umount_begin: called when the VFS is unmounting a filesystem.
 305
 306   sync_inodes: called when the VFS is writing out dirty data associated with
 307         a superblock.
 308
 309   show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
 310
 311   quota_read: called by the VFS to read from filesystem quota file.
 312
 313   quota_write: called by the VFS to write to filesystem quota file.
 314
 315 The read_inode() method is responsible for filling in the "i_op"
 316 field. This is a pointer to a "struct inode_operations" which
 317 describes the methods that can be performed on individual inodes.
 318
 319
 320 struct inode_operations
 321 =======================
 322
 323 This describes how the VFS can manipulate an inode in your
 324 filesystem. As of kernel 2.6.13, the following members are defined:
 325
 326 struct inode_operations {
 327         int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
 328         struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
 329         int (*link) (struct dentry *,struct inode *,struct dentry *);
 330         int (*unlink) (struct inode *,struct dentry *);
 331         int (*symlink) (struct inode *,struct dentry *,const char *);
 332         int (*mkdir) (struct inode *,struct dentry *,int);
 333         int (*rmdir) (struct inode *,struct dentry *);
 334         int (*mknod) (struct inode *,struct dentry *,int,dev_t);
 335         int (*rename) (struct inode *, struct dentry *,
 336                         struct inode *, struct dentry *);
 337         int (*readlink) (struct dentry *, char __user *,int);
 338         void * (*follow_link) (struct dentry *, struct nameidata *);
 339         void (*put_link) (struct dentry *, struct nameidata *, void *);
 340         void (*truncate) (struct inode *);
 341         int (*permission) (struct inode *, int, struct nameidata *);
 342         int (*setattr) (struct dentry *, struct iattr *);
 343         int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
 344         int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
 345         ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
 346         ssize_t (*listxattr) (struct dentry *, char *, size_t);
 347         int (*removexattr) (struct dentry *, const char *);
 348 };
 349
 350 Again, all methods are called without any locks being held, unless
 351 otherwise noted.
 352
 353   create: called by the open(2) and creat(2) system calls. Only
 354         required if you want to support regular files. The dentry you
 355         get should not have an inode (i.e. it should be a negative
 356         dentry). Here you will probably call d_instantiate() with the
 357         dentry and the newly created inode
 358
 359   lookup: called when the VFS needs to look up an inode in a parent
 360         directory. The name to look for is found in the dentry. This
 361         method must call d_add() to insert the found inode into the
 362         dentry. The "i_count" field in the inode structure should be
 363         incremented. If the named inode does not exist a NULL inode
 364         should be inserted into the dentry (this is called a negative
 365         dentry). Returning an error code from this routine must only
 366         be done on a real error, otherwise creating inodes with system
 367         calls like create(2), mknod(2), mkdir(2) and so on will fail.
 368         If you wish to overload the dentry methods then you should
 369         initialise the "d_dop" field in the dentry; this is a pointer
 370         to a struct "dentry_operations".
 371         This method is called with the directory inode semaphore held
 372
 373   link: called by the link(2) system call. Only required if you want
 374         to support hard links. You will probably need to call
 375         d_instantiate() just as you would in the create() method
 376
 377   unlink: called by the unlink(2) system call. Only required if you
 378         want to support deleting inodes
 379
 380   symlink: called by the symlink(2) system call. Only required if you
 381         want to support symlinks. You will probably need to call
 382         d_instantiate() just as you would in the create() method
 383
 384   mkdir: called by the mkdir(2) system call. Only required if you want
 385         to support creating subdirectories. You will probably need to
 386         call d_instantiate() just as you would in the create() method
 387
 388   rmdir: called by the rmdir(2) system call. Only required if you want
 389         to support deleting subdirectories
 390
 391   mknod: called by the mknod(2) system call to create a device (char,
 392         block) inode or a named pipe (FIFO) or socket. Only required
 393         if you want to support creating these types of inodes. You
 394         will probably need to call d_instantiate() just as you would
 395         in the create() method
 396
 397   readlink: called by the readlink(2) system call. Only required if
 398         you want to support reading symbolic links
 399
 400   follow_link: called by the VFS to follow a symbolic link to the
 401         inode it points to.  Only required if you want to support
 402         symbolic links.  This function returns a void pointer cookie
 403         that is passed to put_link().
 404
 405   put_link: called by the VFS to release resources allocated by
 406         follow_link().  The cookie returned by follow_link() is passed to
 407         to this function as the last parameter.  It is used by filesystems
 408         such as NFS where page cache is not stable (i.e. page that was
 409         installed when the symbolic link walk started might not be in the
 410         page cache at the end of the walk).
 411
 412   truncate: called by the VFS to change the size of a file.  The i_size
 413         field of the inode is set to the desired size by the VFS before
 414         this function is called.  This function is called by the truncate(2)
 415         system call and related functionality.
 416
 417   permission: called by the VFS to check for access rights on a POSIX-like
 418         filesystem.
 419
 420   setattr: called by the VFS to set attributes for a file.  This function is
 421         called by chmod(2) and related system calls.
 422
 423   getattr: called by the VFS to get attributes of a file.  This function is
 424         called by stat(2) and related system calls.
 425
 426   setxattr: called by the VFS to set an extended attribute for a file.
 427         Extended attribute is a name:value pair associated with an inode. This
 428         function is called by setxattr(2) system call.
 429
 430   getxattr: called by the VFS to retrieve the value of an extended attribute
 431         name.  This function is called by getxattr(2) function call.
 432
 433   listxattr: called by the VFS to list all extended attributes for a given
 434         file.  This function is called by listxattr(2) system call.
 435
 436   removexattr: called by the VFS to remove an extended attribute from a file.
 437         This function is called by removexattr(2) system call.
 438
 439
 440 struct address_space_operations
 441 ===============================
 442
 443 This describes how the VFS can manipulate mapping of a file to page cache in
 444 your filesystem. As of kernel 2.6.13, the following members are defined:
 445
 446 struct address_space_operations {
 447         int (*writepage)(struct page *page, struct writeback_control *wbc);
 448         int (*readpage)(struct file *, struct page *);
 449         int (*sync_page)(struct page *);
 450         int (*writepages)(struct address_space *, struct writeback_control *);
 451         int (*set_page_dirty)(struct page *page);
 452         int (*readpages)(struct file *filp, struct address_space *mapping,
 453                         struct list_head *pages, unsigned nr_pages);
 454         int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 455         int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
 456         sector_t (*bmap)(struct address_space *, sector_t);
 457         int (*invalidatepage) (struct page *, unsigned long);
 458         int (*releasepage) (struct page *, int);
 459         ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 460                         loff_t offset, unsigned long nr_segs);
 461         struct page* (*get_xip_page)(struct address_space *, sector_t,
 462                         int);
 463 };
 464
 465   writepage: called by the VM write a dirty page to backing store.
 466
 467   readpage: called by the VM to read a page from backing store.
 468
 469   sync_page: called by the VM to notify the backing store to perform all
 470         queued I/O operations for a page. I/O operations for other pages
 471         associated with this address_space object may also be performed.
 472
 473   writepages: called by the VM to write out pages associated with the
 474         address_space object.
 475
 476   set_page_dirty: called by the VM to set a page dirty.
 477
 478   readpages: called by the VM to read pages associated with the address_space
 479         object.
 480
 481   prepare_write: called by the generic write path in VM to set up a write
 482         request for a page.
 483
 484   commit_write: called by the generic write path in VM to write page to
 485         its backing store.
 486
 487   bmap: called by the VFS to map a logical block offset within object to
 488         physical block number. This method is use by for the legacy FIBMAP
 489         ioctl. Other uses are discouraged.
 490
 491   invalidatepage: called by the VM on truncate to disassociate a page from its
 492         address_space mapping.
 493
 494   releasepage: called by the VFS to release filesystem specific metadata from
 495         a page.
 496
 497   direct_IO: called by the VM for direct I/O writes and reads.
 498
 499   get_xip_page: called by the VM to translate a block number to a page.
 500         The page is valid until the corresponding filesystem is unmounted.
 501         Filesystems that want to use execute-in-place (XIP) need to implement
 502         it.  An example implementation can be found in fs/ext2/xip.c.
 503
 504
 505 struct file_operations
 506 ======================
 507
 508 This describes how the VFS can manipulate an open file. As of kernel
 509 2.6.13, the following members are defined:
 510
 511 struct file_operations {
 512         loff_t (*llseek) (struct file *, loff_t, int);
 513         ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
 514         ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
 515         ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 516         ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
 517         int (*readdir) (struct file *, void *, filldir_t);
 518         unsigned int (*poll) (struct file *, struct poll_table_struct *);
 519         int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
 520         long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 521         long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 522         int (*mmap) (struct file *, struct vm_area_struct *);
 523         int (*open) (struct inode *, struct file *);
 524         int (*flush) (struct file *);
 525         int (*release) (struct inode *, struct file *);
 526         int (*fsync) (struct file *, struct dentry *, int datasync);
 527         int (*aio_fsync) (struct kiocb *, int datasync);
 528         int (*fasync) (int, struct file *, int);
 529         int (*lock) (struct file *, int, struct file_lock *);
 530         ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
 531         ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
 532         ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
 533         ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
 534         unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 535         int (*check_flags)(int);
 536         int (*dir_notify)(struct file *filp, unsigned long arg);
 537         int (*flock) (struct file *, int, struct file_lock *);
 538 };
 539
 540 Again, all methods are called without any locks being held, unless
 541 otherwise noted.
 542
 543   llseek: called when the VFS needs to move the file position index
 544
 545   read: called by read(2) and related system calls
 546
 547   aio_read: called by io_submit(2) and other asynchronous I/O operations
 548
 549   write: called by write(2) and related system calls
 550
 551   aio_write: called by io_submit(2) and other asynchronous I/O operations
 552
 553   readdir: called when the VFS needs to read the directory contents
 554
 555   poll: called by the VFS when a process wants to check if there is
 556         activity on this file and (optionally) go to sleep until there
 557         is activity. Called by the select(2) and poll(2) system calls
 558
 559   ioctl: called by the ioctl(2) system call
 560
 561   unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
 562         require the BKL should use this method instead of the ioctl() above.
 563
 564   compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
 565          are used on 64 bit kernels.
 566
 567   mmap: called by the mmap(2) system call
 568
 569   open: called by the VFS when an inode should be opened. When the VFS
 570         opens a file, it creates a new "struct file". It then calls the
 571         open method for the newly allocated file structure. You might
 572         think that the open method really belongs in
 573         "struct inode_operations", and you may be right. I think it's
 574         done the way it is because it makes filesystems simpler to
 575         implement. The open() method is a good place to initialize the
 576         "private_data" member in the file structure if you want to point
 577         to a device structure
 578
 579   flush: called by the close(2) system call to flush a file
 580
 581   release: called when the last reference to an open file is closed
 582
 583   fsync: called by the fsync(2) system call
 584
 585   fasync: called by the fcntl(2) system call when asynchronous
 586         (non-blocking) mode is enabled for a file
 587
 588   lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
 589         commands
 590
 591   readv: called by the readv(2) system call
 592
 593   writev: called by the writev(2) system call
 594
 595   sendfile: called by the sendfile(2) system call
 596
 597   get_unmapped_area: called by the mmap(2) system call
 598
 599   check_flags: called by the fcntl(2) system call for F_SETFL command
 600
 601   dir_notify: called by the fcntl(2) system call for F_NOTIFY command
 602
 603   flock: called by the flock(2) system call
 604
 605 Note that the file operations are implemented by the specific
 606 filesystem in which the inode resides. When opening a device node
 607 (character or block special) most filesystems will call special
 608 support routines in the VFS which will locate the required device
 609 driver information. These support routines replace the filesystem file
 610 operations with those for the device driver, and then proceed to call
 611 the new open() method for the file. This is how opening a device file
 612 in the filesystem eventually ends up calling the device driver open()
 613 method.
 614
 615
 616 Directory Entry Cache (dcache)
 617 ==============================
 618
 619
 620 struct dentry_operations
 621 ------------------------
 622
 623 This describes how a filesystem can overload the standard dentry
 624 operations. Dentries and the dcache are the domain of the VFS and the
 625 individual filesystem implementations. Device drivers have no business
 626 here. These methods may be set to NULL, as they are either optional or
 627 the VFS uses a default. As of kernel 2.6.13, the following members are
 628 defined:
 629
 630 struct dentry_operations {
 631         int (*d_revalidate)(struct dentry *, struct nameidata *);
 632         int (*d_hash) (struct dentry *, struct qstr *);
 633         int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
 634         int (*d_delete)(struct dentry *);
 635         void (*d_release)(struct dentry *);
 636         void (*d_iput)(struct dentry *, struct inode *);
 637 };
 638
 639   d_revalidate: called when the VFS needs to revalidate a dentry. This
 640         is called whenever a name look-up finds a dentry in the
 641         dcache. Most filesystems leave this as NULL, because all their
 642         dentries in the dcache are valid
 643
 644   d_hash: called when the VFS adds a dentry to the hash table
 645
 646   d_compare: called when a dentry should be compared with another
 647
 648   d_delete: called when the last reference to a dentry is
 649         deleted. This means no-one is using the dentry, however it is
 650         still valid and in the dcache
 651
 652   d_release: called when a dentry is really deallocated
 653
 654   d_iput: called when a dentry loses its inode (just prior to its
 655         being deallocated). The default when this is NULL is that the
 656         VFS calls iput(). If you define this method, you must call
 657         iput() yourself
 658
 659 Each dentry has a pointer to its parent dentry, as well as a hash list
 660 of child dentries. Child dentries are basically like files in a
 661 directory.
 662
 663
 664 Directory Entry Cache APIs
 665 --------------------------
 666
 667 There are a number of functions defined which permit a filesystem to
 668 manipulate dentries:
 669
 670   dget: open a new handle for an existing dentry (this just increments
 671         the usage count)
 672
 673   dput: close a handle for a dentry (decrements the usage count). If
 674         the usage count drops to 0, the "d_delete" method is called
 675         and the dentry is placed on the unused list if the dentry is
 676         still in its parents hash list. Putting the dentry on the
 677         unused list just means that if the system needs some RAM, it
 678         goes through the unused list of dentries and deallocates them.
 679         If the dentry has already been unhashed and the usage count
 680         drops to 0, in this case the dentry is deallocated after the
 681         "d_delete" method is called
 682
 683   d_drop: this unhashes a dentry from its parents hash list. A
 684         subsequent call to dput() will deallocate the dentry if its
 685         usage count drops to 0
 686
 687   d_delete: delete a dentry. If there are no other open references to
 688         the dentry then the dentry is turned into a negative dentry
 689         (the d_iput() method is called). If there are other
 690         references, then d_drop() is called instead
 691
 692   d_add: add a dentry to its parents hash list and then calls
 693         d_instantiate()
 694
 695   d_instantiate: add a dentry to the alias hash list for the inode and
 696         updates the "d_inode" member. The "i_count" member in the
 697         inode structure should be set/incremented. If the inode
 698         pointer is NULL, the dentry is called a "negative
 699         dentry". This function is commonly called when an inode is
 700         created for an existing negative dentry
 701
 702   d_lookup: look up a dentry given its parent and path name component
 703         It looks up the child of that given name from the dcache
 704         hash table. If it is found, the reference count is incremented
 705         and the dentry is returned. The caller must use d_put()
 706         to free the dentry when it finishes using it.
 707
 708
 709 RCU-based dcache locking model
 710 ------------------------------
 711
 712 On many workloads, the most common operation on dcache is
 713 to look up a dentry, given a parent dentry and the name
 714 of the child. Typically, for every open(), stat() etc.,
 715 the dentry corresponding to the pathname will be looked
 716 up by walking the tree starting with the first component
 717 of the pathname and using that dentry along with the next
 718 component to look up the next level and so on. Since it
 719 is a frequent operation for workloads like multiuser
 720 environments and web servers, it is important to optimize
 721 this path.
 722
 723 Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
 724 in every component during path look-up. Since 2.5.10 onwards,
 725 fast-walk algorithm changed this by holding the dcache_lock
 726 at the beginning and walking as many cached path component
 727 dentries as possible. This significantly decreases the number
 728 of acquisition of dcache_lock. However it also increases the
 729 lock hold time significantly and affects performance in large
 730 SMP machines. Since 2.5.62 kernel, dcache has been using
 731 a new locking model that uses RCU to make dcache look-up
 732 lock-free.
 733
 734 The current dcache locking model is not very different from the existing
 735 dcache locking model. Prior to 2.5.62 kernel, dcache_lock
 736 protected the hash chain, d_child, d_alias, d_lru lists as well
 737 as d_inode and several other things like mount look-up. RCU-based
 738 changes affect only the way the hash chain is protected. For everything
 739 else the dcache_lock must be taken for both traversing as well as
 740 updating. The hash chain updates too take the dcache_lock.
 741 The significant change is the way d_lookup traverses the hash chain,
 742 it doesn't acquire the dcache_lock for this and rely on RCU to
 743 ensure that the dentry has not been *freed*.
 744
 745
 746 Dcache locking details
 747 ----------------------
 748
 749 For many multi-user workloads, open() and stat() on files are
 750 very frequently occurring operations. Both involve walking
 751 of path names to find the dentry corresponding to the
 752 concerned file. In 2.4 kernel, dcache_lock was held
 753 during look-up of each path component. Contention and
 754 cache-line bouncing of this global lock caused significant
 755 scalability problems. With the introduction of RCU
 756 in Linux kernel, this was worked around by making
 757 the look-up of path components during path walking lock-free.
 758
 759
 760 Safe lock-free look-up of dcache hash table
 761 ===========================================
 762
 763 Dcache is a complex data structure with the hash table entries
 764 also linked together in other lists. In 2.4 kernel, dcache_lock
 765 protected all the lists. We applied RCU only on hash chain
 766 walking. The rest of the lists are still protected by dcache_lock.
 767 Some of the important changes are :
 768
 769 1. The deletion from hash chain is done using hlist_del_rcu() macro which
 770    doesn't initialize next pointer of the deleted dentry and this
 771    allows us to walk safely lock-free while a deletion is happening.
 772
 773 2. Insertion of a dentry into the hash table is done using
 774    hlist_add_head_rcu() which take care of ordering the writes -
 775    the writes to the dentry must be visible before the dentry
 776    is inserted. This works in conjunction with hlist_for_each_rcu()
 777    while walking the hash chain. The only requirement is that
 778    all initialization to the dentry must be done before hlist_add_head_rcu()
 779    since we don't have dcache_lock protection while traversing
 780    the hash chain. This isn't different from the existing code.
 781
 782 3. The dentry looked up without holding dcache_lock by cannot be
 783    returned for walking if it is unhashed. It then may have a NULL
 784    d_inode or other bogosity since RCU doesn't protect the other
 785    fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
 786    indicate unhashed  dentries and use this in conjunction with a
 787    per-dentry lock (d_lock). Once looked up without the dcache_lock,
 788    we acquire the per-dentry lock (d_lock) and check if the
 789    dentry is unhashed. If so, the look-up is failed. If not, the
 790    reference count of the dentry is increased and the dentry is returned.
 791
 792 4. Once a dentry is looked up, it must be ensured during the path
 793    walk for that component it doesn't go away. In pre-2.5.10 code,
 794    this was done holding a reference to the dentry. dcache_rcu does
 795    the same.  In some sense, dcache_rcu path walking looks like
 796    the pre-2.5.10 version.
 797
 798 5. All dentry hash chain updates must take the dcache_lock as well as
 799    the per-dentry lock in that order. dput() does this to ensure
 800    that a dentry that has just been looked up in another CPU
 801    doesn't get deleted before dget() can be done on it.
 802
 803 6. There are several ways to do reference counting of RCU protected
 804    objects. One such example is in ipv4 route cache where
 805    deferred freeing (using call_rcu()) is done as soon as
 806    the reference count goes to zero. This cannot be done in
 807    the case of dentries because tearing down of dentries
 808    require blocking (dentry_iput()) which isn't supported from
 809    RCU callbacks. Instead, tearing down of dentries happen
 810    synchronously in dput(), but actual freeing happens later
 811    when RCU grace period is over. This allows safe lock-free
 812    walking of the hash chains, but a matched dentry may have
 813    been partially torn down. The checking of DCACHE_UNHASHED
 814    flag with d_lock held detects such dentries and prevents
 815    them from being returned from look-up.
 816
 817
 818 Maintaining POSIX rename semantics
 819 ==================================
 820
 821 Since look-up of dentries is lock-free, it can race against
 822 a concurrent rename operation. For example, during rename
 823 of file A to B, look-up of either A or B must succeed.
 824 So, if look-up of B happens after A has been removed from the
 825 hash chain but not added to the new hash chain, it may fail.
 826 Also, a comparison while the name is being written concurrently
 827 by a rename may result in false positive matches violating
 828 rename semantics.  Issues related to race with rename are
 829 handled as described below :
 830
 831 1. Look-up can be done in two ways - d_lookup() which is safe
 832    from simultaneous renames and __d_lookup() which is not.
 833    If __d_lookup() fails, it must be followed up by a d_lookup()
 834    to correctly determine whether a dentry is in the hash table
 835    or not. d_lookup() protects look-ups using a sequence
 836    lock (rename_lock).
 837
 838 2. The name associated with a dentry (d_name) may be changed if
 839    a rename is allowed to happen simultaneously. To avoid memcmp()
 840    in __d_lookup() go out of bounds due to a rename and false
 841    positive comparison, the name comparison is done while holding the
 842    per-dentry lock. This prevents concurrent renames during this
 843    operation.
 844
 845 3. Hash table walking during look-up may move to a different bucket as
 846    the current dentry is moved to a different bucket due to rename.
 847    But we use hlists in dcache hash table and they are null-terminated.
 848    So, even if a dentry moves to a different bucket, hash chain
 849    walk will terminate. [with a list_head list, it may not since
 850    termination is when the list_head in the original bucket is reached].
 851    Since we redo the d_parent check and compare name while holding
 852    d_lock, lock-free look-up will not race against d_move().
 853
 854 4. There can be a theoretical race when a dentry keeps coming back
 855    to original bucket due to double moves. Due to this look-up may
 856    consider that it has never moved and can end up in a infinite loop.
 857    But this is not any worse that theoretical livelocks we already
 858    have in the kernel.
 859
 860
 861 Important guidelines for filesystem developers related to dcache_rcu
 862 ====================================================================
 863
 864 1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
 865    don't change. Only dcache internal implementation changes. However
 866    filesystems *must not* delete from the dentry hash chains directly
 867    using the list macros like allowed earlier. They must use dcache
 868    APIs like d_drop() or __d_drop() depending on the situation.
 869
 870 2. d_flags is now protected by a per-dentry lock (d_lock). All
 871    access to d_flags must be protected by it.
 872
 873 3. For a hashed dentry, checking of d_count needs to be protected
 874    by d_lock.
 875
 876
 877 Papers and other documentation on dcache locking
 878 ================================================
 879
 880 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
 881
 882 2. http://lse.sourceforge.net/locking/dcache/dcache.html