Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block

[sfrench/cifs-2.6.git] / Documentation / memory-barriers.txt
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt

index c1d913944ad8b011b9e6de29816d44e42df11a84..f70ebcdfe592dbb5ab91f0a60561d3885d3a0aff 100644 (file)
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -587,7 +587,7 @@ leading to the following situation:
  
         (Q == &B) and (D == 2) ????
  
-Whilst this may seem like a failure of coherency or causality maintenance, it
+While this may seem like a failure of coherency or causality maintenance, it
  isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
  Alpha).
  
@@ -1937,21 +1937,6 @@ There are some more advanced barrier functions:
       information on consistent memory.
  
  
-MMIO WRITE BARRIER
-------------------
-
-The Linux kernel also has a special barrier for use with memory-mapped I/O
-writes:
-
-       mmiowb();
-
-This is a variation on the mandatory write barrier that causes writes to weakly
-ordered I/O regions to be partially ordered.  Its effects may go beyond the
-CPU->Hardware interface and actually affect the hardware at some level.
-
-See the subsection "Acquires vs I/O accesses" for more information.
-
-
  ===============================
  IMPLICIT KERNEL MEMORY BARRIERS
  ===============================
@@ -2008,7 +1993,7 @@ for each construct.  These operations all imply certain barriers:
  
       Certain locking variants of the ACQUIRE operation may fail, either due to
       being unable to get the lock immediately, or due to receiving an unblocked
-     signal whilst asleep waiting for the lock to become available.  Failed
+     signal while asleep waiting for the lock to become available.  Failed
       locks do not imply any sort of barrier.
  
  [!] Note: one of the consequences of lock ACQUIREs and RELEASEs being only
@@ -2317,75 +2302,6 @@ But it won't see any of:
         *E, *F or *G following RELEASE Q
  
  
-
-ACQUIRES VS I/O ACCESSES
-------------------------
-
-Under certain circumstances (especially involving NUMA), I/O accesses within
-two spinlocked sections on two different CPUs may be seen as interleaved by the
-PCI bridge, because the PCI bridge does not necessarily participate in the
-cache-coherence protocol, and is therefore incapable of issuing the required
-read memory barriers.
-
-For example:
-
-       CPU 1                           CPU 2
-       =============================== ===============================
-       spin_lock(Q)
-       writel(0, ADDR)
-       writel(1, DATA);
-       spin_unlock(Q);
-                                       spin_lock(Q);
-                                       writel(4, ADDR);
-                                       writel(5, DATA);
-                                       spin_unlock(Q);
-
-may be seen by the PCI bridge as follows:
-
-       STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
-
-which would probably cause the hardware to malfunction.
-
-
-What is necessary here is to intervene with an mmiowb() before dropping the
-spinlock, for example:
-
-       CPU 1                           CPU 2
-       =============================== ===============================
-       spin_lock(Q)
-       writel(0, ADDR)
-       writel(1, DATA);
-       mmiowb();
-       spin_unlock(Q);
-                                       spin_lock(Q);
-                                       writel(4, ADDR);
-                                       writel(5, DATA);
-                                       mmiowb();
-                                       spin_unlock(Q);
-
-this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
-before either of the stores issued on CPU 2.
-
-
-Furthermore, following a store by a load from the same device obviates the need
-for the mmiowb(), because the load forces the store to complete before the load
-is performed:
-
-       CPU 1                           CPU 2
-       =============================== ===============================
-       spin_lock(Q)
-       writel(0, ADDR)
-       a = readl(DATA);
-       spin_unlock(Q);
-                                       spin_lock(Q);
-                                       writel(4, ADDR);
-                                       b = readl(DATA);
-                                       spin_unlock(Q);
-
-
-See Documentation/driver-api/device-io.rst for more information.
-
-
  =================================
  WHERE ARE MEMORY BARRIERS NEEDED?
  =================================
@@ -2508,7 +2424,7 @@ CPU, that CPU's dependency ordering logic will take care of everything else.
  ATOMIC OPERATIONS
  -----------------
  
-Whilst they are technically interprocessor interaction considerations, atomic
+While they are technically interprocessor interaction considerations, atomic
  operations are noted specially as some of them imply full memory barriers and
  some don't, but they're very heavily relied on as a group throughout the
  kernel.
@@ -2531,17 +2447,10 @@ the device to malfunction.
  
  Inside of the Linux kernel, I/O should be done through the appropriate accessor
  routines - such as inb() or writel() - which know how to make such accesses
-appropriately sequential.  Whilst this, for the most part, renders the explicit
-use of memory barriers unnecessary, there are a couple of situations where they
-might be needed:
-
- (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
-     so for _all_ general drivers locks should be used and mmiowb() must be
-     issued prior to unlocking the critical section.
-
- (2) If the accessor functions are used to refer to an I/O memory window with
-     relaxed memory access properties, then _mandatory_ memory barriers are
-     required to enforce ordering.
+appropriately sequential.  While this, for the most part, renders the explicit
+use of memory barriers unnecessary, if the accessor functions are used to refer
+to an I/O memory window with relaxed memory access properties, then _mandatory_
+memory barriers are required to enforce ordering.
  
  See Documentation/driver-api/device-io.rst for more information.
  
@@ -2555,7 +2464,7 @@ access the device.
  
  This may be alleviated - at least in part - by disabling local interrupts (a
  form of locking), such that the critical operations are all contained within
-the interrupt-disabled section in the driver.  Whilst the driver's interrupt
+the interrupt-disabled section in the driver.  While the driver's interrupt
  routine is executing, the driver's core may not run on the same CPU, and its
  interrupt is not permitted to happen again until the current interrupt has been
  handled, thus the interrupt handler does not need to lock against that.
@@ -2586,8 +2495,7 @@ explicit barriers are used.
  
  Normally this won't be a problem because the I/O accesses done inside such
  sections will include synchronous load operations on strictly ordered I/O
-registers that form implicit I/O barriers.  If this isn't sufficient then an
-mmiowb() may need to be used explicitly.
+registers that form implicit I/O barriers.
  
  
  A similar situation may occur between an interrupt routine and two routines
@@ -2599,71 +2507,114 @@ likely, then interrupt-disabling locks should be used to guarantee ordering.
  KERNEL I/O BARRIER EFFECTS
  ==========================
  
-When accessing I/O memory, drivers should use the appropriate accessor
-functions:
-
- (*) inX(), outX():
-
-     These are intended to talk to I/O space rather than memory space, but
-     that's primarily a CPU-specific concept.  The i386 and x86_64 processors
-     do indeed have special I/O space access cycles and instructions, but many
-     CPUs don't have such a concept.
-
-     The PCI bus, amongst others, defines an I/O space concept which - on such
-     CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
-     space.  However, it may also be mapped as a virtual I/O space in the CPU's
-     memory map, particularly on those CPUs that don't support alternate I/O
-     spaces.
-
-     Accesses to this space may be fully synchronous (as on i386), but
-     intermediary bridges (such as the PCI host bridge) may not fully honour
-     that.
-
-     They are guaranteed to be fully ordered with respect to each other.
-
-     They are not guaranteed to be fully ordered with respect to other types of
-     memory and I/O operation.
+Interfacing with peripherals via I/O accesses is deeply architecture and device
+specific. Therefore, drivers which are inherently non-portable may rely on
+specific behaviours of their target systems in order to achieve synchronization
+in the most lightweight manner possible. For drivers intending to be portable
+between multiple architectures and bus implementations, the kernel offers a
+series of accessor functions that provide various degrees of ordering
+guarantees:
  
   (*) readX(), writeX():
  
-     Whether these are guaranteed to be fully ordered and uncombined with
-     respect to each other on the issuing CPU depends on the characteristics
-     defined for the memory window through which they're accessing.  On later
-     i386 architecture machines, for example, this is controlled by way of the
-     MTRR registers.
+       The readX() and writeX() MMIO accessors take a pointer to the
+       peripheral being accessed as an __iomem * parameter. For pointers
+       mapped with the default I/O attributes (e.g. those returned by
+       ioremap()), the ordering guarantees are as follows:
+
+       1. All readX() and writeX() accesses to the same peripheral are ordered
+          with respect to each other. This ensures that MMIO register accesses
+          by the same CPU thread to a particular device will arrive in program
+          order.
+
+       2. A writeX() issued by a CPU thread holding a spinlock is ordered
+          before a writeX() to the same peripheral from another CPU thread
+          issued after a later acquisition of the same spinlock. This ensures
+          that MMIO register writes to a particular device issued while holding
+          a spinlock will arrive in an order consistent with acquisitions of
+          the lock.
+
+       3. A writeX() by a CPU thread to the peripheral will first wait for the
+          completion of all prior writes to memory either issued by, or
+          propagated to, the same thread. This ensures that writes by the CPU
+          to an outbound DMA buffer allocated by dma_alloc_coherent() will be
+          visible to a DMA engine when the CPU writes to its MMIO control
+          register to trigger the transfer.
+
+       4. A readX() by a CPU thread from the peripheral will complete before
+          any subsequent reads from memory by the same thread can begin. This
+          ensures that reads by the CPU from an incoming DMA buffer allocated
+          by dma_alloc_coherent() will not see stale data after reading from
+          the DMA engine's MMIO status register to establish that the DMA
+          transfer has completed.
+
+       5. A readX() by a CPU thread from the peripheral will complete before
+          any subsequent delay() loop can begin execution on the same thread.
+          This ensures that two MMIO register writes by the CPU to a peripheral
+          will arrive at least 1us apart if the first write is immediately read
+          back with readX() and udelay(1) is called prior to the second
+          writeX():
+
+               writel(42, DEVICE_REGISTER_0); // Arrives at the device...
+               readl(DEVICE_REGISTER_0);
+               udelay(1);
+               writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.
+
+       The ordering properties of __iomem pointers obtained with non-default
+       attributes (e.g. those returned by ioremap_wc()) are specific to the
+       underlying architecture and therefore the guarantees listed above cannot
+       generally be relied upon for accesses to these types of mappings.
+
+ (*) readX_relaxed(), writeX_relaxed():
+
+       These are similar to readX() and writeX(), but provide weaker memory
+       ordering guarantees. Specifically, they do not guarantee ordering with
+       respect to locking, normal memory accesses or delay() loops (i.e.
+       bullets 2-5 above) but they are still guaranteed to be ordered with
+       respect to other accesses from the same CPU thread to the same
+       peripheral when operating on __iomem pointers mapped with the default
+       I/O attributes.
+
+ (*) readsX(), writesX():
+
+       The readsX() and writesX() MMIO accessors are designed for accessing
+       register-based, memory-mapped FIFOs residing on peripherals that are not
+       capable of performing DMA. Consequently, they provide only the ordering
+       guarantees of readX_relaxed() and writeX_relaxed(), as documented above.
  
-     Ordinarily, these will be guaranteed to be fully ordered and uncombined,
-     provided they're not accessing a prefetchable device.
+ (*) inX(), outX():
  
-     However, intermediary hardware (such as a PCI bridge) may indulge in
-     deferral if it so wishes; to flush a store, a load from the same location
-     is preferred[*], but a load from the same device or from configuration
-     space should suffice for PCI.
+       The inX() and outX() accessors are intended to access legacy port-mapped
+       I/O peripherals, which may require special instructions on some
+       architectures (notably x86). The port number of the peripheral being
+       accessed is passed as an argument.
  
-     [*] NOTE! attempting to load from the same location as was written to may
-        cause a malfunction - consider the 16550 Rx/Tx serial registers for
-        example.
+       Since many CPU architectures ultimately access these peripherals via an
+       internal virtual memory mapping, the portable ordering guarantees
+       provided by inX() and outX() are the same as those provided by readX()
+       and writeX() respectively when accessing a mapping with the default I/O
+       attributes.
  
-     Used with prefetchable I/O memory, an mmiowb() barrier may be required to
-     force stores to be ordered.
+       Device drivers may expect outX() to emit a non-posted write transaction
+       that waits for a completion response from the I/O peripheral before
+       returning. This is not guaranteed by all architectures and is therefore
+       not part of the portable ordering semantics.
  
-     Please refer to the PCI specification for more information on interactions
-     between PCI transactions.
+ (*) insX(), outsX():
  
- (*) readX_relaxed(), writeX_relaxed()
+       As above, the insX() and outsX() accessors provide the same ordering
+       guarantees as readsX() and writesX() respectively when accessing a
+       mapping with the default I/O attributes.
  
-     These are similar to readX() and writeX(), but provide weaker memory
-     ordering guarantees.  Specifically, they do not guarantee ordering with
-     respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
-     ordering with respect to LOCK or UNLOCK operations.  If the latter is
-     required, an mmiowb() barrier can be used.  Note that relaxed accesses to
-     the same peripheral are guaranteed to be ordered with respect to each
-     other.
+ (*) ioreadX(), iowriteX():
  
- (*) ioreadX(), iowriteX()
+       These will perform appropriately for the type of access they're actually
+       doing, be it inX()/outX() or readX()/writeX().
  
-     These will perform appropriately for the type of access they're actually
-     doing, be it inX()/outX() or readX()/writeX().
+With the exception of the string accessors (insX(), outsX(), readsX() and
+writesX()), all of the above assume that the underlying peripheral is
+little-endian and will therefore perform byte-swapping operations on big-endian
+architectures.
  
  
  ========================================
@@ -2763,7 +2714,7 @@ CACHE COHERENCY
  
  Life isn't quite as simple as it may appear above, however: for while the
  caches are expected to be coherent, there's no guarantee that that coherency
-will be ordered.  This means that whilst changes made on one CPU will
+will be ordered.  This means that while changes made on one CPU will
  eventually become visible on all CPUs, there's no guarantee that they will
  become apparent in the same order on those other CPUs.
  
@@ -2799,7 +2750,7 @@ Imagine the system has the following properties:
   (*) an even-numbered cache line may be in cache B, cache D or it may still be
       resident in memory;
  
- (*) whilst the CPU core is interrogating one cache, the other cache may be
+ (*) while the CPU core is interrogating one cache, the other cache may be
       making use of the bus to access the rest of the system - perhaps to
       displace a dirty cacheline or to do a speculative load;
  
@@ -2835,7 +2786,7 @@ now imagine that the second CPU wants to read those values:
                         x = *q;
  
  The above pair of reads may then fail to happen in the expected order, as the
-cacheline holding p may get updated in one of the second CPU's caches whilst
+cacheline holding p may get updated in one of the second CPU's caches while
  the update to the cacheline holding v is delayed in the other of the second
  CPU's caches by some other cache event:
  
@@ -2855,7 +2806,7 @@ CPU's caches by some other cache event:
                         <C:unbusy>
                         <C:commit v=2>
  
-Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
+Basically, while both cachelines will be updated on CPU 2 eventually, there's
  no guarantee that, without intervention, the order of update will be the same
  as that committed on CPU 1.
  
@@ -2885,7 +2836,7 @@ coherency queue before processing any further requests:
  
  This sort of problem can be encountered on DEC Alpha processors as they have a
  split cache that improves performance by making better use of the data bus.
-Whilst most CPUs do imply a data dependency barrier on the read when a memory
+While most CPUs do imply a data dependency barrier on the read when a memory
  access depends on a read, not all do, so it may not be relied on.
  
  Other CPUs may also have split caches, but must coordinate between the various
@@ -2974,7 +2925,7 @@ assumption doesn't hold because:
       thus cutting down on transaction setup costs (memory and PCI devices may
       both be able to do this); and
  
- (*) the CPU's data cache may affect the ordering, and whilst cache-coherency
+ (*) the CPU's data cache may affect the ordering, and while cache-coherency
       mechanisms may alleviate this - once the store has actually hit the cache
       - there's no guarantee that the coherency management will be propagated in
       order to other CPUs.