git.samba.org - sfrench/cifs-2.6.git/log

Merge tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

- fix the handling of the bus_dma_mask in dma_get_required_mask, which
   caused a regression in this merge window (Lucas Stach)

- fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)

- fix dma_mmap_coherent to not cause page attribute mismatches on
   coherent architectures like x86 (me)

* tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: fix page attributes for dma_mmap_*
  dma-direct: don't truncate dma_required_mask to bus addressing capabilities
  dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING

Merge tag 'iommu-fixes-v5.3-rc4' of git://git./linux/kernel/git/joro/iommu

Pull iommu fixes from Joerg Roedel:

- A couple more fixes for the Intel VT-d driver for bugs introduced
   during the recent conversion of this driver to use IOMMU core default
   domains.

- Fix for common dma-iommu code to make sure MSI mappings happen in the
   correct domain for a device.

- Fix a corner case in the handling of sg-lists in dma-iommu code that
   might cause dma_length to be truncated.

- Mark a switch as fall-through in arm-smmu code.

* tag 'iommu-fixes-v5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Fix possible use-after-free of private domain
  iommu/vt-d: Detach domain before using a private one
  iommu/dma: Handle SG length overflow better
  iommu/vt-d: Correctly check format of page table in debugfs
  iommu/vt-d: Detach domain when move device out of group
  iommu/arm-smmu: Mark expected switch fall-through
  iommu/dma: Handle MSI mappings separately

Merge branch 'akpm' (patches from Andrew)

Merge misc VM fixes from Andrew Morton:
"A bunch of hotfixes, all affecting mm/.

  The two-patch series from Andrea may be controversial. This restores
  patches which were reverted in Dec 2018 due to a regression report [*].

  After extensive discussion it is evident that the problems which these
  patches solved were significantly more serious than the problems they
  introduced. I am told that major distros are already carrying these
  two patches for this reason"

[*] See

      https://lore.kernel.org/lkml/alpine.DEB.2.21.1812061343240.144733@chino.kir.corp.google.com/
      https://lore.kernel.org/lkml/alpine.DEB.2.21.1812031545560.161134@chino.kir.corp.google.com/

  for the google-specific issues brought up by David Rijentes. And as
  Andrew says:

    "I'm unaware of anyone else who will be adversely affected by this,
     and google already carries over a thousand kernel patches - another
     won't kill them.

     There has been sporadic discussion about fixing these things for
     real but it's clear that nobody apart from David is particularly
     motivated"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS
  mm, vmscan: do not special-case slab reclaim when watermarks are boosted
  Revert "mm, thp: restore node-local hugepage allocations"
  Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
  include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used
  seq_file: fix problem when seeking mid-record
  mm: workingset: fix vmstat counters for shadow nodes
  mm/usercopy: use memory range to be accessed for wraparound check
  mm: kmemleak: disable early logging in case of error
  mm/vmalloc.c: fix percpu free VM area search criteria
  mm/memcontrol.c: fix use after free in mem_cgroup_iter()
  mm/z3fold.c: fix z3fold_destroy_pool() race condition
  mm/z3fold.c: fix z3fold_destroy_pool() ordering
  mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind
  mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified
  mm/hmm: fix bad subpage pointer in try_to_unmap_one
  mm/hmm: fix ZONE_DEVICE anon page mapping reuse
  mm: document zone device struct page field usage

hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS

Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
the kernel-v5.2.3 testing.  This is caused by a race between hugetlb
page migration and page fault.

If a hugetlb page can not be allocated to satisfy a page fault, the task
is sent SIGBUS.  This is normal hugetlbfs behavior.  A hugetlb fault
mutex exists to prevent two tasks from trying to instantiate the same
page.  This protects against the situation where there is only one
hugetlb page, and both tasks would try to allocate.  Without the mutex,
one would fail and SIGBUS even though the other fault would be
successful.

There is a similar race between hugetlb page migration and fault.
Migration code will allocate a page for the target of the migration.  It
will then unmap the original page from all page tables.  It does this
unmap by first clearing the pte and then writing a migration entry.  The
page table lock is held for the duration of this clear and write
operation.  However, the beginnings of the hugetlb page fault code
optimistically checks the pte without taking the page table lock.  If
clear (as it can be during the migration unmap operation), a hugetlb
page allocation is attempted to satisfy the fault.  Note that the page
which will eventually satisfy this fault was already allocated by the
migration code.  However, the allocation within the fault path could
fail which would result in the task incorrectly being sent SIGBUS.

Ideally, we could take the hugetlb fault mutex in the migration code
when modifying the page tables.  However, locks must be taken in the
order of hugetlb fault mutex, page lock, page table lock.  This would
require significant rework of the migration code.  Instead, the issue is
addressed in the hugetlb fault code.  After failing to allocate a huge
page, take the page table lock and check for huge_pte_none before
returning an error.  This is the same check that must be made further in
the code even if page allocation is successful.

Link: http://lkml.kernel.org/r/20190808000533.7701-1-mike.kravetz@oracle.com
Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Li Wang <liwang@redhat.com>
Tested-by: Li Wang <liwang@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Cyril Hrubis <chrubis@suse.cz>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm, vmscan: do not special-case slab reclaim when watermarks are boosted

Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
("mm: reclaim small amounts of memory when an external fragmentation
event occurs").

The report is extensive:

  https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/

and it's worth recording the most relevant parts (colorful language and
typos included).

When running a simple, steady state 4kB file creation test to
simulate extracting tarballs larger than memory full of small
files into the filesystem, I noticed that once memory fills up
the cache balance goes to hell.

The workload is creating one dirty cached inode for every dirty
page, both of which should require a single IO each to clean and
reclaim, and creation of inodes is throttled by the rate at which
dirty writeback runs at (via balance dirty pages). Hence the ingest
rate of new cached inodes and page cache pages is identical and
steady. As a result, memory reclaim should quickly find a steady
balance between page cache and inode caches.

The moment memory fills, the page cache is reclaimed at a much
faster rate than the inode cache, and evidence suggests that
the inode cache shrinker is not being called when large batches
of pages are being reclaimed. In roughly the same time period
that it takes to fill memory with 50% pages and 50% slab caches,
memory reclaim reduces the page cache down to just dirty pages
and slab caches fill the entirety of memory.

The LRU is largely full of dirty pages, and we're getting spikes
of random writeback from memory reclaim so it's all going to shit.
Behaviour never recovers, the page cache remains pinned at just
dirty pages, and nothing I could tune would make any difference.
vfs_cache_pressure makes no difference - I would set it so high
it should trim the entire inode caches in a single pass, yet it
didn't do anything. It was clear from tracing and live telemetry
that the shrinkers were pretty much not running except when
there was absolutely no memory free at all, and then they did
the minimum necessary to free memory to make progress.

So I went looking at the code, trying to find places where pages
got reclaimed and the shrinkers weren't called. There's only one
- kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
reclaim small amounts of memory when an external fragmentation
event occurs").

The watermark boosting introduced by the commit is triggered in response
to an allocation "fragmentation event".  The boosting was not intended
to target THP specifically and triggers even if THP is disabled.
However, with Dave's perfectly reasonable workload, fragmentation events
can be very common given the ratio of slab to page cache allocations so
boosting remains active for long periods of time.

As high-order allocations might use compaction and compaction cannot
move slab pages the decision was made in the commit to special-case
kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
reclaiming slab does not directly help compaction.

As Dave notes, this decision means that slab can be artificially
protected for long periods of time and messes up the balance with slab
and page caches.

Removing the special casing can still indirectly help avoid
fragmentation by avoiding fragmentation-causing events due to slab
allocation as pages from a slab pageblock will have some slab objects
freed.  Furthermore, with the special casing, reclaim behaviour is
unpredictable as kswapd sometimes examines slab and sometimes does not
in a manner that is tricky to tune or analyse.

This patch removes the special casing.  The downside is that this is not
a universal performance win.  Some benchmarks that depend on the
residency of data when rereading metadata may see a regression when slab
reclaim is restored to its original behaviour.  Similarly, some
benchmarks that only read-once or write-once may perform better when
page reclaim is too aggressive.  The primary upside is that slab
shrinker is less surprising (arguably more sane but that's a matter of
opinion), behaves consistently regardless of the fragmentation state of
the system and properly obeys VM sysctls.

A fsmark benchmark configuration was constructed similar to what Dave
reported and is codified by the mmtest configuration
config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket
machine to avoid dealing with NUMA-related issues and the timing of
reclaim.  The storage was an SSD Samsung Evo and a fresh trimmed XFS
filesystem was used for the test data.

This is not an exact replication of Dave's setup.  The configuration
scales its parameters depending on the memory size of the SUT to behave
similarly across machines.  The parameters mean the first sample
reported by fs_mark is using 50% of RAM which will barely be throttled
and look like a big outlier.  Dave used fake NUMA to have multiple
kswapd instances which I didn't replicate.  Finally, the number of
iterations differ from Dave's test as the target disk was not large
enough.  While not identical, it should be representative.

  fsmark
                                     5.3.0-rc3              5.3.0-rc3
                                       vanilla          shrinker-v1r1
  Min       1-files/sec     4444.80 (   0.00%)     4765.60 (   7.22%)
  1st-qrtle 1-files/sec     5005.10 (   0.00%)     5091.70 (   1.73%)
  2nd-qrtle 1-files/sec     4917.80 (   0.00%)     4855.60 (  -1.26%)
  3rd-qrtle 1-files/sec     4667.40 (   0.00%)     4831.20 (   3.51%)
  Max-1     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Max-5     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Max-10    1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Max-90    1-files/sec     4649.60 (   0.00%)     4780.70 (   2.82%)
  Max-95    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
  Max-99    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
  Max       1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Hmean     1-files/sec     5004.75 (   0.00%)     5075.96 (   1.42%)
  Stddev    1-files/sec     1778.70 (   0.00%)     1369.66 (  23.00%)
  CoeffVar  1-files/sec       33.70 (   0.00%)       26.05 (  22.71%)
  BHmean-99 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
  BHmean-95 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
  BHmean-90 1-files/sec     5107.05 (   0.00%)     5131.41 (   0.48%)
  BHmean-75 1-files/sec     5208.45 (   0.00%)     5206.68 (  -0.03%)
  BHmean-50 1-files/sec     5405.53 (   0.00%)     5381.62 (  -0.44%)
  BHmean-25 1-files/sec     6179.75 (   0.00%)     6095.14 (  -1.37%)

                     5.3.0-rc3   5.3.0-rc3
                       vanillashrinker-v1r1
  Duration User         501.82      497.29
  Duration System      4401.44     4424.08
  Duration Elapsed     8124.76     8358.05

This is showing a slight skew for the max result representing a large
outlier for the 1st, 2nd and 3rd quartile are similar indicating that
the bulk of the results show little difference.  Note that an earlier
version of the fsmark configuration showed a regression but that
included more samples taken while memory was still filling.

Note that the elapsed time is higher.  Part of this is that the
configuration included time to delete all the test files when the test
completes -- the test automation handles the possibility of testing
fsmark with multiple thread counts.  Without the patch, many of these
objects would be memory resident which is part of what the patch is
addressing.

There are other important observations that justify the patch.

1. With the vanilla kernel, the number of dirty pages in the system is
   very low for much of the test. With this patch, dirty pages is
   generally kept at 10% which matches vm.dirty_background_ratio which
   is normal expected historical behaviour.

2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
   0.95 for much of the test i.e. Slab is being left alone and
   dominating memory consumption. With the patch applied, the ratio
   varies between 0.35 and 0.45 with the bulk of the measured ratios
   roughly half way between those values. This is a different balance to
   what Dave reported but it was at least consistent.

3. Slabs are scanned throughout the entire test with the patch applied.
   The vanille kernel has periods with no scan activity and then
   relatively massive spikes.

4. Without the patch, kswapd scan rates are very variable. With the
   patch, the scan rates remain quite steady.

4. Overall vmstats are closer to normal expectations

                                5.3.0-rc3      5.3.0-rc3
                                  vanilla  shrinker-v1r1
    Ops Direct pages scanned             99388.00      328410.00
    Ops Kswapd pages scanned          45382917.00    33451026.00
    Ops Kswapd pages reclaimed        30869570.00    25239655.00
    Ops Direct pages reclaimed           74131.00        5830.00
    Ops Kswapd efficiency %                 68.02          75.45
    Ops Kswapd velocity                   5585.75        4002.25
    Ops Page reclaim immediate         1179721.00      430927.00
    Ops Slabs scanned                 62367361.00    73581394.00
    Ops Direct inode steals               2103.00        1002.00
    Ops Kswapd inode steals             570180.00     5183206.00

o Vanilla kernel is hitting direct reclaim more frequently,
  not very much in absolute terms but the fact the patch
  reduces it is interesting
o "Page reclaim immediate" in the vanilla kernel indicates
  dirty pages are being encountered at the tail of the LRU.
  This is generally bad and means in this case that the LRU
  is not long enough for dirty pages to be cleaned by the
  background flush in time. This is much reduced by the
  patch.
o With the patch, kswapd is reclaiming 10 times more slab
  pages than with the vanilla kernel. This is indicative
  of the watermark boosting over-protecting slab

A more complete set of tests were run that were part of the basis for
introducing boosting and while there are some differences, they are well
within tolerances.

Bottom line, the special casing kswapd to avoid slab behaviour is
unpredictable and can lead to abnormal results for normal workloads.

This patch restores the expected behaviour that slab and page cache is
balanced consistently for a workload with a steady allocation ratio of
slab/pagecache pages.  It also means that if there are workloads that
favour the preservation of slab over pagecache that it can be tuned via
vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
the parameter when boosting is active.

Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm, thp: restore node-local hugepage allocations"

This reverts commit 2f0799a0ffc033b ("mm, thp: restore node-local
hugepage allocations").

commit 2f0799a0ffc033b was rightfully applied to avoid the risk of a
severe regression that was reported by the kernel test robot at the end
of the merge window.  Now we understood the regression was a false
positive and was caused by a significant increase in fairness during a
swap trashing benchmark.  So it's safe to re-apply the fix and continue
improving the code from there.  The benchmark that reported the
regression is very useful, but it provides a meaningful result only when
there is no significant alteration in fairness during the workload.  The
removal of __GFP_THISNODE increased fairness.

__GFP_THISNODE cannot be used in the generic page faults path for new
memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
behavior significantly deviates from what the MPOL_DEFAULT semantics are
supposed to be for THP and 4k allocations alike.

Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
set to "madvise") has never meant to provide an implicit MPOL_BIND on
the "current" node the task is running on, causing swap storms and
providing a much more aggressive behavior than even zone_reclaim_node =
3.

Any workload who could have benefited from __GFP_THISNODE has now to
enable zone_reclaim_mode=1||2||3.  __GFP_THISNODE implicitly provided
the zone_reclaim_mode behavior, but it only did so if THP was enabled:
if THP was disabled, there would have been no chance to get any 4k page
from the current node if the current node was full of pagecache, which
further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
must work exactly the same with MADV_HUGEPAGE set or not.

The performance characteristic of memory depends on the hardware
details.  The numbers below are obtained on Naples/EPYC architecture and
the N/A projection extends them to show what we should aim for in the
future as a good THP NUMA locality default.  The benchmark used
exercises random memory seeks (note: the cost of the page faults is not
part of the measurement).

  D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
  0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A

D0 means distance zero (i.e.  local memory), D1 means distance one (i.e.
intra socket memory), D2 means distance two (i.e.  inter socket memory),
etc...

For the guest physical memory allocated by qemu and for guest mode
kernel the performance characteristic of RAM is more complex and an
ideal default could be:

  D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
  0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A

NOTE: the N/A are projections and haven't been measured yet, the
measurement in this case is done on a 1950x with only two NUMA nodes.
The THP case here means THP was used both in the host and in the guest.

After applying this commit the THP NUMA locality order that we'll get
out of MADV_HUGEPAGE is this:

  D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...

Before this commit it was:

  D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...

Even if we ignore the breakage of large workloads that can't fit in a
single node that the __GFP_THISNODE implicit "current node" mbind
caused, the THP NUMA locality order provided by __GFP_THISNODE was still
not the one we shall aim for in the long term (i.e.  the first one at
the top).

After this commit is applied, we can introduce a new allocator multi
order API and to replace those two alloc_pages_vmas calls in the page
fault path, with a single multi order call:

        unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
        page = alloc_pages_multi_order(..., &order);
        if (!page)
         goto out;
        if (!(order & (1 << 0))) {
         VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
         /* THP fault */
        } else {
         VM_WARN_ON(order != 1 << 0);
         /* 4k fallback */
        }

The page allocator logic has to be altered so that when it fails on any
zone with order 9, it has to try again with a order 0 before falling
back to the next zone in the zonelist.

After that we need to do more measurements and evaluate if adding an
opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
with "DN+1 THP | DN 4k" at every NUMA distance crossing.

Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""

Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".

The fixes for what was originally reported as "pathological THP
behavior" we rightfully reverted to be sure not to introduced
regressions at end of a merge window after a severe regression report
from the kernel bot.  We can safely re-apply them now that we had time
to analyze the problem.

The mm process worked fine, because the good fixes were eventually
committed upstream without excessive delay.

The regression reported by the kernel bot however forced us to revert
the good fixes to be sure not to introduce regressions and to give us
the time to analyze the issue further.  The silver lining is that this
extra time allowed to think more at this issue and also plan for a
future direction to improve things further in terms of THP NUMA
locality.

This patch (of 2):

This reverts commit 356ff8a9a78fb35d ("Revert "mm, thp: consolidate THP
gfp handling into alloc_hugepage_direct_gfpmask").  So it reapplies
89c83fb539f954 ("mm, thp: consolidate THP gfp handling into
alloc_hugepage_direct_gfpmask").

Consolidation of the THP allocation flags at the same place was meant to
be a clean up to easier handle otherwise scattered code which is
imposing a maintenance burden.  There were no real problems observed
with the gfp mask consolidation but the reversion was rushed through
without a larger consensus regardless.

This patch brings the consolidation back because this should make the
long term maintainability easier as well as it should allow future
changes to be less error prone.

[mhocko@kernel.org: changelog additions]
Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used

A compiler throws a warning on an arm64 system since commit 9849a5697d3d
("arch, mm: convert all architectures to use 5level-fixup.h"),

  mm/kasan/init.c: In function 'kasan_free_p4d':
  mm/kasan/init.c:344:9: warning: variable 'p4d' set but not used [-Wunused-but-set-variable]
   p4d_t *p4d;
          ^~~

because p4d_none() in "5level-fixup.h" is compiled away while it is a
static inline function in "pgtable-nopud.h".

However, if converted p4d_none() to a static inline there, powerpc would
be unhappy as it reads those in assembler language in
"arch/powerpc/include/asm/book3s/64/pgtable.h", so it needs to skip
assembly include for the static inline C function.

While at it, converted a few similar functions to be consistent with the
ones in "pgtable-nopud.h".

Link: http://lkml.kernel.org/r/20190806232917.881-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

seq_file: fix problem when seeking mid-record

If you use lseek or similar (e.g.  pread) to access a location in a
seq_file file that is within a record, rather than at a record boundary,
then the first read will return the remainder of the record, and the
second read will return the whole of that same record (instead of the
next record).  When seeking to a record boundary, the next record is
correctly returned.

This bug was introduced by a recent patch (identified below).  Before
that patch, seq_read() would increment m->index when the last of the
buffer was returned (m->count == 0).  After that patch, we rely on
->next to increment m->index after filling the buffer - but there was
one place where that didn't happen.

Link: https://lkml.kernel.org/lkml/877e7xl029.fsf@notabene.neil.brown.name/
Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: NeilBrown <neilb@suse.com>
Reported-by: Sergei Turchanov <turchanov@farpost.com>
Tested-by: Sergei Turchanov <turchanov@farpost.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Markus Elfring <Markus.Elfring@web.de>
Cc: <stable@vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: workingset: fix vmstat counters for shadow nodes

Memcg counters for shadow nodes are broken because the memcg pointer is
obtained in a wrong way. The following approach is used:
        virt_to_page(xa_node)->mem_cgroup

Since commit 4d96ba353075 ("mm: memcg/slab: stop setting
page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
set for slab pages, so memcg_from_slab_page() should be used instead.

Also I doubt that it ever worked correctly: virt_to_head_page() should
be used instead of virt_to_page().  Otherwise objects residing on tail
pages are not accounted, because only the head page contains a valid
mem_cgroup pointer.  That was a case since the introduction of these
counters by the commit 68d48e6a2df5 ("mm: workingset: add vmstat counter
for shadow nodes").

Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/usercopy: use memory range to be accessed for wraparound check

Currently, when checking to see if accessing n bytes starting at address
"ptr" will cause a wraparound in the memory addresses, the check in
check_bogus_address() adds an extra byte, which is incorrect, as the
range of addresses that will be accessed is [ptr, ptr + (n - 1)].

This can lead to incorrectly detecting a wraparound in the memory
address, when trying to read 4 KB from memory that is mapped to the the
last possible page in the virtual address space, when in fact, accessing
that range of memory would not cause a wraparound to occur.

Use the memory range that will actually be accessed when considering if
accessing a certain amount of bytes will cause the memory address to
wrap around.

Link: http://lkml.kernel.org/r/1564509253-23287-1-git-send-email-isaacm@codeaurora.org
Fixes: f5509cc18daa ("mm: Hardened usercopy")
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Co-developed-by: Prasad Sodagudi <psodagud@codeaurora.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Trilok Soni <tsoni@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: kmemleak: disable early logging in case of error

If an error occurs during kmemleak_init() (e.g. kmem cache cannot be
created), kmemleak is disabled but kmemleak_early_log remains enabled.
Subsequently, when the .init.text section is freed, the log_early()
function no longer exists. To avoid a page fault in such scenario,
ensure that kmemleak_disable() also disables early logging.

Link: http://lkml.kernel.org/r/20190731152302.42073-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/vmalloc.c: fix percpu free VM area search criteria

Recent changes to the vmalloc code by commit 68ad4a330433
("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
cause spurious percpu allocation failures.  These, in turn, can result
in panic()s in the slub code.  One such possible panic was reported by
Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
Another related panic observed is,

RIP: 0033:0x7f46f7441b9b
Call Trace:
  dump_stack+0x61/0x80
  pcpu_alloc.cold.30+0x22/0x4f
  mem_cgroup_css_alloc+0x110/0x650
  cgroup_apply_control_enable+0x133/0x330
  cgroup_mkdir+0x41b/0x500
  kernfs_iop_mkdir+0x5a/0x90
  vfs_mkdir+0x102/0x1b0
  do_mkdirat+0x7d/0xf0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
uses two lists (vmap_area_list & free_vmap_area_list) to track the used
and free VM areas in VMALLOC space.  And pcpu_get_vm_areas(offsets[],
sizes[], nr_vms, align) function is used for allocating congruent VM
areas for percpu memory allocator.  In order to not conflict with
VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
VMALLOC space.  So the search for free vm_area for the given requirement
starts near VMALLOC_END and moves upwards towards VMALLOC_START.

Prior to commit 68ad4a330433, the search for free vm_area in
pcpu_get_vm_areas() involves following two main steps.

Step 1:
    Find a aligned "base" adress near VMALLOC_END.
    va = free vm area near VMALLOC_END
Step 2:
    Loop through number of requested vm_areas and check,
        Step 2.1:
           if (base < VMALLOC_START)
              1. fail with error
        Step 2.2:
           // end is offsets[area] + sizes[area]
           if (base + end > va->vm_end)
               1. Move the base downwards and repeat Step 2
        Step 2.3:
           if (base + start < va->vm_start)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below:

        Step 2.3:
           if (base + start < va->vm_start || base + end > va->vm_end)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

Above change is the root cause of spurious percpu memory allocation
failures.  For example, consider a case where a relatively large vm_area
(~ 30 TB) was ignored in free vm_area search because it did not pass the
base + end < vm->vm_end boundary check.  Ignoring such large free
vm_area's would lead to not finding free vm_area within boundary of
VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.

So modify the search algorithm to include Step 2.2.

Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memcontrol.c: fix use after free in mem_cgroup_iter()

This patch is sent to report an use after free in mem_cgroup_iter()
after merging commit be2657752e9e ("mm: memcg: fix use after free in
mem_cgroup_iter()").

I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
to the trees.  However, I can still observe use after free issues
addressed in the commit be2657752e9e.  (on low-end devices, a few times
this month)

backtrace:
        css_tryget <- crash here
        mem_cgroup_iter
        shrink_node
        shrink_zones
        do_try_to_free_pages
        try_to_free_pages
        __perform_reclaim
        __alloc_pages_direct_reclaim
        __alloc_pages_slowpath
        __alloc_pages_nodemask

To debug, I poisoned mem_cgroup before freeing it:

  static void __mem_cgroup_free(struct mem_cgroup *memcg)
        for_each_node(node)
        free_mem_cgroup_per_node_info(memcg, node);
        free_percpu(memcg->stat);
  +     /* poison memcg before freeing it */
  +     memset(memcg, 0x78, sizeof(struct mem_cgroup));
        kfree(memcg);
  }

The coredump shows the position=0xdbbc2a00 is freed.

  (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
  $13 = {position = 0xdbbc2a00, generation = 0x2efd}

  0xdbbc2a00:     0xdbbc2e00      0x00000000      0xdbbc2800      0x00000100
  0xdbbc2a10:     0x00000200      0x78787878      0x00026218      0x00000000
  0xdbbc2a20:     0xdcad6000      0x00000001      0x78787800      0x00000000
  0xdbbc2a30:     0x78780000      0x00000000      0x0068fb84      0x78787878
  0xdbbc2a40:     0x78787878      0x78787878      0x78787878      0xe3fa5cc0
  0xdbbc2a50:     0x78787878      0x78787878      0x00000000      0x00000000
  0xdbbc2a60:     0x00000000      0x00000000      0x00000000      0x00000000
  0xdbbc2a70:     0x00000000      0x00000000      0x00000000      0x00000000
  0xdbbc2a80:     0x00000000      0x00000000      0x00000000      0x00000000
  0xdbbc2a90:     0x00000001      0x00000000      0x00000000      0x00100000
  0xdbbc2aa0:     0x00000001      0xdbbc2ac8      0x00000000      0x00000000
  0xdbbc2ab0:     0x00000000      0x00000000      0x00000000      0x00000000
  0xdbbc2ac0:     0x00000000      0x00000000      0xe5b02618      0x00001000
  0xdbbc2ad0:     0x00000000      0x78787878      0x78787878      0x78787878
  0xdbbc2ae0:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2af0:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b00:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b10:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b20:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b30:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b40:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b50:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b60:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b70:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2b80:     0x78787878      0x78787878      0x00000000      0x78787878
  0xdbbc2b90:     0x78787878      0x78787878      0x78787878      0x78787878
  0xdbbc2ba0:     0x78787878      0x78787878      0x78787878      0x78787878

In the reclaim path, try_to_free_pages() does not setup
sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
shrink_node().

In mem_cgroup_iter(), root is set to root_mem_cgroup because
sc->target_mem_cgroup is NULL.  It is possible to assign a memcg to
root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().

        try_to_free_pages
         struct scan_control sc = {...}, target_mem_cgroup is 0x0;
        do_try_to_free_pages
        shrink_zones
        shrink_node
         mem_cgroup *root = sc->target_mem_cgroup;
         memcg = mem_cgroup_iter(root, NULL, &reclaim);
        mem_cgroup_iter()
         if (!root)
         root = root_mem_cgroup;
         ...

         css = css_next_descendant_pre(css, &root->css);
         memcg = mem_cgroup_from_css(css);
         cmpxchg(&iter->position, pos, memcg);

My device uses memcg non-hierarchical mode.  When we release a memcg:
invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
If non-hierarchical mode is used, invalidate_reclaim_iterators() never
reaches root_mem_cgroup.

  static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  {
        struct mem_cgroup *memcg = dead_memcg;

        for (; memcg; memcg = parent_mem_cgroup(memcg)
        ...
  }

So the use after free scenario looks like:

  CPU1 CPU2

  try_to_free_pages
  do_try_to_free_pages
  shrink_zones
  shrink_node
  mem_cgroup_iter()
      if (!root)
       root = root_mem_cgroup;
      ...
      css = css_next_descendant_pre(css, &root->css);
      memcg = mem_cgroup_from_css(css);
      cmpxchg(&iter->position, pos, memcg);

         invalidate_reclaim_iterators(memcg);
         ...
         __mem_cgroup_free()
         kfree(memcg);

  try_to_free_pages
  do_try_to_free_pages
  shrink_zones
  shrink_node
  mem_cgroup_iter()
      if (!root)
       root = root_mem_cgroup;
      ...
      mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
      iter = &mz->iter[reclaim->priority];
      pos = READ_ONCE(iter->position);
      css_tryget(&pos->css) <- use after free

To avoid this, we should also invalidate root_mem_cgroup.nodeinfo.iter
in invalidate_reclaim_iterators().

[cai@lca.pw: fix -Wparentheses compilation warning]
Link: http://lkml.kernel.org/r/1564580753-17531-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/20190730015729.4406-1-miles.chen@mediatek.com
Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/z3fold.c: fix z3fold_destroy_pool() race condition

The constraint from the zpool use of z3fold_destroy_pool() is there are
no outstanding handles to memory (so no active allocations), but it is
possible for there to be outstanding work on either of the two wqs in
the pool.

Calling z3fold_deregister_migration() before the workqueues are drained
means that there can be allocated pages referencing a freed inode,
causing any thread in compaction to be able to trip over the bad pointer
in PageMovable().

Link: http://lkml.kernel.org/r/20190726224810.79660-2-henryburns@google.com
Fixes: 1f862989b04a ("mm/z3fold.c: support page migration")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Jonathan Adams <jwadams@google.com>
Cc: Vitaly Vul <vitaly.vul@sony.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/z3fold.c: fix z3fold_destroy_pool() ordering

The constraint from the zpool use of z3fold_destroy_pool() is there are
no outstanding handles to memory (so no active allocations), but it is
possible for there to be outstanding work on either of the two wqs in
the pool.

If there is work queued on pool->compact_workqueue when it is called,
z3fold_destroy_pool() will do:

   z3fold_destroy_pool()
     destroy_workqueue(pool->release_wq)
     destroy_workqueue(pool->compact_wq)
       drain_workqueue(pool->compact_wq)
         do_compact_page(zhdr)
           kref_put(&zhdr->refcount)
             __release_z3fold_page(zhdr, ...)
               queue_work_on(pool->release_wq, &pool->work) *BOOM*

So compact_wq needs to be destroyed before release_wq.

Link: http://lkml.kernel.org/r/20190726224810.79660-1-henryburns@google.com
Fixes: 5d03a6613957 ("mm/z3fold.c: use kref to prevent page free/compact race")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Jonathan Adams <jwadams@google.com>
Cc: Vitaly Vul <vitaly.vul@sony.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind

When running syzkaller internally, we ran into the below bug on 4.9.x
kernel:

  kernel BUG at mm/huge_memory.c:2124!
  invalid opcode: 0000 [#1] SMP KASAN
  CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
  task: ffff880067b34900 task.stack: ffff880068998000
  RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
  Call Trace:
    split_huge_page include/linux/huge_mm.h:100 [inline]
    queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
    walk_pmd_range mm/pagewalk.c:50 [inline]
    walk_pud_range mm/pagewalk.c:90 [inline]
    walk_pgd_range mm/pagewalk.c:116 [inline]
    __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
    walk_page_range+0x154/0x370 mm/pagewalk.c:285
    queue_pages_range+0x115/0x150 mm/mempolicy.c:694
    do_mbind mm/mempolicy.c:1241 [inline]
    SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
    SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
    do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
    entry_SYSCALL_64_after_swapgs+0x5d/0xdb
  Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
  RIP  [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
   RSP <ffff88006899f980>

with the below test:

  uint64_t r[1] = {0xffffffffffffffff};

  int main(void)
  {
        syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
                                intptr_t res = 0;
        res = syscall(__NR_socket, 0x11, 3, 0x300);
        if (res != -1)
                r[0] = res;
        *(uint32_t*)0x20000040 = 0x10000;
        *(uint32_t*)0x20000044 = 1;
        *(uint32_t*)0x20000048 = 0xc520;
        *(uint32_t*)0x2000004c = 1;
        syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
        syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
        *(uint64_t*)0x20000340 = 2;
        syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
        return 0;
  }

Actually the test does:

  mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
  socket(AF_PACKET, SOCK_RAW, 768)        = 3
  setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
  mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
  mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0

The setsockopt() would allocate compound pages (16 pages in this test)
for packet tx ring, then the mmap() would call packet_mmap() to map the
pages into the user address space specified by the mmap() call.

When calling mbind(), it would scan the vma to queue the pages for
migration to the new node.  It would split any huge page since 4.9
doesn't support THP migration, however, the packet tx ring compound
pages are not THP and even not movable.  So, the above bug is triggered.

However, the later kernel is not hit by this issue due to commit
d44d363f6578 ("mm: don't assume anonymous pages have SwapBacked flag"),
which just removes the PageSwapBacked check for a different reason.

But, there is a deeper issue.  According to the semantic of mbind(), it
should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
MPOL_MF_STRICT was also specified, but the kernel was unable to move all
existing pages in the range.  The tx ring of the packet socket is
definitely not movable, however, mbind() returns success for this case.

Although the most socket file associates with non-movable pages, but XDP
may have movable pages from gup.  So, it sounds not fine to just check
the underlying file type of vma in vma_migratable().

Change migrate_page_add() to check if the page is movable or not, if it
is unmovable, just return -EIO.  But do not abort pte walk immediately,
since there may be pages off LRU temporarily.  We should migrate other
pages if MPOL_MF_MOVE* is specified.  Set has_unmovable flag if some
paged could not be not moved, then return -EIO for mbind() eventually.

With this change the above test would return -EIO as expected.

[yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified

When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
try best to migrate misplaced pages, if some of the pages could not be
migrated, then return -EIO.

There are three different sub-cases:
1. vma is not migratable
2. vma is migratable, but there are unmovable pages
3. vma is migratable, pages are movable, but migrate_pages() fails

If #1 happens, kernel would just abort immediately, then return -EIO,
after a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when
MPOL_MF_STRICT is specified").

If #3 happens, kernel would set policy and migrate pages with
best-effort, but won't rollback the migrated pages and reset the policy
back.

Before that commit, they behaves in the same way. It'd better to keep
their behavior consistent. But, rolling back the migrated pages and
resetting the policy back sounds not feasible, so just make #1 behave as
same as #3.

Userspace will know that not everything was successfully migrated (via
-EIO), and can take whatever steps it deems necessary - attempt
rollback, determine which exact page(s) are violating the policy, etc.

Make queue_pages_range() return 1 to indicate there are unmovable pages
or vma is not migratable.

The #2 is not handled correctly in the current kernel, the following
patch will fix it.

[yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/hmm: fix bad subpage pointer in try_to_unmap_one

When migrating an anonymous private page to a ZONE_DEVICE private page,
the source page->mapping and page->index fields are copied to the
destination ZONE_DEVICE struct page and the page_mapcount() is
increased. This is so rmap_walk() can be used to unmap and migrate the
page back to system memory.

However, try_to_unmap_one() computes the subpage pointer from a swap pte
which computes an invalid page pointer and a kernel panic results such
as:

BUG: unable to handle page fault for address: ffffea1fffffffc8

Currently, only single pages can be migrated to device private memory so
no subpage computation is needed and it can be set to "page".

[rcampbell@nvidia.com: add comment]
Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/hmm: fix ZONE_DEVICE anon page mapping reuse

When a ZONE_DEVICE private page is freed, the page->mapping field can be
set.  If this page is reused as an anonymous page, the previous value
can prevent the page from being inserted into the CPU's anon rmap table.
For example, when migrating a pte_none() page to device memory:

  migrate_vma(ops, vma, start, end, src, dst, private)
    migrate_vma_collect()
      src[] = MIGRATE_PFN_MIGRATE
    migrate_vma_prepare()
      /* no page to lock or isolate so OK */
    migrate_vma_unmap()
      /* no page to unmap so OK */
    ops->alloc_and_copy()
      /* driver allocates ZONE_DEVICE page for dst[] */
    migrate_vma_pages()
      migrate_vma_insert_page()
        page_add_new_anon_rmap()
          __page_set_anon_rmap()
            /* This check sees the page's stale mapping field */
            if (PageAnon(page))
              return
            /* page->mapping is not updated */

The result is that the migration appears to succeed but a subsequent CPU
fault will be unable to migrate the page back to system memory or worse.

Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
pointer data doesn't affect future page use.

Link: http://lkml.kernel.org/r/20190719192955.30462-3-rcampbell@nvidia.com
Fixes: b7a523109fb5c9d2d6dd ("mm: don't clear ->mapping in hmm_devmem_free")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: document zone device struct page field usage

Patch series "mm/hmm: fixes for device private page migration", v3.

Testing the latest linux git tree turned up a few bugs with page
migration to and from ZONE_DEVICE private and anonymous pages.
Hopefully it clarifies how ZONE_DEVICE private struct page uses the same
mapping and index fields from the source anonymous page mapping.

This patch (of 3):

Struct page for ZONE_DEVICE private pages uses the page->mapping and and
page->index fields while the source anonymous pages are migrated to
device private memory. This is so rmap_walk() can find the page when
migrating the ZONE_DEVICE private page back to system memory.
ZONE_DEVICE pmem backed fsdax pages also use the page->mapping and
page->index fields when files are mapped into a process address space.

Add comments to struct page and remove the unused "_zd_pad_1" field to
make this more clear.

Link: http://lkml.kernel.org/r/20190724232700.23327-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'tpmdd-next-20190813' of git://git.infradead.org/users/jjs/linux-tpmdd

Pull tpm fixes from Jarkko Sakkinen:
"One more bug fix for the next release"

* tag 'tpmdd-next-20190813' of git://git.infradead.org/users/jjs/linux-tpmdd:
KEYS: trusted: allow module init if TPM is inactive or deactivated

Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fix from James Bottomley:
"Single lpfc fix, for a single-cpu corner case"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: lpfc: Fix crash when cpu count is 1 and null irq affinity mask

KEYS: trusted: allow module init if TPM is inactive or deactivated

Commit c78719203fc6 ("KEYS: trusted: allow trusted.ko to initialize w/o a
TPM") allows the trusted module to be loaded even if a TPM is not found, to
avoid module dependency problems.

However, trusted module initialization can still fail if the TPM is
inactive or deactivated. tpm_get_random() returns an error.

This patch removes the call to tpm_get_random() and instead extends the PCR
specified by the user with zeros. The security of this alternative is
equivalent to the previous one, as either option prevents with a PCR update
unsealing and misuse of sealed data by a user space process.

Even if a PCR is extended with zeros, instead of random data, it is still
computationally infeasible to find a value as input for a new PCR extend
operation, to obtain again the PCR value that would allow unsealing.

Cc: stable@vger.kernel.org
Fixes: 240730437deb ("KEYS: trusted: explicitly use tpm_chip structure...")
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Suggested-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>

Linux 5.3-rc4

Merge tag 'dax-fixes-5.3-rc4' of git://git./linux/kernel/git/nvdimm/nvdimm

Pull dax fixes from Dan Williams:
"A filesystem-dax and device-dax fix for v5.3.

  The filesystem-dax fix is tagged for stable as the implementation has
  been mistakenly throwing away all cow pages on any truncate or hole
  punch operation as part of the solution to coordinate device-dma vs
  truncate to dax pages.

  The device-dax change fixes up a regression this cycle from the
  introduction of a common 'internal per-cpu-ref' implementation.

  Summary:

   - Fix dax_layout_busy_page() to not discard private cow pages of
     fs/dax private mappings.

   - Update the memremap_pages core to properly cleanup on behalf of
     internal reference-count users like device-dax"

* tag 'dax-fixes-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  mm/memremap: Fix reuse of pgmap instances with internal references
  dax: dax_layout_busy_page() should not unmap cow pages

Merge tag 'ntb-5.3-bugfixes' of git://github.com/jonmason/ntb

Pull NTB fix from Jon Mason:
"Bug fix for NTB MSI kernel compile warning"

* tag 'ntb-5.3-bugfixes' of git://github.com/jonmason/ntb:
NTB/msi: remove incorrect MODULE defines

Merge tag 'riscv/for-v5.3-rc4' of git://git./linux/kernel/git/riscv/linux

Pull RISC-V updates from Paul Walmsley:
"A few minor RISC-V updates for v5.3-rc4:

   - Remove __udivdi3() from the 32-bit Linux port, converting the only
     upstream user to use do_div(), per Linux policy

   - Convert the RISC-V standard clocksource away from per-cpu data
     structures, since only one is used by Linux, even on a multi-CPU
     system

   - A set of DT binding updates that remove an obsolete text binding in
     favor of a YAML binding, fix a bogus compatible string in the
     schema (thus fixing a "make dtbs_check" warning), and clarifies the
     future values expected in one of the RISC-V CPU properties"

* tag 'riscv/for-v5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  dt-bindings: riscv: fix the schema compatible string for the HiFive Unleashed board
  dt-bindings: riscv: remove obsolete cpus.txt
  RISC-V: Remove udivdi3
  riscv: delay: use do_div() instead of __udivdi3()
  dt-bindings: Update the riscv,isa string description
  RISC-V: Remove per cpu clocksource

Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
"A few fixes for x86:

   - Don't reset the carefully adjusted build flags for the purgatory
     and remove the unwanted flags instead. The 'reset all' approach led
     to build fails under certain circumstances.

   - Unbreak CLANG build of the purgatory by avoiding the builtin
     memcpy/memset implementations.

   - Address missing prototype warnings by including the proper header

   - Fix yet more fall-through issues"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/lib/cpu: Address missing prototypes warning
  x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS
  x86/purgatory: Do not use __builtin_memcpy and __builtin_memset
  x86: mtrr: cyrix: Mark expected switch fall-through
  x86/ptrace: Mark expected switch fall-through

Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf tooling fixes from Thomas Gleixner:
"Perf tooling fixes all over the place:

   - Fix the selection of the main thread COMM in db-export

   - Fix the disassemmbly display for BPF in annotate

   - Fix cpumap mask setup in perf ftrace when only one CPU is present

   - Add the missing 'cpu_clk_unhalted.core' event

   - Fix CPU 0 bindings in NUMA benchmarks

   - Fix the module size calculations for s390

   - Handle the gap between kernel end and module start on s390
     correctly

   - Build and typo fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf pmu-events: Fix missing "cpu_clk_unhalted.core" event
  perf annotate: Fix s390 gap between kernel end and module start
  perf record: Fix module size on s390
  perf tools: Fix include paths in ui directory
  perf tools: Fix a typo in a variable name in the Documentation Makefile
  perf cpumap: Fix writing to illegal memory in handling cpumap mask
  perf ftrace: Fix failure to set cpumask when only one cpu is present
  perf db-export: Fix thread__exec_comm()
  perf annotate: Fix printing of unaugmented disassembled instructions from BPF
  perf bench numa: Fix cpu0 binding

Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
"Three fixlets for the scheduler:

   - Avoid double bandwidth accounting in the push & pull code

   - Use a sane FIFO priority for the Pressure Stall Information (PSI)
     thread.

   - Avoid permission checks when setting the scheduler params for the
     PSI thread"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: Do not require setsched permission from the trigger creator
  sched/psi: Reduce psimon FIFO priority
  sched/deadline: Fix double accounting of rq/running bw in push & pull

Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull irq fix from Thomas Gleixner:
"A small fix for the affinity spreading code.

  It failed to handle situations where a single vector was requested
  either due to only one CPU being available or vector exhaustion
  causing only a single interrupt to be granted.

  The fix is to simply remove the requirement in the affinity spreading
  code for more than one interrupt being available"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/affinity: Create affinity mask for single vector

Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull objtool warning fix from Thomas Gleixner:
"The recent objtool fixes/enhancements unearthed a unbalanced CLAC in
  the i915 driver.

  Chris asked me to pick the fix up and route it through"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  drm/i915: Remove redundant user_access_end() from __copy_from_user() error path

Merge tag 'gfs2-v5.3-rc3.fixes' of git://git./linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:
"Fix incorrect lseek / fiemap results"

* tag 'gfs2-v5.3-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: gfs2_walk_metadata fix

Makefile: Convert -Wimplicit-fallthrough=3 to just -Wimplicit-fallthrough for clang

A compilation -Wimplicit-fallthrough warning was enabled by commit
a035d552a93b ("Makefile: Globally enable fall-through warning")

Even though clang 10.0.0 does not currently support this warning without
a patch, clang currently does not support a value for this option.

Link: https://bugs.llvm.org/show_bug.cgi?id=39382
The gcc default for this warning is 3 so removing the =3 has no effect
for gcc and enables the warning for patched versions of clang.

Also remove the =3 from an existing use in a parisc Makefile:
arch/parisc/math-emu/Makefile

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-and-tested-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'char-misc-5.3-rc4' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes Greg KH:
"Here are some small char/misc driver fixes for 5.3-rc4.

  Two of these are for the habanalabs driver for issues found when
  running on a big-endian system (are they still alive?) The others are
  tiny fixes reported by people, and a MAINTAINERS update about the
  location of the fpga development tree.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  coresight: Fix DEBUG_LOCKS_WARN_ON for uninitialized attribute
  MAINTAINERS: Move linux-fpga tree to new location
  nvmem: Use the same permissions for eeprom as for nvmem
  habanalabs: fix host memory polling in BE architecture
  habanalabs: fix F/W download in BE architecture

Merge tag 'driver-core-5.3-rc4' of git://git./linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
"Here are two small fixes for some driver core issues that have been
  reported. There is also a kernfs "fix" here, which was then reverted
  because it was found to cause problems in linux-next.

  The driver core fixes both resolve reported issues, one with gpioint
  stuff that showed up in 5.3-rc1, and the other finally (and hopefully)
  resolves a very long standing race when removing glue directories.
  It's nice to get that issue finally resolved and the developers
  involved should be applauded for the persistence it took to get this
  patch finally accepted.

  All of these have been in linux-next for a while with no reported
  issues. Well, the one reported issue, hence the revert :)"

* tag 'driver-core-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  Revert "kernfs: fix memleak in kernel_ops_readdir()"
  kernfs: fix memleak in kernel_ops_readdir()
  driver core: Fix use-after-free and double free on glue directory
  driver core: platform: return -ENXIO for missing GpioInt

Merge tag 'tty-5.3-rc4' of git://git./linux/kernel/git/gregkh/tty

Pull tty fix from Greg KH:
"Here is a single tty kgdb fix for 5.3-rc4.

  It fixes an annoying log message that has caused kdb to become
  useless. It's another fallout from commit ddde3c18b700 ("vt: More
  locking checks") which tries to enforce locking checks more strictly
  in the tty layer, unfortunatly when kdb is stopped, there's no need
  for locks :)

  This patch has been linux-next for a while with no reported issues"

* tag 'tty-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  kgdboc: disable the console lock when in kgdb

Merge tag 'staging-5.3-rc4' of git://git./linux/kernel/git/gregkh/staging

Pull staging / IIO driver fixes from Greg KH:
"Here are some small staging and IIO driver fixes for 5.3-rc4.

  Nothing major, just resolutions for a number of small reported issues,
  full details in the shortlog.

  All have been in linux-next for a while with no reported issues"

* tag 'staging-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  iio: adc: gyroadc: fix uninitialized return code
  docs: generic-counter.rst: fix broken references for ABI file
  staging: android: ion: Bail out upon SIGKILL when allocating memory.
  Staging: fbtft: Fix GPIO handling
  staging: unisys: visornic: Update the description of 'poll_for_irq()'
  staging: wilc1000: flush the workqueue before deinit the host
  staging: gasket: apex: fix copy-paste typo
  Staging: fbtft: Fix reset assertion when using gpio descriptor
  Staging: fbtft: Fix probing of gpio descriptor
  iio: imu: mpu6050: add missing available scan masks
  iio: cros_ec_accel_legacy: Fix incorrect channel setting
  IIO: Ingenic JZ47xx: Set clock divider on probe
  iio: adc: max9611: Fix misuse of GENMASK macro

Merge tag 'usb-5.3-rc4' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are some small USB fixes for 5.3-rc4.

  The "biggest" one here is moving code from one file to another in
  order to fix a long-standing race condition with the creation of sysfs
  files for USB devices. Turns out that there are now userspace tools
  out there that are hitting this long-known bug, so it's time to fix
  them. Thankfully the tool-maker in this case fixed the issue :)

  The other patches in here are all fixes for reported issues. Now that
  syzbot knows how to fuzz USB drivers better, and is starting to now
  fuzz the userspace facing side of them at the same time, there will be
  more and more small fixes like these coming, which is a good thing.

  All of these have been in linux-next with no reported issues"

* tag 'usb-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: setup authorized_default attributes using usb_bus_notify
  usb: iowarrior: fix deadlock on disconnect
  Revert "USB: rio500: simplify locking"
  usb: usbfs: fix double-free of usb memory upon submiturb error
  usb: yurex: Fix use-after-free in yurex_delete
  usb: typec: tcpm: Ignore unsupported/unknown alternate mode requests
  xhci: Fix NULL pointer dereference at endpoint zero reset.
  usb: host: xhci-rcar: Fix timeout in xhci_suspend()
  usb: typec: ucsi: ccg: Fix uninitilized symbol error
  usb: typec: tcpm: remove tcpm dir if no children
  usb: typec: tcpm: free log buf memory when remove debug file
  usb: typec: tcpm: Add NULL check before dereferencing config

dma-mapping: fix page attributes for dma_mmap_*

All the way back to introducing dma_common_mmap we've defaulted to mark
the pages as uncached. But this is wrong for DMA coherent devices.
Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that
flag is only treated special on the alloc side for non-coherent devices.

Introduce a new dma_pgprot helper that deals with the check for coherent
devices so that only the remapping cases ever reach arch_dma_mmap_pgprot
and we thus ensure no aliasing of page attributes happens, which makes
the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the
remaining ones.

Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but
we'll phase it out soon.

Fixes: 64ccc9c033c6 ("common: dma-mapping: add support for generic dma_mmap_* calls")
Reported-by: Shawn Anastasio <shawn@anastas.io>
Reported-by: Gavin Li <git@thegavinli.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64

dma-direct: don't truncate dma_required_mask to bus addressing capabilities

The dma required_mask needs to reflect the actual addressing capabilities
needed to handle the whole system RAM. When truncated down to the bus
addressing capabilities dma_addressing_limited() will incorrectly signal
no limitations for devices which are restricted by the bus_dma_mask.

Fixes: b4ebe6063204 (dma-direct: implement complete bus_dma_mask handling)
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Tested-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING

The new DMA_ATTR_NO_KERNEL_MAPPING needs to actually assign
a dma_addr to work. Also skip it if the architecture needs
forced decryption handling, as that needs a kernel virtual
address.

Fixes: d98849aff879 (dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>

Merge tag 'pinctrl-v5.3-2' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:

- Delay acquisition of regmaps in the Aspeed G5 driver.

- Make a symbol static to reduce compiler noise.

* tag 'pinctrl-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: aspeed: Make aspeed_pinmux_ips static
pinctrl: aspeed-g5: Delay acquisition of regmaps

Merge tag 'powerpc-5.3-4' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fix from Michael Ellerman:
"Just one fix, a revert of a commit that was meant to be a minor
  improvement to some inline asm, but ended up having no real benefit
  with GCC and broke booting 32-bit machines when using Clang.

  Thanks to: Arnd Bergmann, Christophe Leroy, Nathan Chancellor, Nick
  Desaulniers, Segher Boessenkool"

* tag 'powerpc-5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  Revert "powerpc: slightly improve cache helpers"

Merge tag 'Wimplicit-fallthrough-5.3-rc4' of git://git./linux/kernel/git/gustavoars/linux

Pull fall-through fixes from Gustavo A. R. Silva:
"Mark more switch cases where we are expecting to fall through, fixing
  fall-through warnings in arm, sparc64, mips, i386 and s390"

* tag 'Wimplicit-fallthrough-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  ARM: ep93xx: Mark expected switch fall-through
  scsi: fas216: Mark expected switch fall-throughs
  pcmcia: db1xxx_ss: Mark expected switch fall-throughs
  video: fbdev: omapfb_main: Mark expected switch fall-throughs
  watchdog: riowd: Mark expected switch fall-through
  s390/net: Mark expected switch fall-throughs
  crypto: ux500/crypt: Mark expected switch fall-throughs
  watchdog: wdt977: Mark expected switch fall-through
  watchdog: scx200_wdt: Mark expected switch fall-through
  watchdog: Mark expected switch fall-throughs
  ARM: signal: Mark expected switch fall-through
  mfd: omap-usb-host: Mark expected switch fall-throughs
  mfd: db8500-prcmu: Mark expected switch fall-throughs
  ARM: OMAP: dma: Mark expected switch fall-throughs
  ARM: alignment: Mark expected switch fall-throughs
  ARM: tegra: Mark expected switch fall-through
  ARM/hw_breakpoint: Mark expected switch fall-throughs

Merge tag 'kbuild-fixes-v5.3-3' of git://git./linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- revive single target %.ko

- do not create built-in.a where it is unneeded

- do not create modules.order where it is unneeded

- show a warning if subdir-y/m is used to visit a module Makefile

* tag 'kbuild-fixes-v5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: show hint if subdir-y/m is used to visit module Makefile
  kbuild: generate modules.order only in directories visited by obj-y/m
  kbuild: fix false-positive need-builtin calculation
  kbuild: revive single target %.ko

ARM: ep93xx: Mark expected switch fall-through

Mark switch cases where we are expecting to fall through.

Fix the following warnings (Building: arm-ep93xx_defconfig arm):

arch/arm/mach-ep93xx/crunch.c: In function 'crunch_do':
arch/arm/mach-ep93xx/crunch.c:46:3: warning: this statement may
fall through [-Wimplicit-fallthrough=]
      memset(crunch_state, 0, sizeof(*crunch_state));
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/arm/mach-ep93xx/crunch.c:53:2: note: here
     case THREAD_NOTIFY_EXIT:
     ^~~~

Notice that, in this particular case, the code comment is
modified in accordance with what GCC is expecting to find.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

scsi: fas216: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

Fix the following warnings (Building: rpc_defconfig arm):

drivers/scsi/arm/fas216.c: In function ‘fas216_disconnect_intr’:
drivers/scsi/arm/fas216.c:913:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (fas216_get_last_msg(info, info->scsi.msgin_fifo) == ABORT) {
      ^
drivers/scsi/arm/fas216.c:919:2: note: here
  default:    /* huh?     */
  ^~~~~~~
drivers/scsi/arm/fas216.c: In function ‘fas216_kick’:
drivers/scsi/arm/fas216.c:1959:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   fas216_allocate_tag(info, SCpnt);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/scsi/arm/fas216.c:1960:2: note: here
  case TYPE_OTHER:
  ^~~~
drivers/scsi/arm/fas216.c: In function ‘fas216_busservice_intr’:
drivers/scsi/arm/fas216.c:1413:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   fas216_stoptransfer(info);
   ^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/scsi/arm/fas216.c:1414:2: note: here
  case STATE(STAT_STATUS, PHASE_SELSTEPS):/* Sel w/ steps -> Status       */
  ^~~~
drivers/scsi/arm/fas216.c:1424:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   fas216_stoptransfer(info);
   ^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/scsi/arm/fas216.c:1425:2: note: here
  case STATE(STAT_MESGIN, PHASE_COMMAND): /* Command -> Message In */
  ^~~~
drivers/scsi/arm/fas216.c: In function ‘fas216_funcdone_intr’:
drivers/scsi/arm/fas216.c:1573:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if ((stat & STAT_BUSMASK) == STAT_MESGIN) {
      ^
drivers/scsi/arm/fas216.c:1579:2: note: here
  default:
  ^~~~~~~
drivers/scsi/arm/fas216.c: In function ‘fas216_handlesync’:
drivers/scsi/arm/fas216.c:605:20: warning: this statement may fall through [-Wimplicit-fallthrough=]
   info->scsi.phase = PHASE_MSGOUT_EXPECT;
   ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
drivers/scsi/arm/fas216.c:607:2: note: here
  case async:
  ^~~~

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

pcmcia: db1xxx_ss: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

This patch fixes the following warnings (Building: db1xxx_defconfig mips):

drivers/pcmcia/db1xxx_ss.c:257:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
drivers/pcmcia/db1xxx_ss.c:269:3: warning: this statement may fall through [-Wimplicit-fallthrough=]

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

video: fbdev: omapfb_main: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

This patch fixes the following warning (Building: omap1_defconfig arm):

drivers/watchdog/wdt285.c:170:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
drivers/watchdog/ar7_wdt.c:237:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
drivers/video/fbdev/omap/omapfb_main.c:449:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
drivers/video/fbdev/omap/omapfb_main.c:1549:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
drivers/video/fbdev/omap/omapfb_main.c:1547:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
drivers/video/fbdev/omap/omapfb_main.c:1545:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
drivers/video/fbdev/omap/omapfb_main.c:1543:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
drivers/video/fbdev/omap/omapfb_main.c:1540:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
drivers/video/fbdev/omap/omapfb_main.c:1538:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
drivers/video/fbdev/omap/omapfb_main.c:1535:3: warning: this statement may fall through [-Wimplicit-fallthrough=]

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

watchdog: riowd: Mark expected switch fall-through

Mark switch cases where we are expecting to fall through.

This patch fixes the following warnings (Building: sparc64):

drivers/watchdog/riowd.c: In function ‘riowd_ioctl’:
drivers/watchdog/riowd.c:136:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   riowd_writereg(p, riowd_timeout, WDTO_INDEX);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/watchdog/riowd.c:139:2: note: here
  case WDIOC_GETTIMEOUT:
  ^~~~

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

s390/net: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

This patch fixes the following warnings (Building: s390):

drivers/s390/net/ctcm_fsms.c: In function ‘ctcmpc_chx_attnbusy’:
drivers/s390/net/ctcm_fsms.c:1703:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (grp->changed_side == 1) {
      ^
drivers/s390/net/ctcm_fsms.c:1707:2: note: here
  case MPCG_STATE_XID0IOWAIX:
  ^~~~

drivers/s390/net/ctcm_mpc.c: In function ‘ctc_mpc_alloc_channel’:
drivers/s390/net/ctcm_mpc.c:358:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (callback)
      ^
drivers/s390/net/ctcm_mpc.c:360:2: note: here
  case MPCG_STATE_XID0IOWAIT:
  ^~~~

drivers/s390/net/ctcm_mpc.c: In function ‘mpc_action_timeout’:
drivers/s390/net/ctcm_mpc.c:1469:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if ((fsm_getstate(rch->fsm) == CH_XID0_PENDING) &&
      ^
drivers/s390/net/ctcm_mpc.c:1472:2: note: here
  default:
  ^~~~~~~
drivers/s390/net/ctcm_mpc.c: In function ‘mpc_send_qllc_discontact’:
drivers/s390/net/ctcm_mpc.c:2087:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (grp->estconnfunc) {
      ^
drivers/s390/net/ctcm_mpc.c:2092:2: note: here
  case MPCG_STATE_FLOWC:
  ^~~~

drivers/s390/net/qeth_l2_main.c: In function ‘qeth_l2_process_inbound_buffer’:
drivers/s390/net/qeth_l2_main.c:328:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
    if (IS_OSN(card)) {
       ^
drivers/s390/net/qeth_l2_main.c:337:3: note: here
   default:
   ^~~~~~~

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

crypto: ux500/crypt: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

This patch fixes the following warning (Building: arm):

drivers/crypto/ux500/cryp/cryp.c: In function ‘cryp_save_device_context’:
drivers/crypto/ux500/cryp/cryp.c:316:16: warning: this statement may fall through [-Wimplicit-fallthrough=]
   ctx->key_4_r = readl_relaxed(&src_reg->key_4_r);
drivers/crypto/ux500/cryp/cryp.c:318:2: note: here
  case CRYP_KEY_SIZE_192:
  ^~~~
drivers/crypto/ux500/cryp/cryp.c:320:16: warning: this statement may fall through [-Wimplicit-fallthrough=]
   ctx->key_3_r = readl_relaxed(&src_reg->key_3_r);
drivers/crypto/ux500/cryp/cryp.c:322:2: note: here
  case CRYP_KEY_SIZE_128:
  ^~~~
drivers/crypto/ux500/cryp/cryp.c:324:16: warning: this statement may fall through [-Wimplicit-fallthrough=]
   ctx->key_2_r = readl_relaxed(&src_reg->key_2_r);
drivers/crypto/ux500/cryp/cryp.c:326:2: note: here
  default:
  ^~~~~~~
In file included from ./include/linux/io.h:13:0,
                 from drivers/crypto/ux500/cryp/cryp_p.h:14,
                 from drivers/crypto/ux500/cryp/cryp.c:15:
drivers/crypto/ux500/cryp/cryp.c: In function ‘cryp_restore_device_context’:
./arch/arm/include/asm/io.h:92:22: warning: this statement may fall through [-Wimplicit-fallthrough=]
#define __raw_writel __raw_writel
                      ^
./arch/arm/include/asm/io.h:299:29: note: in expansion of macro ‘__raw_writel’
#define writel_relaxed(v,c) __raw_writel((__force u32) cpu_to_le32(v),c)
                             ^~~~~~~~~~~~
drivers/crypto/ux500/cryp/cryp.c:363:3: note: in expansion of macro ‘writel_relaxed’
   writel_relaxed(ctx->key_4_r, &reg->key_4_r);
   ^~~~~~~~~~~~~~
drivers/crypto/ux500/cryp/cryp.c:365:2: note: here
  case CRYP_KEY_SIZE_192:
  ^~~~
In file included from ./include/linux/io.h:13:0,
                 from drivers/crypto/ux500/cryp/cryp_p.h:14,
                 from drivers/crypto/ux500/cryp/cryp.c:15:
./arch/arm/include/asm/io.h:92:22: warning: this statement may fall through [-Wimplicit-fallthrough=]
#define __raw_writel __raw_writel
                      ^
./arch/arm/include/asm/io.h:299:29: note: in expansion of macro ‘__raw_writel’
#define writel_relaxed(v,c) __raw_writel((__force u32) cpu_to_le32(v),c)
                             ^~~~~~~~~~~~
drivers/crypto/ux500/cryp/cryp.c:367:3: note: in expansion of macro ‘writel_relaxed’
   writel_relaxed(ctx->key_3_r, &reg->key_3_r);
   ^~~~~~~~~~~~~~
drivers/crypto/ux500/cryp/cryp.c:369:2: note: here
  case CRYP_KEY_SIZE_128:
  ^~~~
In file included from ./include/linux/io.h:13:0,
                 from drivers/crypto/ux500/cryp/cryp_p.h:14,
                 from drivers/crypto/ux500/cryp/cryp.c:15:
./arch/arm/include/asm/io.h:92:22: warning: this statement may fall through [-Wimplicit-fallthrough=]
#define __raw_writel __raw_writel
                      ^
./arch/arm/include/asm/io.h:299:29: note: in expansion of macro ‘__raw_writel’
#define writel_relaxed(v,c) __raw_writel((__force u32) cpu_to_le32(v),c)
                             ^~~~~~~~~~~~
drivers/crypto/ux500/cryp/cryp.c:371:3: note: in expansion of macro ‘writel_relaxed’
   writel_relaxed(ctx->key_2_r, &reg->key_2_r);
   ^~~~~~~~~~~~~~
drivers/crypto/ux500/cryp/cryp.c:373:2: note: here
  default:
  ^~~~~~~

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

watchdog: wdt977: Mark expected switch fall-through

Mark switch cases where we are expecting to fall through.

This patch fixes the following warning (Building: arm):

drivers/watchdog/wdt977.c: In function ‘wdt977_ioctl’:
  LD [M]  drivers/media/platform/vicodec/vicodec.o
drivers/watchdog/wdt977.c:400:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   wdt977_keepalive();
   ^~~~~~~~~~~~~~~~~~
drivers/watchdog/wdt977.c:403:2: note: here
  case WDIOC_GETTIMEOUT:
  ^~~~

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

watchdog: scx200_wdt: Mark expected switch fall-through

Mark switch cases where we are expecting to fall through.

This patch fixes the following warning (Building: i386):

drivers/watchdog/scx200_wdt.c: In function ‘scx200_wdt_ioctl’:
drivers/watchdog/scx200_wdt.c:188:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   scx200_wdt_ping();
   ^~~~~~~~~~~~~~~~~
drivers/watchdog/scx200_wdt.c:189:2: note: here
  case WDIOC_GETTIMEOUT:
  ^~~~

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

watchdog: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

This patch fixes the following warnings:

drivers/watchdog/ar7_wdt.c: warning: this statement may fall
through [-Wimplicit-fallthrough=]:  => 237:3
drivers/watchdog/pcwd.c: warning: this statement may fall
through [-Wimplicit-fallthrough=]:  => 653:3
drivers/watchdog/sb_wdog.c: warning: this statement may fall
through [-Wimplicit-fallthrough=]:  => 204:3
drivers/watchdog/wdt.c: warning: this statement may fall
through [-Wimplicit-fallthrough=]:  => 391:3

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

ARM: signal: Mark expected switch fall-through

Mark switch cases where we are expecting to fall through.

This patch fixes the following warning:

arch/arm/kernel/signal.c: In function 'do_signal':
arch/arm/kernel/signal.c:598:12: warning: this statement may fall through [-Wimplicit-fallthrough=]
    restart -= 2;
    ~~~~~~~~^~~~
arch/arm/kernel/signal.c:599:3: note: here
   case -ERESTARTNOHAND:
   ^~~~

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

mfd: omap-usb-host: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

This patch fixes the following warnings:

drivers/mfd/omap-usb-host.c: In function 'usbhs_runtime_resume':
drivers/mfd/omap-usb-host.c:303:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
    if (!IS_ERR(omap->hsic480m_clk[i])) {
       ^
drivers/mfd/omap-usb-host.c:313:3: note: here
   case OMAP_EHCI_PORT_MODE_TLL:
   ^~~~
drivers/mfd/omap-usb-host.c: In function 'usbhs_runtime_suspend':
drivers/mfd/omap-usb-host.c:345:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
    if (!IS_ERR(omap->hsic480m_clk[i]))
       ^
drivers/mfd/omap-usb-host.c:349:3: note: here
   case OMAP_EHCI_PORT_MODE_TLL:
   ^~~~

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

mfd: db8500-prcmu: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

This patch fixes the following warnings:

drivers/mfd/db8500-prcmu.c: In function 'dsiclk_rate':
drivers/mfd/db8500-prcmu.c:1592:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
   div *= 2;
   ~~~~^~~~
drivers/mfd/db8500-prcmu.c:1593:2: note: here
  case PRCM_DSI_PLLOUT_SEL_PHI_2:
  ^~~~
drivers/mfd/db8500-prcmu.c:1594:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
   div *= 2;
   ~~~~^~~~
drivers/mfd/db8500-prcmu.c:1595:2: note: here
  case PRCM_DSI_PLLOUT_SEL_PHI:
  ^~~~

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

ARM: OMAP: dma: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

This patch fixes the following warnings:

arch/arm/plat-omap/dma.c: In function 'omap_set_dma_src_burst_mode':
arch/arm/plat-omap/dma.c:384:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (dma_omap2plus()) {
      ^
arch/arm/plat-omap/dma.c:393:2: note: here
  case OMAP_DMA_DATA_BURST_16:
  ^~~~
arch/arm/plat-omap/dma.c:394:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (dma_omap2plus()) {
      ^
arch/arm/plat-omap/dma.c:402:2: note: here
  default:
  ^~~~~~~
arch/arm/plat-omap/dma.c: In function 'omap_set_dma_dest_burst_mode':
arch/arm/plat-omap/dma.c:473:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (dma_omap2plus()) {
      ^
arch/arm/plat-omap/dma.c:481:2: note: here
  default:
  ^~~~~~~

Notice that, in this particular case, the code comment is
modified in accordance with what GCC is expecting to find.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

ARM: alignment: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

This patch fixes the following warnings:

arch/arm/mm/alignment.c: In function 'thumb2arm':
arch/arm/mm/alignment.c:688:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if ((tinstr & (3 << 9)) == 0x0400) {
      ^
arch/arm/mm/alignment.c:700:2: note: here
  default:
  ^~~~~~~
arch/arm/mm/alignment.c: In function 'do_alignment_t32_to_handler':
arch/arm/mm/alignment.c:753:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
   poffset->un = (tinst2 & 0xff) << 2;
   ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
arch/arm/mm/alignment.c:754:2: note: here
  case 0xe940:
  ^~~~

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

ARM: tegra: Mark expected switch fall-through

Mark switch cases where we are expecting to fall through.

This patch fixes the following warning:

arch/arm/mach-tegra/reset.c: In function 'tegra_cpu_reset_handler_enable':
arch/arm/mach-tegra/reset.c:72:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   tegra_cpu_reset_handler_set(reset_address);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/arm/mach-tegra/reset.c:74:2: note: here
  case 0:
  ^~~~

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

ARM/hw_breakpoint: Mark expected switch fall-throughs

Mark switch cases where we are expecting to fall through.

This patch fixes the following warnings:

arch/arm/kernel/hw_breakpoint.c: In function 'hw_breakpoint_arch_parse':
arch/arm/kernel/hw_breakpoint.c:609:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (hw->ctrl.len == ARM_BREAKPOINT_LEN_2)
      ^
arch/arm/kernel/hw_breakpoint.c:611:2: note: here
  case 3:
  ^~~~
arch/arm/kernel/hw_breakpoint.c:613:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (hw->ctrl.len == ARM_BREAKPOINT_LEN_1)
      ^
arch/arm/kernel/hw_breakpoint.c:615:2: note: here
  default:
  ^~~~~~~
arch/arm/kernel/hw_breakpoint.c: In function 'arch_build_bp_info':
arch/arm/kernel/hw_breakpoint.c:544:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if ((hw->ctrl.type != ARM_BREAKPOINT_EXECUTE)
      ^
arch/arm/kernel/hw_breakpoint.c:547:2: note: here
  default:
  ^~~~~~~
In file included from include/linux/kernel.h:11,
                 from include/linux/list.h:9,
                 from include/linux/preempt.h:11,
                 from include/linux/hardirq.h:5,
                 from arch/arm/kernel/hw_breakpoint.c:16:
arch/arm/kernel/hw_breakpoint.c: In function 'hw_breakpoint_pending':
include/linux/compiler.h:78:22: warning: this statement may fall through [-Wimplicit-fallthrough=]
# define unlikely(x) __builtin_expect(!!(x), 0)
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/bug.h:136:2: note: in expansion of macro 'unlikely'
  unlikely(__ret_warn_on);     \
  ^~~~~~~~
arch/arm/kernel/hw_breakpoint.c:863:3: note: in expansion of macro 'WARN'
   WARN(1, "Asynchronous watchpoint exception taken. Debugging results may be unreliable\n");
   ^~~~
arch/arm/kernel/hw_breakpoint.c:864:2: note: here
  case ARM_ENTRY_SYNC_WATCHPOINT:
  ^~~~
arch/arm/kernel/hw_breakpoint.c: In function 'core_has_os_save_restore':
arch/arm/kernel/hw_breakpoint.c:910:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (oslsr & ARM_OSLSR_OSLM0)
      ^
arch/arm/kernel/hw_breakpoint.c:912:2: note: here
  default:
  ^~~~~~~

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"Bugfixes (arm and x86) and cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  selftests: kvm: Adding config fragments
  KVM: selftests: Update gitignore file for latest changes
  kvm: remove unnecessary PageReserved check
  KVM: arm/arm64: vgic: Reevaluate level sensitive interrupts on enable
  KVM: arm: Don't write junk to CP15 registers on reset
  KVM: arm64: Don't write junk to sysregs on reset
  KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block
  x86: kvm: remove useless calls to kvm_para_available
  KVM: no need to check return value of debugfs_create functions
  KVM: remove kvm_arch_has_vcpu_debugfs()
  KVM: Fix leak vCPU's VMCS value into other pCPU
  KVM: Check preempted_in_kernel for involuntary preemption
  KVM: LAPIC: Don't need to wakeup vCPU twice afer timer fire
  arm64: KVM: hyp: debug-sr: Mark expected switch fall-through
  KVM: arm64: Update kvm_arm_exception_class and esr_class_str for new EC
  KVM: arm: vgic-v3: Mark expected switch fall-through
  arm64: KVM: regmap: Fix unexpected switch fall-through
  KVM: arm/arm64: Introduce kvm_pmu_vcpu_init() to setup PMU counter index

Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input updates from Dmitry Torokhov:

- newer systems with Elan touchpads will be switched over to SMBus

- HP Spectre X360 will be using SMbus/RMI4

- checks for invalid USB descriptors in kbtab and iforce

- build fixes for applespi driver (misconfigs)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: iforce - add sanity checks
  Input: applespi - use struct_size() helper
  Input: kbtab - sanity check for endpoint type
  Input: usbtouchscreen - initialize PM mutex before using it
  Input: applespi - add dependency on LEDS_CLASS
  Input: synaptics - enable RMI mode for HP Spectre X360
  Input: elantech - annotate fall-through case in elantech_use_host_notify()
  Input: elantech - enable SMBus on new (2018+) systems
  Input: applespi - fix trivial typo in struct description
  Input: applespi - select CRC16 module
  Input: applespi - fix warnings detected by sparse

mm/memremap: Fix reuse of pgmap instances with internal references

Currently, attempts to shutdown and re-enable a device-dax instance
trigger:

    Missing reference count teardown definition
    WARNING: CPU: 37 PID: 1608 at mm/memremap.c:211 devm_memremap_pages+0x234/0x850
    [..]
    RIP: 0010:devm_memremap_pages+0x234/0x850
    [..]
    Call Trace:
     dev_dax_probe+0x66/0x190 [device_dax]
     really_probe+0xef/0x390
     driver_probe_device+0xb4/0x100
     device_driver_attach+0x4f/0x60

Given that the setup path initializes pgmap->ref, arrange for it to be
also torn down so devm_memremap_pages() is ready to be called again and
not be mistaken for the 3rd-party per-cpu-ref case.

Fixes: 24917f6b1041 ("memremap: provide an optional internal refcount in struct dev_pagemap")
Reported-by: Fan Du <fan.du@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/156530042781.2068700.8733813683117819799.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

drm/i915: Remove redundant user_access_end() from __copy_from_user() error path

Objtool reports:

drivers/gpu/drm/i915/gem/i915_gem_execbuffer.o: warning: objtool: .altinstr_replacement+0x36: redundant UACCESS disable

__copy_from_user() already does both STAC and CLAC, so the
user_access_end() in its error path adds an extra unnecessary CLAC.

Fixes: 0b2c8f8b6b0c ("i915: fix missing user_access_end() in page fault exception case")
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://github.com/ClangBuiltLinux/linux/issues/617
Link: https://lkml.kernel.org/r/51a4155c5bc2ca847a9cbe85c1c11918bb193141.1564086017.git.jpoimboe@redhat.com

kbuild: show hint if subdir-y/m is used to visit module Makefile

Since commit ff9b45c55b26 ("kbuild: modpost: read modules.order instead
of $(MODVERDIR)/*.mod"), a module is no longer built in the following
pattern:

  [Makefile]
  subdir-y := some-module

  [some-module/Makefile]
  obj-m := some-module.o

You cannot write Makefile this way in upstream because modules.order is
not correctly generated. subdir-y is used to descend to a sub-directory
that builds tools, device trees, etc.

For external modules, the modules order does not matter. So, the
Makefile above was known to work.

I believe the Makefile should be re-written as follows:

  [Makefile]
  obj-m := some-module/

  [some-module/Makefile]
  obj-m := some-module.o

However, people will have no idea if their Makefile suddenly stops
working. In fact, I received questions from multiple people.

Show a warning for a while if obj-m is specified in a Makefile visited
by subdir-y or subdir-m.

I touched the %/ rule to avoid false-positive warnings for the single
target.

Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Tom Stonecypher <thomas.edwardx.stonecypher@intel.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Tested-by: Jan Kiszka <jan.kiszka@siemens.com>

kbuild: generate modules.order only in directories visited by obj-y/m

The modules.order files in directories visited by the chain of obj-y
or obj-m are merged to the upper-level ones, and become parts of the
top-level modules.order. On the other hand, there is no need to
generate modules.order in directories visited by subdir-y or subdir-m
since they would become orphan anyway.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>

kbuild: fix false-positive need-builtin calculation

The current implementation of need-builtin is false-positive,
for example, in the following Makefile:

obj-m := foo/
obj-y := foo/bar/

..., where foo/built-in.a is not required.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>

kbuild: revive single target %.ko

I removed the single target %.ko in commit ff9b45c55b26 ("kbuild:
modpost: read modules.order instead of $(MODVERDIR)/*.mod") because
the modpost stage does not work reliably. For instance, the module
dependency, modversion, etc. do not work if we lack symbol information
from the other modules.

Yet, some people still want to build only one module in their interest,
and it may be still useful if it is used within those limitations.

Fixes: ff9b45c55b26 ("kbuild: modpost: read modules.order instead of $(MODVERDIR)/*.mod")
Reported-by: Don Brace <don.brace@microsemi.com>
Reported-by: Arend Van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>

Merge tag 'drm-fixes-2019-08-09' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"Usual fixes roundup. Nothing too crazy or serious, one non-released
  ioctl is removed in the amdkfd driver.

  core:
   - mode parser strncpy fix

  i915:
   - GLK DSI escape clock setting
   - HDCP memleak fix

  tegra:
   - one gpiod/of regression fix

  amdgpu:
   - fix VCN to handle the latest navi10 firmware
   - fix for fan control on navi10
   - properly handle SMU metrics table on navi10
   - fix a resume regression on Stoney
   - kfd revert a GWS ioctl

  vmwgfx:
   - memory leak fix

  rockchip:
   - suspend fix"

* tag 'drm-fixes-2019-08-09' of git://anongit.freedesktop.org/drm/drm:
  drm/vmwgfx: fix memory leak when too many retries have occurred
  Revert "drm/amdkfd: New IOCTL to allocate queue GWS"
  Revert "drm/amdgpu: fix transform feedback GDS hang on gfx10 (v2)"
  drm/amdgpu: pin the csb buffer on hw init for gfx v8
  drm/rockchip: Suspend DP late
  drm/i915: Fix wrong escape clock divisor init for GLK
  drm/i915: fix possible memory leak in intel_hdcp_auth_downstream()
  drm/modes: Fix unterminated strncpy
  drm/amd/powerplay: correct navi10 vcn powergate
  drm/amd/powerplay: honor hw limit on fetching metrics data for navi10
  drm/amd/powerplay: Allow changing of fan_control in smu_v11_0
  drm/amd/amdgpu/vcn_v2_0: Move VCN 2.0 specific dec ring test to vcn_v2_0
  drm/amd/amdgpu/vcn_v2_0: Mark RB commands as KMD commands
  drm/tegra: Fix gpiod_get_from_of_node() regression

Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fix from Catalin Marinas:
"Fix bad_pte warning caused by pte_mkdevmap() not setting PTE_SPECIAL"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: mm: add missing PTE_SPECIAL in pte_mkdevmap on arm64

Merge tag 's390-5.3-5' of git://git./linux/kernel/git/s390/linux

Pull s390 fixes from Vasily Gorbik:

- Map vdso also for statically linked binaries like all other
   architectures.

- Fix no .bss usage compile-time check to account common objects with
   the help of binutils size tool. Top level Makefile change acked-by
   Masahiro.

- A fix to make perf happy with _etext symbol type.

- Fix dump_pagetables which is broken since p*d_offset implementation
   change to comply with mm/gup.c expectations.

- Revert memory sharing for diag calls in protected virtualization,
   since this is not required after all.

- Couple of other minor code cleanups.

* tag 's390-5.3-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/vdso: map vdso also for statically linked binaries
  s390/build: use size command to perform empty .bss check
  kbuild: add OBJSIZE variable for the size tool
  s390: put _stext and _etext into .text section
  s390/head64: cleanup unused labels
  s390/unwind: remove stack recursion warning
  s390/setup: adjust start_code of init_mm to _text
  s390/mm: fix dump_pagetables top level page table walking
  s390/protvirt: avoid memory sharing for diag 308 set/store

Merge tag 'for-linus-20190809' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Revert of a bcache patch that caused an oops for some (Coly)

- ata rb532 unused warning fix (Gustavo)

- AoE kernel crash fix (He)

- Error handling fixup for blkdev_get() (Jan)

- libata read/write translation and SFF PIO fix (me)

- Use after free and error handling fix for O_DIRECT fragments. There's
   still a nowait + sync oddity in there, we'll nail that start next
   week. If all else fails, I'll queue a revert of the NOWAIT change.
   (me)

- Loop GFP_KERNEL -> GFP_NOIO deadlock fix (Mikulas)

- Two BFQ regression fixes that caused crashes (Paolo)

* tag 'for-linus-20190809' of git://git.kernel.dk/linux-block:
  bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()"
  loop: set PF_MEMALLOC_NOIO for the worker thread
  bdev: Fixup error handling in blkdev_get()
  block, bfq: handle NULL return value by bfq_init_rq()
  block, bfq: move update of waker and woken list to queue freeing
  block, bfq: reset last_completed_rq_bfqq if the pointed queue is freed
  block: aoe: Fix kernel crash due to atomic sleep when exiting
  libata: add SG safety checks in SFF pio transfers
  libata: have ata_scsi_rw_xlat() fail invalid passthrough requests
  block: fix O_DIRECT error handling for bio fragments
  ata: rb532_cf: Fix unused variable warning in rb532_pata_driver_probe

Merge tag 'mmc-v5.3-rc3' of git://git./linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:

- cavium: Fix DMA support

- sdhci-sprd: Fix soft reset when runtime resuming"

* tag 'mmc-v5.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: cavium: Add the missing dma unmap when the dma has finished.
  mmc: cavium: Set the correct dma max segment size for mmc_host
  mmc: sdhci-sprd: Fix the incorrect soft reset operation when runtime resuming

Merge tag 'fbdev-v5.3-rc4' of git://github.com/bzolnier/linux

Pull fbdev fix from Bartlomiej Zolnierkiewicz:
"fbdev patches will now go to upstream through drm-misc tree for
  improved maintainership and better integration testing so update
  MAINTAINERS file accordingly"

* tag 'fbdev-v5.3-rc4' of git://github.com/bzolnier/linux:
  MAINTAINERS: handle fbdev changes through drm-misc tree

Merge tag 'pwm/for-5.3-rc4' of git://git./linux/kernel/git/thierry.reding/linux-pwm

Pull pwm fix from Thierry Reding:
"A single fix for a backlight brightness regression introduced in
this merge window"

* tag 'pwm/for-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
pwm: Fallback to the static lookup-list when acpi_pwm_get fails

Merge tag 'sound-5.3-rc4' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"Lots of small fixes at this time since we've received the ASoC fix
  batch now.

   - Some coverage in ASoC core mostly for minor issues like NULL checks
     for DPCM and proper error handling in DAI instantiation

   - A collection of small device-specific changes in various ASoC codec
     and platform drivers

   - OF-tree refcount fixes in a few ASoC drivers

   - Fixes of memory leaks in the error paths of various ASoC / ALSA
     drivers

   - A workaround for a long-standing issue on AMD HD-audio device

   - Updates of MAINTAINERS, mail addresses, file permission fixups"

* tag 'sound-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (38 commits)
  ALSA: firewire: fix a memory leak bug
  sound: fix a memory leak bug
  ALSA: hda - Workaround for crackled sound on AMD controller (1022:1457)
  ALSA: hiface: fix multiple memory leak bugs
  ALSA: hda - Don't override global PCM hw info flag
  ALSA: usb-audio: fix a memory leak bug
  ASoC: max98373: Remove executable bits
  ASoC: amd: acp3x: use dma address for acp3x dma driver
  ASoC: amd: acp3x: use dma_ops of parent device for acp3x dma driver
  ASoC: max98373: add 88200 and 96000 sampling rate support
  ASoC: sun4i-i2s: Incorrect SR and WSS computation
  MAINTAINERS: Update Intel ASoC drivers maintainers
  ASoC: ti: davinci-mcasp: Correct slot_width posed constraint
  ASoC: rockchip: Fix mono capture
  ASoC: Intel: Fix some acpi vs apci typo in somme comments
  ASoC: ti: davinci-mcasp: Fix clk PDIR handling for i2s master mode
  ASoC: Fail card instantiation if DAI format setup fails
  ASoC: SOF: Intel: hda: remove misleading error trace from IRQ thread
  ASoC: qcom: apq8016_sbc: Fix oops with multiple DAI links
  ASoC: dapm: fix a memory leak bug
  ...

Merge tag 'media/v5.3-3' of git://git./linux/kernel/git/mchehab/linux-media

Pull media fix from Mauro Carvalho Chehab:
"A fix at the vivid CEC support"

* tag 'media/v5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: vivid: fix missing cec adapter name

Merge tag 'pm-5.3-rc4' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
"Revert a recent PCI power management change that caused problems to
occur on multiple systems (Mika Westerberg)"

* tag 'pm-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "PCI: Add missing link delays required by the PCIe spec"

Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:
"Fix a number of bugs in the ccp driver"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: ccp - Ignore tag length when decrypting GCM ciphertext
  crypto: ccp - Add support for valid authsize values less than 16
  crypto: ccp - Fix oops by properly managing allocated structures

gfs2: gfs2_walk_metadata fix

It turns out that the current version of gfs2_metadata_walker suffers
from multiple problems that can cause gfs2_hole_size to report an
incorrect size. This will confuse fiemap as well as lseek with the
SEEK_DATA flag.

Fix that by changing gfs2_hole_walker to compute the metapath to the
first data block after the hole (if any), and compute the hole size
based on that.

Fixes xfstest generic/490.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Cc: stable@vger.kernel.org # v4.18+

iommu/vt-d: Fix possible use-after-free of private domain

Multiple devices might share a private domain. One real example
is a pci bridge and all devices behind it. When remove a private
domain, make sure that it has been detached from all devices to
avoid use-after-free case.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Fixes: 942067f1b6b97 ("iommu/vt-d: Identify default domains replaced with private")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>

iommu/vt-d: Detach domain before using a private one

When the default domain of a group doesn't work for a device,
the iommu driver will try to use a private domain. The domain
which was previously attached to the device must be detached.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Fixes: 942067f1b6b97 ("iommu/vt-d: Identify default domains replaced with private")
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lkml.org/lkml/2019/8/2/1379
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>

iommu/dma: Handle SG length overflow better

Since scatterlist dimensions are all unsigned ints, in the relatively
rare cases where a device's max_segment_size is set to UINT_MAX, then
the "cur_len + s_length <= max_len" check in __finalise_sg() will always
return true. As a result, the corner case of such a device mapping an
excessively large scatterlist which is mergeable to or beyond a total
length of 4GB can lead to overflow and a bogus truncated dma_length in
the resulting segment.

As we already assume that any single segment must be no longer than
max_len to begin with, this can easily be addressed by reshuffling the
comparison.

Fixes: 809eac54cdd6 ("iommu/dma: Implement scatterlist segment merging")
Reported-by: Nicolin Chen <nicoleotsuka@gmail.com>
Tested-by: Nicolin Chen <nicoleotsuka@gmail.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>

iommu/vt-d: Correctly check format of page table in debugfs

PASID support and enable bit in the context entry isn't the right
indicator for the type of tables (legacy or scalable mode). Check
the DMA_RTADDR_SMT bit in the root context pointer instead.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Sai Praneeth <sai.praneeth.prakhya@intel.com>
Fixes: dd5142ca5d24b ("iommu/vt-d: Add debugfs support to show scalable mode DMAR table internals")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>

Merge tag 'kvmarm-fixes-for-5.3-2' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm fixes for 5.3, take #2

- Fix our system register reset so that we stop writing
  non-sensical values to them, and track which registers
  get reset instead.
- Sync VMCR back from the GIC on WFI so that KVM has an
  exact vue of PMR.
- Reevaluate state of HW-mapped, level triggered interrupts
  on enable.

Merge tag 'kvmarm-fixes-for-5.3' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm fixes for 5.3

- A bunch of switch/case fall-through annotation, fixing one actual bug
- Fix PMU reset bug
- Add missing exception class debug strings

selftests: kvm: Adding config fragments

selftests kvm test cases need pre-required kernel configs for the test
to get pass.

Signed-off-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: selftests: Update gitignore file for latest changes

The kvm_create_max_vcpus test has been moved to the main directory,
and sync_regs_test is now available on s390x, too.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

kvm: remove unnecessary PageReserved check

The same check is already done in kvm_is_reserved_pfn.

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

MAINTAINERS: handle fbdev changes through drm-misc tree

fbdev patches will now go to upstream through drm-misc tree (IOW
starting with v5.4 merge window fbdev changes will be included in
DRM pull request) for improved maintainership and better integration
testing. Update MAINTAINERS file accordingly.

Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>

bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()"

This reverts commit 89e0341af082dbc170019f908846f4a424efc86b.

In drivers/md/bcache/sysfs.c:bch_snprint_string_list(), NULL pointer at
the end of list is necessary. Remove the NULL from last element of each
lists will cause the following panic,

[ 4340.455652] bcache: register_cache() registered cache device nvme0n1
[ 4340.464603] bcache: register_bdev() registered backing device sdk
[ 4421.587335] bcache: bch_cached_dev_run() cached dev sdk is running already
[ 4421.587348] bcache: bch_cached_dev_attach() Caching sdk as bcache0 on set 354e1d46-d99f-4d8b-870b-078b80dc88a6
[ 5139.247950] general protection fault: 0000 [#1] SMP NOPTI
[ 5139.247970] CPU: 9 PID: 5896 Comm: cat Not tainted 4.12.14-95.29-default #1 SLE12-SP4
[ 5139.247988] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/18/2019
[ 5139.248006] task: ffff888fb25c0b00 task.stack: ffff9bbacc704000
[ 5139.248021] RIP: 0010:string+0x21/0x70
[ 5139.248030] RSP: 0018:ffff9bbacc707bf0 EFLAGS: 00010286
[ 5139.248043] RAX: ffffffffa7e432e3 RBX: ffff8881c20da02a RCX: ffff0a00ffffff04
[ 5139.248058] RDX: 3f00656863616362 RSI: ffff8881c20db000 RDI: ffffffffffffffff
[ 5139.248075] RBP: ffff8881c20db000 R08: 0000000000000000 R09: ffff8881c20da02a
[ 5139.248090] R10: 0000000000000004 R11: 0000000000000000 R12: ffff9bbacc707c48
[ 5139.248104] R13: 0000000000000fd6 R14: ffffffffc0665855 R15: ffffffffc0665855
[ 5139.248119] FS:  00007faf253b8700(0000) GS:ffff88903f840000(0000) knlGS:0000000000000000
[ 5139.248137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5139.248149] CR2: 00007faf25395008 CR3: 0000000f72150006 CR4: 00000000007606e0
[ 5139.248164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5139.248179] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5139.248193] PKRU: 55555554
[ 5139.248200] Call Trace:
[ 5139.248210]  vsnprintf+0x1fb/0x510
[ 5139.248221]  snprintf+0x39/0x40
[ 5139.248238]  bch_snprint_string_list.constprop.15+0x5b/0x90 [bcache]
[ 5139.248256]  __bch_cached_dev_show+0x44d/0x5f0 [bcache]
[ 5139.248270]  ? __alloc_pages_nodemask+0xb2/0x210
[ 5139.248284]  bch_cached_dev_show+0x2c/0x50 [bcache]
[ 5139.248297]  sysfs_kf_seq_show+0xbb/0x190
[ 5139.248308]  seq_read+0xfc/0x3c0
[ 5139.248317]  __vfs_read+0x26/0x140
[ 5139.248327]  vfs_read+0x87/0x130
[ 5139.248336]  SyS_read+0x42/0x90
[ 5139.248346]  do_syscall_64+0x74/0x160
[ 5139.248358]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 5139.248370] RIP: 0033:0x7faf24eea370
[ 5139.248379] RSP: 002b:00007fff82d03f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 5139.248395] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007faf24eea370
[ 5139.248411] RDX: 0000000000020000 RSI: 00007faf25396000 RDI: 0000000000000003
[ 5139.248426] RBP: 00007faf25396000 R08: 00000000ffffffff R09: 0000000000000000
[ 5139.248441] R10: 000000007c9d4d41 R11: 0000000000000246 R12: 00007faf25396000
[ 5139.248456] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[ 5139.248892] Code: ff ff ff 0f 1f 80 00 00 00 00 49 89 f9 48 89 cf 48 c7 c0 e3 32 e4 a7 48 c1 ff 30 48 81 fa ff 0f 00 00 48 0f 46 d0 48 85 ff 74 45 <44> 0f b6 02 48 8d 42 01 45 84 c0 74 38 48 01 fa 4c 89 cf eb 0e

The simplest way to fix is to revert commit 89e0341af082 ("bcache: use
sysfs_match_string() instead of __sysfs_match_string()").

This bug was introduced in Linux v5.2, so this fix only applies to
Linux v5.2 is enough for stable tree maintainer.

Fixes: 89e0341af082 ("bcache: use sysfs_match_string() instead of __sysfs_match_string()")
Cc: stable@vger.kernel.org
Cc: Alexandru Ardelean <alexandru.ardelean@analog.com>
Reported-by: Peifeng Lin <pflin@suse.com>
Acked-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

s390/vdso: map vdso also for statically linked binaries

s390 does not map the vdso for statically linked binaries, assuming
that this doesn't make sense. See commit fc5243d98ac2 ("[S390]
arch_setup_additional_pages arguments").

However with glibc commit d665367f596d ("linux: Enable vDSO for static
linking as default (BZ#19767)") and commit 5e855c895401 ("s390: Enable
VDSO for static linking") the vdso is also used for statically linked
binaries - if the kernel would make it available.

Therefore map the vdso always, just like all other architectures.

Reported-by: Stefan Liebler <stli@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>

KVM: arm/arm64: vgic: Reevaluate level sensitive interrupts on enable

A HW mapped level sensitive interrupt asserted by a device will not be put
into the ap_list if it is disabled at the VGIC level. When it is enabled
again, it will be inserted into the ap_list and written to a list register
on guest entry regardless of the state of the device.

We could argue that this can also happen on real hardware, when the command
to enable the interrupt reached the GIC before the device had the chance to
de-assert the interrupt signal; however, we emulate the distributor and
redistributors in software and we can do better than that.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm: Don't write junk to CP15 registers on reset

At the moment, the way we reset CP15 registers is mildly insane:
We write junk to them, call the reset functions, and then check that
we have something else in them.

The "fun" thing is that this can happen while the guest is running
(PSCI, for example). If anything in KVM has to evaluate the state
of a CP15 register while junk is in there, bad thing may happen.

Let's stop doing that. Instead, we track that we have called a
reset function for that register, and assume that the reset
function has done something.

In the end, the very need of this reset check is pretty dubious,
as it doesn't check everything (a lot of the CP15 reg leave outside
of the cp15_regs[] array). It may well be axed in the near future.

Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: Don't write junk to sysregs on reset

At the moment, the way we reset system registers is mildly insane:
We write junk to them, call the reset functions, and then check that
we have something else in them.

The "fun" thing is that this can happen while the guest is running
(PSCI, for example). If anything in KVM has to evaluate the state
of a system register while junk is in there, bad thing may happen.

Let's stop doing that. Instead, we track that we have called a
reset function for that register, and assume that the reset
function has done something. This requires fixing a couple of
sysreg refinition in the trap table.

In the end, the very need of this reset check is pretty dubious,
as it doesn't check everything (a lot of the sysregs leave outside of
the sys_regs[] array). It may well be axed in the near future.

Tested-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>

Merge tag 'drm-intel-fixes-2019-08-08' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

drm/i915 fixes for v5.3-rc4:
- Fix GLK DSI escape clock setting
- Fix a memleak on HDCP revoked Ksv error path

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87pnlghz79.fsf@intel.com