sfrench/cifs-2.6.git
4 years agobcache: Fix leak of bdev reference
Jan Kara [Wed, 6 Sep 2017 06:25:51 +0000 (14:25 +0800)]
bcache: Fix leak of bdev reference

If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock/loop: remove unused field
Shaohua Li [Fri, 1 Sep 2017 18:15:18 +0000 (11:15 -0700)]
block/loop: remove unused field

nobody uses the list.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock/loop: fix use after free
Shaohua Li [Fri, 1 Sep 2017 18:15:17 +0000 (11:15 -0700)]
block/loop: fix use after free

lo_rw_aio->call_read_iter->
1       aops->direct_IO
2       iov_iter_revert
lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could
be freed before 2, which accesses bvec.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agobfq: Use icq_to_bic() consistently
Bart Van Assche [Wed, 30 Aug 2017 18:42:11 +0000 (11:42 -0700)]
bfq: Use icq_to_bic() consistently

Some code uses icq_to_bic() to convert an io_cq pointer to a
bfq_io_cq pointer while other code uses a direct cast. Convert
the code that uses a direct cast such that it uses icq_to_bic().

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agobfq: Suppress compiler warnings about comparisons
Bart Van Assche [Wed, 30 Aug 2017 18:42:10 +0000 (11:42 -0700)]
bfq: Suppress compiler warnings about comparisons

This patch avoids that the following warnings are reported when
building with W=1:

block/bfq-iosched.c: In function 'bfq_back_seek_max_store':
block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/bfq-iosched.c:4876:1: note: in expansion of macro 'STORE_FUNCTION'
 STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
 ^~~~~~~~~~~~~~
block/bfq-iosched.c: In function 'bfq_slice_idle_store':
block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/bfq-iosched.c:4879:1: note: in expansion of macro 'STORE_FUNCTION'
 STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2);
 ^~~~~~~~~~~~~~
block/bfq-iosched.c: In function 'bfq_slice_idle_us_store':
block/bfq-iosched.c:4892:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/bfq-iosched.c:4899:1: note: in expansion of macro 'USEC_STORE_FUNCTION'
 USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0,
 ^~~~~~~~~~~~~~~~~~~

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agobfq: Check kstrtoul() return value
Bart Van Assche [Wed, 30 Aug 2017 18:42:09 +0000 (11:42 -0700)]
bfq: Check kstrtoul() return value

Make sysfs writes fail for invalid numbers instead of storing
uninitialized data copied from the stack. This patch removes
all uninitialized_var() occurrences from the BFQ source code.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agobfq: Declare local functions static
Bart Van Assche [Wed, 30 Aug 2017 18:42:08 +0000 (11:42 -0700)]
bfq: Declare local functions static

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agobfq: Annotate fall-through in a switch statement
Bart Van Assche [Wed, 30 Aug 2017 18:42:07 +0000 (11:42 -0700)]
bfq: Annotate fall-through in a switch statement

This patch avoids that gcc 7 issues a warning about fall-through
when building with W=1.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoMerge branch 'nvme-4.14' of git://git.infradead.org/nvme into for-4.14/block-postmerge
Jens Axboe [Fri, 1 Sep 2017 19:52:37 +0000 (13:52 -0600)]
Merge branch 'nvme-4.14' of git://git.infradead.org/nvme into for-4.14/block-postmerge

Pull NVMe updates from Christoph:

"A few more nvme updates for 4.14:

 - generate a correct default NQN (Daniel Verkamp)
 - metadata passthrough for the NVME_IOCTL_IO_CMD ioctl, as well as
   related fixes and cleanups (Keith)
 - better scalability for connecting to the NVMeOF target (Roland Dreier)
 - target support for reading the host identifier (Omri Mann)"

4 years agoblock/loop: allow request merge for directio mode
Shaohua Li [Fri, 1 Sep 2017 05:09:46 +0000 (22:09 -0700)]
block/loop: allow request merge for directio mode

Currently loop disables merge. While it makes sense for buffer IO mode,
directio mode can benefit from request merge. Without merge, loop could
send small size IO to underlayer disk and harm performance.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock/loop: set hw_sectors
Shaohua Li [Fri, 1 Sep 2017 05:09:45 +0000 (22:09 -0700)]
block/loop: set hw_sectors

Loop can handle any size of request. Limiting it to 255 sectors just
burns the CPU for bio split and request merge for underlayer disk and
also cause bad fs block allocation in directio mode.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agonvme-fabrics: generate spec-compliant UUID NQNs
Daniel Verkamp [Wed, 30 Aug 2017 22:18:19 +0000 (15:18 -0700)]
nvme-fabrics: generate spec-compliant UUID NQNs

The default host NQN, which is generated based on the host's UUID,
does not follow the UUID-based NQN format laid out in the NVMe 1.3
specification.  Remove the "NVMf:" portion of the NQN to match the spec.

Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agodoc, block, bfq: better describe how to properly configure bfq
Paolo Valente [Thu, 31 Aug 2017 18:00:31 +0000 (20:00 +0200)]
doc, block, bfq: better describe how to properly configure bfq

Many users have reported the lack of an HOWTO for properly configuring
bfq as a function of the goal one wants to achieve (max
responsiveness, max throughput, ...). In fact, all needed details are
already provided in the documentation file bfq-iosched.txt. Yet the
document lacks guidance on which parameter descriptions to look
at. This commit adds some simple direction.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Jeremy Hickman <jeremywh7@gmail.com>
Reviewed-by: Laurentiu Nicola <lnicola@dend.ro>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agodoc, block, bfq: fix some typos and remove stale stuff
Paolo Valente [Thu, 31 Aug 2017 18:00:30 +0000 (20:00 +0200)]
doc, block, bfq: fix some typos and remove stale stuff

In addition to containing some typos and stale sentences, the file
bfq-iosched.txt still mentioned a set of sysfs parameters that have
been removed from this version of bfq. This commit fixes all these
issues.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Jeremy Hickman <jeremywh7@gmail.com>
Reviewed-by: Laurentiu Nicola <lnicola@dend.ro>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoloop: fold loop_switch() into callers
Omar Sandoval [Thu, 24 Aug 2017 07:03:44 +0000 (00:03 -0700)]
loop: fold loop_switch() into callers

The comments here are really outdated, and blk-mq made flushing much
simpler, so just fold the two cases into the callers.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoloop: add ioctl for changing logical block size
Omar Sandoval [Thu, 24 Aug 2017 07:03:43 +0000 (00:03 -0700)]
loop: add ioctl for changing logical block size

This is a different approach from the first attempt in f2c6df7dbf9a
("loop: support 4k physical blocksize"). Rather than extending
LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block
size.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoloop: set physical block size to PAGE_SIZE
Omar Sandoval [Thu, 24 Aug 2017 07:03:42 +0000 (00:03 -0700)]
loop: set physical block size to PAGE_SIZE

The physical block size is "the lowest possible sector size that the
hardware can operate on without reverting to read-modify-write
operations" (from the comment on blk_queue_physical_block_size()). Since
loop does buffered I/O on the backing file by default, the RMW unit is a
page. This isn't the case for direct I/O mode, but let's keep it simple.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoloop: get rid of lo_blocksize
Omar Sandoval [Thu, 24 Aug 2017 07:03:41 +0000 (00:03 -0700)]
loop: get rid of lo_blocksize

This is only used for setting the soft block size on the struct
block_device once and then never used again.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock, bfq: guarantee update_next_in_service always returns an eligible entity
Paolo Valente [Thu, 31 Aug 2017 06:46:31 +0000 (08:46 +0200)]
block, bfq: guarantee update_next_in_service always returns an eligible entity

If the function bfq_update_next_in_service is invoked as a consequence
of the activation or requeueing of an entity, say E, then it doesn't
invoke bfq_lookup_next_entity to get the next-in-service entity. In
contrast, it follows a shorter path: if E happens to be eligible (see
commit "bfq-sq-mq: make lookup_next_entity push up vtime on
expirations" for details on eligibility) and to have a lower virtual
finish time than the current candidate as next-in-service entity, then
E directly becomes the next-in-service entity. Unfortunately, there is
a corner case for which this shorter path makes
bfq_update_next_in_service choose a non eligible entity: it occurs if
both E and the current next-in-service entity happen to be non
eligible when bfq_update_next_in_service is invoked. In this case, E
is not set as next-in-service, and, since bfq_lookup_next_entity is
not invoked, the state of the parent entity is not updated so as to
end up with an eligible entity as the proper next-in-service entity.

In this respect, next-in-service is actually allowed to be non
eligible while some queue is in service: since no system-virtual-time
push-up can be performed in that case (see again commit "bfq-sq-mq:
make lookup_next_entity push up vtime on expirations" for details),
next-in-service is chosen, speculatively, as a function of the
possible value that the system virtual time may get after a push
up. But the correctness of the schedule breaks if next-in-service is
still a non eligible entity when it is time to set in service the next
entity. Unfortunately, this may happen in the above corner case.

This commit fixes this problem by making bfq_update_next_in_service
invoke bfq_lookup_next_entity not only if the above shorter path
cannot be taken, but also if the shorter path is taken but fails to
yield an eligible next-in-service entity.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock, bfq: remove direct switch to an entity in higher class
Paolo Valente [Thu, 31 Aug 2017 06:46:30 +0000 (08:46 +0200)]
block, bfq: remove direct switch to an entity in higher class

If the function bfq_update_next_in_service is invoked as a consequence
of the activation or requeueing of an entity, say E, and finds out
that E belongs to a higher-priority class than that of the current
next-in-service entity, then it sets next_in_service directly to
E. But this may lead to anomalous schedules, because E may happen not
be eligible for service, because its virtual start time is higher than
the system virtual time for its service tree.

This commit addresses this issue by simply removing this direct
switch.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock, bfq: make lookup_next_entity push up vtime on expirations
Paolo Valente [Thu, 31 Aug 2017 06:46:29 +0000 (08:46 +0200)]
block, bfq: make lookup_next_entity push up vtime on expirations

To provide a very smooth service, bfq starts to serve a bfq_queue
only if the queue is 'eligible', i.e., if the same queue would
have started to be served in the ideal, perfectly fair system that
bfq simulates internally. This is obtained by associating each
queue with a virtual start time, and by computing a special system
virtual time quantity: a queue is eligible only if the system
virtual time has reached the virtual start time of the
queue. Finally, bfq guarantees that, when a new queue must be set
in service, there is always at least one eligible entity for each
active parent entity in the scheduler. To provide this guarantee,
the function __bfq_lookup_next_entity pushes up, for each parent
entity on which it is invoked, the system virtual time to the
minimum among the virtual start times of the entities in the
active tree for the parent entity (more precisely, the push up
occurs if the system virtual time happens to be lower than all
such virtual start times).

There is however a circumstance in which __bfq_lookup_next_entity
cannot push up the system virtual time for a parent entity, even
if the system virtual time is lower than the virtual start times
of all the child entities in the active tree. It happens if one of
the child entities is in service. In fact, in such a case, there
is already an eligible entity, the in-service one, even if it may
not be not present in the active tree (because in-service entities
may be removed from the active tree).

Unfortunately, in the last re-design of the
hierarchical-scheduling engine, the reset of the pointer to the
in-service entity for a given parent entity--reset to be done as a
consequence of the expiration of the in-service entity--always
happens after the function __bfq_lookup_next_entity has been
invoked. This causes the function to think that there is still an
entity in service for the parent entity, and then that the system
virtual time cannot be pushed up, even if actually such a
no-more-in-service entity has already been properly reinserted
into the active tree (or in some other tree if no more
active). Yet, the system virtual time *had* to be pushed up, to be
ready to correctly choose the next queue to serve. Because of the
lack of this push up, bfq may wrongly set in service a queue that
had been speculatively pre-computed as the possible
next-in-service queue, but that would no more be the one to serve
after the expiration and the reinsertion into the active trees of
the previously in-service entities.

This commit addresses this issue by making
__bfq_lookup_next_entity properly push up the system virtual time
if an expiration is occurring.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agonvmet: add support for reporting the host identifier
Omri Mann [Wed, 30 Aug 2017 12:22:59 +0000 (15:22 +0300)]
nvmet: add support for reporting the host identifier

And fix the Get/Set Log Page implementation to take all 8 bits of the
feature identifier into account.

Signed-off-by: Omri Mann <omri@excelero.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[hch: used the UUID API, updated changelog]

4 years agonvme: Use metadata for passthrough commands
Keith Busch [Tue, 29 Aug 2017 21:46:04 +0000 (17:46 -0400)]
nvme: Use metadata for passthrough commands

The ioctls' struct allows the user to provide a metadata address and
length for a passthrough command. This patch uses these values that were
previously ignored and deletes the now unused wrapper function.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme: Make nvme user functions static
Keith Busch [Tue, 29 Aug 2017 21:46:03 +0000 (17:46 -0400)]
nvme: Make nvme user functions static

These functions are used only locally in the nvme core.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme/pci: Use req_op to determine DIF remapping
Keith Busch [Tue, 29 Aug 2017 21:46:02 +0000 (17:46 -0400)]
nvme/pci: Use req_op to determine DIF remapping

Only read and write commands need DIF remapping. Everything else uses
a passthrough integrity payload.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme: factor metadata handling out of __nvme_submit_user_cmd
Christoph Hellwig [Tue, 29 Aug 2017 21:46:01 +0000 (17:46 -0400)]
nvme: factor metadata handling out of __nvme_submit_user_cmd

Keep the metadata code in a separate helper instead of making the
main function more complicated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-fabrics: Convert nvmf_transports_mutex to an rwsem
Roland Dreier [Tue, 29 Aug 2017 17:33:44 +0000 (10:33 -0700)]
nvme-fabrics: Convert nvmf_transports_mutex to an rwsem

The mutex protects against the list of transports changing while a
controller is being created, but using a plain old mutex means that it
also serializes controller creation.  This unnecessarily slows down
creating multiple controllers - for example for the RDMA transport,
creating a controller involves establishing one connection for every IO
queue, which involves even more network/software round trips, so the
delay can become significant.

The simplest way to fix this is to change the mutex to an rwsem and only
hold it for writing when the list is being mutated.  Since we can take
the rwsem for reading while creating a controller, we can create multiple
controllers in parallel.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agoMerge branch 'nvme-4.14' of git://git.infradead.org/nvme into for-4.14/block-postmerge
Jens Axboe [Tue, 29 Aug 2017 15:09:11 +0000 (09:09 -0600)]
Merge branch 'nvme-4.14' of git://git.infradead.org/nvme into for-4.14/block-postmerge

Pull NVMe changes from Christoph:

"Below is the current set of NVMe updates for Linux 4.14, now against
 your postmerge branch, and with three more patches.

 The biggest bit comes from Sagi and refactors the RDMA driver to
 prepare for more code sharing in the setup and teardown path.  But we
 have various features and bug fixes from a lot of people as well."

4 years agonvme: don't blindly overwrite identifiers on disk revalidate
Christoph Hellwig [Thu, 17 Aug 2017 12:10:00 +0000 (14:10 +0200)]
nvme: don't blindly overwrite identifiers on disk revalidate

Instead validate that these identifiers do not change, as that is
prohibited by the specification.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
4 years agonvme: remove nvme_revalidate_ns
Christoph Hellwig [Wed, 16 Aug 2017 14:14:47 +0000 (16:14 +0200)]
nvme: remove nvme_revalidate_ns

The function is used in two places, and the shared code for those will
diverge later in this series.

Instead factor out a new helper to get the ids for a namespace, simplify
the calling conventions for nvme_identify_ns and just open code the
sequence.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
4 years agonvme: remove unused struct nvme_ns fields
Christoph Hellwig [Wed, 16 Aug 2017 13:47:37 +0000 (15:47 +0200)]
nvme: remove unused struct nvme_ns fields

And move the flags for the flags field near that field while touching
this area.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
4 years agonvme: allow calling nvme_change_ctrl_state from irq context
Christoph Hellwig [Tue, 22 Aug 2017 09:42:24 +0000 (11:42 +0200)]
nvme: allow calling nvme_change_ctrl_state from irq context

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
4 years agonvme: report more detailed status codes to the block layer
Christoph Hellwig [Tue, 22 Aug 2017 08:17:03 +0000 (10:17 +0200)]
nvme: report more detailed status codes to the block layer

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
4 years agonvme: honor RTD3 Entry Latency for shutdowns
Martin K. Petersen [Fri, 25 Aug 2017 23:14:50 +0000 (19:14 -0400)]
nvme: honor RTD3 Entry Latency for shutdowns

If an NVMe controller reports RTD3 Entry Latency larger than
shutdown_timeout, up to a maximum of 60 seconds, use that value to set
the shutdown timer. Otherwise fall back to the module parameter which
defaults to 5 seconds.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
[hch: removed do_div, made transition time local scope]
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme: fix uninitialized prp2 value on small transfers
Jan H. Schönherr [Sun, 27 Aug 2017 13:56:37 +0000 (15:56 +0200)]
nvme: fix uninitialized prp2 value on small transfers

The value of iod->first_dma ends up as prp2 in NVMe commands. In case
there is not enough data to cross a page boundary, iod->first_dma is
never initialized and contains random data.

Comply with the NVMe specification and fill in 0 in that case.

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: Use unlikely macro in the fast path
Max Gurtovoy [Mon, 14 Aug 2017 12:29:26 +0000 (15:29 +0300)]
nvme-rdma: Use unlikely macro in the fast path

This patch slightly improves performance (mainly for small block sizes).

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: use memcpy_and_pad for identify sn/fr
Martin Wilck [Mon, 14 Aug 2017 20:12:39 +0000 (22:12 +0200)]
nvmet: use memcpy_and_pad for identify sn/fr

This changes the earlier patch "nvmet: don't report 0-bytes
in serial number" to use the memcpy_and_pad() helper introduced
in a previous patch.

Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimbeg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agostring.h: add memcpy_and_pad()
Martin Wilck [Mon, 14 Aug 2017 20:12:38 +0000 (22:12 +0200)]
string.h: add memcpy_and_pad()

This helper function is useful for the nvme subsystem, and maybe
others.

Note: the warnings reported by the kbuild test robot for this patch
are actually generated by the use of CONFIG_PROFILE_ALL_BRANCHES
together with __FORTIFY_INLINE.

Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimbeg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet-fc: simplify sg list handling
James Smart [Mon, 31 Jul 2017 20:21:14 +0000 (13:21 -0700)]
nvmet-fc: simplify sg list handling

The existing nvmet_fc sg list handling has 2 faults:
a) the request between LLDD and transport has too large of an sg
   list (256 elements), which is normally 256k (64 elements).
b) sglist handling doesn't optimize on the fact that each element
   is a page.

This patch removes the static sg list in the request and uses the
dynamic list already present in the nvmet_fc transport. It also
simplies the handling of the sg list on multiple sequences to
take advantage of the per-page divisions.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-fc: Reattach to localports on re-registration
James Smart [Mon, 31 Jul 2017 20:20:30 +0000 (13:20 -0700)]
nvme-fc: Reattach to localports on re-registration

If the LLDD resets or detaches from an fc port, the LLDD will
deregister all remoteports seen by the fc port and deregister the
localport associated with the fc port. The teardown of the localport
structure will be held off due to reference counting until all the
remoteports are removed (and they are held off until all
controllers/associations to terminated). Currently, if the fc port
is reinit/reattached and registered again as a localport it is
treated as an independent entity from the prior localport and all
prior remoteports and controllers cannot be revived. They are
created as new and separate entities.

This patch changes the localport registration to look at the known
localports that are waiting to be torndown. If they are the same port
based on wwn's, the local port is transitioned out of the teardown
state.  This allows the remote ports and controller connections to
be reestablished and resumed as long as the localport can also be
reregistered within the timeout windows.

The patch adds a new routine nvme_fc_attach_to_unreg_lport() with
the functionality and moves the lport get/put routines to avoid
forward references.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme: rename AMS symbolic constants to fit specification
Max Gurtovoy [Sun, 13 Aug 2017 16:21:07 +0000 (19:21 +0300)]
nvme: rename AMS symbolic constants to fit specification

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme: add symbolic constants for CC identifiers
Max Gurtovoy [Sun, 13 Aug 2017 16:21:06 +0000 (19:21 +0300)]
nvme: add symbolic constants for CC identifiers

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme: fix identify namespace logging
Sagi Grimberg [Tue, 15 Aug 2017 09:24:05 +0000 (12:24 +0300)]
nvme: fix identify namespace logging

Use ctrl->device and lose the func name.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-fabrics: log a warning if hostid is invalid
Guan Junxiong [Thu, 3 Aug 2017 13:40:26 +0000 (21:40 +0800)]
nvme-fabrics: log a warning if hostid is invalid

This helps users to quickly spot the reason of why connection fails
if the hostid is not compliant with the uuid format.

Signed-off-by: Guan Junxiong <guanjunxiong@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: call ops->reg_read64 instead of nvmf_reg_read64
Sagi Grimberg [Mon, 10 Jul 2017 06:22:39 +0000 (09:22 +0300)]
nvme-rdma: call ops->reg_read64 instead of nvmf_reg_read64

To make the nvme_rdma_configure_admin_queue generic in preparation of
moving it to common code.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: cleanup error path in controller reset
Sagi Grimberg [Mon, 10 Jul 2017 06:22:38 +0000 (09:22 +0300)]
nvme-rdma: cleanup error path in controller reset

No need to queue an extra work to indirect controller removal, just call the
ctrl remove routine.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: introduce nvme_rdma_start_queue
Sagi Grimberg [Mon, 10 Jul 2017 06:22:37 +0000 (09:22 +0300)]
nvme-rdma: introduce nvme_rdma_start_queue

This should pair with nvme_rdma_stop_queue.  While this is not a complete
inverse, it still pairs up pretty well because in fabrics we don't have a
disconnect capsule (yet) but we simply teardown the transport association.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: rename nvme_rdma_init_queue to nvme_rdma_alloc_queue
Sagi Grimberg [Mon, 10 Jul 2017 06:22:36 +0000 (09:22 +0300)]
nvme-rdma: rename nvme_rdma_init_queue to nvme_rdma_alloc_queue

Give it a name symmetric to nvme_rdma_free_queue. Also pass in the ctrl
sqsize+1 and not the opts queue_size.  And suppress a superflous
failure message.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: stop queues instead of simply flipping their state
Sagi Grimberg [Mon, 10 Jul 2017 06:22:35 +0000 (09:22 +0300)]
nvme-rdma: stop queues instead of simply flipping their state

If we move the queues from LIVE state, we might as well stop them (drain
for rdma).  Do it after we stop the request queues to prevent a stray
request sneaking in .queue_rq after we stop the queue.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: introduce configure/destroy io queues
Sagi Grimberg [Mon, 28 Aug 2017 19:41:10 +0000 (21:41 +0200)]
nvme-rdma: introduce configure/destroy io queues

Make a symmetrical handling with admin queue.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: reuse configure/destroy_admin_queue
Sagi Grimberg [Mon, 28 Aug 2017 19:40:06 +0000 (21:40 +0200)]
nvme-rdma: reuse configure/destroy_admin_queue

No need to open-code it.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: don't free tagset on resets
Sagi Grimberg [Mon, 10 Jul 2017 06:22:32 +0000 (09:22 +0300)]
nvme-rdma: don't free tagset on resets

We're not supposed to do that.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: disable the controller on resets
Sagi Grimberg [Mon, 10 Jul 2017 06:22:31 +0000 (09:22 +0300)]
nvme-rdma: disable the controller on resets

Mimic the pci driver as a controller disable might be more lightweight
than a shutdown.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: move tagset allocation to a dedicated routine
Sagi Grimberg [Mon, 10 Jul 2017 06:22:30 +0000 (09:22 +0300)]
nvme-rdma: move tagset allocation to a dedicated routine

We always pair tagset allocation with rdma device reference and it shares
some code, centralize it with an argument if its an admin or IO tagset.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme: Add admin_tagset pointer to nvme_ctrl
Sagi Grimberg [Mon, 10 Jul 2017 06:22:29 +0000 (09:22 +0300)]
nvme: Add admin_tagset pointer to nvme_ctrl

Will be used when we centralize control flows.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: move nvme_rdma_configure_admin_queue code location
Sagi Grimberg [Mon, 10 Jul 2017 06:22:28 +0000 (09:22 +0300)]
nvme-rdma: move nvme_rdma_configure_admin_queue code location

We will call it from other places so avoid having to forward declare it.
Also move it next to nvme_rdma_destroy_admin_queue.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: remove NVME_RDMA_MAX_SEGMENT_SIZE
Johannes Thumshirn [Thu, 3 Aug 2017 09:28:47 +0000 (11:28 +0200)]
nvme-rdma: remove NVME_RDMA_MAX_SEGMENT_SIZE

NVME_RDMA_MAX_SEGMENT_SIZE is not used anywhere, zap it.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet-fcloop: remove ALL_OPTS define
Johannes Thumshirn [Thu, 3 Aug 2017 09:28:48 +0000 (11:28 +0200)]
nvmet-fcloop: remove ALL_OPTS define

ALL_OPTS isn't used anywhere, remove it.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: fix the return error code of target if host is not allowed
Guan Junxiong [Fri, 4 Aug 2017 09:27:47 +0000 (17:27 +0800)]
nvmet: fix the return error code of target if host is not allowed

nvmf target shall return NVME_SC_CONNECT_INVALID_HOST instead of
the gereal code INVALID_PARAM when the given host nqn is not allowed
to connect. Refer to the 2.2.1 section of the NVMe over Fabrics Spec.

Signed-off-by: Guan Junxiong <guanjunxiong@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: use NVME_NSID_ALL
Christoph Hellwig [Tue, 18 Jul 2017 17:46:36 +0000 (19:46 +0200)]
nvmet: use NVME_NSID_ALL

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
4 years agonvme: add support for NVMe 1.3 Timestamp Feature
Jon Derrick [Wed, 16 Aug 2017 07:51:29 +0000 (09:51 +0200)]
nvme: add support for NVMe 1.3 Timestamp Feature

NVME's Timestamp feature allows controllers to be aware of the epoch
time in milliseconds. This patch adds the set features hook for various
transports through the identify path, so that resets and resumes can
update the controller as necessary.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
[hch: rebased on top of nvme-4.13 error handling changes,
      changed nvme_configure_timestamp to return the status]
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme: define NVME_NSID_ALL
Arnav Dawn [Wed, 12 Jul 2017 10:41:53 +0000 (16:11 +0530)]
nvme: define NVME_NSID_ALL

Define the constant "0xffffffff" (used as nsid for all namespaces)
as NVME_NSID_ALL.

Signed-off-by: Arnav Dawn <a.dawn@samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
4 years agonvme: add support for FW activation without reset
Arnav Dawn [Wed, 12 Jul 2017 10:40:40 +0000 (16:10 +0530)]
nvme: add support for FW activation without reset

This patch adds support for handling Fw activation without reset
On completion of FW-activation-starting AER, all queues are
paused till CSTS.PP is cleared or timed out (exceeds max time for
fw activtion MTFA). If device fails to clear CSTS.PP within MTFA,
driver issues reset controller.

Signed-off-by: Arnav Dawn <a.dawn@samsung.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
4 years agoMerge tag 'v4.13-rc7' into for-4.14/block-postmerge
Jens Axboe [Mon, 28 Aug 2017 19:00:44 +0000 (13:00 -0600)]
Merge tag 'v4.13-rc7' into for-4.14/block-postmerge

Linux 4.13-rc7

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock: fix warning when I/O elevator is changed as request_queue is being removed
David Jeffery [Mon, 28 Aug 2017 16:52:44 +0000 (10:52 -0600)]
block: fix warning when I/O elevator is changed as request_queue is being removed

There is a race between changing I/O elevator and request_queue removal
which can trigger the warning in kobject_add_internal.  A program can
use sysfs to request a change of elevator at the same time another task
is unregistering the request_queue the elevator would be attached to.
The elevator's kobject will then attempt to be connected to the
request_queue in the object tree when the request_queue has just been
removed from sysfs.  This triggers the warning in kobject_add_internal
as the request_queue no longer has a sysfs directory:

kobject_add_internal failed for iosched (error: -2 parent: queue)
------------[ cut here ]------------
WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 kobject_add_internal+0x103/0x2d0

To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when
changing the elevator and use the request_queue's sysfs_lock to
serialize between clearing the flag and the elevator testing the flag.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock, scheduler: convert xxx_var_store to void
weiping zhang [Thu, 24 Aug 2017 17:11:33 +0000 (01:11 +0800)]
block, scheduler: convert xxx_var_store to void

The last parameter "count" never be used in xxx_var_store,
convert these functions to void.

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoLinux 4.13-rc7 v4.13-rc7
Linus Torvalds [Mon, 28 Aug 2017 00:20:40 +0000 (17:20 -0700)]
Linux 4.13-rc7

4 years agoMerge tag 'iommu-fixes-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Mon, 28 Aug 2017 00:10:34 +0000 (17:10 -0700)]
Merge tag 'iommu-fixes-v4.13-rc6' of git://git./linux/kernel/git/joro/iommu

Pull IOMMU fix from Joerg Roedel:
 "Another fix, this time in common IOMMU sysfs code.

  In the conversion from the old iommu sysfs-code to the
  iommu_device_register interface, I missed to update the release path
  for the struct device associated with an IOMMU. It freed the 'struct
  device', which was a pointer before, but is now embedded in another
  struct.

  Freeing from the middle of allocated memory had all kinds of nasty
  side effects when an IOMMU was unplugged. Unfortunatly nobody
  unplugged and IOMMU until now, so this was not discovered earlier. The
  fix is to make the 'struct device' a pointer again"

* tag 'iommu-fixes-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu: Fix wrong freeing of iommu_device->dev

4 years agoMerge tag 'char-misc-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Mon, 28 Aug 2017 00:08:37 +0000 (17:08 -0700)]
Merge tag 'char-misc-4.13-rc7' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc fix from Greg KH:
 "Here is a single misc driver fix for 4.13-rc7. It resolves a reported
  problem in the Android binder driver due to previous patches in
  4.13-rc.

  It's been in linux-next with no reported issues"

* tag 'char-misc-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  ANDROID: binder: fix proc->tsk check.

4 years agoMerge tag 'staging-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Mon, 28 Aug 2017 00:03:33 +0000 (17:03 -0700)]
Merge tag 'staging-4.13-rc7' of git://git./linux/kernel/git/gregkh/staging

Pull staging/iio fixes from Greg KH:
 "Here are few small staging driver fixes, and some more IIO driver
  fixes for 4.13-rc7. Nothing major, just resolutions for some reported
  problems.

  All of these have been in linux-next with no reported problems"

* tag 'staging-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  iio: magnetometer: st_magn: remove ihl property for LSM303AGR
  iio: magnetometer: st_magn: fix status register address for LSM303AGR
  iio: hid-sensor-trigger: Fix the race with user space powering up sensors
  iio: trigger: stm32-timer: fix get trigger mode
  iio: imu: adis16480: Fix acceleration scale factor for adis16480
  PATCH] iio: Fix some documentation warnings
  staging: rtl8188eu: add RNX-N150NUB support
  Revert "staging: fsl-mc: be consistent when checking strcmp() return"
  iio: adc: stm32: fix common clock rate
  iio: adc: ina219: Avoid underflow for sleeping time
  iio: trigger: stm32-timer: add enable attribute
  iio: trigger: stm32-timer: fix get/set down count direction
  iio: trigger: stm32-timer: fix write_raw return value
  iio: trigger: stm32-timer: fix quadrature mode get routine
  iio: bmp280: properly initialize device for humidity reading

4 years agoMerge tag 'ntb-4.13-bugfixes' of git://github.com/jonmason/ntb
Linus Torvalds [Mon, 28 Aug 2017 00:01:54 +0000 (17:01 -0700)]
Merge tag 'ntb-4.13-bugfixes' of git://github.com/jonmason/ntb

Pull NTB fixes from Jon Mason:
 "NTB bug fixes to address an incorrect ntb_mw_count reference in the
  NTB transport, improperly bringing down the link if SPADs are
  corrupted, and an out-of-order issue regarding link negotiation and
  data passing"

* tag 'ntb-4.13-bugfixes' of git://github.com/jonmason/ntb:
  ntb: ntb_test: ensure the link is up before trying to configure the mws
  ntb: transport shouldn't disable link due to bogus values in SPADs
  ntb: use correct mw_count function in ntb_tool and ntb_transport

4 years agoAvoid page waitqueue race leaving possible page locker waiting
Linus Torvalds [Sun, 27 Aug 2017 23:25:09 +0000 (16:25 -0700)]
Avoid page waitqueue race leaving possible page locker waiting

The "lock_page_killable()" function waits for exclusive access to the
page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
set.

That means that if it gets woken up, other waiters may have been
skipped.

That, in turn, means that if it sees the page being unlocked, it *must*
take that lock and return success, even if a lethal signal is also
pending.

So instead of checking for lethal signals first, we need to check for
them after we've checked the actual bit that we were waiting for.  Even
if that might then delay the killing of the process.

This matches the order of the old "wait_on_bit_lock()" infrastructure
that the page locking used to use (and is still used in a few other
areas).

Note that if we still return an error after having unsuccessfully tried
to acquire the page lock, that is ok: that means that some other thread
was able to get ahead of us and lock the page, and when that other
thread then unlocks the page, the wakeup event will be repeated.  So any
other pending waiters will now get properly woken up.

Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Kara <jack@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoMinor page waitqueue cleanups
Linus Torvalds [Sun, 27 Aug 2017 20:55:12 +0000 (13:55 -0700)]
Minor page waitqueue cleanups

Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists.  The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.

Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes.  That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.

In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably.  If you have thousands of threads
waiting for the same page, it will be painful.  We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.

That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:

 (a) we don't want to continue walking the page wait list if the bit
     we're waiting for already got set again (which seems to be one of
     the patterns of the bad load).  That makes no progress and just
     causes pointless cache pollution chasing the pointers.

 (b) we don't want to put the non-locking waiters always on the front of
     the queue, and the locking waiters always on the back.  Not only is
     that unfair, it means that we wake up thousands of reading threads
     that will just end up being blocked by the writer later anyway.

Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members.  It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoClarify (and fix) MAX_LFS_FILESIZE macros
Linus Torvalds [Sun, 27 Aug 2017 19:12:25 +0000 (12:12 -0700)]
Clarify (and fix) MAX_LFS_FILESIZE macros

We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
filesystems (and other IO targets) that know they are 64-bit clean and
don't have any 32-bit limits in their IO path.

It turns out that our 32-bit value for that limit was bogus.  On 32-bit,
the VM layer is limited by the page cache to only 32-bit index values,
but our logic for that was confusing and actually wrong.  We used to
define that value to

(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)

which is actually odd in several ways: it limits the index to 31 bits,
and then it limits files so that they can't have data in that last byte
of a page that has the highest 31-bit index (ie page index 0x7fffffff).

Neither of those limitations make sense.  The index is actually the full
32 bit unsigned value, and we can use that whole full page.  So the
maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".

However, we do wan tto avoid the maximum index, because we have code
that iterates over the page indexes, and we don't want that code to
overflow.  So the maximum size of a file on a 32-bit host should
actually be one page less than the full 32-bit index.

So the actual limit is ULONG_MAX << PAGE_SHIFT.  That means that we will
not actually be using the page of that last index (ULONG_MAX), but we
can grow a file up to that limit.

The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
volume.  It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
byte less), but the actual true VM limit is one page less than 16TiB.

This was invisible until commit c2a9737f45e2 ("vfs,mm: fix a dead loop
in truncate_inode_pages_range()"), which started applying that
MAX_LFS_FILESIZE limit to block devices too.

NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
actually just the offset type itself (loff_t), which is signed.  But for
clarity, on 64-bit, just use the maximum signed value, and don't make
people have to count the number of 'f' characters in the hex constant.

So just use LLONG_MAX for the 64-bit case.  That was what the value had
been before too, just written out as a hex constant.

Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Reported-and-tested-by: Doug Nazar <nazard@nazar.ca>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Sat, 26 Aug 2017 19:48:29 +0000 (12:48 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:

 - a tweak to the IBM Trackpoint driver that helps recognizing
   trackpoints on never Lenovo Carbons

 - a fix to the ALPS driver solving scroll issues on some Dells

 - yet another ACPI ID has been added to Elan I2C toucpad driver

 - quieted diagnostic message in soc_button_array driver

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: ALPS - fix two-finger scroll breakage in right side on ALPS touchpad
  Input: soc_button_array - silence -ENOENT error on Dell XPS13 9365
  Input: trackpoint - add new trackpoint firmware ID
  Input: elan_i2c - add ELAN0602 ACPI ID to support Lenovo Yoga310

4 years agoMerge tag 'pci-v4.13-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaa...
Linus Torvalds [Sat, 26 Aug 2017 19:46:14 +0000 (12:46 -0700)]
Merge tag 'pci-v4.13-fixes-3' of git://git./linux/kernel/git/helgaas/pci

Pull PCI fix from Bjorn Helgaas:
 "Remove needlessly alarming MSI affinity warning (this is not actually
  a bug fix, but the warning prompts unnecessary bug reports)"

* tag 'pci-v4.13-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  PCI/MSI: Don't warn when irq_create_affinity_masks() returns NULL

4 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 26 Aug 2017 16:06:28 +0000 (09:06 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
 "Two fixes: one for an ldt_struct handling bug and a cherry-picked
  objtool fix"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Fix use-after-free of ldt_struct
  objtool: Fix '-mtune=atom' decoding support in objtool 2.0

4 years agoMerge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 26 Aug 2017 16:02:18 +0000 (09:02 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "Fix a timer granularity handling race+bug, which would manifest itself
  by spuriously increasing timeouts of some timers (from 1 jiffy to ~500
  jiffies in the worst case measured) in certain nohz states"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Fix excessive granularity of new timers after a nohz idle

4 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 26 Aug 2017 15:59:50 +0000 (08:59 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf fix from Ingo Molnar:
 "A single fix to not allow nonsensical event groups that result in
  kernel warnings"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix group {cpu,task} validation

4 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sat, 26 Aug 2017 01:02:27 +0000 (18:02 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "6 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/memblock.c: reversed logic in memblock_discard()
  fork: fix incorrect fput of ->exe_file causing use-after-free
  mm/madvise.c: fix freeing of locked page with MADV_FREE
  dax: fix deadlock due to misaligned PMD faults
  mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled
  PM/hibernate: touch NMI watchdog when creating snapshot

4 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sat, 26 Aug 2017 00:46:23 +0000 (17:46 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull Paolo Bonzini:
 "Bugfixes for x86, PPC and s390"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: PPC: Book3S: Fix race and leak in kvm_vm_ioctl_create_spapr_tce()
  KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state
  KVM: x86: simplify handling of PKRU
  KVM: x86: block guest protection keys unless the host has them enabled
  KVM: PPC: Book3S HV: Add missing barriers to XIVE code and document them
  KVM: PPC: Book3S HV: Workaround POWER9 DD1.0 bug causing IPB bit loss
  KVM: PPC: Book3S HV: Use msgsync with hypervisor doorbells on POWER9
  KVM: s390: sthyi: fix specification exception detection
  KVM: s390: sthyi: fix sthyi inline assembly

4 years agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Sat, 26 Aug 2017 00:40:03 +0000 (17:40 -0700)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost

Pull virtio fixes from Michael Tsirkin:
 "Fixes two obvious bugs in virtio pci"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_pci: fix cpu affinity support
  virtio_blk: fix incorrect message when disk is resized

4 years agoMerge tag 'powerpc-4.13-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sat, 26 Aug 2017 00:32:35 +0000 (17:32 -0700)]
Merge tag 'powerpc-4.13-8' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fix from Michael Ellerman:
 "Just one fix, to add a barrier in the switch_mm() code to make sure
  the mm cpumask update is ordered vs the MMU starting to load
  translations. As far as we know no one's actually hit the bug, but
  that's just luck.

  Thanks to Benjamin Herrenschmidt, Nicholas Piggin"

* tag 'powerpc-4.13-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm: Ensure cpumask update is ordered

4 years agoMerge tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linux
Linus Torvalds [Sat, 26 Aug 2017 00:27:26 +0000 (17:27 -0700)]
Merge tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Two nfsd bugfixes, neither 4.13 regressions, but both potentially
  serious"

* tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linux:
  net: sunrpc: svcsock: fix NULL-pointer exception
  nfsd: Limit end of page list when decoding NFSv4 WRITE

4 years agoMerge tag 'cifs-fixes-for-4.13-rc6-and-stable' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sat, 26 Aug 2017 00:22:33 +0000 (17:22 -0700)]
Merge tag 'cifs-fixes-for-4.13-rc6-and-stable' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Some bug fixes for stable for cifs"

* tag 'cifs-fixes-for-4.13-rc6-and-stable' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()
  cifs: Fix df output for users with quota limits

4 years agoMerge tag 'for-linus-20170825' of git://git.infradead.org/linux-mtd
Linus Torvalds [Sat, 26 Aug 2017 00:09:19 +0000 (17:09 -0700)]
Merge tag 'for-linus-20170825' of git://git.infradead.org/linux-mtd

Pull MTD fixes from Brian Norris:
 "Two fixes - one for a 4.13 regression, and the other for an older one:

   - Atmel NAND: since we started utilizing ONFI timings, we found that
     we were being too restrict at rejecting them, partly due to
     discrepancies in ONFI 4.0 and earlier versions. Relax the
     restriction to keep these platforms booting. This is a 4.13-rc1
     regression.

   - nandsim: repeated probe/removal may not work after a failed init,
     because we didn't free up our debugfs files properly on the failure
     path. This has been around since 3.8, but it's nice to get this
     fixed now in a nice easy patch that can target -stable, since
     there's already refactoring work (that also fixes the issue)
     targeted for the next merge window"

* tag 'for-linus-20170825' of git://git.infradead.org/linux-mtd:
  mtd: nand: atmel: Relax tADL_min constraint
  mtd: nandsim: remove debugfs entries in error path

4 years agoMerge branch 'for-linus' of git://git.kernel.dk/linux-block
Linus Torvalds [Sat, 26 Aug 2017 00:02:59 +0000 (17:02 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A small batch of fixes that should be included for the 4.13 release.
  This contains:

   - Revert of the 4k loop blocksize support. Even with a recent batch
     of 4 fixes, we're still not really happy with it. Rather than be
     stuck with an API issue, let's revert it and get it right for 4.14.

   - Trivial patch from Bart, adding a few flags to the blk-mq debugfs
     exports that were added in this release, but not to the debugfs
     parts.

   - Regression fix for bsg, fixing a potential kernel panic. From
     Benjamin.

   - Tweak for the blk throttling, improving how we account discards.
     From Shaohua"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq-debugfs: Add names for recently added flags
  bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer
  Revert "loop: support 4k physical blocksize"
  blk-throttle: cap discard request size

4 years agoMerge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Fri, 25 Aug 2017 23:59:38 +0000 (16:59 -0700)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "I2C has some bugfixes for you: mainly Jarkko fixed up a few things in
  the designware driver regarding the new slave mode. But Ulf also fixed
  a long-standing and now agreed suspend problem. Plus, some simple
  stuff which nonetheless needs fixing"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: designware: Fix runtime PM for I2C slave mode
  i2c: designware: Remove needless pm_runtime_put_noidle() call
  i2c: aspeed: fixed potential null pointer dereference
  i2c: simtec: use release_mem_region instead of release_resource
  i2c: core: Make comment about I2C table requirement to reflect the code
  i2c: designware: Fix standard mode speed when configuring the slave mode
  i2c: designware: Fix oops from i2c_dw_irq_handler_slave
  i2c: designware: Fix system suspend

4 years agoPCI/MSI: Don't warn when irq_create_affinity_masks() returns NULL
Christoph Hellwig [Fri, 25 Aug 2017 23:58:42 +0000 (18:58 -0500)]
PCI/MSI: Don't warn when irq_create_affinity_masks() returns NULL

irq_create_affinity_masks() can return NULL on non-SMP systems, when there
are not enough "free" vectors available to spread, or if memory allocation
for the CPU masks fails.  Only the allocation failure is of interest, and
even then the system will work just fine except for non-optimally spread
vectors.  Thus remove the warnings.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
4 years agoMerge tag 'mmc-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Linus Torvalds [Fri, 25 Aug 2017 23:57:53 +0000 (16:57 -0700)]
Merge tag 'mmc-v4.13-rc6' of git://git./linux/kernel/git/ulfh/mmc

Pull MMC fix from Ulf Hansson:
 "MMC core: don't return error code R1_OUT_OF_RANGE for open-ending mode"

* tag 'mmc-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: block: prevent propagating R1_OUT_OF_RANGE for open-ending mode

4 years agoMerge tag 'sound-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Fri, 25 Aug 2017 23:56:04 +0000 (16:56 -0700)]
Merge tag 'sound-4.13-rc7' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "We're keeping in a good shape, this batch contains just a few small
  fixes (a regression fix for ASoC rt5677 codec, NULL dereference and
  error-path fixes in firewire, and a corner-case ioctl error fix for
  user TLV), as well as usual quirks for USB-audio and HD-audio"

* tag 'sound-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ASoC: rt5677: Reintroduce I2C device IDs
  ALSA: hda - Add stereo mic quirk for Lenovo G50-70 (17aa:3978)
  ALSA: core: Fix unexpected error at replacing user TLV
  ALSA: usb-audio: Add delay quirk for H650e/Jabra 550a USB headsets
  ALSA: firewire-motu: destroy stream data surely at failure of card initialization
  ALSA: firewire: fix NULL pointer dereference when releasing uninitialized data of iso-resource

4 years agoMerge tag 'dmaengine-fix-4.13-rc7' of git://git.infradead.org/users/vkoul/slave-dma
Linus Torvalds [Fri, 25 Aug 2017 23:43:08 +0000 (16:43 -0700)]
Merge tag 'dmaengine-fix-4.13-rc7' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fix from Vinod Koul:
 "A single fix for tegra210-adma driver to check of_irq_get() error"

* tag 'dmaengine-fix-4.13-rc7' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: tegra210-adma: fix of_irq_get() error check

4 years agoMerge tag 'drm-fixes-for-v4.13-rc7' of git://people.freedesktop.org/~airlied/linux
Linus Torvalds [Fri, 25 Aug 2017 23:39:51 +0000 (16:39 -0700)]
Merge tag 'drm-fixes-for-v4.13-rc7' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
 "Fixes for rc7, nothing too crazy, some core, i915, and sunxi fixes,
  Intel CI has been responsible for some of these fixes being required"

* tag 'drm-fixes-for-v4.13-rc7' of git://people.freedesktop.org/~airlied/linux:
  drm/i915/gvt: Fix the kernel null pointer error
  drm: Release driver tracking before making the object available again
  drm/i915: Clear lost context-switch interrupts across reset
  drm/i915/bxt: use NULL for GPIO connection ID
  drm/i915/cnl: Fix LSPCON support.
  drm/i915/vbt: ignore extraneous child devices for a port
  drm/i915: Initialize 'data' in intel_dsi_dcs_backlight.c
  drm/atomic: If the atomic check fails, return its value first
  drm/atomic: Handle -EDEADLK with out-fences correctly
  drm: Fix framebuffer leak
  drm/imx: ipuv3-plane: fix YUV framebuffer scanout on the base plane
  gpu: ipu-v3: add DRM dependency
  drm/rockchip: Fix suspend crash when drm is not bound
  drm/sun4i: Implement drm_driver lastclose to restore fbdev console

4 years agomm/memblock.c: reversed logic in memblock_discard()
Pavel Tatashin [Fri, 25 Aug 2017 22:55:46 +0000 (15:55 -0700)]
mm/memblock.c: reversed logic in memblock_discard()

In recently introduced memblock_discard() there is a reversed logic bug.
Memory is freed of static array instead of dynamically allocated one.

Link: http://lkml.kernel.org/r/1503511441-95478-2-git-send-email-pasha.tatashin@oracle.com
Fixes: 3010f876500f ("mm: discard memblock data later")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reported-by: Woody Suwalski <terraluna977@gmail.com>
Tested-by: Woody Suwalski <terraluna977@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agofork: fix incorrect fput of ->exe_file causing use-after-free
Eric Biggers [Fri, 25 Aug 2017 22:55:43 +0000 (15:55 -0700)]
fork: fix incorrect fput of ->exe_file causing use-after-free

Commit 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for
write killable") made it possible to kill a forking task while it is
waiting to acquire its ->mmap_sem for write, in dup_mmap().

However, it was overlooked that this introduced an new error path before
a reference is taken on the mm_struct's ->exe_file.  Since the
->exe_file of the new mm_struct was already set to the old ->exe_file by
the memcpy() in dup_mm(), it was possible for the mmput() in the error
path of dup_mm() to drop a reference to ->exe_file which was never
taken.

This caused the struct file to later be freed prematurely.

Fix it by updating mm_init() to NULL out the ->exe_file, in the same
place it clears other things like the list of mmaps.

This bug was found by syzkaller.  It can be reproduced using the
following C program:

    #define _GNU_SOURCE
    #include <pthread.h>
    #include <stdlib.h>
    #include <sys/mman.h>
    #include <sys/syscall.h>
    #include <sys/wait.h>
    #include <unistd.h>

    static void *mmap_thread(void *_arg)
    {
        for (;;) {
            mmap(NULL, 0x1000000, PROT_READ,
                 MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
        }
    }

    static void *fork_thread(void *_arg)
    {
        usleep(rand() % 10000);
        fork();
    }

    int main(void)
    {
        fork();
        fork();
        fork();
        for (;;) {
            if (fork() == 0) {
                pthread_t t;

                pthread_create(&t, NULL, mmap_thread, NULL);
                pthread_create(&t, NULL, fork_thread, NULL);
                usleep(rand() % 10000);
                syscall(__NR_exit_group, 0);
            }
            wait(NULL);
        }
    }

No special kernel config options are needed.  It usually causes a NULL
pointer dereference in __remove_shared_vm_struct() during exit, or in
dup_mmap() (which is usually inlined into copy_process()) during fork.
Both are due to a vm_area_struct's ->vm_file being used after it's
already been freed.

Google Bug Id: 64772007

Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com
Fixes: 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [v4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm/madvise.c: fix freeing of locked page with MADV_FREE
Eric Biggers [Fri, 25 Aug 2017 22:55:39 +0000 (15:55 -0700)]
mm/madvise.c: fix freeing of locked page with MADV_FREE

If madvise(..., MADV_FREE) split a transparent hugepage, it called
put_page() before unlock_page().

This was wrong because put_page() can free the page, e.g. if a
concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
mapping. put_page() then rightfully complained about freeing a locked
page.

Fix this by moving the unlock_page() before put_page().

This bug was found by syzkaller, which encountered the following splat:

    BUG: Bad page state in process syzkaller412798  pfn:1bd800
    page:ffffea0006f60000 count:0 mapcount:0 mapping:          (null) index:0x20a00
    flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
    raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
    raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags: 0x1(locked)
    Modules linked in:
    CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
     __dump_stack lib/dump_stack.c:16 [inline]
     dump_stack+0x194/0x257 lib/dump_stack.c:52
     bad_page+0x230/0x2b0 mm/page_alloc.c:565
     free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
     free_pages_check mm/page_alloc.c:952 [inline]
     free_pages_prepare mm/page_alloc.c:1043 [inline]
     free_pcp_prepare mm/page_alloc.c:1068 [inline]
     free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
     __put_single_page mm/swap.c:79 [inline]
     __put_page+0xfb/0x160 mm/swap.c:113
     put_page include/linux/mm.h:814 [inline]
     madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
     walk_pmd_range mm/pagewalk.c:50 [inline]
     walk_pud_range mm/pagewalk.c:108 [inline]
     walk_p4d_range mm/pagewalk.c:134 [inline]
     walk_pgd_range mm/pagewalk.c:160 [inline]
     __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
     walk_page_range+0x200/0x470 mm/pagewalk.c:326
     madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
     madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
     madvise_dontneed_free mm/madvise.c:555 [inline]
     madvise_vma mm/madvise.c:664 [inline]
     SYSC_madvise mm/madvise.c:832 [inline]
     SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
     entry_SYSCALL_64_fastpath+0x1f/0xbe

Here is a C reproducer:

    #define _GNU_SOURCE
    #include <pthread.h>
    #include <sys/mman.h>
    #include <unistd.h>

    #define MADV_FREE 8
    #define PAGE_SIZE 4096

    static void *mapping;
    static const size_t mapping_size = 0x1000000;

    static void *madvise_thrproc(void *arg)
    {
        madvise(mapping, mapping_size, (long)arg);
    }

    int main(void)
    {
        pthread_t t[2];

        for (;;) {
            mapping = mmap(NULL, mapping_size, PROT_WRITE,
                           MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

            munmap(mapping + mapping_size / 2, PAGE_SIZE);

            pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
            pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
            pthread_join(t[0], NULL);
            pthread_join(t[1], NULL);
            munmap(mapping, mapping_size);
        }
    }

Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
CONFIG_DEBUG_VM=y are needed.

Google Bug Id: 64696096

Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [v4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agodax: fix deadlock due to misaligned PMD faults
Ross Zwisler [Fri, 25 Aug 2017 22:55:36 +0000 (15:55 -0700)]
dax: fix deadlock due to misaligned PMD faults

In DAX there are two separate places where the 2MiB range of a PMD is
defined.

The first is in the page tables, where a PMD mapping inserted for a
given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
address to the 2MiB boundary above the address.

So, for example, a fault at address 3MiB (0x30 0000) falls within the
PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).

The second PMD range is in the mapping->page_tree, where a given file
offset is covered by a radix tree entry that spans from one 2MiB aligned
file offset to another 2MiB aligned file offset.

So, for example, the file offset for 3MiB (pgoff 768) falls within the
PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
512) to 4MiB (pgoff 1024).

This system works so long as the addresses and file offsets for a given
mapping both have the same offsets relative to the start of each PMD.

Consider the case where the starting address for a given file isn't 2MiB
aligned - say our faulting address is 3 MiB (0x30 0000), but that
corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
the mapping are misaligned so that the 2MiB range defined in the page
tables never matches up with the 2MiB range defined in the radix tree.

The current code notices this case for DAX faults to storage with the
following test in dax_pmd_insert_mapping():

if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
goto unlock_fallback;

This test makes sure that the pfn we get from the driver is 2MiB
aligned, and relies on the assumption that the 2MiB alignment of the pfn
we get back from the driver matches the 2MiB alignment of the faulting
address.

However, faults to holes were not checked and we could hit the problem
described above.

This was reported in response to the NVML nvml/src/test/pmempool_sync
TEST5:

$ cd nvml/src/test/pmempool_sync
$ make TEST5

You can grab NVML here:

https://github.com/pmem/nvml/

The dmesg warning you see when you hit this error is:

  WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310

Where we notice in dax_insert_mapping_entry() that the radix tree entry
we are about to replace doesn't match the locked entry that we had
previously inserted into the tree.  This happens because the initial
insertion was done in grab_mapping_entry() using a pgoff calculated from
the faulting address (vmf->address), and the replacement in
dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
vmf->pgoff.

In our failure case those two page offsets (one calculated from
vmf->address, one using vmf->pgoff) point to different order 9 radix
tree entries.

This failure case can result in a deadlock because the radix tree unlock
also happens on the pgoff calculated from vmf->address.  This means that
the locked radix tree entry that we swapped in to the tree in
dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
future faults to that 2MiB range will block forever.

Fix this by validating that the faulting address's PMD offset matches
the PMD offset from the start of the file.  This check is done at the
very beginning of the fault and covers faults that would have mapped to
storage as well as faults to holes.  I left the COLOUR check in
dax_pmd_insert_mapping() in place in case we ever hit the insanity
condition where the alignment of the pfn we get from the driver doesn't
match the alignment of the userspace address.

Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: "Slusarz, Marcin" <marcin.slusarz@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled
Kirill A. Shutemov [Fri, 25 Aug 2017 22:55:33 +0000 (15:55 -0700)]
mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled

/sys/kernel/mm/transparent_hugepage/shmem_enabled controls if we want
to allocate huge pages when allocate pages for private in-kernel shmem
mount.

Unfortunately, as Dan noticed, I've screwed it up and the only way to
make kernel allocate huge page for the mount is to use "force" there.
All other values will be effectively ignored.

Link: http://lkml.kernel.org/r/20170822144254.66431-1-kirill.shutemov@linux.intel.com
Fixes: 5a6e75f8110c ("shmem: prepare huge= mount option and sysfs knob")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoPM/hibernate: touch NMI watchdog when creating snapshot
Chen Yu [Fri, 25 Aug 2017 22:55:30 +0000 (15:55 -0700)]
PM/hibernate: touch NMI watchdog when creating snapshot

There is a problem that when counting the pages for creating the
hibernation snapshot will take significant amount of time, especially on
system with large memory.  Since the counting job is performed with irq
disabled, this might lead to NMI lockup.  The following warning were
found on a system with 1.5TB DRAM:

  Freezing user space processes ... (elapsed 0.002 seconds) done.
  OOM killer disabled.
  PM: Preallocating image memory...
  NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
  CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
  task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
  RIP: 0010:memory_bm_find_bit+0xf4/0x100
  Call Trace:
   swsusp_set_page_free+0x2b/0x30
   mark_free_pages+0x147/0x1c0
   count_data_pages+0x41/0xa0
   hibernate_preallocate_memory+0x80/0x450
   hibernation_snapshot+0x58/0x410
   hibernate+0x17c/0x310
   state_store+0xdf/0xf0
   kobj_attr_store+0xf/0x20
   sysfs_kf_write+0x37/0x40
   kernfs_fop_write+0x11c/0x1a0
   __vfs_write+0x37/0x170
   vfs_write+0xb1/0x1a0
   SyS_write+0x55/0xc0
   entry_SYSCALL_64_fastpath+0x1a/0xa5
  ...
  done (allocated 6590003 pages)
  PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)

It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was
triggered.  In case the timeout of the NMI watch dog has been set to 1
second, a safe interval should be 6590003/20 = 320k pages in theory.
However there might also be some platforms running at a lower frequency,
so feed the watchdog every 100k pages.

[yu.c.chen@intel.com: simplification]
Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com
[yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus]
Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reported-by: Jan Filipcewicz <jan.filipcewicz@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoskd: Remove SKD_ID_INCR
Bart Van Assche [Fri, 25 Aug 2017 21:24:14 +0000 (14:24 -0700)]
skd: Remove SKD_ID_INCR

The SKD_ID_INCR flag in skd_request_context.id duplicates information
that is already available otherwise, e.g. through the block layer
request state and through skd_request_context.state. Hence remove
the code that manipulates this flag and also the flag itself.
Since skd_isr_completion_posted() only uses the lower bits of
skd_request_context.id as hardware tag, this patch does not change
the behavior of the skd driver. I'm referring to the following code:

    tag = req_id & SKD_ID_SLOT_AND_TABLE_MASK;

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>