sfrench/cifs-2.6.git
2 years agoblkcg: remove unused __blkg_release_rcu()
Dennis Zhou [Wed, 19 Dec 2018 22:43:53 +0000 (16:43 -0600)]
blkcg: remove unused __blkg_release_rcu()

An earlier commit 7fcf2b033b84 ("blkcg: change blkg reference counting
to use percpu_ref") moved around the release call from blkg_put() to be
a part of the percpu_ref cleanup. Remove the additional unused code
which should have been removed earlier.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblkcg: clean up blkg_tryget_closest()
Dennis Zhou [Wed, 19 Dec 2018 22:43:21 +0000 (16:43 -0600)]
blkcg: clean up blkg_tryget_closest()

The implementation of blkg_tryget_closest() wasn't super obvious and
became a point of suspicion when debugging [1]. So let's clean it up so
it's obviously not the problem.

Also add missing RCU read locking to bio_clone_blkg_association(), which
got exposed by adding the RCU read lock held check in
blkg_tryget_closest().

[1] https://lore.kernel.org/linux-block/a7e97e4b-0dd8-3a54-23b7-a0f27b17fde8@kernel.dk/

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: Change drbd_request_detach_interruptible's return type to int
Nathan Chancellor [Thu, 20 Dec 2018 16:23:44 +0000 (17:23 +0100)]
drbd: Change drbd_request_detach_interruptible's return type to int

Clang warns when an implicit conversion is done between enumerated
types:

drivers/block/drbd/drbd_state.c:708:8: warning: implicit conversion from
enumeration type 'enum drbd_ret_code' to different enumeration type
'enum drbd_state_rv' [-Wenum-conversion]
                rv = ERR_INTR;
                   ~ ^~~~~~~~

drbd_request_detach_interruptible's only call site is in the return
statement of adm_detach, which returns an int. Change the return type of
drbd_request_detach_interruptible to match, silencing Clang's warning.

Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: Avoid Clang warning about pointless switch statment
Nathan Chancellor [Thu, 20 Dec 2018 16:23:43 +0000 (17:23 +0100)]
drbd: Avoid Clang warning about pointless switch statment

There are several warnings from Clang about no case statement matching
the constant 0:

In file included from drivers/block/drbd/drbd_receiver.c:48:
In file included from drivers/block/drbd/drbd_int.h:48:
In file included from ./include/linux/drbd_genl_api.h:54:
In file included from ./include/linux/genl_magic_struct.h:236:
./include/linux/drbd_genl.h:321:1: warning: no case matching constant
switch condition '0'
GENL_struct(DRBD_NLA_HELPER, 24, drbd_helper_info,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/linux/genl_magic_struct.h:220:10: note: expanded from macro
'GENL_struct'
        switch (0) {
                ^

Silence this warning by adding a 'case 0:' statement. Additionally,
adjust the alignment of the statements in the ct_assert_unique macro to
avoid a checkpatch warning.

This solution was originally sent by Arnd Bergmann with a default case
statement: https://lore.kernel.org/patchwork/patch/756723/

Link: https://github.com/ClangBuiltLinux/linux/issues/43
Suggested-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire")
Lars Ellenberg [Thu, 20 Dec 2018 16:23:42 +0000 (17:23 +0100)]
drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire")

And also re-enable partial-zero-out + discard aligned.

With the introduction of REQ_OP_WRITE_ZEROES,
we started to use that for both WRITE_ZEROES and DISCARDS,
hoping that WRITE_ZEROES would "do what we want",
UNMAP if possible, zero-out the rest.

The example scenario is some LVM "thin" backend.

While an un-allocated block on dm-thin reads as zeroes, on a dm-thin
with "skip_block_zeroing=true", after a partial block write allocated
that block, that same block may well map "undefined old garbage" from
the backends on LBAs that have not yet been written to.

If we cannot distinguish between zero-out and discard on the receiving
side, to avoid "undefined old garbage" to pop up randomly at later times
on supposedly zero-initialized blocks, we'd need to map all discards to
zero-out on the receiving side.  But that would potentially do a full
alloc on thinly provisioned backends, even when the expectation was to
unmap/trim/discard/de-allocate.

We need to distinguish on the protocol level, whether we need to guarantee
zeroes (and thus use zero-out, potentially doing the mentioned full-alloc),
or if we want to put the emphasis on discard, and only do a "best effort
zeroing" (by "discarding" blocks aligned to discard-granularity, and zeroing
only potential unaligned head and tail clippings to at least *try* to
avoid "false positives" in an online-verify later), hoping that someone
set skip_block_zeroing=false.

For some discussion regarding this on dm-devel, see also
https://www.mail-archive.com/dm-devel%40redhat.com/msg07965.html
https://www.redhat.com/archives/dm-devel/2018-January/msg00271.html

For backward compatibility, P_TRIM means zero-out, unless the
DRBD_FF_WZEROES feature flag is agreed upon during handshake.

To have upper layers even try to submit WRITE ZEROES requests,
we need to announce "efficient zeroout" independently.

We need to fixup max_write_zeroes_sectors after blk_queue_stack_limits():
if we can handle "zeroes" efficiently on the protocol,
we want to do that, even if our backend does not announce
max_write_zeroes_sectors itself.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: skip spurious timeout (ping-timeo) when failing promote
Lars Ellenberg [Thu, 20 Dec 2018 16:23:41 +0000 (17:23 +0100)]
drbd: skip spurious timeout (ping-timeo) when failing promote

If you try to promote a Secondary while connected to a Primary
and allow-two-primaries is NOT set, we will wait for "ping-timeout"
to give this node a chance to detect a dead primary,
in case the cluster manager noticed faster than we did.

But if we then are *still* connected to a Primary,
we fail (after an additional timeout of ping-timout).

This change skips the spurious second timeout.

Most people won't notice really,
since "ping-timeout" by default is half a second.

But in some installations, ping-timeout may be 10 or 20 seconds or more,
and spuriously delaying the error return becomes annoying.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: don't retry connection if peers do not agree on "authentication" settings
Lars Ellenberg [Thu, 20 Dec 2018 16:23:40 +0000 (17:23 +0100)]
drbd: don't retry connection if peers do not agree on "authentication" settings

emma: "Unexpected data packet AuthChallenge (0x0010)"
 ava: "expected AuthChallenge packet, received: ReportProtocol (0x000b)"
      "Authentication of peer failed, trying again."

Pattern repeats.

There is no point in retrying the handshake,
if we expect to receive an AuthChallenge,
but the peer is not even configured to expect or use a shared secret.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: fix print_st_err()'s prototype to match the definition
Luc Van Oostenryck [Thu, 20 Dec 2018 16:23:39 +0000 (17:23 +0100)]
drbd: fix print_st_err()'s prototype to match the definition

print_st_err() is defined with its 4th argument taking an
'enum drbd_state_rv' but its prototype use an int for it.

Fix this by using 'enum drbd_state_rv' in the prototype too.

Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: avoid spurious self-outdating with concurrent disconnect / down
Lars Ellenberg [Thu, 20 Dec 2018 16:23:38 +0000 (17:23 +0100)]
drbd: avoid spurious self-outdating with concurrent disconnect / down

If peers are "simultaneously" told to disconnect from each other,
either explicitly, or implicitly by taking down the resource,
with bad timing, one side may see its disconnect "fail" with
a result of "state change failed by peer", and interpret this as
"please oudate yourself".

Try to catch this by checking for current connection status,
and possibly retry as local-only state change instead.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: do not block when adjusting "disk-options" while IO is frozen
Lars Ellenberg [Thu, 20 Dec 2018 16:23:37 +0000 (17:23 +0100)]
drbd: do not block when adjusting "disk-options" while IO is frozen

"suspending" IO is overloaded.
It can mean "do not allow new requests" (obviously),
but it also may mean "must not complete pending IO",
for example while the fencing handlers do their arbitration.

When adjusting disk options, we suspend io (disallow new requests), then
wait for the activity-log to become unused (drain all IO completions),
and possibly replace it with a new activity log of different size.

If the other "suspend IO" aspect is active, pending IO completions won't
happen, and we would block forever (unkillable drbdsetup process).

Fix this by skipping the activity log adjustment if the "al-extents"
setting did not change. Also, in case it did change, fail early without
blocking if it looks like we would block forever.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: fix comment typos
Lars Ellenberg [Thu, 20 Dec 2018 16:23:36 +0000 (17:23 +0100)]
drbd: fix comment typos

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: reject attach of unsuitable uuids even if connected
Lars Ellenberg [Thu, 20 Dec 2018 16:23:35 +0000 (17:23 +0100)]
drbd: reject attach of unsuitable uuids even if connected

Multiple failure scenario:
a) all good
   Connected Primary/Secondary UpToDate/UpToDate
b) lose disk on Primary,
   Connected Primary/Secondary Diskless/UpToDate
c) continue to write to the device,
   changes only make it to the Secondary storage.
d) lose disk on Secondary,
   Connected Primary/Secondary Diskless/Diskless
e) now try to re-attach on Primary

This would have succeeded before, even though that is clearly the
wrong data set to attach to (missing the modifications from c).
Because we only compared our "effective" and the "to-be-attached"
data generation uuid tags if (device->state.conn < C_CONNECTED).

Fix: change that constraint to (device->state.pdsk != D_UP_TO_DATE)
compare the uuids, and reject the attach.

This patch also tries to improve the reverse scenario:
first lose Secondary, then Primary disk,
then try to attach the disk on Secondary.

Before this patch, the attach on the Secondary succeeds, but since commit
drbd: disconnect, if the wrong UUIDs are attached on a connected peer
the Primary will notice unsuitable data, and drop the connection hard.

Though unfortunately at a point in time during the handshake where
we cannot easily abort the attach on the peer without more
refactoring of the handshake.

We now reject any attach to "unsuitable" uuids,
as long as we can see a Primary role,
unless we already have access to "good" data.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: attach on connected diskless peer must not shrink a consistent device
Lars Ellenberg [Thu, 20 Dec 2018 16:23:34 +0000 (17:23 +0100)]
drbd: attach on connected diskless peer must not shrink a consistent device

If we would reject a new handshake, if the peer had attached first,
and then connected, we should force disconnect if the peer first connects,
and only then attaches.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: fix confusing error message during attach
Lars Ellenberg [Thu, 20 Dec 2018 16:23:33 +0000 (17:23 +0100)]
drbd: fix confusing error message during attach

If we attach a (consistent) backing device,
which knows about a last-agreed effective size,
and that effective size is *larger* than the currently requested size,
we refused to attach with ERR_DISK_TOO_SMALL
  Failure: (111) Low.dev. smaller than requested DRBD-dev. size.
which is confusing to say the least.

This patch changes the error code in that case to ERR_IMPLICIT_SHRINK
  Failure: (170) Implicit device shrinking not allowed. See kernel log.
  additional info from kernel:
  To-be-attached device has last effective > current size, and is consistent
  (9999 > 7777 sectors). Refusing to attach.

It also allows to attach with an explicit size.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: disconnect, if the wrong UUIDs are attached on a connected peer
Lars Ellenberg [Thu, 20 Dec 2018 16:23:32 +0000 (17:23 +0100)]
drbd: disconnect, if the wrong UUIDs are attached on a connected peer

With "on-no-data-accessible suspend-io", DRBD requires the next attach
or connect to be to the very same data generation uuid tag it lost last.

If we first lost connection to the peer,
then later lost connection to our own disk,
we would usually refuse to re-connect to the peer,
because it presents the wrong data set.

However, if the peer first connects without a disk,
and then attached its disk, we accepted that same wrong data set,
which would be "unexpected" by any user of that DRBD
and cause "undefined results" (read: very likely data corruption).

The fix is to forcefully disconnect as soon as we notice that the peer
attached to the "wrong" dataset.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: ignore "all zero" peer volume sizes in handshake
Lars Ellenberg [Thu, 20 Dec 2018 16:23:31 +0000 (17:23 +0100)]
drbd: ignore "all zero" peer volume sizes in handshake

During handshake, if we are diskless ourselves, we used to accept any size
presented by the peer.

Which could be zero if that peer was just brought up and connected
to us without having a disk attached first, in which case both
peers would just "flip" their volume sizes.

Now, even a diskless node will ignore "zero" sizes
presented by a diskless peer.

Also a currently Diskless Primary will refuse to shrink during handshake:
it may be frozen, and waiting for a "suitable" local disk or peer to
re-appear (on-no-data-accessible suspend-io). If the peer is smaller
than what we used to be, it is not suitable.

The logic for a diskless node during handshake is now supposed to be:
believe the peer, if
 - I don't have a current size myself
 - we agree on the size anyways
 - I do have a current size, am Secondary, and he has the only disk
 - I do have a current size, am Primary, and he has the only disk,
   which is larger than my current size

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: centralize printk reporting of new size into drbd_set_my_capacity()
Lars Ellenberg [Thu, 20 Dec 2018 16:23:30 +0000 (17:23 +0100)]
drbd: centralize printk reporting of new size into drbd_set_my_capacity()

Previously, some implicit resizes that happend during handshake
have not been reported as prominently as explicit resize.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: must not use connection after kref_put(&connection->kref)
Lars Ellenberg [Thu, 20 Dec 2018 16:23:29 +0000 (17:23 +0100)]
drbd: must not use connection after kref_put(&connection->kref)

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: narrow rcu_read_lock in drbd_sync_handshake
Roland Kammerer [Thu, 20 Dec 2018 16:23:28 +0000 (17:23 +0100)]
drbd: narrow rcu_read_lock in drbd_sync_handshake

So far there was the possibility that we called
genlmsg_new(GFP_NOIO)/mutex_lock() while holding an rcu_read_lock().

This included cases like:

drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
    drbd_khelper
      drbd_bcast_event
        genlmsg_new(GFP_NOIO) --> may sleep

drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
    drbd_khelper
      notify_helper
        genlmsg_new(GFP_NOIO) --> may sleep

drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
    drbd_khelper
      notify_helper
        mutex_lock --> may sleep

While using GFP_ATOMIC whould have been possible in the first two cases,
the real fix is to narrow the rcu_read_lock.

Reported-by: Jia-Ju Bai <baijiaju1990@163.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: save irq state in blkg_lookup_create()
Ming Lei [Wed, 19 Dec 2018 16:29:15 +0000 (00:29 +0800)]
block: save irq state in blkg_lookup_create()

blkg_lookup_create() may be called from pool_map() in which
irq state is saved, so we have to do that in blkg_lookup_create().

Otherwise, the following lockdep warning can be triggered:

[  104.258537] ================================
[  104.259129] WARNING: inconsistent lock state
[  104.259725] 4.20.0-rc6+ #545 Not tainted
[  104.260268] --------------------------------
[  104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
[  104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool]
[  104.263747] {SOFTIRQ-ON-W} state was registered at:
[  104.264417]   _raw_spin_unlock_irq+0x29/0x4c
[  104.265014]   blkg_lookup_create+0xdc/0xe6
[  104.265609]   bio_associate_blkg_from_css+0xd3/0x13f
[  104.266312]   bio_associate_blkg+0x15a/0x1bb
[  104.266913]   pool_map+0xe8/0x103 [dm_thin_pool]
[  104.267572]   __map_bio+0x98/0x29c [dm_mod]
[  104.268162]   __split_and_process_non_flush+0x29e/0x306 [dm_mod]
[  104.269003]   __split_and_process_bio+0x16a/0x25b [dm_mod]
[  104.269971]   __dm_make_request.isra.14+0xdc/0x124 [dm_mod]
[  104.270973]   generic_make_request+0x3f5/0x68b
[  104.271676]   process_prepared_mapping+0x166/0x1ef [dm_thin_pool]
[  104.272531]   schedule_zero+0x239/0x273 [dm_thin_pool]
[  104.273245]   process_cell+0x60c/0x6f1 [dm_thin_pool]
[  104.273967]   do_worker+0x60c/0xca8 [dm_thin_pool]
[  104.274635]   process_one_work+0x4eb/0x834
[  104.275203]   worker_thread+0x318/0x484
[  104.275740]   kthread+0x1d1/0x1e1
[  104.276203]   ret_from_fork+0x3a/0x50
[  104.276714] irq event stamp: 170003
[  104.277201] hardirqs last  enabled at (170002): [<ffffffff81bcc33e>] _raw_spin_unlock_irqrestore+0x44/0x6b
[  104.278535] hardirqs last disabled at (170003): [<ffffffff81bcc1ad>] _raw_spin_lock_irqsave+0x20/0x55
[  104.280273] softirqs last  enabled at (169978): [<ffffffff810d13d4>] irq_enter+0x4c/0x73
[  104.281617] softirqs last disabled at (169979): [<ffffffff810d1479>] irq_exit+0x7e/0x11d
[  104.282744]
[  104.282744] other info that might help us debug this:
[  104.283640]  Possible unsafe locking scenario:
[  104.283640]
[  104.284452]        CPU0
[  104.284803]        ----
[  104.285150]   lock(&(&pool->lock)->rlock#3);
[  104.285762]   <Interrupt>
[  104.286130]     lock(&(&pool->lock)->rlock#3);
[  104.286750]
[  104.286750]  *** DEADLOCK ***
[  104.286750]
[  104.287564] no locks held by swapper/49/0.
[  104.288129]
[  104.288129] stack backtrace:
[  104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545
[  104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[  104.290858] Call Trace:
[  104.291204]  <IRQ>
[  104.291502]  dump_stack+0x9a/0xe6
[  104.291968]  mark_lock+0x56c/0x7a6
[  104.292442]  ? check_usage_backwards+0x209/0x209
[  104.293086]  __lock_acquire+0x400/0x15bf
[  104.293662]  ? check_chain_key+0x150/0x1aa
[  104.294236]  lock_acquire+0x1a6/0x1e3
[  104.294768]  ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
[  104.295444]  ? _raw_spin_unlock_irqrestore+0x44/0x6b
[  104.296143]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
[  104.297031]  _raw_spin_lock_irqsave+0x46/0x55
[  104.297659]  ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
[  104.298335]  thin_endio+0xcf/0x2a3 [dm_thin_pool]
[  104.298997]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
[  104.299886]  ? check_flags+0x20a/0x20a
[  104.300408]  ? lock_acquire+0x1a6/0x1e3
[  104.300954]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
[  104.301865]  clone_endio+0x1bb/0x22d [dm_mod]
[  104.302491]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
[  104.303200]  ? bio_disassociate_blkg+0xc6/0x15f
[  104.303836]  ? bio_endio+0x2b2/0x2da
[  104.304349]  clone_endio+0x1f3/0x22d [dm_mod]
[  104.304978]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
[  104.305709]  ? bio_disassociate_blkg+0xc6/0x15f
[  104.306333]  ? bio_endio+0x2b2/0x2da
[  104.306853]  clone_endio+0x1f3/0x22d [dm_mod]
[  104.307476]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
[  104.308185]  ? bio_disassociate_blkg+0xc6/0x15f
[  104.308817]  ? bio_endio+0x2b2/0x2da
[  104.309319]  blk_update_request+0x2de/0x4cc
[  104.309927]  blk_mq_end_request+0x2a/0x183
[  104.310498]  blk_done_softirq+0x16a/0x1a6
[  104.311051]  ? blk_softirq_cpu_dead+0xe2/0xe2
[  104.311653]  ? __lock_is_held+0x2a/0x87
[  104.312186]  __do_softirq+0x250/0x4e8
[  104.312705]  irq_exit+0x7e/0x11d
[  104.313157]  call_function_single_interrupt+0xf/0x20
[  104.313860]  </IRQ>
[  104.314163] RIP: 0010:native_safe_halt+0x2/0x3
[  104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 <c3> f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8
[  104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
[  104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7
[  104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c
[  104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001
[  104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf
[  104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000
[  104.323307]  ? lockdep_hardirqs_on+0x26b/0x278
[  104.323927]  default_idle+0xd9/0x1a8
[  104.324427]  do_idle+0x162/0x2b2
[  104.324891]  ? arch_cpu_idle_exit+0x28/0x28
[  104.325467]  ? mark_held_locks+0x28/0x7f
[  104.326031]  ? _raw_spin_unlock_irqrestore+0x44/0x6b
[  104.326719]  cpu_startup_entry+0x1d/0x1f
[  104.327261]  start_secondary+0x2cb/0x308
[  104.327806]  ? set_cpu_sibling_map+0x8a3/0x8a3
[  104.328421]  secondary_startup_64+0xa4/0xb0

Fixes: b978962ad4f7f9 ("blkcg: update blkg_lookup_create() to do locking")
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodm: don't reuse bio for flushes
Jens Axboe [Wed, 19 Dec 2018 16:13:34 +0000 (09:13 -0700)]
dm: don't reuse bio for flushes

DM currently has a statically allocated bio that it uses to issue empty
flushes. It doesn't submit this bio, it just uses it for maintaining
state while setting up clones. Multiple users can access this bio at the
same time. This wasn't previously an issue, even if it was a bit iffy,
but with the blkg associations it can become one.

We setup the blkg association, then clone bio's and submit, then remove
the blkg assocation again. But since we can have multiple tasks doing
this at the same time, against multiple blkg's, then we can either lose
references to a blkg, or put it twice. The latter causes complaints on
the percpu ref being <= 0 when released, and can cause use-after-free as
well. Ming reports that xfstest generic/475 triggers this:

------------[ cut here ]------------
percpu ref (blkg_release) <= 0 (0) after switching to atomic
WARNING: CPU: 13 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x2c9/0x4a0

Switch to just using an on-stack bio for this, and get rid of the
embedded bio.

Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
Reported-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoMerge branch 'nvme-4.21' of git://git.infradead.org/nvme into for-4.21/block
Jens Axboe [Wed, 19 Dec 2018 15:08:41 +0000 (08:08 -0700)]
Merge branch 'nvme-4.21' of git://git.infradead.org/nvme into for-4.21/block

Pull last batch of NVMe updates for 4.21 from Christoph:

"This contains a series from Sagi to restore poll support for nvme-rdma,
 a new tracepoint from yupeng and various fixes."

* 'nvme-4.21' of git://git.infradead.org/nvme:
  nvme-pci: trace SQ status on completions
  nvme-rdma: implement polling queue map
  nvme-fabrics: allow user to pass in nr_poll_queues
  nvme-fabrics: allow nvmf_connect_io_queue to poll
  nvme-core: optionally poll sync commands
  block: make request_to_qc_t public
  nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
  nvme-tcp: fix endianess annotations
  nvmet-tcp: fix endianess annotations
  nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
  nvme-pci: only set nr_maps to 2 if poll queues are supported
  nvmet: use a macro for default error location
  nvmet: fix comparison of a u16 with -1

2 years agonvme-pci: trace SQ status on completions
yupeng [Tue, 18 Dec 2018 16:59:53 +0000 (17:59 +0100)]
nvme-pci: trace SQ status on completions

Export the disk name, queue id, sq_head, sq_tail to a trace event in
completion handling.

Usage example:

cd /sys/kernel/debug/tracing/events/nvme/nvme_sq

echo 'disk=="nvme1n1"' > filter

echo 1 > enable

cat /sys/kernel/debug/tracing/trace_pipe

Signed-off-by: yupeng <yupeng0921@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <keith.busch@intel.com>
[hch: slight formatting tweaks, use standard nvme tracepoint
 conventions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
wip

2 years agonvme-rdma: implement polling queue map
Sagi Grimberg [Fri, 14 Dec 2018 19:06:10 +0000 (11:06 -0800)]
nvme-rdma: implement polling queue map

When passed with nr_poll_queues setup additional queues with cq polling
context IB_POLL_DIRECT (no interrupts) and make sure to set
QUEUE_FLAG_POLL on the connect_q. In addition add the third queue
mapping for polling queues.

nvmf connect on this queue is polled for like all other requests so make
nvmf_connect_io_queue poll for polling queues.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fabrics: allow user to pass in nr_poll_queues
Sagi Grimberg [Fri, 14 Dec 2018 19:06:09 +0000 (11:06 -0800)]
nvme-fabrics: allow user to pass in nr_poll_queues

This argument will specify how many polling I/O queues to connect when
creating the controller. These I/O queues will host I/O that is set with
REQ_HIPRI.

Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fabrics: allow nvmf_connect_io_queue to poll
Sagi Grimberg [Fri, 14 Dec 2018 19:06:08 +0000 (11:06 -0800)]
nvme-fabrics: allow nvmf_connect_io_queue to poll

Preparation for polling support for fabrics. Polling support
means that our completion queues are not generating any interrupts
which means we need to poll for the nvmf io queue connect as well.

Reviewed by Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-core: optionally poll sync commands
Sagi Grimberg [Fri, 14 Dec 2018 19:06:07 +0000 (11:06 -0800)]
nvme-core: optionally poll sync commands

Pass poll bool to indicate that we need it to poll. This prepares us for
polling support in nvmf since connect is an I/O that will be queued
and has to be polled in order to complete. If poll is passed,
we call nvme_execute_rq_polled which sends the requests and polls
for its completion.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agoblock: make request_to_qc_t public
Sagi Grimberg [Fri, 14 Dec 2018 19:06:06 +0000 (11:06 -0800)]
block: make request_to_qc_t public

block consumers will need it for polling requests that
are sent with blk_execute_rq_nowait. Also, get rid of
blk_tag_to_qc_t and open-code it instead.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
Colin Ian King [Fri, 14 Dec 2018 11:42:43 +0000 (11:42 +0000)]
nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"

There is a spelling mistake in a dev_info message, fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-tcp: fix endianess annotations
Christoph Hellwig [Thu, 13 Dec 2018 08:46:59 +0000 (09:46 +0100)]
nvme-tcp: fix endianess annotations

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2 years agonvmet-tcp: fix endianess annotations
Christoph Hellwig [Thu, 13 Dec 2018 08:41:15 +0000 (09:41 +0100)]
nvmet-tcp: fix endianess annotations

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
2 years agonvme-pci: refactor nvme_poll_irqdisable to make sparse happy
Christoph Hellwig [Thu, 13 Dec 2018 08:48:00 +0000 (09:48 +0100)]
nvme-pci: refactor nvme_poll_irqdisable to make sparse happy

By duplicating the nvme_process_cq in both branches we keep the
sparse lock context checking happy, so do it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2 years agonvme-pci: only set nr_maps to 2 if poll queues are supported
Christoph Hellwig [Fri, 14 Dec 2018 13:06:59 +0000 (14:06 +0100)]
nvme-pci: only set nr_maps to 2 if poll queues are supported

The block layer now enables polling support on a queue if nr_maps
includes the poll map, so we should only set that if we actually
support poll queues.

Fixes:  6544d229bf ("block: enable polling by default if a poll map is initalized")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2 years agonvmet: use a macro for default error location
Chaitanya Kulkarni [Tue, 18 Dec 2018 02:35:29 +0000 (18:35 -0800)]
nvmet: use a macro for default error location

This patch defines a new macro NVMET_NO_ERROR_LOC to represent the
default error location value in the nvme-error-log-page.
This is a pure cleanup patch and it does not change any functionality.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: fix comparison of a u16 with -1
Colin Ian King [Fri, 14 Dec 2018 18:31:21 +0000 (18:31 +0000)]
nvmet: fix comparison of a u16 with -1

Currently the u16 req->error_loc is being compared to -1 which
will always be false.  Fix this by casting -1 to u16 to fix this.

Detected by clang:
  warning: result of comparison of constant -1 with expression of
  type 'u16' (aka 'unsigned short') is always false
  [-Wtautological-constant-out-of-range-compare]

Fixes: 76574f37bf4c ("nvmet: add interface to update error-log page")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agoblk-mq: enable IO poll if .nr_queues of type poll > 0
Ming Lei [Tue, 18 Dec 2018 04:15:29 +0000 (12:15 +0800)]
blk-mq: enable IO poll if .nr_queues of type poll > 0

The queue mapping of type poll only exists when set->map[HCTX_TYPE_POLL].nr_queues
is bigger than zero, so enhance the constraint by checking .nr_queues of type poll
before enabling IO poll.

Otherwise IO race & timeout can be observed when running block/007.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
Jens Axboe [Tue, 18 Dec 2018 04:11:17 +0000 (21:11 -0700)]
blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()

There's a single user of this function, dm, and dm just wants
to check if IO is inflight, not that it's just allocated.

This fixes a hang with srp/002 in blktests with dm, where it tries
to suspend but waits for inflight IO to finish first. As it checks
for just allocated requests, this fails.

Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq: skip zero-queue maps in blk_mq_map_swqueue
Ming Lei [Mon, 17 Dec 2018 17:28:56 +0000 (01:28 +0800)]
blk-mq: skip zero-queue maps in blk_mq_map_swqueue

From 7e849dd9cf37 ("nvme-pci: don't share queue maps"), the mapping
table won't be initialized actually if map->nr_queues is zero, so
we can't use blk_mq_map_queue_type() to retrieve hctx any more.

This way still may cause broken mapping, fix it by skipping zero-queues
maps in blk_mq_map_swqueue().

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: fix blk-iolatency accounting underflow
Dennis Zhou [Mon, 17 Dec 2018 16:03:51 +0000 (11:03 -0500)]
block: fix blk-iolatency accounting underflow

The blk-iolatency controller measures the time from rq_qos_throttle() to
rq_qos_done_bio() and attributes this time to the first bio that needs
to create the request. This means if a bio is plug-mergeable or
bio-mergeable, it gets to bypass the blk-iolatency controller.

The recent series [1], to tag all bios w/ blkgs undermined how iolatency
was determining which bios it was charging and should process in
rq_qos_done_bio(). Because all bios are being tagged, this caused the
atomic_t for the struct rq_wait inflight count to underflow and result
in a stall.

This patch adds a new flag BIO_TRACKED to let controllers know that a
bio is going through the rq_qos path. blk-iolatency now checks if this
flag is set to see if it should process the bio in rq_qos_done_bio().

Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
BIO_THROTTLED was another candidate, but the flag is set for all bios
that have gone through blk-throttle code. Overloading a flag comes with
the burden of making sure that when either implementation changes, a
change in setting rules for one doesn't cause a bug in the other. So
here, we unfortunately opt for adding a new flag.

[1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/

Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq: fix dispatch from sw queue
Ming Lei [Mon, 17 Dec 2018 15:44:05 +0000 (08:44 -0700)]
blk-mq: fix dispatch from sw queue

When a request is added to rq list of sw queue(ctx), the rq may be from
a different type of hctx, especially after multi queue mapping is
introduced.

So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
hctx can be dispatched to current hctx in case that read queue or poll
queue is enabled.

This patch fixes this issue by introducing per-queue-type list.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Changed by me to not use separately cacheline aligned lists, just
place them all in the same cacheline where we had just the one list
and lock before.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: mq-deadline: Fix write completion handling
Damien Le Moal [Mon, 17 Dec 2018 06:14:05 +0000 (15:14 +0900)]
block: mq-deadline: Fix write completion handling

For a zoned block device using mq-deadline, if a write request for a
zone is received while another write was already dispatched for the same
zone, dd_dispatch_request() will return NULL and the newly inserted
write request is kept in the scheduler queue waiting for the ongoing
zone write to complete. With this behavior, when no other request has
been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
__blk_mq_free_request() call of blk_mq_sched_restart() to not run the
queue when the already dispatched write request completes. The newly
dispatched request stays stuck in the scheduler queue until eventually
another request is submitted.

This problem does not affect SCSI disk as the SCSI stack handles queue
restart on request completion. However, this problem is can be triggered
the nullblk driver with zoned mode enabled.

Fix this by always requesting a queue restart in dd_dispatch_request()
if no request was dispatched while WRITE requests are queued.

Fixes: 5700f69178e9 ("mq-deadline: Introduce zone locking support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Add missing export of blk_mq_sched_restart()

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agonvme-pci: don't share queue maps
Christoph Hellwig [Mon, 17 Dec 2018 11:16:27 +0000 (12:16 +0100)]
nvme-pci: don't share queue maps

Now that the block layer checks if a queue map has any queues inside
it there is no more reason to duplicate the maps for the non-default
types.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq: only dispatch to non-defauly queue maps if they have queues
Christoph Hellwig [Mon, 17 Dec 2018 11:16:26 +0000 (12:16 +0100)]
blk-mq: only dispatch to non-defauly queue maps if they have queues

We should check if a given queue map actually has queues enabled before
dispatching to it.  This allows drivers to not initialize optional but
not used map types, which subsequently will allow fixing problems with
queue map rebuilds for that case.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq: export hctx->type in debugfs instead of sysfs
Ming Lei [Mon, 17 Dec 2018 10:42:48 +0000 (18:42 +0800)]
blk-mq: export hctx->type in debugfs instead of sysfs

Now we only export hctx->type via sysfs, and there isn't such info
in hctx entry under debugfs. We often use debugfs only to diagnose
queue mapping issue, so add the support in debugfs.

Queue mapping becomes a bit more complicated after multiple queue
mapping is supported, we may write blktest to verify if queue mapping
is valid based on blk-mq-debugfs.

Given not necessary to export hctx->type twice, so remove the export
from sysfs.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq: fix allocation for queue mapping table
Ming Lei [Mon, 17 Dec 2018 10:42:45 +0000 (18:42 +0800)]
blk-mq: fix allocation for queue mapping table

Type of each element in queue mapping table is 'unsigned int,
intead of 'struct blk_mq_queue_map)', so fix it.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-wbt: export internal state via debugfs
Ming Lei [Mon, 17 Dec 2018 01:46:01 +0000 (09:46 +0800)]
blk-wbt: export internal state via debugfs

This information is helpful to either investigate issues, or understand
wbt's internal behaviour.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq-debugfs: support rq_qos
Ming Lei [Mon, 17 Dec 2018 01:46:00 +0000 (09:46 +0800)]
blk-mq-debugfs: support rq_qos

blk-mq-debugfs has been proved as very helpful for debug some
tough issues, such as IO hang.

We have seen blk-wbt related IO hang several times, even inside
Red Hat BZ, there is such report not sovled yet, so this patch
adds support debugfs on rq_qos.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: update sysfs documentation
Damien Le Moal [Fri, 30 Nov 2018 05:36:24 +0000 (14:36 +0900)]
block: update sysfs documentation

Add the description of the zoned, nr_zones and chunk_sectors sysfs queue
attributes to Documentation/block/queue-sysfs.txt. The description of
the zoned and chunk_sector attributes are mostly copied from
ABI/testing/sysfs-block (added a typo fix). While at it, also fix a
typo in the description of the io_poll_delay attribute.

nr_zones description is also added to ABI/testing/sysfs-block and
contact email address updated for the zoned attribute.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add()
Chengguang Xu [Sun, 16 Dec 2018 09:35:00 +0000 (17:35 +0800)]
block: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add()

blk_mq_init_queue() will not return NULL pointer to its caller,
so it's better to replace IS_ERR_OR_NULL using IS_ERR in loop_add().

If in the future things change to check NULL pointer inside loop_add(),
we should return -ENOMEM as return code instead of PTR_ERR(NULL).

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoaoe: add __exit annotation
Chengguang Xu [Sun, 16 Dec 2018 06:08:18 +0000 (14:08 +0800)]
aoe: add __exit annotation

Add __exit annotation to cleanup helper which
is only called once in the module.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: clear REQ_HIPRI if polling is not supported
Christoph Hellwig [Fri, 14 Dec 2018 16:21:22 +0000 (17:21 +0100)]
block: clear REQ_HIPRI if polling is not supported

This prevents a HIPRI bio from being submitted through a stacking
driver that does not support polling and thus won't poll for I/O
completion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq: replace and kill blk_mq_request_issue_directly
Jianchao Wang [Fri, 14 Dec 2018 01:28:20 +0000 (09:28 +0800)]
blk-mq: replace and kill blk_mq_request_issue_directly

Replace blk_mq_request_issue_directly with blk_mq_try_issue_directly
in blk_insert_cloned_request and kill it as nobody uses it any more.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests
Jianchao Wang [Fri, 14 Dec 2018 01:28:19 +0000 (09:28 +0800)]
blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests

It is not necessary to issue request directly with bypass 'true'
in blk_mq_sched_insert_requests and handle the non-issued requests
itself. Just set bypass to 'false' and let blk_mq_try_issue_directly
handle them totally. Remove the blk_rq_can_direct_dispatch check,
because blk_mq_try_issue_directly can handle it well.If request is
direct-issued unsuccessfully, insert the reset.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq: refactor the code of issue request directly
Jianchao Wang [Fri, 14 Dec 2018 01:28:18 +0000 (09:28 +0800)]
blk-mq: refactor the code of issue request directly

Merge blk_mq_try_issue_directly and __blk_mq_try_issue_directly
into one interface to unify the interfaces to issue requests
directly. The merged interface takes over the requests totally,
it could insert, end or do nothing based on the return value of
.queue_rq and 'bypass' parameter. Then caller needn't any other
handling any more and then code could be cleaned up.

And also the commit c616cbee ( blk-mq: punt failed direct issue
to dispatch list ) always inserts requests to hctx dispatch list
whenever get a BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, this is
overkill and will harm the merging. We just need to do that for
the requests that has been through .queue_rq. This patch also
could fix this.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: remove the bio_integrity_advance export
Christoph Hellwig [Thu, 13 Dec 2018 20:32:14 +0000 (21:32 +0100)]
block: remove the bio_integrity_advance export

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: remove the bioset_integrity_free export
Christoph Hellwig [Thu, 13 Dec 2018 20:32:13 +0000 (21:32 +0100)]
block: remove the bioset_integrity_free export

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: remove the unused bio_set_pages_dirty and bio_check_pages_dirty exports
Christoph Hellwig [Thu, 13 Dec 2018 20:32:12 +0000 (21:32 +0100)]
block: remove the unused bio_set_pages_dirty and bio_check_pages_dirty exports

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: remove the unused bio_iov_iter_get_pages export
Christoph Hellwig [Thu, 13 Dec 2018 20:32:11 +0000 (21:32 +0100)]
block: remove the unused bio_iov_iter_get_pages export

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: remove the blk_recount_segments export
Christoph Hellwig [Thu, 13 Dec 2018 20:32:10 +0000 (21:32 +0100)]
block: remove the blk_recount_segments export

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: remove the bio_phys_segments export
Christoph Hellwig [Thu, 13 Dec 2018 20:32:09 +0000 (21:32 +0100)]
block: remove the bio_phys_segments export

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agonvme: fix kernel paging oops
Sagi Grimberg [Thu, 13 Dec 2018 20:34:07 +0000 (12:34 -0800)]
nvme: fix kernel paging oops

free the controller discard_page correctly.

Fixes: cb5b7262b011 ("nvme: provide fallback for discard alloc failure")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoMerge branch 'nvme-4.21' of git://git.infradead.org/nvme into for-4.21/block
Jens Axboe [Thu, 13 Dec 2018 19:59:37 +0000 (12:59 -0700)]
Merge branch 'nvme-4.21' of git://git.infradead.org/nvme into for-4.21/block

Pull NVMe updates from Christoph:

"Here is the second large chunk of nvme updates for 4.21:

 - host and target support for NVMe over TCP (Sagi Grimberg,
Roy Shterman, Solganik Alexander)
 - error log page support in target (Chaitanya Kulkarni)

plus small fixes and improvements from Jens Axboe and Chengguang Xu."

* 'nvme-4.21' of git://git.infradead.org/nvme: (33 commits)
  nvme-rdma: support separate queue maps for read and write
  nvme-tcp: support separate queue maps for read and write
  nvme-fabrics: allow user to set nr_write_queues for separate queue maps
  nvme-fabrics: add missing nvmf_ctrl_options documentation
  blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues
  nvmet: update smart log with num err log entries
  nvmet: add error log page cmd handler
  nvmet: add error log support for file backend
  nvmet: add error log support for bdev backend
  nvmet: add error log support for admin-cmd
  nvmet: add error log support for rdma backend
  nvmet: add error log support for fabrics-cmd
  nvmet: add error log support in the core
  nvmet: add interface to update error-log page
  nvmet: add error-log definitions
  nvme: add error log page slot definition
  nvme: remove nvme_common command cdw10 array
  nvmet: remove unused variable
  nvme: provide fallback for discard alloc failure
  nvme: add __exit annotation
  ...

2 years agobcache: print number of keys in trace_bcache_journal_write
Guoju Fang [Thu, 13 Dec 2018 14:53:57 +0000 (22:53 +0800)]
bcache: print number of keys in trace_bcache_journal_write

Sometimes flush journal may be very frequent, so it's useful to dump
number of keys every time write journal.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: set writeback_percent in a flexible range
Coly Li [Thu, 13 Dec 2018 14:53:56 +0000 (22:53 +0800)]
bcache: set writeback_percent in a flexible range

Because CUTOFF_WRITEBACK is defined as 40, so before the changes of
dynamic cutoff writeback values, writeback_percent is limited to [0,
CUTOFF_WRITEBACK]. Any value larger than CUTOFF_WRITEBACK will be fixed
up to 40.

Now cutof writeback limit is a dynamic value bch_cutoff_writeback, so
the range of writeback_percent can be a more flexible range as [0,
bch_cutoff_writeback]. The flexibility is, it can be expended to a
larger or smaller range than [0, 40], depends on how value
bch_cutoff_writeback is specified.

The default value is still strongly recommended to most of users for
most of workloads. But for people who want to do research on bcache
writeback perforamnce tuning, they may have chance to specify more
flexible writeback_percent in range [0, 70].

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: make cutoff_writeback and cutoff_writeback_sync tunable
Coly Li [Thu, 13 Dec 2018 14:53:55 +0000 (22:53 +0800)]
bcache: make cutoff_writeback and cutoff_writeback_sync tunable

Currently the cutoff writeback and cutoff writeback sync thresholds are
defined by CUTOFF_WRITEBACK (40) and CUTOFF_WRITEBACK_SYNC (70) as
static values. Most of time these they work fine, but when people want
to do research on bcache writeback mode performance tuning, there is no
chance to modify the soft and hard cutoff writeback values.

This patch introduces two module parameters bch_cutoff_writeback_sync
and bch_cutoff_writeback which permit people to tune the values when
loading bcache.ko. If they are not specified by module loading, current
values CUTOFF_WRITEBACK_SYNC and CUTOFF_WRITEBACK will be used as
default and nothing changes.

When people want to tune this two values,
- cutoff_writeback can be set in range [1, 70]
- cutoff_writeback_sync can be set in range [1, 90]
- cutoff_writeback always <= cutoff_writeback_sync

The default values are strongly recommended to most of users for most of
workloads. Anyway, if people wants to take their own risk to do research
on new writeback cutoff tuning for their own workload, now they can make
it.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: add MODULE_DESCRIPTION information
Coly Li [Thu, 13 Dec 2018 14:53:54 +0000 (22:53 +0800)]
bcache: add MODULE_DESCRIPTION information

This patch moves MODULE_AUTHOR and MODULE_LICENSE to end of super.c, and
add MODULE_DESCRIPTION("Bcache: a Linux block layer cache").

This is preparation for adding module parameters.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: option to automatically run gc thread after writeback
Coly Li [Thu, 13 Dec 2018 14:53:53 +0000 (22:53 +0800)]
bcache: option to automatically run gc thread after writeback

The option gc_after_writeback is disabled by default, because garbage
collection will discard SSD data which drops cached data.

Echo 1 into /sys/fs/bcache/<UUID>/internal/gc_after_writeback will
enable this option, which wakes up gc thread when writeback accomplished
and all cached data is clean.

This option is helpful for people who cares writing performance more. In
heavy writing workload, all cached data can be clean only happens when
writeback thread cleans all cached data in I/O idle time. In such
situation a following gc running may help to shrink bcache B+ tree and
discard more clean data, which may be helpful for future writing
requests.

If you are not sure whether this is helpful for your own workload,
please leave it as disabled by default.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: introduce force_wake_up_gc()
Coly Li [Thu, 13 Dec 2018 14:53:52 +0000 (22:53 +0800)]
bcache: introduce force_wake_up_gc()

Garbage collection thread starts to work when c->sectors_to_gc is
negative value, otherwise nothing will happen even the gc thread is
woken up by wake_up_gc().

force_wake_up_gc() sets c->sectors_to_gc to -1 before calling
wake_up_gc(), then gc thread may have chance to run if no one else sets
c->sectors_to_gc to a positive value before gc_should_run().

This routine can be called where the gc thread is woken up and required
to run in force.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: cannot set writeback_running via sysfs if no writeback kthread created
Shenghui Wang [Thu, 13 Dec 2018 14:53:51 +0000 (22:53 +0800)]
bcache: cannot set writeback_running via sysfs if no writeback kthread created

"echo 1 > writeback_running" marks writeback_running even if no
writeback kthread created as "d_strtoul(writeback_running)" will simply
set dc-> writeback_running without checking the existence of
dc->writeback_thread.

Add check for setting writeback_running via sysfs: if no writeback
kthread available, reject setting to 1.

v2 -> v3:
  * Make message on wrong assignment more clear.
  * Print name of bcache device instead of name of backing device.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: do not mark writeback_running too early
Shenghui Wang [Thu, 13 Dec 2018 14:53:50 +0000 (22:53 +0800)]
bcache: do not mark writeback_running too early

A fresh backing device is not attached to any cache_set, and
has no writeback kthread created until first attached to some
cache_set.

But bch_cached_dev_writeback_init run
"
dc->writeback_running = true;
WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING,
&dc->disk.flags));
"
for any newly formatted backing devices.

For a fresh standalone backing device, we can get something like
following even if no writeback kthread created:
------------------------
/sys/block/bcache0/bcache# cat writeback_running
1
/sys/block/bcache0/bcache# cat writeback_rate_debug
rate: 512.0k/sec
dirty: 0.0k
target: 0.0k
proportional: 0.0k
integral: 0.0k
change: 0.0k/sec
next io: -15427384ms

The none ZERO fields are misleading as no alive writeback kthread yet.

Set dc->writeback_running false as no writeback thread created in
bch_cached_dev_writeback_init().

We have writeback thread created and woken up in bch_cached_dev_writeback
_start(). Set dc->writeback_running true before bch_writeback_queue()
called, as a writeback thread will check if dc->writeback_running is true
before writing back dirty data, and hung if false detected.

After the change, we can get the following output for a fresh standalone
backing device:
-----------------------
/sys/block/bcache0/bcache$ cat writeback_running
0
/sys/block/bcache0/bcache# cat writeback_rate_debug
rate: 0.0k/sec
dirty: 0.0k
target: 0.0k
proportional: 0.0k
integral: 0.0k
change: 0.0k/sec
next io: 0ms

v1 -> v2:
  Set dc->writeback_running before bch_writeback_queue() called,

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: update comment in sysfs.c
Shenghui Wang [Thu, 13 Dec 2018 14:53:49 +0000 (22:53 +0800)]
bcache: update comment in sysfs.c

We have struct cached_dev allocated by kzalloc in register_bcache(),
which initializes all the fields of cached_dev with 0s. And commit
ce4c3e19e520 ("bcache: Replace bch_read_string_list() by
__sysfs_match_string()") has remove the string "default".

Update the comment.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: update comment for bch_data_insert
Shenghui Wang [Thu, 13 Dec 2018 14:53:48 +0000 (22:53 +0800)]
bcache: update comment for bch_data_insert

commit 220bb38c21b8 ("bcache: Break up struct search") introduced
changes to struct search and s->iop. bypass/bio are fields of struct
data_insert_op now. Update the comment.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: do not check if debug dentry is ERR or NULL explicitly on remove
Shenghui Wang [Thu, 13 Dec 2018 14:53:47 +0000 (22:53 +0800)]
bcache: do not check if debug dentry is ERR or NULL explicitly on remove

debugfs_remove and debugfs_remove_recursive will check if the dentry
pointer is NULL or ERR, and will do nothing in that case.

Remove the check in cache_set_free and bch_debug_init.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: add comment for cache_set->fill_iter
Shenghui Wang [Thu, 13 Dec 2018 14:53:46 +0000 (22:53 +0800)]
bcache: add comment for cache_set->fill_iter

We have the following define for btree iterator:
struct btree_iter {
size_t size, used;
#ifdef CONFIG_BCACHE_DEBUG
struct btree_keys *b;
#endif
struct btree_iter_set {
struct bkey *k, *end;
} data[MAX_BSETS];
};

We can see that the length of data[] field is static MAX_BSETS, which is
defined as 4 currently.

But a btree node on disk could have too many bsets for an iterator to fit
on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate
space to host more btree_iter_sets.

bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can
allocate an iterator equipped with enough room that can host
(sb.bucket_size / sb.block_size)
btree_iter_sets, which is more than static MAX_BSETS.

bch_btree_node_read_done() will use that pool to allocate one iterator, to
host many bsets in one btree node.

Add more comment around cache_set->fill_iter to make code less confusing.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agonvme-rdma: support separate queue maps for read and write
Sagi Grimberg [Wed, 12 Dec 2018 07:38:58 +0000 (23:38 -0800)]
nvme-rdma: support separate queue maps for read and write

llow NVMF_OPT_NR_WRITE_QUEUES to describe additional write queues.  In
addition, implement .map_queues that will apply 2 queue maps for read
and write queue sets.

Note that with the separate queue map, HCTX_TYPE_READ will always use
nr_io_queues and HCTX_TYPE_DEFAULT will use nr_write_queues.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-tcp: support separate queue maps for read and write
Sagi Grimberg [Wed, 12 Dec 2018 07:38:57 +0000 (23:38 -0800)]
nvme-tcp: support separate queue maps for read and write

Allow NVMF_OPT_NR_WRITE_QUEUES to describe additional write queues.  In
addition, implement .map_queues that will apply 2 queue maps for read
and write queue sets.

Note that with the separate queue map, HCTX_TYPE_READ will always use
nr_io_queues and HCTX_TYPE_DEFAULT will use nr_write_queues.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fabrics: allow user to set nr_write_queues for separate queue maps
Sagi Grimberg [Wed, 12 Dec 2018 07:38:56 +0000 (23:38 -0800)]
nvme-fabrics: allow user to set nr_write_queues for separate queue maps

This argument will specify how many I/O queues will be connected in
create_ctrl in addition to nr_io_queues. With this configuration, I/O
that carries payload from the host to the target, will use the default
hctx queue map, and I/O that involves target to host transfers will use
the read hctx queue map.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fabrics: add missing nvmf_ctrl_options documentation
Sagi Grimberg [Wed, 12 Dec 2018 07:38:55 +0000 (23:38 -0800)]
nvme-fabrics: add missing nvmf_ctrl_options documentation

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agoblk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues
Sagi Grimberg [Wed, 12 Dec 2018 07:38:54 +0000 (23:38 -0800)]
blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues

Will be used by nvme-rdma for queue map separation support.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: update smart log with num err log entries
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:48 +0000 (15:11 -0800)]
nvmet: update smart log with num err log entries

Now that we have error log page implementation update smart log command
handler to provide number of error log entries in the lifetime of the
controller field.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: add error log page cmd handler
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:47 +0000 (15:11 -0800)]
nvmet: add error log page cmd handler

Now that we have support for all the major parts of the target we add
a NVMe error log page handler so that host can read the log page.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: add error log support for file backend
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:43 +0000 (15:11 -0800)]
nvmet: add error log support for file backend

This patch adds support for the file backend to populate the
error log entries. Here we map the errno to the NVMe status codes.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: add error log support for bdev backend
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:42 +0000 (15:11 -0800)]
nvmet: add error log support for bdev backend

This patch adds the support for the block device backend to populate the
error log entries. Here we map the blk_status_t to the NVMe status.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: add error log support for admin-cmd
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:46 +0000 (15:11 -0800)]
nvmet: add error log support for admin-cmd

This patch adds the support to maintain the error log page for admin
commands.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: add error log support for rdma backend
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:45 +0000 (15:11 -0800)]
nvmet: add error log support for rdma backend

This patch adds the support to maintain the error log page for rdma
transport, we mainly focus here on the NVME_INVALID_FIELD errors.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: add error log support for fabrics-cmd
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:44 +0000 (15:11 -0800)]
nvmet: add error log support for fabrics-cmd

This patch adds the support to maintain error log page for the fabrics
prop get, prop set, and admin connect commands. Here we also update the
discovery.c and add update set/get features and parse functions to
support error log page.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: add error log support in the core
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:41 +0000 (15:11 -0800)]
nvmet: add error log support in the core

This patch adds the support to maintain error log page for the
nvmet-core.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: add interface to update error-log page
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:40 +0000 (15:11 -0800)]
nvmet: add interface to update error-log page

This patch adds nvmet_req based interface to the nvmet-core so that
we can update the error log page. We update error log page in
the request completion path when status is not set to NVME_SC_SUCCESS.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: add error-log definitions
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:39 +0000 (15:11 -0800)]
nvmet: add error-log definitions

This patch adds necessary fields in the target data structures to
support error log page. For a target controller, we add a new error log
field to maintain the error log, at any given point we maintain error
entries equal to NVMET_ERROR_LOG_SLOTS for each controller. In the
following patch, we also update the error log page entry in the I/O
completion path so we introduce a spinlock for synchronization of the
log.

For nvmet_req, we add a new field error_loc to hold the location of
the error in the command when the actual error occurs for each request
and a starting LBA if applicable.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: add error log page slot definition
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:38 +0000 (15:11 -0800)]
nvme: add error log page slot definition

This patch adds the NVMe error slot definition from the spec.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: remove nvme_common command cdw10 array
Chaitanya Kulkarni [Wed, 12 Dec 2018 23:11:37 +0000 (15:11 -0800)]
nvme: remove nvme_common command cdw10 array

This is a preparation patch which removes the nvme common command cdw10
array and replace with individual fields. This is needed for the nvmet
error log page implementation make is error log page entry offset
assignment easier.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: remove unused variable
Sagi Grimberg [Thu, 13 Dec 2018 07:01:54 +0000 (23:01 -0800)]
nvmet: remove unused variable

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: provide fallback for discard alloc failure
Jens Axboe [Wed, 12 Dec 2018 16:18:11 +0000 (09:18 -0700)]
nvme: provide fallback for discard alloc failure

When boxes are run near (or to) OOM, we have a problem with the discard
page allocation in nvme. If we fail allocating the special page, we
return busy, and it'll get retried. But since ordering is honored for
dispatch requests, we can keep retrying this same IO and failing. Behind
that IO could be requests that want to free memory, but they never get
the chance.

Allocate a fixed discard page per controller for a safe fallback, and use
that if the initial allocation fails.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: add __exit annotation
Chengguang Xu [Mon, 10 Dec 2018 23:24:34 +0000 (07:24 +0800)]
nvme: add __exit annotation

Add __exit annotation to cleanup helper which is only
called once in the module.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-tcp: add NVMe over TCP host driver
Sagi Grimberg [Tue, 4 Dec 2018 01:52:17 +0000 (17:52 -0800)]
nvme-tcp: add NVMe over TCP host driver

This patch implements the NVMe over TCP host driver. It can be used to
connect to remote NVMe over Fabrics subsystems over good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

To connect to all NVMe over Fabrics controllers reachable on a given taget
port over TCP use the following command:

nvme connect-all -t tcp -a $IPADDR

This requires the latest version of nvme-cli with TCP support.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: allow configfs tcp trtype configuration
Sagi Grimberg [Tue, 4 Dec 2018 01:52:16 +0000 (17:52 -0800)]
nvmet: allow configfs tcp trtype configuration

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet-tcp: add NVMe over TCP target driver
Sagi Grimberg [Tue, 4 Dec 2018 01:52:15 +0000 (17:52 -0800)]
nvmet-tcp: add NVMe over TCP target driver

This patch implements the TCP transport driver for the NVMe over Fabrics
target stack. This allows exporting NVMe over Fabrics functionality over
good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Solganik Alexander <sashas@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-tcp: Add protocol header
Sagi Grimberg [Tue, 4 Dec 2018 01:52:14 +0000 (17:52 -0800)]
nvme-tcp: Add protocol header

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fabrics: allow user passing data digest
Sagi Grimberg [Tue, 4 Dec 2018 01:52:13 +0000 (17:52 -0800)]
nvme-fabrics: allow user passing data digest

Data digest is a nvme-tcp specific feature, but nothing prevents other
transports reusing the concept so do not associate with tcp transport
solely.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fabrics: allow user passing header digest
Sagi Grimberg [Tue, 4 Dec 2018 01:52:12 +0000 (17:52 -0800)]
nvme-fabrics: allow user passing header digest

Header digest is a nvme-tcp specific feature, but nothing prevents other
transports reusing the concept so do not associate with tcp transport
solely.

Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>