Jeff Layton [Mon, 6 Mar 2017 15:08:16 +0000 (10:08 -0500)]
ceph: take cap references in write_begin, release them in write_end
We've gotten a report of the BUG_ON in ceph_set_page_dirty being hit in
the splice codepath:
BUG_ON(ci->i_wr_ref == 0);
Here's the stack trace we got from crash:
crash> bt
PID: 7872 TASK:
ffff881fbe3c5e20 CPU: 9 COMMAND: "pool"
#0 [
ffff881f7f23f8d0] machine_kexec at
ffffffff81059c7b
#1 [
ffff881f7f23f930] __crash_kexec at
ffffffff811052d2
#2 [
ffff881f7f23fa00] crash_kexec at
ffffffff811053c0
#3 [
ffff881f7f23fa18] oops_end at
ffffffff8168f088
#4 [
ffff881f7f23fa40] die at
ffffffff8102e93b
#5 [
ffff881f7f23fa70] do_trap at
ffffffff8168e740
#6 [
ffff881f7f23fac0] do_invalid_op at
ffffffff8102b144
#7 [
ffff881f7f23fb70] invalid_op at
ffffffff8169805e
[exception RIP: ceph_set_page_dirty+448]
RIP:
ffffffffa1391b80 RSP:
ffff881f7f23fc20 RFLAGS:
00010246
RAX:
0000000000140014 RBX:
ffffea00f86a76c0 RCX:
0000000000000011
RDX:
0000000000000000 RSI:
0000000000000011 RDI:
ffff881f94f7ef50
RBP:
ffff881f7f23fc80 R8:
0000000000000011 R9:
ffffea00f86a76c0
R10:
0000000000000000 R11:
0000000000000000 R12:
ffff881f94f7f280
R13:
ffff881f94f7f3d0 R14:
ffff881f94f7ef50 R15:
0000000000000000
ORIG_RAX:
ffffffffffffffff CS: 0010 SS: 0018
#8 [
ffff881f7f23fc48] ceph_update_writeable_page at
ffffffffa139362e [ceph]
#9 [
ffff881f7f23fce8] pagecache_write_end at
ffffffff81180386
#10 [
ffff881f7f23fd38] pipe_to_file at
ffffffff8122d4d6
#11 [
ffff881f7f23fdb0] splice_from_pipe_feed at
ffffffff8122cd1e
#12 [
ffff881f7f23fde8] generic_file_splice_write_actor at
ffffffff8122d70f
#13 [
ffff881f7f23fe20] splice_write_to_file at
ffffffff8122d589
#14 [
ffff881f7f23fea8] generic_file_splice_write at
ffffffff8122ee14
#15 [
ffff881f7f23fec0] do_splice_from at
ffffffff8122dafd
#16 [
ffff881f7f23ff00] sys_splice at
ffffffff8122f3da
#17 [
ffff881f7f23ff80] system_call_fastpath at
ffffffff816967c9
RIP:
00007fe5e01ec043 RSP:
00007fe5d11a07e0 RFLAGS:
00010202
RAX:
0000000000000113 RBX:
ffffffff816967c9 RCX:
000000000248cc08
RDX:
0000000000000010 RSI:
0000000000000000 RDI:
0000000000000011
RBP:
00007fe5d11a0898 R8:
0000000000000011 R9:
0000000000000004
R10:
00007fe5d11a0898 R11:
0000000000000293 R12:
00007fe5d11a08a8
R13:
0000000000000011 R14:
0000000000000000 R15:
0000000000000010
ORIG_RAX:
0000000000000113 CS: 0033 SS: 002b
If this BUG_ON is correct, then it seems like we should be holding cap
refs across the page dirtying. We could just take them in write_end,
but it's better to take them in write_begin and then hold them until
the page dirtying is done. That way if we get an error, it'll happen
before the page is dirtied, not after.
This patch has write_begin take cap references before returning, and
then release those caps in write_end. The opaque fsdata pointer is used
to pass the "got" value across functions.
RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=
1428973
Reported-by: Vikhyat Umrao <vumrao@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Elena Reshetova [Fri, 3 Mar 2017 09:15:07 +0000 (11:15 +0200)]
ceph: convert ceph_cap_snap.nref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Elena Reshetova [Fri, 3 Mar 2017 09:15:06 +0000 (11:15 +0200)]
ceph: convert ceph_mds_session.s_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Fri, 3 Mar 2017 17:16:07 +0000 (18:16 +0100)]
libceph: supported_features module parameter
Add a readonly, exported to sysfs module parameter so that userspace
can generate meaningful error messages. It's a bit funky, but there is
no other libceph-specific place.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Fri, 3 Mar 2017 17:16:07 +0000 (18:16 +0100)]
libceph, ceph: always advertise all supported features
No reason to hide CephFS-specific features in the rbd case. Recent
feature bits mix RADOS and CephFS-specific stuff together anyway.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Thu, 2 Mar 2017 18:56:57 +0000 (19:56 +0100)]
rbd: supported_features bus attribute
... so that userspace can generate meaningful error messages and spell
out unsupported features that need to be disabled.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Sun, 12 Feb 2017 16:11:07 +0000 (17:11 +0100)]
libceph: osd_request_timeout option
osd_request_timeout specifies how many seconds to wait for a response
from OSDs before returning -ETIMEDOUT from an OSD request. 0 (default)
means no limit.
osd_request_timeout is osdkeepalive-precise -- in-flight requests are
swept through every osdkeepalive seconds. With ack vs commit behaviour
gone, abort_request() is really simple.
This is based on a patch from Artur Molchanov <artur.molchanov@synesis.ru>.
Tested-by: Artur Molchanov <artur.molchanov@synesis.ru>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ilya Dryomov [Tue, 28 Feb 2017 17:53:53 +0000 (18:53 +0100)]
libceph: fix crush_decode() for older maps
Older (shorter) CRUSH maps too need to be finalized.
Fixes: 66a0e2d579db ("crush: remove mutable part of CRUSH map")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Mon, 27 Feb 2017 11:18:40 +0000 (06:18 -0500)]
ceph: when seeing write errors on an inode, switch to sync writes
Currently, we don't have a real feedback mechanism in place for when we
start seeing buffered writeback errors. If writeback is failing, there
is nothing that prevents an application from continuing to dirty pages
that aren't being cleaned.
In the event that we're seeing write errors of any sort occur on an
inode, have the callback set a flag to force further writes to be
synchronous. When the next write succeeds, clear the flag to allow
buffered writeback to continue.
Since this is just a hint to the write submission mechanism, we only
take the i_ceph_lock when a lockless check shows that the flag needs to
be changed.
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Jeff Layton [Mon, 27 Feb 2017 11:18:39 +0000 (06:18 -0500)]
Revert "ceph: SetPageError() for writeback pages if writepages fails"
This reverts commit
b109eec6f4332bd517e2f41e207037c4b9065094.
If I'm filling up a filesystem with this sort of command:
$ dd if=/dev/urandom of=/mnt/cephfs/fillfile bs=2M oflag=sync
...then I'll eventually get back EIO on a write. Further calls
will give us ENOSPC.
I'm not sure what prompted this change, but I don't think it's what we
want to do. If writepages failed, we will have already set the mapping
error appropriately, and that's what gets reported by fsync() or
close().
__filemap_fdatawait_range however, does this:
wait_on_page_writeback(page);
if (TestClearPageError(page))
ret = -EIO;
...and that -EIO ends up trumping the mapping's error if one exists.
When writepages fails, we only want to set the error in the mapping,
and not flag the individual pages.
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Jeff Layton [Mon, 27 Feb 2017 11:18:38 +0000 (06:18 -0500)]
ceph: handle epoch barriers in cap messages
Have the client store and update the osdc epoch_barrier when a cap
message comes in with one.
When sending cap messages, send the epoch barrier as well. This allows
clients to inform servers that their released caps may not be used until
a particular OSD map epoch.
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Jeff Layton [Mon, 27 Feb 2017 11:18:37 +0000 (06:18 -0500)]
libceph: add an epoch_barrier field to struct ceph_osd_client
Cephfs can get cap update requests that contain a new epoch barrier in
them. When that happens we want to pause all OSD traffic until the right
map epoch arrives.
Add an epoch_barrier field to ceph_osd_client that is protected by the
osdc->lock rwsem. When the barrier is set, and the current OSD map
epoch is below that, pause the request target when submitting the
request or when revisiting it. Add a way for upper layers (cephfs)
to update the epoch_barrier as well.
If we get a new map, compare the new epoch against the barrier before
kicking requests and request another map if the map epoch is still lower
than the one we want.
If we end up cancelling requests because of a new map showing a full OSD
or pool condition, then set the barrier higher than the highest replay
epoch of all the cancelled requests.
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Jeff Layton [Mon, 27 Feb 2017 11:18:36 +0000 (06:18 -0500)]
libceph: abort already submitted but abortable requests when map or pool goes full
When a Ceph volume hits capacity, a flag is set in the OSD map to
indicate that, and a new map is sprayed around the cluster. With cephfs
we want it to shut down any abortable requests that are in progress with
an -ENOSPC error as they'd just hang otherwise.
Add a new ceph_osdc_abort_on_full helper function to handle this. It
will first check whether there is an out-of-space condition in the
cluster. It will then walk the tree and abort any request that has
r_abort_on_full set with an ENOSPC error. Call this new function
directly whenever we get a new OSD map.
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Jeff Layton [Mon, 27 Feb 2017 11:18:35 +0000 (06:18 -0500)]
libceph: allow requests to return immediately on full conditions if caller wishes
Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.
If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.
Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.
A later patch will deal with requests that were submitted before the new
map showing the full condition came in.
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Jeff Layton [Wed, 25 Jan 2017 12:24:23 +0000 (07:24 -0500)]
[DO NOT MERGE] ceph: switch DIO code to use iov_iter_get_pages_alloc
xfstest generic/095 triggers soft lockups in kcephfs. It uses fio to
drive some I/O via vmsplice ane splice. Ceph then ends up trying to
access an ITER_BVEC type iov_iter as a ITER_IOVEC one. That causes it to
pick up a wrong offset and get stuck in an infinite loop while trying to
populate the page array. dio_get_pagev_size has a similar problem.
Now that iov_iter_get_pages_alloc doesn't stop after the first vector in
the array, we can just call it instead and dump the old code that tried
to do the same thing.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Jeff Layton [Thu, 26 Jan 2017 13:01:13 +0000 (08:01 -0500)]
[DO NOT MERGE] iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per call
Currently, iov_iter_get_pages_alloc will only ever operate on the first
vector that iterate_all_kinds hands back. Most of the callers however
would like to have as long a set of pages as possible, to allow for
fewer, but larger I/Os.
When the previous vector ends on a page boundary and the current one
begins on one, we can continue to add more pages.
Change the function to first scan the iov_iter to see how long an
array of pages we could create from the current position up to the
maxsize passed in. Then, allocate an array that large and start
filling in that many pages.
The main impetus for this patch is to rip out a swath of code in ceph
that tries to do this but that doesn't handle ITER_BVEC correctly.
NFS also uses this function and this also allows the client to do larger
I/Os when userland passes down an array of page-aligned iovecs in an
O_DIRECT request. This should also make splice writes into an O_DIRECT
file on NFS use larger I/Os, since that's now done by passing down an
ITER_BVEC representing the data to be written.
I believe the other callers (lustre and 9p) may also benefit from this
change, but I don't have a great way to verify that.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Ilya Dryomov [Sat, 20 Feb 2016 17:26:57 +0000 (18:26 +0100)]
[DO NOT MERGE] rbd: bump RBD_MAX_PARENT_CHAIN_LEN to 128
Bump RBD_MAX_PARENT_CHAIN_LEN from 16 to 128 to avoid fsx failures.
(The alternative is changing fsx to flatten unconditionally when the
limit of 16 is reached, which is ugly and not needed for librbd.)
Linus Torvalds [Fri, 3 Mar 2017 01:41:27 +0000 (17:41 -0800)]
Merge tag 'pm-turbostat-4.11-rc1' of git://git./linux/kernel/git/rafael/linux-pm
Pull turbostat utility updates from Rafael Wysocki:
"Power management turbostat utility updates.
These update turbostat significantly and in particular:
- default output is now verbose, --debug is no longer required to get
all counters. As a result, some options have been added to specify
exactly what output is wanted.
- added --quiet to skip system configuration output
- added --list, --show and --hide parameters
- added --cpu parameter
- enhanced Baytrail SoC support
- added Gemini Lake SoC support
- added sysfs C-state columns
Also the symbol definitions in arch/x86/include/asm/intel-family.h and
arch/x86/include/asm/msr-index.h are updated and the intel_idle and
intel_pstate drivers are modified to use the updated symbols.
Credits to Len Brown for all of these changes"
* tag 'pm-turbostat-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (44 commits)
tools/power turbostat: version 17.02.24
tools/power turbostat: bugfix: --add u32 was printed as u64
tools/power turbostat: show error on exec
tools/power turbostat: dump p-state software config
tools/power turbostat: show package number, even without --debug
tools/power turbostat: support "--hide C1" etc.
tools/power turbostat: move --Package and --processor into the --cpu option
tools/power turbostat: turbostat.8 update
tools/power turbostat: update --list feature
tools/power turbostat: use wide columns to display large numbers
tools/power turbostat: Add --list option to show available header names
tools/power turbostat: fix zero IRQ count shown in one-shot command mode
tools/power turbostat: add --cpu parameter
tools/power turbostat: print sysfs C-state stats
tools/power turbostat: extend --add option to accept /sys path
tools/power turbostat: skip unused counters on BDX
tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits
tools/power turbostat: skip unused counters on SKX
tools/power turbostat: Denverton: use HW CC1 counter, skip C3, C7
tools/power turbostat: initial Gemini Lake SOC support
...
Linus Torvalds [Fri, 3 Mar 2017 01:38:30 +0000 (17:38 -0800)]
Merge tag 'acpi-extra-4.11-rc1' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"This fixes an apparent, but actually artificial, resource conflict
between the ACPI NVS memory region and the ACPI BERT (Boot Error
Record Table) address range (Huang Ying)"
* tag 'acpi-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: APEI: Fix BERT resources conflict with ACPI NVS area
Linus Torvalds [Fri, 3 Mar 2017 01:33:52 +0000 (17:33 -0800)]
Merge tag 'pm-extra-4.11-rc1' of git://git./linux/kernel/git/rafael/linux-pm
Pull more power management updates deom Rafael Wysocki:
"These fix two bugs introduced by recent power management updates (in
the cpuidle menu governor and intel_pstate) and a few other issues,
clean up things and remove unused code.
Specifics:
- Fix for a cpuidle menu governor problem that started to take an
unnecessary spinlock after one of the recent updates and that did
not play well with the RT patch (Rafael Wysocki).
- Fix for the new intel_pstate operation mode switching feature added
recently that did not reinitialize P-state limits properly when
switching operation modes (Rafael Wysocki).
- Removal of unused global notifiers from the PM QoS framework
(Viresh Kumar).
- Generic power domains framework update to make it handle
asynchronous invocations of PM callbacks in the "noirq" phases of
system suspend/hibernation correctly (Ulf Hansson).
- Two hibernation core cleanups (Rafael Wysocki).
- intel_idle cleanup related to the sysfs interface (Len Brown).
- Off-by-one bug fix in the OPP (Operating Performance Points)
framework (Andrzej Hajda).
- OPP framework's documentation fix (Viresh Kumar).
- cpufreq qoriq driver cleanup (Tang Yuantian).
- Fixes for typos in comments in the device runtime PM framework
(Christophe Jaillet)"
* tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / OPP: Documentation: Fix opp-microvolt in examples
intel_idle: stop exposing platform acronyms in sysfs
cpufreq: intel_pstate: Fix limits issue with operation mode switching
PM / hibernate: Define pr_fmt() and use pr_*() instead of printk()
PM / hibernate: Untangle power_down()
cpuidle: menu: Avoid taking spinlock for accessing QoS values
PM / QoS: Remove global notifiers
PM / runtime: Fix some typos
cpufreq: qoriq: clean up unused code
PM / OPP: fix off-by-one bug in dev_pm_opp_get_max_volt_latency loop
PM / Domains: Power off masters immediately in the power off sequence
PM / Domains: Rename is_async to one_dev_on for genpd_power_off()
PM / Domains: Move genpd_power_off() above genpd_power_on()
Linus Torvalds [Fri, 3 Mar 2017 00:03:00 +0000 (16:03 -0800)]
Merge branch 'for-linus-4.11' of git://git./linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
"Btrfs round two.
These are mostly a continuation of Dave Sterba's collection of
cleanups, but Filipe also has some bug fixes and performance
improvements"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (69 commits)
btrfs: add dummy callback for readpage_io_failed and drop checks
btrfs: drop checks for mandatory extent_io_ops callbacks
btrfs: document existence of extent_io ops callbacks
btrfs: let writepage_end_io_hook return void
btrfs: do proper error handling in btrfs_insert_xattr_item
btrfs: handle allocation error in update_dev_stat_item
btrfs: remove BUG_ON from __tree_mod_log_insert
btrfs: derive maximum output size in the compression implementation
btrfs: use predefined limits for calculating maximum number of pages for compression
btrfs: export compression buffer limits in a header
btrfs: merge nr_pages input and output parameter in compress_pages
btrfs: merge length input and output parameter in compress_pages
btrfs: constify name of subvolume in creation helpers
btrfs: constify buffers used by compression helpers
btrfs: constify input buffer of btrfs_csum_data
btrfs: constify device path passed to relevant helpers
btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode
btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode
btrfs: Make btrfs_add_nondir take btrfs_inode
btrfs: Make btrfs_add_link take btrfs_inode
...
Rafael J. Wysocki [Thu, 2 Mar 2017 23:45:22 +0000 (00:45 +0100)]
Merge branch 'acpi-apei'
* acpi-apei:
ACPI: APEI: Fix BERT resources conflict with ACPI NVS area
Rafael J. Wysocki [Thu, 2 Mar 2017 23:43:11 +0000 (00:43 +0100)]
Merge branches 'pm-cpuidle', 'pm-cpufreq' and 'pm-sleep'
* pm-cpuidle:
intel_idle: stop exposing platform acronyms in sysfs
cpuidle: menu: Avoid taking spinlock for accessing QoS values
* pm-cpufreq:
cpufreq: intel_pstate: Fix limits issue with operation mode switching
cpufreq: qoriq: clean up unused code
* pm-sleep:
PM / hibernate: Define pr_fmt() and use pr_*() instead of printk()
PM / hibernate: Untangle power_down()
Rafael J. Wysocki [Thu, 2 Mar 2017 23:34:44 +0000 (00:34 +0100)]
Merge branches 'pm-core', 'pm-qos', 'pm-domains' and 'pm-opp'
* pm-core:
PM / runtime: Fix some typos
* pm-qos:
PM / QoS: Remove global notifiers
* pm-domains:
PM / Domains: Power off masters immediately in the power off sequence
PM / Domains: Rename is_async to one_dev_on for genpd_power_off()
PM / Domains: Move genpd_power_off() above genpd_power_on()
* pm-opp:
PM / OPP: Documentation: Fix opp-microvolt in examples
PM / OPP: fix off-by-one bug in dev_pm_opp_get_max_volt_latency loop
Linus Torvalds [Thu, 2 Mar 2017 23:20:00 +0000 (15:20 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull vfs pile two from Al Viro:
- orangefs fix
- series of fs/namei.c cleanups from me
- VFS stuff coming from overlayfs tree
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
orangefs: Use RCU for destroy_inode
vfs: use helper for calling f_op->fsync()
mm: use helper for calling f_op->mmap()
vfs: use helpers for calling f_op->{read,write}_iter()
vfs: pass type instead of fn to do_{loop,iter}_readv_writev()
vfs: extract common parts of {compat_,}do_readv_writev()
vfs: wrap write f_ops with file_{start,end}_write()
vfs: deny copy_file_range() for non regular files
vfs: deny fallocate() on directory
vfs: create vfs helper vfs_tmpfile()
namei.c: split unlazy_walk()
namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate()
lookup_fast(): clean up the logics around the fallback to non-rcu mode
namei: fold unlazy_link() into its sole caller
Linus Torvalds [Thu, 2 Mar 2017 23:16:38 +0000 (15:16 -0800)]
Merge branch 'work.sendmsg' of git://git./linux/kernel/git/viro/vfs
Pull vfs sendmsg updates from Al Viro:
"More sendmsg work.
This is a fairly separate isolated stuff (there's a continuation
around lustre, but that one was too late to soak in -next), thus the
separate pull request"
* 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ncpfs: switch to sock_sendmsg()
ncpfs: don't mess with manually advancing iovec on send
ncpfs: sendmsg does *not* bugger iovec these days
ceph_tcp_sendpage(): use ITER_BVEC sendmsg
afs_send_pages(): use ITER_BVEC
rds: remove dead code
ceph: switch to sock_recvmsg()
usbip_recv(): switch to sock_recvmsg()
iscsi_target: deal with short writes on the tx side
[nbd] pass iov_iter to nbd_xmit()
[nbd] switch sock_xmit() to sock_{send,recv}msg()
[drbd] use sock_sendmsg()
Linus Torvalds [Thu, 2 Mar 2017 22:52:05 +0000 (14:52 -0800)]
Merge branch 'for-next' of git://git./linux/kernel/git/nab/target-pending
Pull SCSI target updates from Nicholas Bellinger:
"The highlights this round include:
- enable dual mode (initiator + target) qla2xxx operation. (Quinn +
Himanshu)
- add a framework for qla2xxx async fabric discovery. (Quinn +
Himanshu)
- enable iscsi PDU DDP completion offload in cxgbit/T6 NICs. (Varun)
- fix target-core handling of aborted failed commands. (Bart)
- fix a long standing target-core issue NULL pointer dereference with
active I/O LUN shutdown. (Rob Millner + Bryant + nab)"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (44 commits)
target: Add counters for ABORT_TASK success + failure
iscsi-target: Fix early login failure statistics misses
target: Fix NULL dereference during LUN lookup + active I/O shutdown
target: Delete tmr from list before processing
target: Fix handling of aborted failed commands
uapi: fix linux/target_core_user.h userspace compilation errors
target: export protocol identifier
qla2xxx: Fix a warning reported by the "smatch" static checker
target/iscsi: Fix unsolicited data seq_end_offset calculation
target/cxgbit: add T6 iSCSI DDP completion feature
target/cxgbit: Enable DDP for T6 only if data sequence and pdu are in order
target/cxgbit: Use T6 specific macros to get ETH/IP hdr len
target/cxgbit: use cxgb4_tp_smt_idx() to get smt idx
target/iscsi: split iscsit_check_dataout_hdr()
target: Remove command flag CMD_T_DEV_ACTIVE
target: Remove command flag CMD_T_BUSY
target: Move session check from target_put_sess_cmd() into target_release_cmd_kref()
target: Inline transport_cmd_check_stop()
target: Remove an overly chatty debug message
target: Stop execution if CMD_T_STOP has been set
...
Linus Torvalds [Thu, 2 Mar 2017 22:36:00 +0000 (14:36 -0800)]
Merge tag 'dm-4.11-fixes' of git://git./linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- a dm-raid stable@ fix for possible corruption when triggering a raid
reshape via lvm2; and an additional small patch ontop to bump version
of the dm-raid target outside of the stable@ fix
- a dm-raid fix for a 'dm-4.11-changes' regression introduced by a
commit that was meant to only cleanup confusing branching.
* tag 'dm-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm raid: bump the target version
dm raid: fix data corruption on reshape request
dm raid: fix raid "check" regression due to improper cleanup in raid_message()
Linus Torvalds [Thu, 2 Mar 2017 21:53:13 +0000 (13:53 -0800)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull vhost updates from Michael Tsirkin:
"virtio, vhost: optimizations, fixes
Looks like a quiet cycle for vhost/virtio, just a couple of minor
tweaks. Most notable is automatic interrupt affinity for blk and scsi.
Hopefully other devices are not far behind"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio-console: avoid DMA from stack
vhost: introduce O(1) vq metadata cache
virtio_scsi: use virtio IRQ affinity
virtio_blk: use virtio IRQ affinity
blk-mq: provide a default queue mapping for virtio device
virtio: provide a method to get the IRQ affinity mask for a virtqueue
virtio: allow drivers to request IRQ affinity when creating VQs
virtio_pci: simplify MSI-X setup
virtio_pci: don't duplicate the msix_enable flag in struct pci_dev
virtio_pci: use shared interrupts for virtqueues
virtio_pci: remove struct virtio_pci_vq_info
vhost: try avoiding avail index access when getting descriptor
virtio_mmio: expose header to userspace
Linus Torvalds [Thu, 2 Mar 2017 21:22:18 +0000 (13:22 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/linux-security
Pull security subsystem fixes from James Morris:
"Two fixes for the security subsystem:
- keys: split both rcu_dereference_key() and user_key_payload() into
versions which can be called with or without holding the key
semaphore.
- SELinux: fix Android init(8) breakage due to new cgroup security
labeling support when using older policy"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
selinux: wrap cgroup seclabel support with its own policy capability
KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload()
Linus Torvalds [Thu, 2 Mar 2017 20:45:46 +0000 (12:45 -0800)]
Merge tag 'watchdog-for-linus-v4.11-2' of git://git./linux/kernel/git/groeck/linux-staging
Pull more watchdog updates from Guenter Roeck:
- fix fallout from enabling COMPILE_TEST
- fix gcc-4.3 build of kempld watchdog driver
- use hrtimer in softdog
* tag 'watchdog-for-linus-v4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
watchdog: retu: restore MFD dependency
watchdog: db8500: add back prmcu dependency
watchdog: kempld: fix gcc-4.3 build
watchdog: softdog: fire watchdog even if softirqs do not get to run
watchdog: kempld: revert to full dependency
watchdog: bcm2835: add CONFIG_OF dependency
watchdog: sp805: add back AMBA dependency
watchdog: menf21bmc: add I2C dependency
watchdog: geode: restore hard CS5535_MFGPT dependency
watchdog: wm831x watchdog really needs mfd
Linus Torvalds [Thu, 2 Mar 2017 20:17:22 +0000 (12:17 -0800)]
give up on gcc ilog2() constant optimizations
gcc-7 has an "optimization" pass that completely screws up, and
generates the code expansion for the (impossible) case of calling
ilog2() with a zero constant, even when the code gcc compiles does not
actually have a zero constant.
And we try to generate a compile-time error for anybody doing ilog2() on
a constant where that doesn't make sense (be it zero or negative). So
now gcc7 will fail the build due to our sanity checking, because it
created that constant-zero case that didn't actually exist in the source
code.
There's a whole long discussion on the kernel mailing about how to work
around this gcc bug. The gcc people themselevs have discussed their
"feature" in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72785
but it's all water under the bridge, because while it looked at one
point like it would be solved by the time gcc7 was released, that was
not to be.
So now we have to deal with this compiler braindamage.
And the only simple approach seems to be to just delete the code that
tries to warn about bad uses of ilog2().
So now "ilog2()" will just return 0 not just for the value 1, but for
any non-positive value too.
It's not like I can recall anybody having ever actually tried to use
this function on any invalid value, but maybe the sanity check just
meant that such code never made it out in public.
Reported-by: Laura Abbott <labbott@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>,
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Al Viro [Thu, 2 Mar 2017 11:41:22 +0000 (06:41 -0500)]
Merge remote-tracking branch 'ovl/for-viro' into for-linus
Overlayfs-related series from Miklos and Amir
Al Viro [Thu, 2 Mar 2017 11:41:12 +0000 (06:41 -0500)]
Merge branch 'work.namei' into for-linus
Peter Zijlstra [Fri, 24 Feb 2017 15:43:36 +0000 (16:43 +0100)]
orangefs: Use RCU for destroy_inode
freeing of inodes must be RCU-delayed on all filesystems
Cc: stable@vger.kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Thu, 2 Mar 2017 01:04:50 +0000 (17:04 -0800)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull objtool relocation fixes from Ingo Molnar:
"Two fixes related to the module loading regression introduced by the
recent objtool changes"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool, modules: Discard objtool annotation sections for modules
objtool, compiler.h: Fix __unreachable section relocation size
Linus Torvalds [Thu, 2 Mar 2017 00:10:30 +0000 (16:10 -0800)]
Merge tag 'nfs-for-4.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Highlights include:
Stable bugfixes:
- NFSv4: Fix memory and state leak in _nfs4_open_and_get_state
- xprtrdma: Fix Read chunk padding
- xprtrdma: Per-connection pad optimization
- xprtrdma: Disable pad optimization by default
- xprtrdma: Reduce required number of send SGEs
- nlm: Ensure callback code also checks that the files match
- pNFS/flexfiles: If the layout is invalid, it must be updated before
retrying
- NFSv4: Fix reboot recovery in copy offload
- Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION
replies to OP_SEQUENCE"
- NFSv4: fix getacl head length estimation
- NFSv4: fix getacl ERANGE for sum ACL buffer sizes
Features:
- Add and use dprintk_cont macros
- Various cleanups to NFS v4.x to reduce code duplication and
complexity
- Remove unused cr_magic related code
- Improvements to sunrpc "read from buffer" code
- Clean up sunrpc timeout code and allow changing TCP timeout
parameters
- Remove duplicate mw_list management code in xprtrdma
- Add generic functions for encoding and decoding xdr streams
Bugfixes:
- Clean up nfs_show_mountd_netid
- Make layoutreturn_ops static and use NULL instead of 0 to fix
sparse warnings
- Properly handle -ERESTARTSYS in nfs_rename()
- Check if register_shrinker() failed during rpcauth_init()
- Properly clean up procfs/pipefs entries
- Various NFS over RDMA related fixes
- Silence unititialized variable warning in sunrpc"
* tag 'nfs-for-4.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (64 commits)
NFSv4: fix getacl ERANGE for some ACL buffer sizes
NFSv4: fix getacl head length estimation
Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE"
NFSv4: Fix reboot recovery in copy offload
pNFS/flexfiles: If the layout is invalid, it must be updated before retrying
NFSv4: Clean up owner/group attribute decode
SUNRPC: Add a helper function xdr_stream_decode_string_dup()
NFSv4: Remove bogus "struct nfs_client" argument from decode_ace()
NFSv4: Fix the underestimation of delegation XDR space reservation
NFSv4: Replace callback string decode function with a generic
NFSv4: Replace the open coded decode_opaque_inline() with the new generic
NFSv4: Replace ad-hoc xdr encode/decode helpers with xdr_stream_* generics
SUNRPC: Add generic helpers for xdr_stream encode/decode
sunrpc: silence uninitialized variable warning
nlm: Ensure callback code also checks that the files match
sunrpc: Allow xprt->ops->timer method to sleep
xprtrdma: Refactor management of mw_list field
xprtrdma: Handle stale connection rejection
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
xprtrdma: Shrink send SGEs array
...
Linus Torvalds [Wed, 1 Mar 2017 23:55:04 +0000 (15:55 -0800)]
Merge tag 'for-f2fs-4.11' of git://git./linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This round introduces several interesting features such as on-disk NAT
bitmaps, IO alignment, and a discard thread. And it includes a couple
of major bug fixes as below.
Enhancements:
- introduce on-disk bitmaps to avoid scanning NAT blocks when getting
free nids
- support IO alignment to prepare open-channel SSD integration in
future
- introduce a discard thread to avoid long latency during checkpoint
and fstrim
- use SSR for warm node and enable inline_xattr by default
- introduce in-memory bitmaps to check FS consistency for debugging
- improve write_begin by avoiding needless read IO
Bug fixes:
- fix broken zone_reset behavior for SMR drive
- fix wrong victim selection policy during GC
- fix missing behavior when preparing discard commands
- fix bugs in atomic write support and fiemap
- workaround to handle multiple f2fs_add_link calls having same name
... and it includes a bunch of clean-up patches as well"
* tag 'for-f2fs-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (97 commits)
f2fs: avoid to flush nat journal entries
f2fs: avoid to issue redundant discard commands
f2fs: fix a plint compile warning
f2fs: add f2fs_drop_inode tracepoint
f2fs: Fix zoned block device support
f2fs: remove redundant set_page_dirty()
f2fs: fix to enlarge size of write_io_dummy mempool
f2fs: fix memory leak of write_io_dummy mempool during umount
f2fs: fix to update F2FS_{CP_}WB_DATA count correctly
f2fs: use MAX_FREE_NIDS for the free nids target
f2fs: introduce free nid bitmap
f2fs: new helper cur_cp_crc() getting crc in f2fs_checkpoint
f2fs: update the comment of default nr_pages to skipping
f2fs: drop the duplicate pval in f2fs_getxattr
f2fs: Don't update the xattr data that same as the exist
f2fs: kill __is_extent_same
f2fs: avoid bggc->fggc when enough free segments are avaliable after cp
f2fs: select target segment with closer temperature in SSR mode
f2fs: show simple call stack in fault injection message
f2fs: no need lock_op in f2fs_write_inline_data
...
Omar Sandoval [Wed, 1 Feb 2017 08:02:27 +0000 (00:02 -0800)]
virtio-console: avoid DMA from stack
put_chars() stuffs the buffer it gets into an sg, but that buffer may be
on the stack. This breaks with CONFIG_VMAP_STACK=y (for me, it
manifested as printks getting turned into NUL bytes).
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Jason Wang [Tue, 28 Feb 2017 09:56:02 +0000 (17:56 +0800)]
vhost: introduce O(1) vq metadata cache
When device IOTLB is enabled, all address translations were stored in
interval tree. O(lgN) searching time could be slow for virtqueue
metadata (avail, used and descriptors) since they were accessed much
often than other addresses. So this patch introduces an O(1) array
which points to the interval tree nodes that store the translations of
vq metadata. Those array were update during vq IOTLB prefetching and
were reset during each invalidation and tlb update. Each time we want
to access vq metadata, this small array were queried before interval
tree. This would be sufficient for static mappings but not dynamic
mappings, we could do optimizations on top.
Test were done with l2fwd in guest (2M hugepage):
noiommu | before | after
tx 1.32Mpps | 1.06Mpps(82%) | 1.30Mpps(98%)
rx 2.33Mpps | 1.46Mpps(63%) | 2.29Mpps(98%)
We can almost reach the same performance as noiommu mode.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Stephen Smalley [Tue, 28 Feb 2017 15:35:56 +0000 (10:35 -0500)]
selinux: wrap cgroup seclabel support with its own policy capability
commit
1ea0ce40690dff38935538e8dab7b12683ded0d3 ("selinux: allow
changing labels for cgroupfs") broke the Android init program,
which looks up security contexts whenever creating directories
and attempts to assign them via setfscreatecon().
When creating subdirectories in cgroup mounts, this would previously
be ignored since cgroup did not support userspace setting of security
contexts. However, after the commit, SELinux would attempt to honor
the requested context on cgroup directories and fail due to permission
denial. Avoid breaking existing userspace/policy by wrapping this change
with a conditional on a new cgroup_seclabel policy capability. This
preserves existing behavior until/unless a new policy explicitly enables
this capability.
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
David Howells [Wed, 1 Mar 2017 15:11:23 +0000 (15:11 +0000)]
KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload()
rcu_dereference_key() and user_key_payload() are currently being used in
two different, incompatible ways:
(1) As a wrapper to rcu_dereference() - when only the RCU read lock used
to protect the key.
(2) As a wrapper to rcu_dereference_protected() - when the key semaphor is
used to protect the key and the may be being modified.
Fix this by splitting both of the key wrappers to produce:
(1) RCU accessors for keys when caller has the key semaphore locked:
dereference_key_locked()
user_key_payload_locked()
(2) RCU accessors for keys when caller holds the RCU read lock:
dereference_key_rcu()
user_key_payload_rcu()
This should fix following warning in the NFS idmapper
===============================
[ INFO: suspicious RCU usage. ]
4.10.0 #1 Tainted: G W
-------------------------------
./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
1 lock held by mount.nfs/5987:
#0: (rcu_read_lock){......}, at: [<
d000000002527abc>] nfs_idmap_get_key+0x15c/0x420 [nfsv4]
stack backtrace:
CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G W 4.10.0 #1
Call Trace:
dump_stack+0xe8/0x154 (unreliable)
lockdep_rcu_suspicious+0x140/0x190
nfs_idmap_get_key+0x380/0x420 [nfsv4]
nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4]
decode_getfattr_attrs+0xfac/0x16b0 [nfsv4]
decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4]
nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4]
rpcauth_unwrap_resp+0xe8/0x140 [sunrpc]
call_decode+0x29c/0x910 [sunrpc]
__rpc_execute+0x140/0x8f0 [sunrpc]
rpc_run_task+0x170/0x200 [sunrpc]
nfs4_call_sync_sequence+0x68/0xa0 [nfsv4]
_nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4]
nfs4_lookup_root+0xe0/0x350 [nfsv4]
nfs4_lookup_root_sec+0x70/0xa0 [nfsv4]
nfs4_find_root_sec+0xc4/0x100 [nfsv4]
nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4]
nfs4_get_rootfh+0x6c/0x190 [nfsv4]
nfs4_server_common_setup+0xc4/0x260 [nfsv4]
nfs4_create_server+0x278/0x3c0 [nfsv4]
nfs4_remote_mount+0x50/0xb0 [nfsv4]
mount_fs+0x74/0x210
vfs_kern_mount+0x78/0x220
nfs_do_root_mount+0xb0/0x140 [nfsv4]
nfs4_try_mount+0x60/0x100 [nfsv4]
nfs_fs_mount+0x5ec/0xda0 [nfs]
mount_fs+0x74/0x210
vfs_kern_mount+0x78/0x220
do_mount+0x254/0xf70
SyS_mount+0x94/0x100
system_call+0x38/0xe0
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Rafael J. Wysocki [Wed, 1 Mar 2017 22:34:38 +0000 (23:34 +0100)]
Merge branch 'turbostat' of git://git./linux/kernel/git/lenb/linux
Pull changes related to turbostat for v4.11 from Len Brown.
* 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (44 commits)
tools/power turbostat: version 17.02.24
tools/power turbostat: bugfix: --add u32 was printed as u64
tools/power turbostat: show error on exec
tools/power turbostat: dump p-state software config
tools/power turbostat: show package number, even without --debug
tools/power turbostat: support "--hide C1" etc.
tools/power turbostat: move --Package and --processor into the --cpu option
tools/power turbostat: turbostat.8 update
tools/power turbostat: update --list feature
tools/power turbostat: use wide columns to display large numbers
tools/power turbostat: Add --list option to show available header names
tools/power turbostat: fix zero IRQ count shown in one-shot command mode
tools/power turbostat: add --cpu parameter
tools/power turbostat: print sysfs C-state stats
tools/power turbostat: extend --add option to accept /sys path
tools/power turbostat: skip unused counters on BDX
tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits
tools/power turbostat: skip unused counters on SKX
tools/power turbostat: Denverton: use HW CC1 counter, skip C3, C7
tools/power turbostat: initial Gemini Lake SOC support
...
Viresh Kumar [Wed, 1 Mar 2017 11:09:55 +0000 (16:39 +0530)]
PM / OPP: Documentation: Fix opp-microvolt in examples
The triplet present in "opp-microvolt" property should be in the order
<target min max>, while all the examples have it in the order
<min target max>.
Fix it.
Luckily all of the users of "opp-microvolt" property have applied brain
instead of copying the examples from documentation and none of the
actual dts files have it wrong.
Reported-by: Rob Herring <robh@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Josh Poimboeuf [Wed, 1 Mar 2017 18:04:44 +0000 (12:04 -0600)]
objtool, modules: Discard objtool annotation sections for modules
The '__unreachable' and '__func_stack_frame_non_standard' sections are
only used at compile time. They're discarded for vmlinux but they
should also be discarded for modules.
Since this is a recurring pattern, prefix the section names with
".discard.". It's a nice convention and vmlinux.lds.h already discards
such sections.
Also remove the 'a' (allocatable) flag from the __unreachable section
since it doesn't make sense for a discarded section.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: d1091c7fa3d5 ("objtool: Improve detection of BUG() and other dead ends")
Link: http://lkml.kernel.org/r/20170301180444.lhd53c5tibc4ns77@treble
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Wed, 1 Mar 2017 18:32:30 +0000 (10:32 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"The main fix here addresses a kernel panic triggered on Qualcomm
QDF2400 due to incorrect register usage in an erratum workaround
introduced during the merge window.
Summary:
- Fix kernel panic on specific Qualcomm platform due to broken
erratum workaround
- Revert contiguous bit support due to TLB conflict aborts in
simulation
- Don't treat all CPU ID register fields as 4-bit quantities"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/cpufeature: check correct field width when updating sys_val
Revert "arm64: mm: set the contiguous bit for kernel mappings where appropriate"
arm64: Avoid clobbering mm in erratum workaround on QDF2400
Linus Torvalds [Wed, 1 Mar 2017 18:10:16 +0000 (10:10 -0800)]
Merge tag 'powerpc-4.11-2' of git://git./linux/kernel/git/powerpc/linux
Pull more powerpc updates from Michael Ellerman:
"Highlights include:
- an update of the disassembly code used by xmon to the latest
versions in binutils. We've received permission from all the
authors of the relevant binutils changes to relicense their changes
to the relevant files from GPLv3 to GPLv2, for inclusion in Linux.
Thanks to Peter Bergner for doing the leg work to get permission
from everyone.
- addition of the "architected" Power9 CPU table entry, allowing us
to boot in Power9 architected mode under a hypervisor.
- updates to the Power9 PMU code.
- implementation of clear_bit_unlock_is_negative_byte() to optimise
unlock_page().
- Freescale updates from Scott: "Highlights include 8xx breakpoints
and perf, t1042rdb display support, and board updates."
Thanks to:
Al Viro, Andrew Donnellan, Aneesh Kumar K.V, Balbir Singh, Douglas
Miller, Frédéric Weisbecker, Gavin Shan, Madhavan Srinivasan,
Michael Roth, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Peter
Bergner, Paul E. McKenney, Rashmica Gupta, Russell Currey, Sahil
Mehta, Stewart Smith"
* tag 'powerpc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
powerpc: Remove leftover cputime_to_nsecs call causing build error
powerpc/mm/hash: Always clear UPRT and Host Radix bits when setting up CPU
powerpc/optprobes: Fix TOC handling in optprobes trampoline
powerpc/pseries: Advertise Hot Plug Event support to firmware
cxl: fix nested locking hang during EEH hotplug
powerpc/xmon: Dump memory in CPU endian format
powerpc/pseries: Revert 'Auto-online hotplugged memory'
powerpc/powernv: Make PCI non-optional
powerpc/64: Implement clear_bit_unlock_is_negative_byte()
powerpc/powernv: Remove unused variable in pnv_pci_sriov_disable()
powerpc/kernel: Remove error message in pcibios_setup_phb_resources()
powerpc/mm: Fix typo in set_pte_at()
pci/hotplug/pnv-php: Disable MSI and PCI device properly
pci/hotplug/pnv-php: Disable surprise hotplug capability on conflicts
pci/hotplug/pnv-php: Remove WARN_ON() in pnv_php_put_slot()
powerpc: Add POWER9 architected mode to cputable
powerpc/perf: use is_kernel_addr macro in perf_get_misc_flags()
powerpc/perf: Avoid FAB_*_MATCH checks for power9
powerpc/perf: Add restrictions to PMC5 in power9 DD1
powerpc/perf: Use Instruction Counter value
...
Benjamin Tissoires [Wed, 1 Mar 2017 08:57:00 +0000 (09:57 +0100)]
Input: rmi4 - f30: detect INPUT_PROP_BUTTONPAD from the button count
INPUT_PROP_BUTTONPAD is currently only set through the platform data.
The RMI4 header doc says that this property is there to force the
buttonpad property, so we also need to detect it by looking at
the exported buttons count.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 1 Mar 2017 17:59:21 +0000 (09:59 -0800)]
Merge tag 'sound-fix-4.11-rc1' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A few last-minute fixes for rc1:
- ALSA core timer and sequencer fixes for bugs spotted by syzkaller
- a couple of trivial HD-audio fixups
- additional PCI / codec IDs for Intel Geminilake
- fixes for CT-XFi DMA mask bugs"
* tag 'sound-fix-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: seq: Fix link corruption by event error handling
ALSA: hda - Add subwoofer support for Dell Inspiron 17 7000 Gaming
ALSA: ctxfi: Fallback DMA mask to 32bit
ALSA: timer: Reject user params with too small ticks
ALSA: hda: Add Geminilake HDMI codec ID
ALSA: hda - Fix micmute hotkey problem for a lenovo AIO machine
ALSA: hda - Add Geminilake PCI ID
Linus Torvalds [Wed, 1 Mar 2017 17:54:32 +0000 (09:54 -0800)]
Merge branch 'next' of git://git./linux/kernel/git/rzhang/linux
Pull thermal management updates from Zhang Rui:
- add thermal driver for R-Car Gen3 thermal sensors.
- add thermal driver for ZTE' zx2967 family thermal sensors.
- convert thermal ID allocation from IDR to IDA.
- fix a possible NULL dereference in imx thermal driver.
- fix a ti-soc-thermal driver dependency issue so that critical thermal
control is still available when CPU_THERMAL is not defined.
- update binding information for QorIQ thermal driver.
- a couple of cleanups in thermal core, intel_powerclamp, exynos,
dra752-thermal, mtk-thermal driver.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
powerpc/mpc85xx: Update TMU device tree node for T1023/T1024
powerpc/mpc85xx: Update TMU device tree node for T1040/T1042
dt-bindings: Update QorIQ TMU thermal bindings
thermal: mtk_thermal: Staticise a number of data variables
thermal: arm: dra752: Remove all TSHUT related definitions
thermal: arm: dra752: Remove TSHUT configuration
thermal: ti-soc-thermal: Remove CPU_THERMAL Dependency from TI_THERMAL
thermal: imx: Fix possible NULL dereference.
thermal: exynos: Remove parsing unused samsung,tmu_cal_mode property
thermal: zx2967: add thermal driver for ZTE's zx2967 family
thermal: use cpumask_var_t for on-stack cpu masks
dt: bindings: add documentation for zx2967 family thermal sensor
thermal/intel_powerclamp: Remove set-but-not-used variables
thermal: rcar_gen3_thermal: Add R-Car Gen3 thermal driver
thermal: rcar_gen3_thermal: Document the R-Car Gen3
thermal: convert devfreq_cooling to use an IDA
thermal: convert cpu_cooling to use an IDA
thermal: convert clock cooling to use an IDA
thermal core: convert ID allocation to IDA
Linus Torvalds [Wed, 1 Mar 2017 17:46:02 +0000 (09:46 -0800)]
Merge tag 'pwm/for-4.11-rc1' of git://git./linux/kernel/git/thierry.reding/linux-pwm
Pull pwm updates from Thierry Reding:
"This set contains mostly fixes to existing drivers as well as cleanup
of code that's not been in active use for a while"
* tag 'pwm/for-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: (27 commits)
acpi: lpss: call pwm_add_table() for BSW PWM device
pwm: Try to load modules during pwm_get()
pwm: Don't hold pwm_lookup_lock longer than necessary
pwm: Make the PWM_POLARITY flag in DTB optional
pwm: Print error messages with pr_err() instead of pr_debug()
pwm: imx: Add polarity inversion support to i.MX's PWMv2
pwm: imx: doc: Update imx-pwm.txt documentation entry
pwm: imx: Remove redundant i.MX PWMv2 code
pwm: imx: Provide atomic PWM support for i.MX PWMv2
pwm: imx: Move PWMv2 wait for fifo slot code to a separate function
pwm: imx: Move PWMv2 software reset code to a separate function
pwm: imx: Rewrite v1 code to facilitate switch to atomic PWM
pwm: imx: Add separate set of PWM ops for v1 and v2
pwm: imx: Remove ipg clock and enable per clock when required
pwm: lpss: Add Intel Gemini Lake PCI ID
pwm: lpss: Do not export board infos for different PWM types
pwm: lpss: Avoid reconfiguring while UPDATE bit is still enabled
pwm: lpss: Switch to new atomic API
pwm: lpss: Allow duty cycle to be 0
pwm: lpss: Avoid potential overflow of base_unit
...
Linus Torvalds [Wed, 1 Mar 2017 17:42:42 +0000 (09:42 -0800)]
Merge tag 'drm-ast-2500-for-v4.11' of git://people.freedesktop.org/~airlied/linux
Pull drm AST2500 support from Dave Airlie:
"This is a set of changes to enable the AST2500 BMC hardware, and also
fix some bugs interacting with the older AST hardware.
Some of the bug fixes are cc'ed to stable"
* tag 'drm-ast-2500-for-v4.11' of git://people.freedesktop.org/~airlied/linux:
drm/ast: Call open_key before enable_mmio in POST code
drm/ast: Fix test for VGA enabled
drm/ast: POST code for the new AST2500
drm/ast: Rename ast_init_dram_2300 to ast_post_chip_2300
drm/ast: Factor mmc_test code in POST code
drm/ast: Fixed vram size incorrect issue on POWER
drm/ast: Base support for AST2500
drm/ast: Fix calculation of MCLK
drm/ast: Remove spurious include
drm/ast: const'ify mode setting tables
drm/ast: Handle configuration without P2A bridge
drm/ast: Fix AST2400 POST failure without BMC FW or VBIOS
Linus Torvalds [Wed, 1 Mar 2017 16:50:33 +0000 (08:50 -0800)]
Merge tag 'drm-fixes-for-v4.11-rc1' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Misc fixes for v4.11-rc1.
This is a selection of fixes for recent bugs, the vmwgfx one is
important to avoid a regression, and compat ioctl one is pretty urgent
for stable. Otherwise nothing too much.
I've got a separate pull req for some AST hw IBM need to enable"
* tag 'drm-fixes-for-v4.11-rc1' of git://people.freedesktop.org/~airlied/linux:
dma-buf: add support for compat ioctl
drm/vmwgfx: Work around drm removal of control nodes
drm/rockchip: cdn-dp: Fix error handling
drm/rockchip: add extcon dependency for DP
drm: zte: fix static checker warning on variable 'fmt'
Arnd Bergmann [Wed, 1 Mar 2017 09:15:31 +0000 (10:15 +0100)]
watchdog: retu: restore MFD dependency
The retu watchdog calls into the respective mfd driver, but fails to
link if that is diabled:
drivers/watchdog/built-in.o: In function `retu_wdt_set_timeout':
ziirave_wdt.c:(.text+0x8c88): undefined reference to `retu_write'
ziirave_wdt.c:(.text+0x8c88): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `retu_write'
drivers/watchdog/built-in.o: In function `retu_wdt_start':
ziirave_wdt.c:(.text+0x8cc8): undefined reference to `retu_write'
ziirave_wdt.c:(.text+0x8cc8): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `retu_write'
This restores the dependency as it was before
Fixes: da2a68b3eb47 ("watchdog: Enable COMPILE_TEST where possible")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Arnd Bergmann [Wed, 1 Mar 2017 09:15:30 +0000 (10:15 +0100)]
watchdog: db8500: add back prmcu dependency
When the db8500 watchdog is enabled without the PRCMU, we get a lot of
warnings about duplicate or missing helper functions:
In file included from drivers/watchdog/ux500_wdt.c:21:0:
include/linux/mfd/dbx500-prcmu.h:422:19: error: redefinition of 'prcmu_abb_read'
static inline int prcmu_abb_read(u8 slave, u8 reg, u8 *value, u8 size)
This restores the dependency as it was.
Fixes: da2a68b3eb47 ("watchdog: Enable COMPILE_TEST where possible")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Arnd Bergmann [Wed, 1 Mar 2017 09:15:29 +0000 (10:15 +0100)]
watchdog: kempld: fix gcc-4.3 build
gcc-4.3 can't decide whether the constant value in
kempld_prescaler[PRESCALER_21] is built-time constant or
not, and gets confused by the logic in do_div():
drivers/watchdog/kempld_wdt.o: In function `kempld_wdt_set_stage_timeout':
kempld_wdt.c:(.text.kempld_wdt_set_stage_timeout+0x130): undefined reference to `__aeabi_uldivmod'
This adds a call to ACCESS_ONCE() to force it to not consider
it to be constant, and leaves the more efficient normal case
in place for modern compilers, using an #ifdef to annotate
why we do this hack.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Niklas Cassel [Mon, 27 Feb 2017 12:49:09 +0000 (13:49 +0100)]
watchdog: softdog: fire watchdog even if softirqs do not get to run
Checking for timer expiration is done from the softirq TIMER_SOFTIRQ.
Since commit
4cd13c21b207 ("softirq: Let ksoftirqd do its job"),
pending softirqs are no longer always handled immediately, instead,
if there are pending softirqs, and ksoftirqd is in state TASK_RUNNING,
the handling of the softirqs are deferred, and are instead supposed
to be handled by ksoftirqd, when ksoftirqd gets scheduled.
If a user space process with a real-time policy starts to misbehave
by never relinquishing the CPU while ksoftirqd is in state TASK_RUNNING,
what will happen is that all softirqs will get deferred, while ksoftirqd,
which is supposed to handle the deferred softirqs, will never get to run.
To make sure that the watchdog is able to fire even when we do not get
to run softirqs, replace the timers with hrtimers.
Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Arnd Bergmann [Tue, 28 Feb 2017 21:01:23 +0000 (22:01 +0100)]
watchdog: kempld: revert to full dependency
The kempld watchdog driver requires the respective MFD driver:
drivers/watchdog/built-in.o: In function `kempld_wdt_probe':
kempld_wdt.c:(.text+0x5c78): undefined reference to `kempld_get_mutex'
kempld_wdt.c:(.text+0x5c84): undefined reference to `kempld_read8'
kempld_wdt.c:(.text+0x5c8e): undefined reference to `kempld_release_mutex'
kempld_wdt.c:(.text+0x5d1c): undefined reference to `kempld_read8'
kempld_wdt.c:(.text+0x5d2c): undefined reference to `kempld_write8'
This adds the Kconfig dependency back.
Fixes: da2a68b3eb47 ("watchdog: Enable COMPILE_TEST where possible")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Arnd Bergmann [Tue, 28 Feb 2017 21:01:22 +0000 (22:01 +0100)]
watchdog: bcm2835: add CONFIG_OF dependency
Without CONFIG_OF, the driver fails to link:
drivers/watchdog/built-in.o: In function `bcm2835_power_off':
bcm2835_wdt.c:(.text+0x1946): undefined reference to `of_find_device_by_node'
This adds a new dependency, to allow the COMPILE_TEST check to succeed.
Fixes: da2a68b3eb47 ("watchdog: Enable COMPILE_TEST where possible")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Arnd Bergmann [Tue, 28 Feb 2017 21:01:21 +0000 (22:01 +0100)]
watchdog: sp805: add back AMBA dependency
The driver fails to link if ARM_AMBA is disabled:
drivers/watchdog/sp805_wdt.o: In function `sp805_wdt_driver_init':
sp805_wdt.c:(.init.text+0x4): undefined reference to `amba_driver_register'
It seems that the COMPILE_TEST was added in the wrong place, as there
is no architecture dependency, but a bus dependency. This moves
the dependency accordingly.
Fixes: da2a68b3eb47 ("watchdog: Enable COMPILE_TEST where possible")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Arnd Bergmann [Tue, 28 Feb 2017 21:01:20 +0000 (22:01 +0100)]
watchdog: menf21bmc: add I2C dependency
This driver fails to link when CONFIG_I2C is disabled or a loadable module while
the watchdog is built-in:
drivers/watchdog/built-in.o: In function `menf21bmc_wdt_shutdown':
menf21bmc_wdt.c:(.text+0x9b44): undefined reference to `i2c_smbus_write_word_data'
menf21bmc_wdt.c:(.text+0x9b44): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `i2c_smbus_write_word_data'
This adds a Kconfig dependency for it, to enforce a valid configuration.
Fixes: da2a68b3eb47 ("watchdog: Enable COMPILE_TEST where possible")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Arnd Bergmann [Tue, 28 Feb 2017 21:01:19 +0000 (22:01 +0100)]
watchdog: geode: restore hard CS5535_MFGPT dependency
Wtihout CONFIG_CS5535_MFGPT, the driver does not link right:
drivers/watchdog/built-in.o: In function `geodewdt_probe':
geodewdt.c:(.init.text+0xca3): undefined reference to `cs5535_mfgpt_alloc_timer'
geodewdt.c:(.init.text+0xcd4): undefined reference to `cs5535_mfgpt_write'
geodewdt.c:(.init.text+0xcef): undefined reference to `cs5535_mfgpt_toggle_event'
This adds back the dependency on this base driver.
Fixes: da2a68b3eb47 ("watchdog: Enable COMPILE_TEST where possible")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Arnd Bergmann [Tue, 28 Feb 2017 21:01:18 +0000 (22:01 +0100)]
watchdog: wm831x watchdog really needs mfd
The wm831x watchdog driver can now be built without the wm831x mfd
driver, which results in a link error:
(.text+0x1a95c): undefined reference to `wm831x_set_bits'
(.text+0x1a95c): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `wm831x_set_bits'
(.text+0x1a968): undefined reference to `wm831x_reg_lock'
(.text+0x1a968): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `wm831x_reg_lock'
(.text+0x1a9dc): undefined reference to `wm831x_reg_unlock'
(.text+0x1a9dc): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `wm831x_reg_unlock'
This adds back the dependency that was removed. We can still build test
this driver on all architectures by enabling the MFD driver for it first.
Fixes: da2a68b3eb47 ("watchdog: Enable COMPILE_TEST where possible")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Josh Poimboeuf [Wed, 1 Mar 2017 06:05:04 +0000 (00:05 -0600)]
objtool, compiler.h: Fix __unreachable section relocation size
Linus reported the following commit broke module loading on his laptop:
d1091c7fa3d5 ("objtool: Improve detection of BUG() and other dead ends")
It showed errors like the following:
module: overflow in relocation type 10 val
ffffffffc02afc81
module: 'nvme' likely not compiled with -mcmodel=kernel
The problem is that the __unreachable section addresses are stored using
the '.long' asm directive, which isn't big enough for .text section
kernel addresses. Use relative addresses instead:
".long %c0b - .\t\n"
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: d1091c7fa3d5 ("objtool: Improve detection of BUG() and other dead ends")
Link: http://lkml.kernel.org/r/20170301060504.oltm3iws6fmubnom@treble
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Len Brown [Sat, 21 Jan 2017 07:33:23 +0000 (02:33 -0500)]
tools/power turbostat: version 17.02.24
The turbostat before this last set of changes is obsolete.
This new version can do a lot more, but it also has
some different defaults, that might catch some off-guard.
So it seems a good time to give a new version number.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Thu, 23 Feb 2017 23:10:27 +0000 (18:10 -0500)]
tools/power turbostat: bugfix: --add u32 was printed as u64
When the "u32" keyword is used with --add, it means that
the output should be truncated to 32-bits. This was not
happening and all 64-bits were printed.
Also, when no column name was used for an added MSR,
The default column name was in deximal, eg. MSR16.
Users report that they tend to use hex MSR numbers,
so print them in hex. To always fit into the columns,
use the syntax M0x10. Note that the user can always
supply any column header that they want.
eg --add msr0x10,MY_TSC
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Thu, 23 Feb 2017 22:00:51 +0000 (17:00 -0500)]
tools/power turbostat: show error on exec
When turbostat is run in one-shot command mode,
the parent takes the 'before' counter snapshot,
fork/exec/wait for the child to exit,
takes the 'after' counter snapshot,
and prints the results.
however, if the child fails to exec the command,
it immediately returns, without indicating that
anythign was wrong.
Add an error message showing that exec failed:
sudo turbostat sleeeep 4
...
turbostat: exec sleeeep: No such file or directory
...
Note that the parent will still print out the statistics,
because it can't tell the difference between the failed
exec and a command that is purposefully returning
the same status. Unfortunately, this may obscure the
error message. However, if the --out parameter is used,
the error message is evident on stderr.
Reported-by: Wendy Wang <wendy.wang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Wed, 22 Feb 2017 05:11:12 +0000 (00:11 -0500)]
tools/power turbostat: dump p-state software config
cpu1: cpufreq driver: acpi-cpufreq
cpu1: cpufreq governor: ondemand
cpufreq boost: 1
or
cpu0: cpufreq driver: intel_pstate
cpu0: cpufreq governor: powersave
cpufreq intel_pstate no_turbo: 0
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Wed, 22 Feb 2017 04:43:41 +0000 (23:43 -0500)]
tools/power turbostat: show package number, even without --debug
On multi-package systems, the "Package" column was being displayed
only if --debug was used. Show it always.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Wed, 22 Feb 2017 04:21:13 +0000 (23:21 -0500)]
tools/power turbostat: support "--hide C1" etc.
Originally, the only way to hide the sysfs C-state statistics columns
was with "--hide sysfs". This was because we process "--hide" before
we probe for those columns.
hack --hide to remember deferred hide requests, and apply
them when sysfs is probed.
"--hide sysfs" is still available as short-hand to refer to
the entire group of counters.
The down-side of this change is that we no longer error check for
bogus --hide column names. But the user will quickly figure that
out if a column they mean to hide is still there...
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Wed, 22 Feb 2017 03:33:42 +0000 (22:33 -0500)]
tools/power turbostat: move --Package and --processor into the --cpu option
--Package is now "--cpu package",
which will display just the 1st CPU in each package
--processor is not "--cpu core"
which will display just the 1st CPU in each core
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Wed, 15 Feb 2017 05:30:22 +0000 (00:30 -0500)]
tools/power turbostat: turbostat.8 update
update examples to show recently updated features.
In particular
--add
--show
--hide
--cpu
--list
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 17 Feb 2017 04:07:51 +0000 (23:07 -0500)]
tools/power turbostat: update --list feature
Make it possible to take the entire un-edited output
from `turbostat --list` and feed it to "turbostat --show"
or "turbostat --hide".
To do this, the leading comma was removed
(no mater what columns are active)
and also they dynamic C-state "C1, C2, C3" etc are replaced
by the string "sysfs", which refers to them as a group.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Thu, 16 Feb 2017 02:45:40 +0000 (21:45 -0500)]
tools/power turbostat: use wide columns to display large numbers
When a counter overlfows 7 columns, it shifts the remaining
columns to the right, so they no longer line up under
their column header.
Update turbostat to dectect when it is handling large
numbers, and switch to wider columns where, necessary.
Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Wed, 15 Feb 2017 22:15:11 +0000 (17:15 -0500)]
tools/power turbostat: Add --list option to show available header names
It is handy to know the list of column header names,
so that they can be used with --add and --skip
The new --list option shows them:
sudo ./turbostat --list --hide sysfs
,Core,CPU,Avg_MHz,Busy%,Bzy_MHz,TSC_MHz,IRQ,SMI,CPU%c1,CPU%c3,CPU%c6,CPU%c7,CoreTmp,PkgTmp,GFX%rc6,GFXMHz,PkgWatt,CorWatt,GFXWatt
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Wed, 15 Feb 2017 03:07:52 +0000 (22:07 -0500)]
tools/power turbostat: fix zero IRQ count shown in one-shot command mode
The IRQ column has been working for periodic mode,
but not in one-shot command mode, it shows only 0.
until now.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Sat, 11 Feb 2017 04:54:15 +0000 (23:54 -0500)]
tools/power turbostat: add --cpu parameter
With the --cpu parameter, turbostat prints only lines
for the specified set of CPUs:
sudo ./turbostat --quiet --show Core,CPU --cpu 0,1,3..5,6-7
Core CPU
- -
0 0
0 4
1 1
1 5
2 6
3 3
3 7
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Thu, 9 Feb 2017 23:25:22 +0000 (18:25 -0500)]
tools/power turbostat: print sysfs C-state stats
When turbostat shows % of time in a CPU idle power state,
it has always been showing information from underlying
hardware residency counters.
While this reflects what the hardware is doing, and is thus
useful for understanding the hardware,
it doesn't directly tell us what Linux requested --
which is useful for tuning Linux itself.
Here we add columns to turbostat to show the
Linux cpuidle sub-system statistics:
/sys/devices/system/cpu/cpu*/cpuidle/state*/*
The first group of columns are the "usage", which is the
number of times software requested that C-state in the
measurement interval. eg C1 below.
The second group of columns are the "time", which is the percentage
of the measurement interval time that software has requested
the specified C-state. eg C1% below.
These software counters can be compared to the underlying
hardware residency counters (eg CPU%c1 CPU%c3 CPU%c6 CPU%c7)
to compare what sofware requested to what the hardware delivered.
These sysfs attributes are discovered when turbostat starts,
rather than being "built in". So the --show and --hide
parameters do not know about these dynamic column names.
However "--show sysfs" and "--hide sysfs" act on the
entire group of columns:
turbostat --show sysfs
...
cpu4: POLL: CPUIDLE CORE POLL IDLE
cpu4: C1: MWAIT 0x00
cpu4: C1E: MWAIT 0x01
cpu4: C3: MWAIT 0x10
cpu4: C6: MWAIT 0x20
cpu4: C7s: MWAIT 0x32
...
C1 C1E C3 C6 C7s C1% C1E% C3% C6% C7s%
3 6 5 1 188 0.00 0.02 0.00 0.00 99.93
0 6 5 0 58 0.00 0.16 0.02 0.00 99.70
0 0 0 0 9 0.00 0.00 0.00 0.00 99.96
0 0 0 1 24 0.00 0.00 0.00 0.02 99.93
0 0 0 0 9 0.00 0.00 0.00 0.00 99.97
0 0 0 0 32 0.00 0.00 0.00 0.00 99.96
0 0 0 0 7 0.00 0.00 0.00 0.00 99.98
2 0 0 0 36 0.00 0.00 0.00 0.00 99.97
1 0 0 0 13 0.00 0.00 0.00 0.00 99.98
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Wed, 8 Feb 2017 07:41:51 +0000 (02:41 -0500)]
tools/power turbostat: extend --add option to accept /sys path
Previously, the --add option could specify only an MSR.
Here is is extended so an arbitrary /sys attribute,
as specified by an absolute file path name.
sudo ./turbostat --add /sys/devices/system/cpu/cpu0/cpuidle/state5/usage
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 10 Feb 2017 06:56:47 +0000 (01:56 -0500)]
tools/power turbostat: skip unused counters on BDX
Skip these two counters on BDX, as they are always zero:
cc7, pc7
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Wed, 1 Feb 2017 04:07:49 +0000 (23:07 -0500)]
tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits
Newer processors do not hard-code the the number of cpus in each bin
to {1, 2, 3, 4, 5, 6, 7, 8} Rather, they can specify any number
of CPUS in each of the 8 bins:
eg.
...
37 * 100.0 = 3600.0 MHz max turbo 4 active cores
38 * 100.0 = 3700.0 MHz max turbo 3 active cores
39 * 100.0 = 3800.0 MHz max turbo 2 active cores
39 * 100.0 = 3900.0 MHz max turbo 1 active cores
could now look something like this:
...
37 * 100.0 = 3600.0 MHz max turbo 16 active cores
38 * 100.0 = 3700.0 MHz max turbo 8 active cores
39 * 100.0 = 3800.0 MHz max turbo 4 active cores
39 * 100.0 = 3900.0 MHz max turbo 2 active cores
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 27 Jan 2017 07:36:41 +0000 (02:36 -0500)]
tools/power turbostat: skip unused counters on SKX
Skip these four counters on SKX, as they are always zero:
cc3, pc3
cc7, pc7
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 27 Jan 2017 07:13:27 +0000 (02:13 -0500)]
tools/power turbostat: Denverton: use HW CC1 counter, skip C3, C7
The CC1 column in tubostat can be computed by subtracting
the core c-state residency countes from the total Cx residency.
CC1 = (Idle_time_as_measured by MPERF) - (all core C-states with
residency counters)
However, as the underlying counter reads are not atomic,
error can be noticed in this calculations, especially
when the numbers are small.
Denverton has a hardware CC1 residency counter
to improve the accuracy of the cc1 statistic -- use it.
At the same time, Denverton has no concept of CC3, PC3, CC7, PC7,
so skip collecting and printing those columns.
Finally, a note of clarification.
Turbostat prints the standard PC2 residency counter,
but on Denverton hardware, that actually means PC1E.
Turbostat prints the standard PC6 residency counter,
but on Denverton hardware, that actually means PC2.
At this point, we document that differnce in this commit message,
rather than adding a quirk to the software.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 27 Jan 2017 06:45:35 +0000 (01:45 -0500)]
tools/power turbostat: initial Gemini Lake SOC support
Gemini Lake is similar to Apollo Lake (Broxton/Goldmont)
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 17 Feb 2017 04:48:50 +0000 (23:48 -0500)]
x86: intel-family.h: Add GEMINI_LAKE SOC
Cc: x86@kernel.org
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 27 Jan 2017 05:50:45 +0000 (00:50 -0500)]
tools/power turbostat: bug fixes to --add, --show/--hide features
Fix a bug with --add, where the title of the column
is un-initialized if not specified by the user.
The initial implementation of --show and --hide
neglected to handle the pc8/pc9/pc10 counters.
Fix a bug where "--show Core" only worked with --debug
Reported-by: Wendy Wang <wendy.wang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 10 Feb 2017 05:29:51 +0000 (00:29 -0500)]
tools/power turbostat: use tsc_tweak everwhere it is needed
The CPU ticks at a rate in the "bus clock" domain.
eg. 100 MHz * bus_ratio.
On newer processors, the TSC has been moved out of this BCLK
domain and into a separate crystal-clock domain.
While the TSC ticks "close to" the base frequency, those that look
closely at the numbers will notice small errors in calculations that
mix units of TSC clocks and bus clocks.
"tsc_tweak" was introduced to address the most visible
mixing -- the %Busy and the the Busy_MHz calculations.
(A simplification as since removed TSC from the BusyMHz calculation)
Here we apply the tsc_tweak to everyplace where BCLK
and TSC units are mixed. The results is that
on a system which is 100% idle, the sum of the C-states
are now much more likely to be closer to 100%.
Reported-by: Travis Downs <travis.downs@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Sat, 21 Jan 2017 07:26:00 +0000 (02:26 -0500)]
tools/power turbostat: print system config, unless --quiet
Some users want turbostat to tell them everything, by default.
Some users want turbostat to be quiet, by default.
I find that I'm in the 1st camp, and so I've never liked
needing to type the --debug parameter to decode the system
configuration.
So here we change the default and print the system configuration,
by default. (The --debug option is now un-documented, though
it does still exist for debugging turbostat internals)
When you do not want to see the system configuration
header, use the new "--quiet" option.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Sat, 21 Jan 2017 06:59:12 +0000 (01:59 -0500)]
tools/power turbostat: show all columns, independent of --debug
Some time ago, turbostat overflowed 80 columns.
So on the assumption that a "casual" user would always
want topology and frequency columns, we hid the rest
of the columns and the system configuration decoding
behind the --debug option.
Not everybody liked that change -- including me.
I use --debug 99% of the time...
Well, now we have "-o file" to put turbostat output into a file,
so unless you are watching real-time in a small window,
column count is less frequently a factor.
And more recently, we got the "--hide columnA,columnB" option
to specify columns to skip.
So now we "un-hide" the rest of the columns from behind --debug,
and show them all, by default.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Sat, 21 Jan 2017 06:26:16 +0000 (01:26 -0500)]
tools/power turbostat: decode MSR_MISC_FEATURE_CONTROL
useful for observing if the BIOS disabled prefetch
Not architectural, but docuemented as present on NHM, SNB
and is present on others.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Sat, 21 Jan 2017 06:15:09 +0000 (01:15 -0500)]
x86 msr_index.h: Define MSR_MISC_FEATURE_CONTROL
This non-architectural MSR has disable bits
for various prefetchers on modern processors.
While these bits are generally touched only by the BIOS,
say, via BIOS SETUP, it is useful to dump them
when examining options that can alter performance.
Cc: x86@kernel.org
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Sat, 21 Jan 2017 05:50:08 +0000 (00:50 -0500)]
tools/power turbostat: decode CPUID(6).TURBO
show the CPUID feature for turbo to clarify the case
when it may not be shown in MISC_ENABLE
CPUID(6): APERF, TURBO, DTS, PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu4: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MWAIT TURBO)
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 13 Jan 2017 04:49:18 +0000 (23:49 -0500)]
tools/power turbostat: dump Atom P-states correctly
Turbostat dumps MSR_TURBO_RATIO_LIMIT on Core Architecture.
But Atom Architecture uses MSR_ATOM_CORE_RATIOS and
MSR_ATOM_CORE_TURBO_RATIOS.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Sat, 25 Feb 2017 21:55:17 +0000 (16:55 -0500)]
intel_pstate: use MSR_ATOM_RATIOS definitions from msr-index.h
Originally, these MSRs were locally defined in this driver.
Now the definitions are in msr-index.h -- use them.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 13 Jan 2017 04:22:28 +0000 (23:22 -0500)]
x86 msr-index.h: Define Atom specific core ratio MSR locations
These MSRs are currently used by the intel_pstate driver,
using a local definition.
Cc: x86@kernel.org
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Thu, 12 Jan 2017 04:17:24 +0000 (23:17 -0500)]
tools/power turbostat: further decode MSR_IA32_MISC_ENABLE
Decode MISC_ENABLE.NO_TURBO,
also use the #defines in msr-index.h for decoding this register
cpu0: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MWAIT TURBO)
Although it is not architectural, decode also
MSR_IA32_MISC_ENABLE.prefetch-disable (bit-9).
documented to be present on: Core, P4, Intel-Xeon
reserved on: Atom, Silvermont, Nehalem, SNB, PHI ec.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Thu, 12 Jan 2017 03:12:25 +0000 (22:12 -0500)]
tools/power turbostat: add precision to --debug frequency output
Add a digit of precision to the --debug output for frequency range.
This is useful when BCLK is not an integer.
old:
6 * 83 = 500 MHz max efficiency frequency
26 * 83 = 2166 MHz base frequency
new:
6 * 83.3 = 499.8 MHz max efficiency frequency
26 * 83.3 = 2165.8 MHz base frequency
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Fri, 10 Feb 2017 05:27:20 +0000 (00:27 -0500)]
tools/power turbostat: Baytrail c-state support
The Baytrail SOC, with its Silvermont core, has some unique properties:
1. a hardware CC1 residency counter
2. a module-c6 residency counter
3. a package-c6 counter at traditional package-c7 counter address.
The SOC does not support c3, pc3, c7 or pc7 counters.
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Sun, 8 Jan 2017 04:26:22 +0000 (23:26 -0500)]
x86: msr-index.h: Remove unused MSR_NHM_SNB_PKG_CST_CFG_CTL
The two users, intel_idle driver and turbostat utility
are using the new name, MSR_PKG_CST_CONFIG_CONTROL
Cc: x86@kernel.org
Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Sun, 8 Jan 2017 04:24:57 +0000 (23:24 -0500)]
tools/power turbostat: use new name for MSR_PKG_CST_CONFIG_CONTROL
Previously called MSR_NHM_SNB_PKG_CST_CFG_CTL
Signed-off-by: Len Brown <len.brown@intel.com>