sfrench/cifs-2.6.git
5 years agonet: bridge: convert mtu_set_by_user to a bit
Nikolay Aleksandrov [Wed, 26 Sep 2018 14:01:06 +0000 (17:01 +0300)]
net: bridge: convert mtu_set_by_user to a bit

Convert the last remaining bool option to a bit thus reducing the overall
net_bridge size further by 8 bytes.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: convert neigh_suppress_enabled option to a bit
Nikolay Aleksandrov [Wed, 26 Sep 2018 14:01:05 +0000 (17:01 +0300)]
net: bridge: convert neigh_suppress_enabled option to a bit

Convert the neigh_suppress_enabled option to a bit.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: convert mcast options to bits
Nikolay Aleksandrov [Wed, 26 Sep 2018 14:01:04 +0000 (17:01 +0300)]
net: bridge: convert mcast options to bits

This patch converts the rest of the mcast options to bits. It also packs
the mcast options a little better by moving multicast_mld_version to an
existing hole, reducing the net_bridge size by 8 bytes.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: convert and rename mcast disabled
Nikolay Aleksandrov [Wed, 26 Sep 2018 14:01:03 +0000 (17:01 +0300)]
net: bridge: convert and rename mcast disabled

Convert mcast disabled to an option bit and while doing so convert the
logic to check if multicast is enabled instead. That is make the logic
follow the option value - if it's set then mcast is enabled and vice versa.
This avoids a few confusing places where we inverted the value that's being
set to follow the mcast_disabled logic.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: convert group_addr_set option to a bit
Nikolay Aleksandrov [Wed, 26 Sep 2018 14:01:02 +0000 (17:01 +0300)]
net: bridge: convert group_addr_set option to a bit

Convert group_addr_set internal bridge opt to a bit.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: convert nf call options to bits
Nikolay Aleksandrov [Wed, 26 Sep 2018 14:01:01 +0000 (17:01 +0300)]
net: bridge: convert nf call options to bits

No functional change, convert of nf_call_[ip|ip6|arp]tables to bits.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: add bitfield for options and convert vlan opts
Nikolay Aleksandrov [Wed, 26 Sep 2018 14:01:00 +0000 (17:01 +0300)]
net: bridge: add bitfield for options and convert vlan opts

Bridge options have usually been added as separate fields all over the
net_bridge struct taking up space and ending up in different cache lines.
Let's move them to a single bitfield to save up space and speedup lookups.
This patch adds a simple API for option modifying and retrieving using
bitops and converts the first user of the API - the bridge vlan options
(vlan_enabled and vlan_stats_enabled).

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: make struct opening bracket consistent
Nikolay Aleksandrov [Wed, 26 Sep 2018 14:00:59 +0000 (17:00 +0300)]
net: bridge: make struct opening bracket consistent

Currently we have a mix of opening brackets on new lines and on the same
line, let's move them all on the same line.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 's390-net-next'
David S. Miller [Wed, 26 Sep 2018 16:56:08 +0000 (09:56 -0700)]
Merge branch 's390-net-next'

Julian Wiedmann says:

====================
s390/net: updates 2018-09-26

please apply one more series of cleanups and small improvements for qeth
to net-next. Note that one patch needs to touch both af_iucv and qeth, in
order to untangle their receive paths.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: remove duplicated carrier state tracking
Julian Wiedmann [Wed, 26 Sep 2018 16:29:16 +0000 (18:29 +0200)]
s390/qeth: remove duplicated carrier state tracking

The netdevice is always available, apply any carrier state changes to it
without caching them.
On a STARTLAN event (ie. carrier-up), defer updating the state to
qeth_core_hardsetup_card() in the subsequent recovery action.

Also remove the carrier-state checks from the xmit routines. Stopping
transmission on carrier-down is the responsibility of upper-level code
(eg see dev_direct_xmit()).

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: clean up drop conditions for received cmds
Julian Wiedmann [Wed, 26 Sep 2018 16:29:15 +0000 (18:29 +0200)]
s390/qeth: clean up drop conditions for received cmds

If qeth_check_ipa_data() consumed an event, there's no point in
processing it further. So drop it early, and make the surrounding code
a tiny bit more readable.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: re-indent qeth_check_ipa_data()
Julian Wiedmann [Wed, 26 Sep 2018 16:29:14 +0000 (18:29 +0200)]
s390/qeth: re-indent qeth_check_ipa_data()

Pull one level of checking up into qeth_send_control_data_cb(), and
clean up an else-after-return. No functional change.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: consume local address events
Julian Wiedmann [Wed, 26 Sep 2018 16:29:13 +0000 (18:29 +0200)]
s390/qeth: consume local address events

We have no code that is waiting for these events, so just drop them when
they arrive.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: remove various redundant code
Julian Wiedmann [Wed, 26 Sep 2018 16:29:12 +0000 (18:29 +0200)]
s390/qeth: remove various redundant code

1. tracing iob->rc makes no sense when it hasn't been modified by the
   callback,
2. the qeth_dbf_list is declared with LIST_HEAD, which also initializes
   the list,
3. the ccwgroup core only calls the thaw/restore callbacks if the gdev
   is online, so we don't have to check for it again.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: remove CARD_FROM_CDEV helper
Julian Wiedmann [Wed, 26 Sep 2018 16:29:11 +0000 (18:29 +0200)]
s390/qeth: remove CARD_FROM_CDEV helper

The cdev-to-card translation walks through two layers of drvdata,
with no locking or refcounting (where eg. the ccwgroup core only
accesses a cdev's drvdata while holding the ccwlock).

This might be safe for now, but any careless usage of the helper has the
potential for subtle races and use-after-free's. Luckily there's only
one occurrence where we _really_ need it (in qeth_irq()), for any other
user we can just pass through an appropriate card pointer.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: pass card pointer in iob callback
Julian Wiedmann [Wed, 26 Sep 2018 16:29:10 +0000 (18:29 +0200)]
s390/qeth: pass card pointer in iob callback

This allows us to remove the CARD_FROM_CDEV calls in the iob callbacks.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: re-use qeth_notify_skbs()
Julian Wiedmann [Wed, 26 Sep 2018 16:29:09 +0000 (18:29 +0200)]
s390/qeth: re-use qeth_notify_skbs()

When not using the CQ, this allows us avoid the second skb queue walk
in qeth_release_skbs().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: remove additional skb refcount
Julian Wiedmann [Wed, 26 Sep 2018 16:29:08 +0000 (18:29 +0200)]
s390/qeth: remove additional skb refcount

This was presumably left over from back when qeth recursed into
dev_queue_xmit().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: replace open-coded skb_queue_walk()
Julian Wiedmann [Wed, 26 Sep 2018 16:29:07 +0000 (18:29 +0200)]
s390/qeth: replace open-coded skb_queue_walk()

To match the use of __skb_queue_purge(), also make the skb's enqueue in
qeth_fill_buffer() lockless.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/af_iucv: locate IUCV header via skb_network_header()
Julian Wiedmann [Wed, 26 Sep 2018 16:29:06 +0000 (18:29 +0200)]
net/af_iucv: locate IUCV header via skb_network_header()

This patch attempts to untangle the TX and RX code in qeth from
af_iucv's respective HiperTransport path:
On the TX side, pointing skb_network_header() at the IUCV header
means that qeth_l3_fill_af_iucv_hdr() no longer needs a magical offset
to access the header.
On the RX side, qeth pulls the (fake) L2 header off the skb like any
normal ethernet driver would. This makes working with the IUCV header
in af_iucv easier, since we no longer have to assume a fixed skb layout.

While at it, replace the open-coded length checks in af_iucv's RX path
with pskb_may_pull().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: on gdev release, reset drvdata
Julian Wiedmann [Wed, 26 Sep 2018 16:29:05 +0000 (18:29 +0200)]
s390/qeth: on gdev release, reset drvdata

qeth_core_probe_device() sets the gdev's drvdata, but doesn't reset it
on a subsequent error. Move the (re-)setting around a bit, so that it
happens symmetrically on allocating/freeing the qeth_card struct.

This is no actual problem, as the ccwgroup core will discard the gdev
on a probe error. But from qeth's perspective the gdev is an external
resource, so it's best to manage it cleanly.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: fix discipline unload after setup error
Julian Wiedmann [Wed, 26 Sep 2018 16:29:04 +0000 (18:29 +0200)]
s390/qeth: fix discipline unload after setup error

Device initialization code usually first loads a subdriver
(via qeth_core_load_discipline()), and then runs its setup() callback.
If this fails, it rolls back the load via qeth_core_free_discipline().

qeth_core_free_discipline() expects the options.layer attribute to be
initialized, but on error in setup() that's currently not the case.
Resulting in misbalanced symbol_put() calls.

Fix this by setting options.layer when loading the subdriver.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: use DEFINE_MUTEX for qeth_mod_mutex
Julian Wiedmann [Wed, 26 Sep 2018 16:29:03 +0000 (18:29 +0200)]
s390/qeth: use DEFINE_MUTEX for qeth_mod_mutex

Consolidate declaration and initialization of a static variable.
While at it reduce its scope in qeth_core_load_discipline(), and simplify
the return logic accordingly.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agos390/qeth: convert layer attribute to enum
Julian Wiedmann [Wed, 26 Sep 2018 16:29:02 +0000 (18:29 +0200)]
s390/qeth: convert layer attribute to enum

While the raw values are fixed due to their use in a sysfs attribute,
we can still use the proper QETH_DISCIPLINE_* enum within the driver.

Also move the initialization into qeth_set_initial_options(), along with
all other user-configurable fields.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: marvell: Fix build.
David S. Miller [Wed, 26 Sep 2018 05:41:31 +0000 (22:41 -0700)]
net: phy: marvell: Fix build.

Local variable 'autoneg' doesn't even exist:

drivers/net/phy/marvell.c: In function 'm88e1121_config_aneg':
drivers/net/phy/marvell.c:468:25: error: 'autoneg' undeclared (first use in this function); did you mean 'put_net'?
  if (phydev->autoneg != autoneg || changed) {
                         ^~~~~~~

Fixes: d6ab93364734 ("net: phy: marvell: Avoid unnecessary soft reset")
Reported-by:Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobridge: br_arp_nd_proxy: set icmp6_router if neigh has NTF_ROUTER
Roopa Prabhu [Tue, 25 Sep 2018 21:39:14 +0000 (14:39 -0700)]
bridge: br_arp_nd_proxy: set icmp6_router if neigh has NTF_ROUTER

Fixes: ed842faeb2bd ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
David S. Miller [Wed, 26 Sep 2018 03:29:38 +0000 (20:29 -0700)]
Merge git://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-09-25

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Allow for RX stack hardening by implementing the kernel's flow
   dissector in BPF. Idea was originally presented at netconf 2017 [0].
   Quote from merge commit:

     [...] Because of the rigorous checks of the BPF verifier, this
     provides significant security guarantees. In particular, the BPF
     flow dissector cannot get inside of an infinite loop, as with
     CVE-2013-4348, because BPF programs are guaranteed to terminate.
     It cannot read outside of packet bounds, because all memory accesses
     are checked. Also, with BPF the administrator can decide which
     protocols to support, reducing potential attack surface. Rarely
     encountered protocols can be excluded from dissection and the
     program can be updated without kernel recompile or reboot if a
     bug is discovered. [...]

   Also, a sample flow dissector has been implemented in BPF as part
   of this work, from Petar and Willem.

   [0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

2) Add support for bpftool to list currently active attachment
   points of BPF networking programs providing a quick overview
   similar to bpftool's perf subcommand, from Yonghong.

3) Fix a verifier pruning instability bug where a union member
   from the register state was not cleared properly leading to
   branches not being pruned despite them being valid candidates,
   from Alexei.

4) Various smaller fast-path optimizations in XDP's map redirect
   code, from Jesper.

5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
   in bpftool, from Roman.

6) Remove a duplicate check in libbpf that probes for function
   storage, from Taeung.

7) Fix an issue in test_progs by avoid checking for errno since
   on success its value should not be checked, from Mauricio.

8) Fix unused variable warning in bpf_getsockopt() helper when
   CONFIG_INET is not configured, from Anders.

9) Fix a compilation failure in the BPF sample code's use of
   bpf_flow_keys, from Prashant.

10) Minor cleanups in BPF code, from Yue and Zhong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: lantiq_gswip: Depend on HAS_IOMEM
Hauke Mehrtens [Tue, 25 Sep 2018 19:55:33 +0000 (21:55 +0200)]
net: dsa: lantiq_gswip: Depend on HAS_IOMEM

The driver uses devm_ioremap_resource() which is only available when
CONFIG_HAS_IOMEM is set, make the driver depend on this config option.
User mode Linux does not have CONFIG_HAS_IOMEM set and the driver was
failing on this architecture.

Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'net-phy-Eliminate-unnecessary-soft'
David S. Miller [Wed, 26 Sep 2018 03:26:45 +0000 (20:26 -0700)]
Merge branch 'net-phy-Eliminate-unnecessary-soft'

Florian Fainelli says:

====================
net: phy: Eliminate unnecessary soft

This patch series eliminates unnecessary software resets of the PHY.
This should hopefully not break anybody's hardware; but I would
appreciate testing to make sure this is is the case.

Sorry for this long email list, I wanted to make sure I reached out to
all people who made changes to the Marvell PHY driver.

Thank you!

Changes since RFT:

- added Tested-by tags from Wang, Dongsheng, Andrew, Chris and Clemens
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: marvell: Avoid unnecessary soft reset
Florian Fainelli [Tue, 25 Sep 2018 18:28:46 +0000 (11:28 -0700)]
net: phy: marvell: Avoid unnecessary soft reset

The BMCR.RESET bit on the Marvell PHYs has a special meaning in that
it commits the register writes into the HW for it to latch and be
configured appropriately. Doing software resets causes link drops, and
this is unnecessary disruption if nothing changed.

Determine from marvell_set_polarity()'s return code whether the register value
was changed and if it was, propagate that to the logic that hits the software
reset bit.

This avoids doing unnecessary soft reset if the PHY is configured in
the same state it was previously, this also eliminates the need for a
m88e1111_config_aneg() function since it now is the same as
marvell_config_aneg().

Tested-by: Wang, Dongsheng <dongsheng.wang@hxt-semitech.com>
Tested-by: Chris Healy <cphealy@gmail.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Clemens Gruber <clemens.gruber@pqgruber.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: Stop with excessive soft reset
Florian Fainelli [Tue, 25 Sep 2018 18:28:45 +0000 (11:28 -0700)]
net: phy: Stop with excessive soft reset

While consolidating the PHY reset in phy_init_hw() an unconditionaly
BMCR soft-reset I became quite trigger happy with those. This was later
on deactivated for the Generic PHY driver on the premise that a prior
software entity (e.g: bootloader) might have applied workarounds in
commit 0878fff1f42c ("net: phy: Do not perform software reset for
Generic PHY").

Since we have a hook to wire-up a soft_reset callback, just use that and
get rid of the call to genphy_soft_reset() entirely. This speeds up
initialization and link establishment for most PHYs out there that do
not require a reset.

Fixes: 87aa9f9c61ad ("net: phy: consolidate PHY reset in phy_init_hw()")
Tested-by: Wang, Dongsheng <dongsheng.wang@hxt-semitech.com>
Tested-by: Chris Healy <cphealy@gmail.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Clemens Gruber <clemens.gruber@pqgruber.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next...
David S. Miller [Wed, 26 Sep 2018 03:25:00 +0000 (20:25 -0700)]
Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2018-09-25

This series contains updates to i40e and xsk.

Mariusz fixes an issue where the VF link state was not being updated
properly when the PF is down or up.  Also cleaned up the promiscuous
configuration during a VF reset.

Patryk simplifies the code a bit to use the variables for PF and HW that
are declared, rather than using the VSI pointers.  Cleaned up the
message length parameter to several virtchnl functions, since it was not
being used (or needed).

Harshitha fixes two potential race conditions when trying to change VF
settings by creating a helper function to validate that the VF is
enabled and that the VSI is set up.

Sergey corrects a double "link down" message by putting in a check for
whether or not the link is up or going down.

Björn addresses an AF_XDP zero-copy issue that buffers passed
from userspace to the kernel was leaked when the hardware descriptor
ring was torn down.  A zero-copy capable driver picks buffers off the
fill ring and places them on the hardware receive ring to be completed at
a later point when DMA is complete. Similar on the transmit side; The
driver picks buffers off the transmit ring and places them on the
transmit hardware ring.

In the typical flow, the receive buffer will be placed onto an receive
ring (completed to the user), and the transmit buffer will be placed on
the completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), the userspace
buffers cannot be leaked. They have to be reused or completed back to
userspace.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'Refactor-classifier-API-to-work-with-Qdisc-blocks-without-rtnl-lock'
David S. Miller [Wed, 26 Sep 2018 03:17:36 +0000 (20:17 -0700)]
Merge branch 'Refactor-classifier-API-to-work-with-Qdisc-blocks-without-rtnl-lock'

Vlad Buslov says:

====================
Refactor classifier API to work with Qdisc/blocks without rtnl lock

Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a third step to remove
rtnl lock dependency from TC rules update path.

Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
Handlers registered with this flag are called without RTNL taken. End
goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
etc.) to be registered with UNLOCKED flag to allow parallel execution.
However, there is no intention to completely remove or split rtnl lock
itself. This patch set addresses specific problems in implementation of
classifiers API that prevent its control path from being executed
concurrently. Additional changes are required to refactor classifiers
API and individual classifiers for parallel execution. This patch set
lays groundwork to eventually register rule update handlers as
rtnl-unlocked by modifying code in cls API that works with Qdiscs and
blocks. Following patch set does the same for chains and classifiers.

The goal of this change is to refactor tcf_block_find() and its
dependencies to allow concurrent execution:
- Extend Qdisc API with rcu to lookup and take reference to Qdisc
  without relying on rtnl lock.
- Extend tcf_block with atomic reference counting and rcu.
- Always take reference to tcf_block while working with it.
- Implement tcf_block_release() to release resources obtained by
  tcf_block_find()
- Create infrastructure to allow registering Qdiscs with class ops that
  do not require the caller to hold rtnl lock.

All three netlink rule update handlers use tcf_block_find() to lookup
Qdisc and block, and this patch set introduces additional means of
synchronization to substitute rtnl lock in cls API.

Some functions in cls and sch APIs have historic names that no longer
clearly describe their intent. In order not make this code even more
confusing when introducing their concurrency-friendly versions, rename
these functions to describe actual implementation.

Changes from V2 to V3:
- Patch 1:
  - Explicitly include refcount.h in rtnetlink.h.
- Patch 3:
  - Move rcu_head field to the end of struct Qdisc.
  - Rearrange local variable declarations in qdisc_lookup_rcu().
- Patch 5:
  - Remove tcf_qdisc_put() and inline its content to callers.

Changes from V1 to V2:
- Rebase on latest net-next.
- Patch 8 - remove.
- Patch 9 - fold into patch 11.
- Patch 11:
  - Rename tcf_block_{get|put}() to tcf_block_refcnt_{get|put}().
- Patch 13 - remove.
====================

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: use reference counting for tcf blocks on rules update
Vlad Buslov [Mon, 24 Sep 2018 16:22:58 +0000 (19:22 +0300)]
net: sched: use reference counting for tcf blocks on rules update

In order to remove dependency on rtnl lock on rules update path, always
take reference to block while using it on rules update path. Change
tcf_block_get() error handling to properly release block with reference
counting, instead of just destroying it, in order to accommodate potential
concurrent users.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: implement tcf_block_refcnt_{get|put}()
Vlad Buslov [Mon, 24 Sep 2018 16:22:57 +0000 (19:22 +0300)]
net: sched: implement tcf_block_refcnt_{get|put}()

Implement get/put function for blocks that only take/release the reference
and perform deallocation. These functions are intended to be used by
unlocked rules update path to always hold reference to block while working
with it. They use on new fine-grained locking mechanisms introduced in
previous patches in this set, instead of relying on global protection
provided by rtnl lock.

Extract code that is common with tcf_block_detach_ext() into common
function __tcf_block_put().

Extend tcf_block with rcu to allow safe deallocation when it is accessed
concurrently.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: protect block idr with spinlock
Vlad Buslov [Mon, 24 Sep 2018 16:22:56 +0000 (19:22 +0300)]
net: sched: protect block idr with spinlock

Protect block idr access with spinlock, instead of relying on rtnl lock.
Take tn->idr_lock spinlock during block insertion and removal.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: implement functions to put and flush all chains
Vlad Buslov [Mon, 24 Sep 2018 16:22:55 +0000 (19:22 +0300)]
net: sched: implement functions to put and flush all chains

Extract code that flushes and puts all chains on tcf block to two
standalone function to be shared with functions that locklessly get/put
reference to block.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: change tcf block reference counter type to refcount_t
Vlad Buslov [Mon, 24 Sep 2018 16:22:54 +0000 (19:22 +0300)]
net: sched: change tcf block reference counter type to refcount_t

As a preparation for removing rtnl lock dependency from rules update path,
change tcf block reference counter type to refcount_t to allow modification
by concurrent users.

In block put function perform decrement and check reference counter once to
accommodate concurrent modification by unlocked users. After this change
tcf_chain_put at the end of block put function is called with
block->refcnt==0 and will deallocate block after the last chain is
released, so there is no need to manually deallocate block in this case.
However, if block reference counter reached 0 and there are no chains to
release, block must still be deallocated manually.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: use Qdisc rcu API instead of relying on rtnl lock
Vlad Buslov [Mon, 24 Sep 2018 16:22:53 +0000 (19:22 +0300)]
net: sched: use Qdisc rcu API instead of relying on rtnl lock

As a preparation from removing rtnl lock dependency from rules update path,
use Qdisc rcu and reference counting capabilities instead of relying on
rtnl lock while working with Qdiscs. Create new tcf_block_release()
function, and use it to free resources taken by tcf_block_find().
Currently, this function only releases Qdisc and it is extended in next
patches in this series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: add helper function to take reference to Qdisc
Vlad Buslov [Mon, 24 Sep 2018 16:22:52 +0000 (19:22 +0300)]
net: sched: add helper function to take reference to Qdisc

Implement function to take reference to Qdisc that relies on rcu read lock
instead of rtnl mutex. Function only takes reference to Qdisc if reference
counter isn't zero. Intended to be used by unlocked cls API.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: extend Qdisc with rcu
Vlad Buslov [Mon, 24 Sep 2018 16:22:51 +0000 (19:22 +0300)]
net: sched: extend Qdisc with rcu

Currently, Qdisc API functions assume that users have rtnl lock taken. To
implement rtnl unlocked classifiers update interface, Qdisc API must be
extended with functions that do not require rtnl lock.

Extend Qdisc structure with rcu. Implement special version of put function
qdisc_put_unlocked() that is called without rtnl lock taken. This function
only takes rtnl lock if Qdisc reference counter reached zero and is
intended to be used as optimization.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: rename qdisc_destroy() to qdisc_put()
Vlad Buslov [Mon, 24 Sep 2018 16:22:50 +0000 (19:22 +0300)]
net: sched: rename qdisc_destroy() to qdisc_put()

Current implementation of qdisc_destroy() decrements Qdisc reference
counter and only actually destroy Qdisc if reference counter value reached
zero. Rename qdisc_destroy() to qdisc_put() in order for it to better
describe the way in which this function currently implemented and used.

Extract code that deallocates Qdisc into new private qdisc_destroy()
function. It is intended to be shared between regular qdisc_put() and its
unlocked version that is introduced in next patch in this series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: core: netlink: add helper refcount dec and lock function
Vlad Buslov [Mon, 24 Sep 2018 16:22:49 +0000 (19:22 +0300)]
net: core: netlink: add helper refcount dec and lock function

Rtnl lock is encapsulated in netlink and cannot be accessed by other
modules directly. This means that reference counted objects that rely on
rtnl lock cannot use it with refcounter helper function that atomically
releases decrements reference and obtains mutex.

This patch implements simple wrapper function around refcount_dec_and_lock
that obtains rtnl lock if reference counter value reached 0.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoi40e: disallow changing the number of descriptors when AF_XDP is on
Björn Töpel [Fri, 7 Sep 2018 08:18:48 +0000 (10:18 +0200)]
i40e: disallow changing the number of descriptors when AF_XDP is on

When an AF_XDP UMEM is attached to any of the Rx rings, we disallow a
user to change the number of descriptors via e.g. "ethtool -G IFNAME".

Otherwise, the size of the stash/reuse queue can grow unbounded, which
would result in OOM or leaking userspace buffers.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoi40e: clean zero-copy XDP Rx ring on shutdown/reset
Björn Töpel [Fri, 7 Sep 2018 08:18:47 +0000 (10:18 +0200)]
i40e: clean zero-copy XDP Rx ring on shutdown/reset

Outstanding Rx descriptors are temporarily stored on a stash/reuse
queue. When/if the HW rings comes up again, entries from the stash are
used to re-populate the ring.

The latter required some restructuring of the allocation scheme for
the AF_XDP zero-copy implementation. There is now a fast, and a slow
allocation. The "fast allocation" is used from the fast-path and
obtains free buffers from the fill ring and the internal recycle
mechanism. The "slow allocation" is only used in ring setup, and
obtains buffers from the fill ring and the stash (if any).

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agonet: xsk: add a simple buffer reuse queue
Jakub Kicinski [Fri, 7 Sep 2018 08:18:46 +0000 (10:18 +0200)]
net: xsk: add a simple buffer reuse queue

XSK UMEM is strongly single producer single consumer so reuse of
frames is challenging.  Add a simple "stash" of FILL packets to
reuse for drivers to optionally make use of.  This is useful
when driver has to free (ndo_stop) or resize a ring with an active
AF_XDP ZC socket.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoi40e: clean zero-copy XDP Tx ring on shutdown/reset
Björn Töpel [Fri, 7 Sep 2018 08:18:45 +0000 (10:18 +0200)]
i40e: clean zero-copy XDP Tx ring on shutdown/reset

When the zero-copy enabled XDP Tx ring is torn down, due to
configuration changes, outstanding frames on the hardware descriptor
ring are queued on the completion ring.

The completion ring has a back-pressure mechanism that will guarantee
that there is sufficient space on the ring.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoi40e: Remove unused msglen parameter from virtchnl functions
Patryk Małek [Tue, 28 Aug 2018 17:16:07 +0000 (10:16 -0700)]
i40e: Remove unused msglen parameter from virtchnl functions

msglen parameter seems to be unused in several virtchnl function.
This patch removes it from signatures of those functions.

Signed-off-by: Patryk Małek <patryk.malek@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoi40e: fix double 'NIC Link is Down' messages
Sergey Nemov [Tue, 28 Aug 2018 17:16:06 +0000 (10:16 -0700)]
i40e: fix double 'NIC Link is Down' messages

When isup is false meaning that interface is going to shut down
set new speed to 0 to avoid double 'NIC Link is Down' messages.

Signed-off-by: Sergey Nemov <sergey.nemov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoi40e: add a helper function to validate a VF based on the vf id
Harshitha Ramamurthy [Tue, 28 Aug 2018 17:16:05 +0000 (10:16 -0700)]
i40e: add a helper function to validate a VF based on the vf id

When we are trying to change VF settings, it is possible for 2 race
conditions to happen. One, when the VF is created but not yet enabled.
Second, the VF is enabled but the VSI is still not created or not yet
re-created in the VF reset flow.

This patch introduces a helper function to validate that the VF is
enabled and that the VSI is set up. This patch also calls this
function from other functions which could get into these race conditions.
While we are poking around here, remove unnecessary parenthesis that
checkpatch was complaining about.

Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoi40e: use declared variables for pf and hw
Patryk Małek [Tue, 28 Aug 2018 17:16:04 +0000 (10:16 -0700)]
i40e: use declared variables for pf and hw

In order to slightly simplify the code use the variables for pf and hw
that are declared in i40e_set_mac function.

Signed-off-by: Patryk Małek <patryk.malek@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoi40e: Unset promiscuous settings on VF reset
Mariusz Stachura [Mon, 20 Aug 2018 15:12:28 +0000 (08:12 -0700)]
i40e: Unset promiscuous settings on VF reset

This patch cleans up promiscuous configuration when a VF reset occurs.
Previously the promiscuous mode settings were still there after the VF
driver removal.

Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoi40e: Fix VF's link state notification
Mariusz Stachura [Mon, 20 Aug 2018 15:12:22 +0000 (08:12 -0700)]
i40e: Fix VF's link state notification

This resolves an issue where the VF link state was not being updated
when the PF is down or up, and the VF link state would always show
that it is running.

Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agotls: Fixed a memory leak during socket close
Vakul Garg [Tue, 25 Sep 2018 14:51:51 +0000 (20:21 +0530)]
tls: Fixed a memory leak during socket close

During socket close, if there is a open record with tx context, it needs
to be be freed apart from freeing up plaintext and encrypted scatter
lists. This patch frees up the open record if present in tx context.

Also tls_free_both_sg() has been renamed to tls_free_open_rec() to
indicate that the free record in tx context is being freed inside the
function.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotls: Fix socket mem accounting error under async encryption
Vakul Garg [Tue, 25 Sep 2018 10:56:17 +0000 (16:26 +0530)]
tls: Fix socket mem accounting error under async encryption

Current async encryption implementation sometimes showed up socket
memory accounting error during socket close. This results in kernel
warning calltrace. The root cause of the problem is that socket var
sk_forward_alloc gets corrupted due to access in sk_mem_charge()
and sk_mem_uncharge() being invoked from multiple concurrent contexts
in multicore processor. The apis sk_mem_charge() and sk_mem_uncharge()
are called from functions alloc_plaintext_sg(), free_sg() etc. It is
required that memory accounting apis are called under a socket lock.

The plaintext sg data sent for encryption is freed using free_sg() in
tls_encryption_done(). It is wrong to call free_sg() from this function.
This is because this function may run in irq context. We cannot acquire
socket lock in this function.

We remove calling of function free_sg() for plaintext data from
tls_encryption_done() and defer freeing up of plaintext data to the time
when the record is picked up from tx_list and transmitted/freed. When
tls_tx_records() gets called, socket is already locked and thus there is
no concurrent access problem.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
David S. Miller [Tue, 25 Sep 2018 17:35:29 +0000 (10:35 -0700)]
Merge ra./pub/scm/linux/kernel/git/davem/net

Version bump conflict in batman-adv, take what's in net-next.

iavf conflict, adjustment of netdev_ops in net-next conflicting
with poll controller method removal in net.

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: macb: Clean 64b dma addresses if they are not detected
Michal Simek [Tue, 25 Sep 2018 06:32:50 +0000 (08:32 +0200)]
net: macb: Clean 64b dma addresses if they are not detected

Clear ADDR64 dma bit in DMACFG register in case that HW_DMA_CAP_64B is
not detected on 64bit system.
The issue was observed when bootloader(u-boot) does not check macb
feature at DCFG6 register (DAW64_OFFSET) and enabling 64bit dma support
by default. Then macb driver is reading DMACFG register back and only
adding 64bit dma configuration but not cleaning it out.

Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'r8169-series-with-smaller-improvements'
David S. Miller [Tue, 25 Sep 2018 17:28:03 +0000 (10:28 -0700)]
Merge branch 'r8169-series-with-smaller-improvements'

Heiner Kallweit says:

====================
r8169: series with smaller improvements

This series includes smaller improvements, nothing exciting.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: improve a check in rtl_init_one
Heiner Kallweit [Tue, 25 Sep 2018 05:59:36 +0000 (07:59 +0200)]
r8169: improve a check in rtl_init_one

The check for pci_is_pcie() is redundant here because all
chip versions >=18 are PCIe only anyway. In addition use
dma_set_mask_and_coherent() instead of separate calls to
pci_set_dma_mask() and pci_set_consistent_dma_mask().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: improve rtl8169_irq_mask_and_ack
Heiner Kallweit [Tue, 25 Sep 2018 05:58:00 +0000 (07:58 +0200)]
r8169: improve rtl8169_irq_mask_and_ack

Code can be slightly simplified by acking even events we're not
interested in. In addition add a comment making clear that the
read has no functional purpose and is just a PCI commit.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agor8169: use default watchdog timeout
Heiner Kallweit [Tue, 25 Sep 2018 05:56:53 +0000 (07:56 +0200)]
r8169: use default watchdog timeout

The networking core has a default watchdog timeout of 5s. I see no
need to define an own timeout of 6s which is basically the same.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Greg Kroah-Hartman [Tue, 25 Sep 2018 16:14:14 +0000 (18:14 +0200)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

James writes:
  "SCSI fixes on 20180925

   Nine obvious bug fixes mostly in individual drivers.  The target fix
   is of particular importance because it's CVE related."

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: sd: don't crash the host on invalid commands
  scsi: ipr: System hung while dlpar adding primary ipr adapter back
  scsi: target: iscsi: Use bin2hex instead of a re-implementation
  scsi: target: iscsi: Use hex2bin instead of a re-implementation
  scsi: lpfc: Synchronize access to remoteport via rport
  scsi: ufs: Disable blk-mq for now
  scsi: sd: Contribute to randomness when running rotational device
  scsi: ibmvscsis: Ensure partition name is properly NUL terminated
  scsi: ibmvscsis: Fix a stringop-overflow warning

5 years agoMerge tag 'usb-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Greg Kroah-Hartman [Tue, 25 Sep 2018 15:54:55 +0000 (17:54 +0200)]
Merge tag 'usb-4.19-rc6' of git://git./linux/kernel/git/gregkh/usb

I wrote:
  "USB fixes for 4.19-rc6

   Here are some small USB core and driver fixes for reported issues for
   4.19-rc6.

   The most visible is the oops fix for when the USB core is built into the
   kernel that is present in 4.18.  Turns out not many people actually do
   that so it went unnoticed for a while.  The rest is some tiny typec,
   musb, and other core fixes.

   All have been in linux-next with no reported issues."

* tag 'usb-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: mux: Take care of driver module reference counting
  usb: core: safely deal with the dynamic quirk lists
  usb: roles: Take care of driver module reference counting
  USB: handle NULL config in usb_find_alt_setting()
  USB: fix error handling in usb_driver_claim_interface()
  USB: remove LPM management from usb_driver_claim_interface()
  USB: usbdevfs: restore warning for nonsensical flags
  USB: usbdevfs: sanitize flags more
  Revert "usb: cdc-wdm: Fix a sleep-in-atomic-context bug in service_outstanding_interrupt()"
  usb: musb: dsps: do not disable CPPI41 irq in driver teardown

5 years agoflow_dissector: lookup netns by skb->sk if skb->dev is NULL
Willem de Bruijn [Mon, 24 Sep 2018 20:49:57 +0000 (16:49 -0400)]
flow_dissector: lookup netns by skb->sk if skb->dev is NULL

BPF flow dissectors are configured per network namespace.
__skb_flow_dissect looks up the netns through dev_net(skb->dev).

In some dissector paths skb->dev is NULL, such as for Unix sockets.
In these cases fall back to looking up the netns by socket.

Analyzing the codepaths leading to __skb_flow_dissect I did not find
a case where both skb->dev and skb->sk are NULL. Warn and fall back to
standard flow dissector if one is found.

Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agobpftool: add support for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
Roman Gushchin [Fri, 21 Sep 2018 22:47:20 +0000 (22:47 +0000)]
bpftool: add support for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps

Add BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map type to the list
of maps types which bpftool recognizes.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoMerge tag 'tty-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Greg Kroah-Hartman [Tue, 25 Sep 2018 15:22:50 +0000 (17:22 +0200)]
Merge tag 'tty-4.19-rc6' of git://git./linux/kernel/git/gregkh/tty

I wrote:
  "TTY/Serial driver fixes for 4.19-rc6

   Here are a number of small tty and serial driver fixes for reported
   issues for 4.19-rc6.

   One should hopefully resolve a much-reported issue that syzbot has found
   in the tty layer.  Although there are still more issues there, getting
   this fixed is nice to see finally happen.

   All of these have been in linux-next for a while with no reported
   issues."

* tag 'tty-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serial: imx: restore handshaking irq for imx1
  tty: vt_ioctl: fix potential Spectre v1
  tty: Drop tty->count on tty_reopen() failure
  serial: cpm_uart: return immediately from console poll
  tty: serial: lpuart: avoid leaking struct tty_struct
  serial: mvebu-uart: Fix reporting of effective CSIZE to userspace

5 years agoMerge tag 'char-misc-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Greg Kroah-Hartman [Tue, 25 Sep 2018 15:01:52 +0000 (17:01 +0200)]
Merge tag 'char-misc-4.19-rc6' of git://git./linux/kernel/git/gregkh/char-misc

Greg (well I), wrote:
  "Char/Misc driver fixes for 4.19-rc6

   Here are some soundwire and intel_th (tracing) driver fixes for some
   reported issues.

   All of these have been in linux-next for a week with no reported issues."

* tag 'char-misc-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  intel_th: pci: Add Ice Lake PCH support
  intel_th: Fix resource handling for ACPI glue layer
  intel_th: Fix device removal logic
  soundwire: Fix acquiring bus lock twice during master release
  soundwire: Fix incorrect exit after configuring stream
  soundwire: Fix duplicate stream state assignment

5 years agoRevert "uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name"
Lubomir Rintel [Mon, 24 Sep 2018 12:18:34 +0000 (13:18 +0100)]
Revert "uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name"

This changes UAPI, breaking iwd and libell:

  ell/key.c: In function 'kernel_dh_compute':
  ell/key.c:205:38: error: 'struct keyctl_dh_params' has no member named 'private'; did you mean 'dh_private'?
    struct keyctl_dh_params params = { .private = private,
                                        ^~~~~~~
                                        dh_private

This reverts commit 8a2336e549d385bb0b46880435b411df8d8200e8.

Fixes: 8a2336e549d3 ("uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name")
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Randy Dunlap <rdunlap@infradead.org>
cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
cc: Stephan Mueller <smueller@chronox.de>
cc: James Morris <jmorris@namei.org>
cc: "Serge E. Hallyn" <serge@hallyn.com>
cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: <stable@vger.kernel.org>
Signed-off-by: James Morris <james.morris@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 years agoMerge gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net
Greg Kroah-Hartman [Tue, 25 Sep 2018 09:19:28 +0000 (11:19 +0200)]
Merge gitolite./pub/scm/linux/kernel/git/davem/net

Dave writes:
  "Networking fixes:

  1) Fix multiqueue handling of coalesce timer in stmmac, from Jose
     Abreu.

   2) Fix memory corruption in NFC, from Suren Baghdasaryan.

   3) Don't write reserved bits in ravb driver, from Kazuya Mizuguchi.

   4) SMC bug fixes from Karsten Graul, YueHaibing, and Ursula Braun.

   5) Fix TX done race in mvpp2, from Antoine Tenart.

   6) ipv6 metrics leak, from Wei Wang.

   7) Adjust firmware version requirements in mlxsw, from Petr Machata.

   8) Fix autonegotiation on resume in r8169, from Heiner Kallweit.

   9) Fixed missing entries when dumping /proc/net/if_inet6, from Jeff
      Barnhill.

   10) Fix double free in devlink, from Dan Carpenter.

   11) Fix ethtool regression from UFO feature removal, from Maciej
       Å»enczykowski.

   12) Fix drivers that have a ndo_poll_controller() that captures the
       cpu entirely on loaded hosts by trying to drain all rx and tx
       queues, from Eric Dumazet.

   13) Fix memory corruption with jumbo frames in aquantia driver, from
       Friedemann Gerold."

* gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net: (79 commits)
  net: mvneta: fix the remaining Rx descriptor unmapping issues
  ip_tunnel: be careful when accessing the inner header
  mpls: allow routes on ip6gre devices
  net: aquantia: memory corruption on jumbo frames
  tun: remove ndo_poll_controller
  nfp: remove ndo_poll_controller
  bnxt: remove ndo_poll_controller
  bnx2x: remove ndo_poll_controller
  mlx5: remove ndo_poll_controller
  mlx4: remove ndo_poll_controller
  i40evf: remove ndo_poll_controller
  ice: remove ndo_poll_controller
  igb: remove ndo_poll_controller
  ixgb: remove ndo_poll_controller
  fm10k: remove ndo_poll_controller
  ixgbevf: remove ndo_poll_controller
  ixgbe: remove ndo_poll_controller
  bonding: use netpoll_poll_dev() helper
  netpoll: make ndo_poll_controller() optional
  rds: Fix build regression.
  ...

5 years agodpaa2-eth: Make Rx flow hash key configurable
Ioana Ciocoi Radulescu [Mon, 24 Sep 2018 15:36:21 +0000 (15:36 +0000)]
dpaa2-eth: Make Rx flow hash key configurable

Until now, the Rx flow hash key was a 5-tuple (IP src, IP dst,
IP nextproto, L4 src port, L4 dst port) fixed value that we
configured at probe.

Add support for configuring this hash key at runtime.
We support all standard header fields configurable through ethtool,
but cannot differentiate between flow types, so the same hash key
is applied regardless of protocol.

We also don't support the discard option.

Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: mvneta: fix the remaining Rx descriptor unmapping issues
Antoine Tenart [Mon, 24 Sep 2018 14:56:13 +0000 (16:56 +0200)]
net: mvneta: fix the remaining Rx descriptor unmapping issues

With CONFIG_DMA_API_DEBUG enabled we get DMA unmapping warning in
various places of the mvneta driver, for example when putting down an
interface while traffic is passing through.

The issue is when using s/w buffer management, the Rx buffers are mapped
using dma_map_page but unmapped with dma_unmap_single. This patch fixes
this by using the right unmapping function.

Fixes: 562e2f467e71 ("net: mvneta: Improve the buffer allocation method for SWBM")
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Gregory CLEMENT <gregory.clement@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoip_tunnel: be careful when accessing the inner header
Paolo Abeni [Mon, 24 Sep 2018 13:48:19 +0000 (15:48 +0200)]
ip_tunnel: be careful when accessing the inner header

Cong noted that we need the same checks introduced by commit 76c0ddd8c3a6
("ip6_tunnel: be careful when accessing the inner header")
even for ipv4 tunnels.

Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: qca_spi: Introduce write register verification
Stefan Wahren [Mon, 24 Sep 2018 11:20:10 +0000 (13:20 +0200)]
net: qca_spi: Introduce write register verification

The SPI protocol for the QCA7000 doesn't have any fault detection.
In order to increase the drivers reliability in noisy environments,
we could implement a write verification inspired by the enc28j60.
This should avoid situations were the driver wrongly assumes the
receive interrupt is enabled and miss all incoming packets.

This function is disabled per default and can be controlled via module
parameter wr_verify.

Signed-off-by: Michael Heimpold <michael.heimpold@i2se.com>
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotls: Fixed uninitialised vars warning
Vakul Garg [Mon, 24 Sep 2018 10:39:49 +0000 (16:09 +0530)]
tls: Fixed uninitialised vars warning

In tls_sw_sendmsg() and tls_sw_sendpage(), it is possible that the
uninitialised variable 'ret' gets passed to sk_stream_error(). So
initialise local variable 'ret' to '0. The warnings were detected by
'smatch' tool.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/tls: Fixed race condition in async encryption
Vakul Garg [Mon, 24 Sep 2018 10:05:56 +0000 (15:35 +0530)]
net/tls: Fixed race condition in async encryption

On processors with multi-engine crypto accelerators, it is possible that
multiple records get encrypted in parallel and their encryption
completion is notified to different cpus in multicore processor. This
leads to the situation where tls_encrypt_done() starts executing in
parallel on different cores. In current implementation, encrypted
records are queued to tx_ready_list in tls_encrypt_done(). This requires
addition to linked list 'tx_ready_list' to be protected. As
tls_decrypt_done() could be executing in irq content, it is not possible
to protect linked list addition operation using a lock.

To fix the problem, we remove linked list addition operation from the
irq context. We do tx_ready_list addition/removal operation from
application context only and get rid of possible multiple access to
the linked list. Before starting encryption on the record, we add it to
the tail of tx_ready_list. To prevent tls_tx_records() from transmitting
it, we mark the record with a new flag 'tx_ready' in 'struct tls_rec'.
When record encryption gets completed, tls_encrypt_done() has to only
update the 'tx_ready' flag to true & linked list add operation is not
required.

The changed logic brings some other side benefits. Since the records
are always submitted in tls sequence number order for encryption, the
tx_ready_list always remains sorted and addition of new records to it
does not have to traverse the linked list.

Lastly, we renamed tx_ready_list in 'struct tls_sw_context_tx' to
'tx_list'. This is because now, the some of the records at the tail are
not ready to transmit.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'few-NTF_ROUTER-related-updates'
David S. Miller [Mon, 24 Sep 2018 19:21:33 +0000 (12:21 -0700)]
Merge branch 'few-NTF_ROUTER-related-updates'

Roopa Prabhu says:

====================
few NTF_ROUTER related updates

This series allows setting of NTF_ROUTER by an external
entity (eg BGP E-VPN control plane). Also fixes missing
netlink notification on neigh NTF_ROUTER flag changes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoneighbour: send netlink notification if NTF_ROUTER changes
Roopa Prabhu [Sun, 23 Sep 2018 04:26:20 +0000 (21:26 -0700)]
neighbour: send netlink notification if NTF_ROUTER changes

send netlink notification if neigh_update results in NTF_ROUTER
change and if NEIGH_UPDATE_F_ISROUTER is on. Also move the
NTF_ROUTER change function into a helper.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoneighbour: allow admin to set NTF_ROUTER
Roopa Prabhu [Sun, 23 Sep 2018 04:26:19 +0000 (21:26 -0700)]
neighbour: allow admin to set NTF_ROUTER

This patch allows admin setting of NTF_ROUTER flag
on a neighbour entry. This enables external control
plane (like bgp evpn) to manage neigh entries with
NTF_ROUTER flag.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agompls: allow routes on ip6gre devices
Saif Hasan [Fri, 21 Sep 2018 21:30:05 +0000 (14:30 -0700)]
mpls: allow routes on ip6gre devices

Summary:

This appears to be necessary and sufficient change to enable `MPLS` on
`ip6gre` tunnels (RFC4023).

This diff allows IP6GRE devices to be recognized by MPLS kernel module
and hence user can configure interface to accept packets with mpls
headers as well setup mpls routes on them.

Test Plan:

Test plan consists of multiple containers connected via GRE-V6 tunnel.
Then carrying out testing steps as below.

- Carry out necessary sysctl settings on all containers

```
sysctl -w net.mpls.platform_labels=65536
sysctl -w net.mpls.ip_ttl_propagate=1
sysctl -w net.mpls.conf.lo.input=1
```

- Establish IP6GRE tunnels

```
ip -6 tunnel add name if_1_2_1 mode ip6gre \
  local 2401:db00:21:6048:feed:0::1 \
  remote 2401:db00:21:6048:feed:0::2 key 1
ip link set dev if_1_2_1 up
sysctl -w net.mpls.conf.if_1_2_1.input=1
ip -4 addr add 169.254.0.2/31 dev if_1_2_1 scope link

ip -6 tunnel add name if_1_3_1 mode ip6gre \
  local 2401:db00:21:6048:feed:0::1 \
  remote 2401:db00:21:6048:feed:0::3 key 1
ip link set dev if_1_3_1 up
sysctl -w net.mpls.conf.if_1_3_1.input=1
ip -4 addr add 169.254.0.4/31 dev if_1_3_1 scope link
```

- Install MPLS encap rules on node-1 towards node-2

```
ip route add 192.168.0.11/32 nexthop encap mpls 32/64 \
  via inet 169.254.0.3 dev if_1_2_1
```

- Install MPLS forwarding rules on node-2 and node-3
```
// node2
ip -f mpls route add 32 via inet 169.254.0.7 dev if_2_4_1

// node3
ip -f mpls route add 64 via inet 169.254.0.12 dev if_4_3_1
```

- Ping 192.168.0.11 (node4) from 192.168.0.1 (node1) (where routing
  towards 192.168.0.1 is via IP route directly towards node1 from node4)
```
ping 192.168.0.11
```

- tcpdump on interface to capture ping packets wrapped within MPLS
  header which inturn wrapped within IP6GRE header

```
16:43:41.121073 IP6
  2401:db00:21:6048:feed::1 > 2401:db00:21:6048:feed::2:
  DSTOPT GREv0, key=0x1, length 100:
  MPLS (label 32, exp 0, ttl 255) (label 64, exp 0, [S], ttl 255)
  IP 192.168.0.1 > 192.168.0.11:
  ICMP echo request, id 1208, seq 45, length 64

0x0000:  6000 2cdb 006c 3c3f 2401 db00 0021 6048  `.,..l<?$....!`H
0x0010:  feed 0000 0000 0001 2401 db00 0021 6048  ........$....!`H
0x0020:  feed 0000 0000 0002 2f00 0401 0401 0100  ......../.......
0x0030:  2000 8847 0000 0001 0002 00ff 0004 01ff  ...G............
0x0040:  4500 0054 3280 4000 ff01 c7cb c0a8 0001  E..T2.@.........
0x0050:  c0a8 000b 0800 a8d7 04b8 002d 2d3c a05b  ...........--<.[
0x0060:  0000 0000 bcd8 0100 0000 0000 1011 1213  ................
0x0070:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223  .............!"#
0x0080:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233  $%&'()*+,-./0123
0x0090:  3435 3637                                4567
```

Signed-off-by: Saif Hasan <has@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'net-sched-Add-hardware-specific-counters-to-TC-actions'
David S. Miller [Mon, 24 Sep 2018 19:18:43 +0000 (12:18 -0700)]
Merge branch 'net-sched-Add-hardware-specific-counters-to-TC-actions'

Eelco Chaudron says:

====================
net/sched: Add hardware specific counters to TC actions

Add hardware specific counters to TC actions which will be exported
through the netlink API. This makes troubleshooting TC flower offload
easier, as it possible to differentiate the packets being offloaded.

v2 - Rebased on latest net-next
====================

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/sched: Add hardware specific counters to TC actions
Eelco Chaudron [Fri, 21 Sep 2018 11:14:02 +0000 (07:14 -0400)]
net/sched: Add hardware specific counters to TC actions

Add additional counters that will store the bytes/packets processed by
hardware. These will be exported through the netlink interface for
displaying by the iproute2 tc tool

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/core: Add new basic hardware counter
Eelco Chaudron [Fri, 21 Sep 2018 11:13:54 +0000 (07:13 -0400)]
net/core: Add new basic hardware counter

Add a new hardware specific basic counter, TCA_STATS_BASIC_HW. This can
be used to count packets/bytes processed by hardware offload.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
David S. Miller [Mon, 24 Sep 2018 17:08:45 +0000 (10:08 -0700)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2018-09-24

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Several fixes for BPF sockmap to only allow sockets being attached in
   ESTABLISHED state, from John.

2) Fix up the license to LGPL/BSD for the libc compat header which contains
   fallback helpers that libbpf and bpftool is using, from Jakub.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'mvpp2-Add-txq-to-CPU-mapping'
David S. Miller [Mon, 24 Sep 2018 17:01:10 +0000 (10:01 -0700)]
Merge branch 'mvpp2-Add-txq-to-CPU-mapping'

Maxime Chevallier says:

====================
net: mvpp2: Add txq to CPU mapping

This short series adds XPS support to the mvpp2 driver, by mapping
txqs and CPUs. This comes with a patch using round-robin scheduling
for the HW to pick the next txq to transmit from, instead of the default
fixed-priority scheduling.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: mvpp2: use round-robin scheduling for TX queues on the same CPU
Maxime Chevallier [Mon, 24 Sep 2018 09:11:06 +0000 (11:11 +0200)]
net: mvpp2: use round-robin scheduling for TX queues on the same CPU

This commit allows each TXQ to be picked in a round-robin fashion by
the PPv2 transmit scheduling mechanism. This is opposed to the default
behaviour that prioritizes the highest numbered queues.

Suggested-by: Yan Markman <ymarkman@marvell.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: mvpp2: support XPS by mapping TX queues to CPUs
Maxime Chevallier [Mon, 24 Sep 2018 09:11:05 +0000 (11:11 +0200)]
net: mvpp2: support XPS by mapping TX queues to CPUs

Since the PPv2 controller has multiple TX queues, we can spread traffic
by assining TX queues to CPUs, allowing to use XPS to balance egress
traffic between CPUs.

Suggested-by : Yan Markman <ymarkman@marvell.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'media/v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Greg Kroah-Hartman [Mon, 24 Sep 2018 13:16:41 +0000 (15:16 +0200)]
Merge tag 'media/v4.19-2' of git://git./linux/kernel/git/mchehab/linux-media

Mauro briefly writes:
  "media fixes for v4.19-rc5

   some drivers and Kbuild fixes"

* tag 'media/v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: platform: fix cros-ec-cec build error
  media: staging/media/mt9t031/Kconfig: remove bogus entry
  media: i2c: mt9v111: Fix v4l2-ctrl error handling
  media: camss: add missing includes
  media: camss: Use managed memory allocations
  media: camss: mark PM functions as __maybe_unused
  media: af9035: prevent buffer overflow on write
  media: video_function_calls.rst: drop obsolete video-set-attributes reference

5 years agonet: aquantia: memory corruption on jumbo frames
Friedemann Gerold [Sat, 15 Sep 2018 15:03:39 +0000 (18:03 +0300)]
net: aquantia: memory corruption on jumbo frames

This patch fixes skb_shared area, which will be corrupted
upon reception of 4K jumbo packets.

Originally build_skb usage purpose was to reuse page for skb to eliminate
needs of extra fragments. But that logic does not take into account that
skb_shared_info should be reserved at the end of skb data area.

In case packet data consumes all the page (4K), skb_shinfo location
overflows the page. As a consequence, __build_skb zeroed shinfo data above
the allocated page, corrupting next page.

The issue is rarely seen in real life because jumbo are normally larger
than 4K and that causes another code path to trigger.
But it 100% reproducible with simple scapy packet, like:

    sendp(IP(dst="192.168.100.3") / TCP(dport=443) \
          / Raw(RandString(size=(4096-40))), iface="enp1s0")

Fixes: 018423e90bee ("net: ethernet: aquantia: Add ring support code")
Reported-by: Friedemann Gerold <f.gerold@b-c-s.de>
Reported-by: Michael Rauch <michael@rauch.be>
Signed-off-by: Friedemann Gerold <f.gerold@b-c-s.de>
Tested-by: Nikita Danilov <nikita.danilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'netpoll-avoid-capture-effects-for-NAPI-drivers'
David S. Miller [Mon, 24 Sep 2018 04:55:26 +0000 (21:55 -0700)]
Merge branch 'netpoll-avoid-capture-effects-for-NAPI-drivers'

Eric Dumazet says:

====================
netpoll: avoid capture effects for NAPI drivers

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC).

This capture, showing one ksoftirqd eating all cycles
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

It seems that all networking drivers that do use NAPI
for their TX completions, should not provide a ndo_poll_controller() :

Most NAPI drivers have netpoll support already handled
in core networking stack, since netpoll_poll_dev()
uses poll_napi(dev) to iterate through registered
NAPI contexts for a device.

This patch series take care of the first round, we will
handle other drivers in future rounds.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotun: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:52 +0000 (15:27 -0700)]
tun: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

tun uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:51 +0000 (15:27 -0700)]
nfp: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

nfp uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobnxt: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:50 +0000 (15:27 -0700)]
bnxt: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

bnxt uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobnx2x: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:49 +0000 (15:27 -0700)]
bnx2x: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

bnx2x uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlx5: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:48 +0000 (15:27 -0700)]
mlx5: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

mlx5 uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlx4: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:47 +0000 (15:27 -0700)]
mlx4: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

mlx4 uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoi40evf: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:46 +0000 (15:27 -0700)]
i40evf: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

i40evf uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoice: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:45 +0000 (15:27 -0700)]
ice: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

ice uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoigb: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:44 +0000 (15:27 -0700)]
igb: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

igb uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoixgb: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:43 +0000 (15:27 -0700)]
ixgb: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

ixgb uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

This also removes a problematic use of disable_irq() in
a context it is forbidden, as explained in commit
af3e0fcf7887 ("8139too: Use disable_irq_nosync() in
rtl8139_poll_controller()")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agofm10k: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:42 +0000 (15:27 -0700)]
fm10k: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
lasts for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

fm10k uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>