git.samba.org - sfrench/cifs-2.6.git/log

Merge branch 'bpf-xsk-fix-mixed-mode'

Magnus Karlsson says:

====================
Previously, the xsk code did not record which umem was bound to a
specific queue id. This was not required if all drivers were zero-copy
enabled as this had to be recorded in the driver anyway. So if a user
tried to bind two umems to the same queue, the driver would say
no. But if copy-mode was first enabled and then zero-copy mode (or the
reverse order), we mistakenly enabled both of them on the same umem
leading to buggy behavior. The main culprit for this is that we did
not store the association of umem to queue id in the copy case and
only relied on the driver reporting this. As this relation was not
stored in the driver for copy mode (it does not rely on the AF_XDP
NDOs), this obviously could not work.

This patch fixes the problem by always recording the umem to queue id
relationship in the netdev_queue and netdev_rx_queue structs. This way
we always know what kind of umem has been bound to a queue id and can
act appropriately at bind time. To make the bind semantics consistent
with ethtool queue manipulations and to facilitate the implementation
of drivers, we also forbid decreasing the number of queues/channels
with ethtool if there is an active AF_XDP socket in the set of queues
that are disabled.

Jakub, please take a look at your patches. The last one I had to
change slightly to make it fit with the new interface
xdp_get_umem_from_qid(). An added bonus with this function is that we,
in the future, can also use it from the driver to get a umem, thus
simplifying driver implementations (and later remove the umem from the
NDO completely). Björn will mail patches, at a later point in time,
using this in the i40e and ixgbe drivers, that removes a good chunk of
code from the ZC implementations. I also made your code aware of Tx
queues. If we create a socket that only has a Tx queue, then the queue
id will refer to a Tx queue id only and could be larger than the
available amount of Rx queues. Please take a look at it.

Differences against v1:
* Included patches from Jakub that forbids decreasing the number of active
  queues if a queue to be deactivated has an AF_XDP socket. These have
  been adapted somewhat to the new interfaces in patch 2.
* Removed redundant check against real_num_[rt]x_queue in xsk_bind
* Only need to test against real_num_[rt]x_queues in
  xdp_clear_umem_at_qid.

Patch 1: Introduces a umem reference in the netdev_rx_queue and
         netdev_queue structs.
Patch 2: Records which queue_id is bound to which umem and make sure
         that you cannot bind two different umems to the same queue_id.
Patch 3: Pre patch to ethtool_set_channels.
Patch 4: Forbid decreasing the number of active queues if a deactivated
         queue has an AF_XDP socket.
Patch 5: Simplify xdp_clear_umem_at_qid now when ethtool cannot deactivate
         the queue id we are running on.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

xsk: simplify xdp_clear_umem_at_qid implementation

As we now do not allow ethtool to deactivate the queue id we are
running an AF_XDP socket on, we can simplify the implementation of
xdp_clear_umem_at_qid().

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

ethtool: don't allow disabling queues with umem installed

We already check the RSS indirection table does not use queues which
would be disabled by channel reconfiguration. Make sure user does not
try to disable queues which have a UMEM and zero-copy AF_XDP socket
installed.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

ethtool: rename local variable max -> curr

ethtool_set_channels() validates the config against driver's max
settings. It retrieves the current config and stores it in a
variable called max. This was okay when only max settings were
accessed but we will soon want to access current settings as
well, so calling the entire structure max makes the code less
readable.

While at it drop unnecessary parenthesis.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

xsk: fix bug when trying to use both copy and zero-copy on one queue id

Previously, the xsk code did not record which umem was bound to a
specific queue id. This was not required if all drivers were zero-copy
enabled as this had to be recorded in the driver anyway. So if a user
tried to bind two umems to the same queue, the driver would say
no. But if copy-mode was first enabled and then zero-copy mode (or the
reverse order), we mistakenly enabled both of them on the same umem
leading to buggy behavior. The main culprit for this is that we did
not store the association of umem to queue id in the copy case and
only relied on the driver reporting this. As this relation was not
stored in the driver for copy mode (it does not rely on the AF_XDP
NDOs), this obviously could not work.

This patch fixes the problem by always recording the umem to queue id
relationship in the netdev_queue and netdev_rx_queue structs. This way
we always know what kind of umem has been bound to a queue id and can
act appropriately at bind time.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net: add umem reference in netdev{_rx}_queue

These references to the umem will be used to store information
on what kind of AF_XDP umem that is bound to a queue id, if any.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: typo fix in Documentation/networking/af_xdp.rst

Fix a simple typo: Completetion -> Completion

Signed-off-by: Konrad Djimeli <kdjimeli@igalia.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf, tracex3_user: erase "ARRAY_SIZE" redefined

There is a warning when compiling bpf sample programs in sample/bpf:

  make -C /home/foo/bpf/samples/bpf/../../tools/lib/bpf/ RM='rm -rf' LDFLAGS= srctree=/home/foo/bpf/samples/bpf/../../ O=
    HOSTCC  /home/foo/bpf/samples/bpf/tracex3_user.o
  /home/foo/bpf/samples/bpf/tracex3_user.c:20:0: warning: "ARRAY_SIZE" redefined
   #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))

  In file included from /home/foo/bpf/samples/bpf/tracex3_user.c:18:0:
  ./tools/testing/selftests/bpf/bpf_util.h:48:0: note: this is the location of the previous definition
   # define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))

Signed-off-by: Bo YU <tsu.yubo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-libbpf-consistent-iface'

Andrey Ignatov says:

====================
This patch set renames a few interfaces in libbpf, mostly netlink related,
so that all symbols provided by the library have only three possible
prefixes:

% nm -D tools/lib/bpf/libbpf.so  | \
    awk '$2 == "T" {sub(/[_\(].*/, "", $3); if ($3) print $3}' | \
    sort | \
    uniq -c
     91 bpf
      8 btf
     14 libbpf

libbpf is used more and more outside kernel tree. That means the library
should follow good practices in library design and implementation to
play well with third party code that uses it.

One of such practices is to have a common prefix (or a few) for every
interface, function or data structure, library provides. It helps to
avoid name conflicts with other libraries and keeps API/ABI consistent.

Inconsistent names in libbpf already cause problems in real life. E.g.
an application can't use both libbpf and libnl due to conflicting
symbols (specifically nla_parse, nla_parse_nested and a few others).

Some of problematic global symbols are not part of ABI and can be
restricted from export with either visibility attribute/pragma or export
map (what is useful by itself and can be done in addition). That won't
solve the problem for those that are part of ABI though. Also export
restrictions would help only in DSO case. If third party application links
libbpf statically it won't help, and people do it (e.g. Facebook links
most of libraries statically, including libbpf).

libbpf already uses the following prefixes for its interfaces:
* bpf_ for bpf system call wrappers, program/map/elf-object
  abstractions and a few other things;
* btf_ for BTF related API;
* libbpf_ for everything else.

The patch adds libbpf_ prefix to interfaces that use none of mentioned
above prefixes and don't fit well into the first two categories.

Long term benefits of having common prefix should outweigh possible
inconvenience of changing API for those functions now.

Patches 2-4 add libbpf_ prefix to libbpf interfaces: separate patch per
header. Other patches are simple improvements in API.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: Use __u32 instead of u32 in bpf_program__load

Make bpf_program__load consistent with other interfaces: use __u32
instead of u32. That in turn fixes build of samples:

In file included from ./samples/bpf/trace_output_user.c:21:0:
./tools/lib/bpf/libbpf.h:132:9: error: unknown type name ‘u32’
u32 kern_version);
^

Fixes: commit 29cd77f41620d ("libbpf: Support loading individual progs")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: Make include guards consistent

Rename include guards to have consistent names "__LIBBPF_<header_name>".

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: Consistent prefixes for interfaces in str_error.h.

libbpf is used more and more outside kernel tree. That means the library
should follow good practices in library design and implementation to
play well with third party code that uses it.

One of such practices is to have a common prefix (or a few) for every
interface, function or data structure, library provides. I helps to
avoid name conflicts with other libraries and keeps API consistent.

Inconsistent names in libbpf already cause problems in real life. E.g.
an application can't use both libbpf and libnl due to conflicting
symbols.

Having common prefix will help to fix current and avoid future problems.

libbpf already uses the following prefixes for its interfaces:
* bpf_ for bpf system call wrappers, program/map/elf-object
abstractions and a few other things;
* btf_ for BTF related API;
* libbpf_ for everything else.

The patch renames function in str_error.h to have libbpf_ prefix since it
misses one and doesn't fit well into the first two categories.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: Consistent prefixes for interfaces in nlattr.h.

libbpf is used more and more outside kernel tree. That means the library
should follow good practices in library design and implementation to
play well with third party code that uses it.

One of such practices is to have a common prefix (or a few) for every
interface, function or data structure, library provides. I helps to
avoid name conflicts with other libraries and keeps API consistent.

Inconsistent names in libbpf already cause problems in real life. E.g.
an application can't use both libbpf and libnl due to conflicting
symbols.

Having common prefix will help to fix current and avoid future problems.

libbpf already uses the following prefixes for its interfaces:
* bpf_ for bpf system call wrappers, program/map/elf-object
abstractions and a few other things;
* btf_ for BTF related API;
* libbpf_ for everything else.

The patch adds libbpf_ prefix to interfaces in nlattr.h that use none of
mentioned above prefixes and doesn't fit well into the first two
categories.

Since affected part of API is used in bpftool, the patch applies
corresponding change to bpftool as well. Having it in a separate patch
will cause a state of tree where bpftool is broken what may not be a
good idea.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: Consistent prefixes for interfaces in libbpf.h.

libbpf is used more and more outside kernel tree. That means the library
should follow good practices in library design and implementation to
play well with third party code that uses it.

One of such practices is to have a common prefix (or a few) for every
interface, function or data structure, library provides. I helps to
avoid name conflicts with other libraries and keeps API consistent.

Inconsistent names in libbpf already cause problems in real life. E.g.
an application can't use both libbpf and libnl due to conflicting
symbols.

Having common prefix will help to fix current and avoid future problems.

libbpf already uses the following prefixes for its interfaces:
* bpf_ for bpf system call wrappers, program/map/elf-object
abstractions and a few other things;
* btf_ for BTF related API;
* libbpf_ for everything else.

The patch adds libbpf_ prefix to functions and typedef in libbpf.h that
use none of mentioned above prefixes and doesn't fit well into the first
two categories.

Since affected part of API is used in bpftool, the patch applies
corresponding change to bpftool as well. Having it in a separate patch
will cause a state of tree where bpftool is broken what may not be a
good idea.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: Move __dump_nlmsg_t from API to implementation

This typedef is used only by implementation in netlink.c. Nothing uses
it in public API. Move it to netlink.c.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net: core: Fix build with CONFIG_IPV6=m

Stephen Rothwell reports the following link failure with IPv6 as module:

x86_64-linux-gnu-ld: net/core/filter.o: in function `sk_lookup':
(.text+0x19219): undefined reference to `__udp6_lib_lookup'

Fix the build by only enabling the IPv6 socket lookup if IPv6 support is
compiled into the kernel.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-sk-lookup'

Joe Stringer says:

====================
This series proposes a new helper for the BPF API which allows BPF programs to
perform lookups for sockets in a network namespace. This would allow programs
to determine early on in processing whether the stack is expecting to receive
the packet, and perform some action (eg drop, forward somewhere) based on this
information.

The series is structured roughly into:
* Misc refactor
* Add the socket pointer type
* Add reference tracking to ensure that socket references are freed
* Extend the BPF API to add sk_lookup_xxx() / sk_release() functions
* Add tests/documentation

The helper proposed in this series includes a parameter for a tuple which must
be filled in by the caller to determine the socket to look up. The simplest
case would be filling with the contents of the packet, ie mapping the packet's
5-tuple into the parameter. In common cases, it may alternatively be useful to
reverse the direction of the tuple and perform a lookup, to find the socket
that initiates this connection; and if the BPF program ever performs a form of
IP address translation, it may further be useful to be able to look up
arbitrary tuples that are not based upon the packet, but instead based on state
held in BPF maps or hardcoded in the BPF program.

Currently, access into the socket's fields are limited to those which are
otherwise already accessible, and are restricted to read-only access.

Changes since v3:
* New patch: "bpf: Reuse canonical string formatter for ctx errs"
* Add PTR_TO_SOCKET to is_ctx_reg().
* Add a few new checks to prevent mixing of socket/non-socket pointers.
* Swap order of checks in sock_filter_is_valid_access().
* Prefix register spill macros with "bpf_".
* Add acks from previous round
* Rebase

Changes since v2:
* New patch: "selftests/bpf: Generalize dummy program types".
  This enables adding verifier tests for socket lookup with tail calls.
* Define the semantics of the new helpers more clearly in uAPI header.
* Fix release of caller_net when netns is not specified.
* Use skb->sk to find caller net when skb->dev is unavailable.
* Fix build with !CONFIG_NET.
* Replace ptr_id defensive coding when releasing reference state with an
  internal error (-EFAULT).
* Remove flags argument to sk_release().
* Add several new assembly tests suggested by Daniel.
* Add a few new C tests.
* Fix typo in verifier error message.

Changes since v1:
* Limit netns_id field to 32 bits
* Reuse reg_type_mismatch() in more places
* Reduce the number of passes at convert_ctx_access()
* Replace ptr_id defensive coding when releasing reference state with an
  internal error (-EFAULT)
* Rework 'struct bpf_sock_tuple' to allow passing a packet pointer
* Allow direct packet access from helper
* Fix compile error with CONFIG_IPV6 enabled
* Improve commit messages

Changes since RFC:
* Split up sk_lookup() into sk_lookup_tcp(), sk_lookup_udp().
* Only take references on the socket when necessary.
  * Make sk_release() only free the socket reference in this case.
* Fix some runtime reference leaks:
  * Disallow BPF_LD_[ABS|IND] instructions while holding a reference.
  * Disallow bpf_tail_call() while holding a reference.
* Prevent the same instruction being used for reference and other
  pointer type.
* Simplify locating copies of a reference during helper calls by caching
  the pointer id from the caller.
* Fix kbuild compilation warnings with particular configs.
* Improve code comments describing the new verifier pieces.
* Tested by Nitin
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Documentation: Describe bpf reference tracking

Document the new pointer types in the verifier and how the pointer ID
tracking works to ensure that references which are taken are later
released.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Add C tests for reference tracking

Add some tests that demonstrate and test the balanced lookup/free
nature of socket lookup. Section names that start with "fail" represent
programs that are expected to fail verification; all others should
succeed.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: Support loading individual progs

Allow the individual program load to be invoked. This will help with
testing, where a single ELF may contain several sections, some of which
denote subprograms that are expected to fail verification, along with
some which are expected to pass verification. By allowing programs to be
iterated and individually loaded, each program can be independently
checked against its expected verification result.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Add tests for reference tracking

reference tracking: leak potential reference
reference tracking: leak potential reference on stack
reference tracking: leak potential reference on stack 2
reference tracking: zero potential reference
reference tracking: copy and zero potential references
reference tracking: release reference without check
reference tracking: release reference
reference tracking: release reference twice
reference tracking: release reference twice inside branch
reference tracking: alloc, check, free in one subbranch
reference tracking: alloc, check, free in both subbranches
reference tracking in call: free reference in subprog
reference tracking in call: free reference in subprog and outside
reference tracking in call: alloc & leak reference in subprog
reference tracking in call: alloc in subprog, release outside
reference tracking in call: sk_ptr leak into caller stack
reference tracking in call: sk_ptr spill into caller stack
reference tracking: allow LD_ABS
reference tracking: forbid LD_ABS while holding reference
reference tracking: allow LD_IND
reference tracking: forbid LD_IND while holding reference
reference tracking: check reference or tail call
reference tracking: release reference then tail call
reference tracking: leak possible reference over tail call
reference tracking: leak checked reference over tail call
reference tracking: mangle and release sock_or_null
reference tracking: mangle and release sock
reference tracking: access member
reference tracking: write to member
reference tracking: invalid 64-bit access of member
reference tracking: access after release
reference tracking: direct access for lookup
unpriv: spill/fill of different pointers stx - ctx and sock
unpriv: spill/fill of different pointers stx - leak sock
unpriv: spill/fill of different pointers stx - sock and ctx (read)
unpriv: spill/fill of different pointers stx - sock and ctx (write)

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Generalize dummy program types

Don't hardcode the dummy program types to SOCKET_FILTER type, as this
prevents testing bpf_tail_call in conjunction with other program types.
Instead, use the program type specified in the test case.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Add helper to retrieve socket in BPF

This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
socket listening on this host, and returns a socket pointer which the
BPF program can then access to determine, for instance, whether to
forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
socket, so when a BPF program makes use of this function, it must
subsequently pass the returned pointer into the newly added sk_release()
to return the reference.

By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:

  struct bpf_sock_tuple tuple;
  struct bpf_sock_ops *sk;

  populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
  sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
  if (!sk) {
    // Couldn't find a socket listening for this traffic. Drop.
    return TC_ACT_SHOT;
  }
  bpf_sk_release(sk, 0);
  return TC_ACT_OK;

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Add reference tracking to verifier

Allow helper functions to acquire a reference and return it into a
register. Specific pointer types such as the PTR_TO_SOCKET will
implicitly represent such a reference. The verifier must ensure that
these references are released exactly once in each path through the
program.

To achieve this, this commit assigns an id to the pointer and tracks it
in the 'bpf_func_state', then when the function or program exits,
verifies that all of the acquired references have been freed. When the
pointer is passed to a function that frees the reference, it is removed
from the 'bpf_func_state` and all existing copies of the pointer in
registers are marked invalid.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Macrofy stack state copy

An upcoming commit will need very similar copy/realloc boilerplate, so
refactor the existing stack copy/realloc functions into macros to
simplify it.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Add PTR_TO_SOCKET verifier type

Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Generalize ptr_or_null regs check

This check will be reused by an upcoming commit for conditional jump
checks for sockets. Refactor it a bit to simplify the later commit.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Reuse canonical string formatter for ctx errs

The array "reg_type_str" provides canonical formatting of register
types, however a couple of places would previously check whether a
register represented the context and write the name "context" directly.
An upcoming commit will add another pointer type to these statements, so
to provide more accurate error messages in the verifier, update these
error messages to use "reg_type_str" instead.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Simplify ptr_min_max_vals adjustment

An upcoming commit will add another two pointer types that need very
similar behaviour, so generalise this function now.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Add iterator for spilled registers

Add this iterator for spilled registers, it concentrates the details of
how to get the current frame's spilled registers into a single macro
while clarifying the intention of the code which is calling the macro.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-big-map-entries'

Jakub Kicinski says:

====================
This series makes the control message parsing for interacting
with BPF maps more flexible. Up until now we had a hard limit
in the ABI for key and value size to be 64B at most. Using
TLV capability allows us to support large map entries.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

nfp: bpf: allow control message sizing for map ops

In current ABI the size of the messages carrying map elements was
statically defined to at most 16 words of key and 16 words of value
(NFP word is 4 bytes). We should not make this assumption and use
the max key and value sizes from the BPF capability instead.

To make sure old kernels don't get surprised with larger (or smaller)
messages bump the FW ABI version to 3 when key/value size is different
than 16 words.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

nfp: allow apps to request larger MTU on control vNIC

Some apps may want to have higher MTU on the control vNIC/queue.
Allow them to set the requested MTU at init time.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

nfp: bpf: parse global BPF ABI version capability

Up until now we only had per-vNIC BPF ABI version capabilities,
which are slightly awkward to use because bulk of the resources
and configuration does not relate to any particular vNIC. Add
a new capability for global ABI version and check the per-vNIC
version are equal to it. Assume the ABI version 2 if no explicit
version capability is present.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-per-cpu-cgroup-storage'

Roman Gushchin says:

====================
This patchset implements per-cpu cgroup local storage and provides
an example how per-cpu and shared cgroup local storage can be used
for efficient accounting of network traffic.

v4->v3:
  1) incorporated Alexei's feedback

v3->v2:
  1) incorporated Song's feedback
  2) rebased on top of current bpf-next

v2->v1:
  1) added a selftest implementing network counters
  2) added a missing free() in cgroup local storage selftest
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: cgroup local storage-based network counters

This commit adds a bpf kselftest, which demonstrates how percpu
and shared cgroup local storage can be used for efficient lookup-free
network accounting.

Cgroup local storage provides generic memory area with a very efficient
lookup free access. To avoid expensive atomic operations for each
packet, per-cpu cgroup local storage is used. Each packet is initially
charged to a per-cpu counter, and only if the counter reaches certain
value (32 in this case), the charge is moved into the global atomic
counter. This allows to amortize atomic operations, keeping reasonable
accuracy.

The test also implements a naive network traffic throttling, mostly to
demonstrate the possibility of bpf cgroup--based network bandwidth
control.

Expected output:
./test_netcnt
test_netcnt:PASS

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

samples/bpf: extend test_cgrp2_attach2 test to use per-cpu cgroup storage

This commit extends the test_cgrp2_attach2 test to cover per-cpu
cgroup storage. Bpf program will use shared and per-cpu cgroup
storages simultaneously, so a better coverage of corresponding
core code will be achieved.

Expected output:
  $ ./test_cgrp2_attach2
  Attached DROP prog. This ping in cgroup /foo should fail...
  ping: sendmsg: Operation not permitted
  Attached DROP prog. This ping in cgroup /foo/bar should fail...
  ping: sendmsg: Operation not permitted
  Attached PASS prog. This ping in cgroup /foo/bar should pass...
  Detached PASS from /foo/bar while DROP is attached to /foo.
  This ping in cgroup /foo/bar should fail...
  ping: sendmsg: Operation not permitted
  Attached PASS from /foo/bar and detached DROP from /foo.
  This ping in cgroup /foo/bar should pass...
  ### override:PASS
  ### multi:PASS

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: extend the storage test to test per-cpu cgroup storage

This test extends the cgroup storage test to use per-cpu flavor
of the cgroup storage as well.

The test initializes a per-cpu cgroup storage to some non-zero initial
value (1000), and then simple bumps a per-cpu counter each time
the shared counter is atomically incremented. Then it reads all
per-cpu areas from the userspace side, and checks that the sum
of values adds to the expected sum.

Expected output:
$ ./test_cgroup_storage
test_cgroup_storage:PASS

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: add verifier per-cpu cgroup storage tests

This commits adds verifier tests covering per-cpu cgroup storage
functionality. There are 6 new tests, which are exactly the same
as for shared cgroup storage, but do use per-cpu cgroup storage
map.

Expected output:
  $ ./test_verifier
  #0/u add+sub+mul OK
  #0/p add+sub+mul OK
  ...
  #286/p invalid cgroup storage access 6 OK
  #287/p valid per-cpu cgroup storage access OK
  #288/p invalid per-cpu cgroup storage access 1 OK
  #289/p invalid per-cpu cgroup storage access 2 OK
  #290/p invalid per-cpu cgroup storage access 3 OK
  #291/p invalid per-cpu cgroup storage access 4 OK
  #292/p invalid per-cpu cgroup storage access 5 OK
  #293/p invalid per-cpu cgroup storage access 6 OK
  #294/p multiple registers share map_lookup_elem result OK
  ...
  #662/p mov64 src == dst OK
  #663/p mov64 src != dst OK
  Summary: 914 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpftool: add support for PERCPU_CGROUP_STORAGE maps

This commit adds support for BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE
map type.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sync include/uapi/linux/bpf.h to tools/include/uapi/linux/bpf.h

The sync is required due to the appearance of a new map type:
BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, which implements per-cpu
cgroup local storage.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: don't allow create maps of per-cpu cgroup local storages

Explicitly forbid creating map of per-cpu cgroup local storages.
This behavior matches the behavior of shared cgroup storages.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: introduce per-cpu cgroup local storage

This commit introduced per-cpu cgroup local storage.

Per-cpu cgroup local storage is very similar to simple cgroup storage
(let's call it shared), except all the data is per-cpu.

The main goal of per-cpu variant is to implement super fast
counters (e.g. packet counters), which don't require neither
lookups, neither atomic operations.

>From userspace's point of view, accessing a per-cpu cgroup storage
is similar to other per-cpu map types (e.g. per-cpu hashmaps and
arrays).

Writing to a per-cpu cgroup storage is not atomic, but is performed
by copying longs, so some minimal atomicity is here, exactly
as with other per-cpu maps.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: rework cgroup storage pointer passing

To simplify the following introduction of per-cpu cgroup storage,
let's rework a bit a mechanism of passing a pointer to a cgroup
storage into the bpf_get_local_storage(). Let's save a pointer
to the corresponding bpf_cgroup_storage structure, instead of
a pointer to the actual buffer.

It will help us to handle per-cpu storage later, which has
a different way of accessing to the actual data.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: extend cgroup bpf core to allow multiple cgroup storage types

In order to introduce per-cpu cgroup storage, let's generalize
bpf cgroup core to support multiple cgroup storage types.
Potentially, per-node cgroup storage can be added later.

This commit is mostly a formal change that replaces
cgroup_storage pointer with a array of cgroup_storage pointers.
It doesn't actually introduce a new storage type,
it will be done later.

Each bpf program is now able to have one cgroup storage of each type.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: permit CGROUP_DEVICE programs accessing helper bpf_get_current_cgroup_id()

Currently, helper bpf_get_current_cgroup_id() is not permitted
for CGROUP_DEVICE type of programs. If the helper is used
in such cases, the verifier will log the following error:

  0: (bf) r6 = r1
  1: (69) r7 = *(u16 *)(r6 +0)
  2: (85) call bpf_get_current_cgroup_id#80
  unknown func bpf_get_current_cgroup_id#80

The bpf_get_current_cgroup_id() is useful for CGROUP_DEVICE
type of programs in order to customize action based on cgroup id.
This patch added such a support.

Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-libbpf-attach-by-name'

Andrey Ignatov says:

====================
This patch set introduces libbpf_attach_type_by_name function in libbpf
to identify attach type by section name.

This is useful to avoid writing same logic over and over again in user
space applications that leverage libbpf.

Patch 1 has more details on the new function and problem being solved.
Patches 2 and 3 add support for new section names.
Patch 4 uses new function in a selftest.
Patch 5 adds selftest for libbpf_{prog,attach}_type_by_name.

As a side note there are a lot of inconsistencies now between names used
by libbpf and bpftool (e.g. cgroup/skb vs cgroup_skb, cgroup_device and
device vs cgroup/dev, sockops vs sock_ops, etc). This patch set does not
address it but it tries not to make it harder to address it in the future.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Test libbpf_{prog,attach}_type_by_name

Add selftest for libbpf functions libbpf_prog_type_by_name and
libbpf_attach_type_by_name.

Example of output:
% ./tools/testing/selftests/bpf/test_section_names
Summary: 35 PASSED, 0 FAILED

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Use libbpf_attach_type_by_name in test_socket_cookie

Use newly introduced libbpf_attach_type_by_name in test_socket_cookie
selftest.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: Support sk_skb/stream_{parser, verdict} section names

Add section names for BPF_SK_SKB_STREAM_PARSER and
BPF_SK_SKB_STREAM_VERDICT attach types to be able to identify them in
libbpf_attach_type_by_name.

"stream_parser" and "stream_verdict" are used instead of simple "parser"
and "verdict" just to avoid possible confusion in a place where attach
type is used alone (e.g. in bpftool's show sub-commands) since there is
another attach point that can be named as "verdict": BPF_SK_MSG_VERDICT.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: Support cgroup_skb/{e,in}gress section names

Add section names for BPF_CGROUP_INET_INGRESS and BPF_CGROUP_INET_EGRESS
attach types to be able to identify them in libbpf_attach_type_by_name.

"cgroup_skb" is used instead of "cgroup/skb" mostly to easy possible
unifying of how libbpf and bpftool works with section names:
* bpftool uses "cgroup_skb" to in "prog list" sub-command;
* bpftool uses "ingress" and "egress" in "cgroup list" sub-command;
* having two parts instead of three in a string like "cgroup_skb/ingress"
can be leveraged to split it to prog_type part and attach_type part,
or vise versa: use two parts to make a section name.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: Introduce libbpf_attach_type_by_name

There is a common use-case when ELF object contains multiple BPF
programs and every program has its own section name. If it's cgroup-bpf
then programs have to be 1) loaded and 2) attached to a cgroup.

It's convenient to have information necessary to load BPF program
together with program itself. This is where section name works fine in
conjunction with libbpf_prog_type_by_name that identifies prog_type and
expected_attach_type and these can be used with BPF_PROG_LOAD.

But there is currently no way to identify attach_type by section name
and it leads to messy code in user space that reinvents guessing logic
every time it has to identify attach type to use with BPF_PROG_ATTACH.

The patch introduces libbpf_attach_type_by_name that guesses attach type
by section name if a program can be attached.

The difference between expected_attach_type provided by
libbpf_prog_type_by_name and attach_type provided by
libbpf_attach_type_by_name is the former is used at BPF_PROG_LOAD time
and can be zero if a program of prog_type X has only one corresponding
attach type Y whether the latter provides specific attach type to use
with BPF_PROG_ATTACH.

No new section names were added to section_names array. Only existing
ones were reorganized and attach_type was added where appropriate.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: test_bpf: add init_net to dev for flow_dissector

Latest changes in __skb_flow_dissect() assume skb->dev has valid nd_net.
However, this is not true for test_bpf. As a result, test_bpf.ko crashes
the system with the following stack trace:

[ 1133.716622] BUG: unable to handle kernel paging request at 0000000000001030
[ 1133.716623] PGD 8000001fbf7ee067
[ 1133.716624] P4D 8000001fbf7ee067
[ 1133.716624] PUD 1f6c1cf067
[ 1133.716625] PMD 0
[ 1133.716628] Oops: 0000 [#1] SMP PTI
[ 1133.716630] CPU: 7 PID: 40473 Comm: modprobe Kdump: loaded Not tainted 4.19.0-rc5-00805-gca11cc92ccd2 #1167
[ 1133.716631] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM12.5 12/06/2017
[ 1133.716638] RIP: 0010:__skb_flow_dissect+0x83/0x1680
[ 1133.716639] Code: 04 00 00 41 0f b7 44 24 04 48 85 db 4d 8d 14 07 0f 84 01 02 00 00 48 8b 43 10 48 85 c0 0f 84 e5 01 00 00 48 8b 80 a8 04 00 00 <48> 8b 90 30 10 00 00 48 85 d2 0f 84 dd 01 00 00 31 c0 b9 05 00 00
[ 1133.716640] RSP: 0018:ffffc900303c7a80 EFLAGS: 00010282
[ 1133.716642] RAX: 0000000000000000 RBX: ffff881fea0b7400 RCX: 0000000000000000
[ 1133.716643] RDX: ffffc900303c7bb4 RSI: ffffffff8235c3e0 RDI: ffff881fea0b7400
[ 1133.716643] RBP: ffffc900303c7b80 R08: 0000000000000000 R09: 000000000000000e
[ 1133.716644] R10: ffffc900303c7bb4 R11: ffff881fb6840400 R12: ffffffff8235c3e0
[ 1133.716645] R13: 0000000000000008 R14: 000000000000001e R15: ffffc900303c7bb4
[ 1133.716646] FS:  00007f54e75d3740(0000) GS:ffff881fff5c0000(0000) knlGS:0000000000000000
[ 1133.716648] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1133.716649] CR2: 0000000000001030 CR3: 0000001f6c226005 CR4: 00000000003606e0
[ 1133.716649] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1133.716650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1133.716651] Call Trace:
[ 1133.716660]  ? sched_clock_cpu+0xc/0xa0
[ 1133.716662]  ? sched_clock_cpu+0xc/0xa0
[ 1133.716665]  ? log_store+0x1b5/0x260
[ 1133.716667]  ? up+0x12/0x60
[ 1133.716669]  ? skb_get_poff+0x4b/0xa0
[ 1133.716674]  ? __kmalloc_reserve.isra.47+0x2e/0x80
[ 1133.716675]  skb_get_poff+0x4b/0xa0
[ 1133.716680]  bpf_skb_get_pay_offset+0xa/0x10
[ 1133.716686]  ? test_bpf_init+0x578/0x1000 [test_bpf]
[ 1133.716690]  ? netlink_broadcast_filtered+0x153/0x3d0
[ 1133.716695]  ? free_pcppages_bulk+0x324/0x600
[ 1133.716696]  ? 0xffffffffa0279000
[ 1133.716699]  ? do_one_initcall+0x46/0x1bd
[ 1133.716704]  ? kmem_cache_alloc_trace+0x144/0x1a0
[ 1133.716709]  ? do_init_module+0x5b/0x209
[ 1133.716712]  ? load_module+0x2136/0x25d0
[ 1133.716715]  ? __do_sys_finit_module+0xba/0xe0
[ 1133.716717]  ? __do_sys_finit_module+0xba/0xe0
[ 1133.716719]  ? do_syscall_64+0x48/0x100
[ 1133.716724]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

This patch fixes tes_bpf by using init_net in the dummy dev.

Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
Reported-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpftool: Fix bpftool net output

Print `bpftool net` output to stdout instead of stderr. Only errors
should be printed to stderr. Regular output should go to stdout and this
is what all other subcommands of bpftool do, including --json and
--pretty formats of `bpftool net` itself.

Fixes: commit f6f3bac08ff9 ("tools/bpf: bpftool: add net support")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net-ipv4: remove 2 always zero parameters from ipv4_redirect()

(the parameters in question are mark and flow_flags)

Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-ipv4: remove 2 always zero parameters from ipv4_update_pmtu()

(the parameters in question are mark and flow_flags)

Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: Add support for 2500Mbps SGMII

The mvneta controller can handle speeds up to 2500Mbps on the SGMII
interface. This relies on serdes configuration, the lane must be
configured at 3.125Gbps and we can't use in-band autoneg at that speed.

The main issue when supporting that speed on this particular controller
is that the link partner can send ethernet frames with a shortened
preamble, which if not explicitly enabled in the controller will cause
unexpected behaviours.

This was tested on Armada 385, with the comphy configuration done in
bootloader.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-vhost-improve-performance-when-enable-busyloop'

Tonghao Zhang says:

====================
net: vhost: improve performance when enable busyloop

This patches improve the guest receive performance.
On the handle_tx side, we poll the sock receive queue
at the same time. handle_rx do that in the same way.

For more performance report, see patch 4
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: vhost: add rx busy polling in tx path

This patch improves the guest receive performance.
On the handle_tx side, we poll the sock receive queue at the
same time. handle_rx do that in the same way.

We set the poll-us=100us and use the netperf to test throughput
and mean latency. When running the tests, the vhost-net kthread
of that VM, is alway 100% CPU. The commands are shown as below.

Rx performance is greatly improved by this patch. There is not
notable performance change on tx with this series though. This
patch is useful for bi-directional traffic.

netperf -H IP -t TCP_STREAM -l 20 -- -O "THROUGHPUT, THROUGHPUT_UNITS, MEAN_LATENCY"

Topology:
[Host] ->linux bridge -> tap vhost-net ->[Guest]

TCP_STREAM:
* Without the patch: 19842.95 Mbps, 6.50 us mean latency
* With the patch: 37598.20 Mbps, 3.43 us mean latency

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: vhost: factor out busy polling logic to vhost_net_busy_poll()

Factor out generic busy polling logic and will be
used for in tx path in the next patch. And with the patch,
qemu can set differently the busyloop_timeout for rx queue.

To avoid duplicate codes, introduce the helper functions:
* sock_has_rx_data(changed from sk_has_rx_data)
* vhost_net_busy_poll_try_queue

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: vhost: replace magic number of lock annotation

Use the VHOST_NET_VQ_XXX as a subclass for mutex_lock_nested.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: vhost: lock the vqs one by one

This patch changes the way that lock all vqs
at the same, to lock them one by one. It will
be used for next patch to avoid the deadlock.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: expose sk_state in tcp_retransmit_skb tracepoint

After sk_state exposed, we can get in which state this retransmission
occurs. That could give us more detail for dignostic.
For example, if this retransmission occurs in SYN_SENT state, it may
also indicates that the syn packet may be dropped on the remote peer due
to syn backlog queue full and then we could check the remote peer.

BTW,SYNACK retransmission is traced in tcp_retransmit_synack tracepoint.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: faraday: fix return type of ndo_start_xmit function

The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.

Found by coccinelle.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: smsc: fix return type of ndo_start_xmit function

The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, so make sure the implementation in
this driver has returns 'netdev_tx_t' value, and change the function
return type to netdev_tx_t.

Found by coccinelle.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: liquidio: list usage cleanup

Trival cleanup, list_move_tail will implement the same function that
list_del() + list_add_tail() will do. hence just replace them.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qed: list usage cleanup

Trival cleanup, list_move_tail will implement the same function that
list_del() + list_add_tail() will do. hence just replace them.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-bridge-convert-bool-options-to-bits'

Nikolay Aleksandrov says:

====================
net: bridge: convert bool options to bits

A lot of boolean bridge options have been added around the net_bridge
structure resulting in holes and more importantly different cache lines
that need to be fetched in the fast path. This set moves all of those
to bits in a bitfield which resides in a hot cache line thus reducing
the size of net_bridge, the number of holes and the number of cache
lines needed for the fast path.
The set is also sent in preparation for new boolean options to avoid
spreading them in the structure and making new holes.
One nice side-effect is that we avoid potential race conditions by using
the bitops since some of the options were bits being directly set in
parallel risking hard to debug issues (has_ipv6_addr).

Before:
size: 1184, holes: 8, sum holes: 30
After:
size: 1160, holes: 3, sum holes: 7

Patch 01 is a trivial style fix
Patch 02 adds the new options bitfield and converts the vlan boolean
options to bits
Patches 03-08 convert the rest of the boolean options to bits
Patch 09 re-arranges a few fields in net_bridge to further reduce size

v2: patch 09: remove the comment about offload_fwd_mark in net_bridge and
leave it where it is now, thanks to Ido for spotting it
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: pack net_bridge better

Further reduce the size of net_bridge with 8 bytes and reduce the number of
holes in it:
Before: holes: 5, sum holes: 15
After: holes: 3, sum holes: 7

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: convert mtu_set_by_user to a bit

Convert the last remaining bool option to a bit thus reducing the overall
net_bridge size further by 8 bytes.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: convert neigh_suppress_enabled option to a bit

Convert the neigh_suppress_enabled option to a bit.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: convert mcast options to bits

This patch converts the rest of the mcast options to bits. It also packs
the mcast options a little better by moving multicast_mld_version to an
existing hole, reducing the net_bridge size by 8 bytes.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: convert and rename mcast disabled

Convert mcast disabled to an option bit and while doing so convert the
logic to check if multicast is enabled instead. That is make the logic
follow the option value - if it's set then mcast is enabled and vice versa.
This avoids a few confusing places where we inverted the value that's being
set to follow the mcast_disabled logic.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: convert group_addr_set option to a bit

Convert group_addr_set internal bridge opt to a bit.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: convert nf call options to bits

No functional change, convert of nf_call_[ip|ip6|arp]tables to bits.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: add bitfield for options and convert vlan opts

Bridge options have usually been added as separate fields all over the
net_bridge struct taking up space and ending up in different cache lines.
Let's move them to a single bitfield to save up space and speedup lookups.
This patch adds a simple API for option modifying and retrieving using
bitops and converts the first user of the API - the bridge vlan options
(vlan_enabled and vlan_stats_enabled).

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: make struct opening bracket consistent

Currently we have a mix of opening brackets on new lines and on the same
line, let's move them all on the same line.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 's390-net-next'

Julian Wiedmann says:

====================
s390/net: updates 2018-09-26

please apply one more series of cleanups and small improvements for qeth
to net-next. Note that one patch needs to touch both af_iucv and qeth, in
order to untangle their receive paths.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: remove duplicated carrier state tracking

The netdevice is always available, apply any carrier state changes to it
without caching them.
On a STARTLAN event (ie. carrier-up), defer updating the state to
qeth_core_hardsetup_card() in the subsequent recovery action.

Also remove the carrier-state checks from the xmit routines. Stopping
transmission on carrier-down is the responsibility of upper-level code
(eg see dev_direct_xmit()).

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: clean up drop conditions for received cmds

If qeth_check_ipa_data() consumed an event, there's no point in
processing it further. So drop it early, and make the surrounding code
a tiny bit more readable.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: re-indent qeth_check_ipa_data()

Pull one level of checking up into qeth_send_control_data_cb(), and
clean up an else-after-return. No functional change.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: consume local address events

We have no code that is waiting for these events, so just drop them when
they arrive.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: remove various redundant code

1. tracing iob->rc makes no sense when it hasn't been modified by the
   callback,
2. the qeth_dbf_list is declared with LIST_HEAD, which also initializes
   the list,
3. the ccwgroup core only calls the thaw/restore callbacks if the gdev
   is online, so we don't have to check for it again.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: remove CARD_FROM_CDEV helper

The cdev-to-card translation walks through two layers of drvdata,
with no locking or refcounting (where eg. the ccwgroup core only
accesses a cdev's drvdata while holding the ccwlock).

This might be safe for now, but any careless usage of the helper has the
potential for subtle races and use-after-free's. Luckily there's only
one occurrence where we _really_ need it (in qeth_irq()), for any other
user we can just pass through an appropriate card pointer.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: pass card pointer in iob callback

This allows us to remove the CARD_FROM_CDEV calls in the iob callbacks.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: re-use qeth_notify_skbs()

When not using the CQ, this allows us avoid the second skb queue walk
in qeth_release_skbs().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: remove additional skb refcount

This was presumably left over from back when qeth recursed into
dev_queue_xmit().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: replace open-coded skb_queue_walk()

To match the use of __skb_queue_purge(), also make the skb's enqueue in
qeth_fill_buffer() lockless.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/af_iucv: locate IUCV header via skb_network_header()

This patch attempts to untangle the TX and RX code in qeth from
af_iucv's respective HiperTransport path:
On the TX side, pointing skb_network_header() at the IUCV header
means that qeth_l3_fill_af_iucv_hdr() no longer needs a magical offset
to access the header.
On the RX side, qeth pulls the (fake) L2 header off the skb like any
normal ethernet driver would. This makes working with the IUCV header
in af_iucv easier, since we no longer have to assume a fixed skb layout.

While at it, replace the open-coded length checks in af_iucv's RX path
with pskb_may_pull().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: on gdev release, reset drvdata

qeth_core_probe_device() sets the gdev's drvdata, but doesn't reset it
on a subsequent error. Move the (re-)setting around a bit, so that it
happens symmetrically on allocating/freeing the qeth_card struct.

This is no actual problem, as the ccwgroup core will discard the gdev
on a probe error. But from qeth's perspective the gdev is an external
resource, so it's best to manage it cleanly.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: fix discipline unload after setup error

Device initialization code usually first loads a subdriver
(via qeth_core_load_discipline()), and then runs its setup() callback.
If this fails, it rolls back the load via qeth_core_free_discipline().

qeth_core_free_discipline() expects the options.layer attribute to be
initialized, but on error in setup() that's currently not the case.
Resulting in misbalanced symbol_put() calls.

Fix this by setting options.layer when loading the subdriver.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: use DEFINE_MUTEX for qeth_mod_mutex

Consolidate declaration and initialization of a static variable.
While at it reduce its scope in qeth_core_load_discipline(), and simplify
the return logic accordingly.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

s390/qeth: convert layer attribute to enum

While the raw values are fixed due to their use in a sysfs attribute,
we can still use the proper QETH_DISCIPLINE_* enum within the driver.

Also move the initialization into qeth_set_initial_options(), along with
all other user-configurable fields.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: marvell: Fix build.

Local variable 'autoneg' doesn't even exist:

drivers/net/phy/marvell.c: In function 'm88e1121_config_aneg':
drivers/net/phy/marvell.c:468:25: error: 'autoneg' undeclared (first use in this function); did you mean 'put_net'?
if (phydev->autoneg != autoneg || changed) {
^~~~~~~

Fixes: d6ab93364734 ("net: phy: marvell: Avoid unnecessary soft reset")
Reported-by:Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: br_arp_nd_proxy: set icmp6_router if neigh has NTF_ROUTER

Fixes: ed842faeb2bd ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-09-25

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Allow for RX stack hardening by implementing the kernel's flow
   dissector in BPF. Idea was originally presented at netconf 2017 [0].
   Quote from merge commit:

     [...] Because of the rigorous checks of the BPF verifier, this
     provides significant security guarantees. In particular, the BPF
     flow dissector cannot get inside of an infinite loop, as with
     CVE-2013-4348, because BPF programs are guaranteed to terminate.
     It cannot read outside of packet bounds, because all memory accesses
     are checked. Also, with BPF the administrator can decide which
     protocols to support, reducing potential attack surface. Rarely
     encountered protocols can be excluded from dissection and the
     program can be updated without kernel recompile or reboot if a
     bug is discovered. [...]

   Also, a sample flow dissector has been implemented in BPF as part
   of this work, from Petar and Willem.

   [0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

2) Add support for bpftool to list currently active attachment
   points of BPF networking programs providing a quick overview
   similar to bpftool's perf subcommand, from Yonghong.

3) Fix a verifier pruning instability bug where a union member
   from the register state was not cleared properly leading to
   branches not being pruned despite them being valid candidates,
   from Alexei.

4) Various smaller fast-path optimizations in XDP's map redirect
   code, from Jesper.

5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
   in bpftool, from Roman.

6) Remove a duplicate check in libbpf that probes for function
   storage, from Taeung.

7) Fix an issue in test_progs by avoid checking for errno since
   on success its value should not be checked, from Mauricio.

8) Fix unused variable warning in bpf_getsockopt() helper when
   CONFIG_INET is not configured, from Anders.

9) Fix a compilation failure in the BPF sample code's use of
   bpf_flow_keys, from Prashant.

10) Minor cleanups in BPF code, from Yue and Zhong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: lantiq_gswip: Depend on HAS_IOMEM

The driver uses devm_ioremap_resource() which is only available when
CONFIG_HAS_IOMEM is set, make the driver depend on this config option.
User mode Linux does not have CONFIG_HAS_IOMEM set and the driver was
failing on this architecture.

Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-phy-Eliminate-unnecessary-soft'

Florian Fainelli says:

====================
net: phy: Eliminate unnecessary soft

This patch series eliminates unnecessary software resets of the PHY.
This should hopefully not break anybody's hardware; but I would
appreciate testing to make sure this is is the case.

Sorry for this long email list, I wanted to make sure I reached out to
all people who made changes to the Marvell PHY driver.

Thank you!

Changes since RFT:

- added Tested-by tags from Wang, Dongsheng, Andrew, Chris and Clemens
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: marvell: Avoid unnecessary soft reset

The BMCR.RESET bit on the Marvell PHYs has a special meaning in that
it commits the register writes into the HW for it to latch and be
configured appropriately. Doing software resets causes link drops, and
this is unnecessary disruption if nothing changed.

Determine from marvell_set_polarity()'s return code whether the register value
was changed and if it was, propagate that to the logic that hits the software
reset bit.

This avoids doing unnecessary soft reset if the PHY is configured in
the same state it was previously, this also eliminates the need for a
m88e1111_config_aneg() function since it now is the same as
marvell_config_aneg().

Tested-by: Wang, Dongsheng <dongsheng.wang@hxt-semitech.com>
Tested-by: Chris Healy <cphealy@gmail.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Clemens Gruber <clemens.gruber@pqgruber.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Stop with excessive soft reset

While consolidating the PHY reset in phy_init_hw() an unconditionaly
BMCR soft-reset I became quite trigger happy with those. This was later
on deactivated for the Generic PHY driver on the premise that a prior
software entity (e.g: bootloader) might have applied workarounds in
commit 0878fff1f42c ("net: phy: Do not perform software reset for
Generic PHY").

Since we have a hook to wire-up a soft_reset callback, just use that and
get rid of the call to genphy_soft_reset() entirely. This speeds up
initialization and link establishment for most PHYs out there that do
not require a reset.

Fixes: 87aa9f9c61ad ("net: phy: consolidate PHY reset in phy_init_hw()")
Tested-by: Wang, Dongsheng <dongsheng.wang@hxt-semitech.com>
Tested-by: Chris Healy <cphealy@gmail.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Clemens Gruber <clemens.gruber@pqgruber.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>