Andrey Ignatov [Sun, 12 Aug 2018 17:49:30 +0000 (10:49 -0700)]
selftests/bpf: Selftest for bpf_skb_ancestor_cgroup_id
Add selftests for bpf_skb_ancestor_cgroup_id helper.
test_skb_cgroup_id.sh prepares testing interface and adds tc qdisc and
filter for it using BPF object compiled from test_skb_cgroup_id_kern.c
program.
BPF program in test_skb_cgroup_id_kern.c gets ancestor cgroup id using
the new helper at different levels of cgroup hierarchy that skb belongs
to, including root level and non-existing level, and saves it to the map
where the key is the level of corresponding cgroup and the value is its
id.
To trigger BPF program, user space program test_skb_cgroup_id_user is
run. It adds itself into testing cgroup and sends UDP datagram to
link-local multicast address of testing interface. Then it reads cgroup
ids saved in kernel for different levels from the BPF map and compares
them with those in user space. They must be equal for every level of
ancestry.
Example of run:
# ./test_skb_cgroup_id.sh
Wait for testing link-local IP to become available ... OK
Note: 8 bytes struct bpf_elf_map fixup performed due to size mismatch!
[PASS]
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Andrey Ignatov [Sun, 12 Aug 2018 17:49:29 +0000 (10:49 -0700)]
selftests/bpf: Add cgroup id helpers to bpf_helpers.h
Add bpf_skb_cgroup_id and bpf_skb_ancestor_cgroup_id helpers to
bpf_helpers.h to use them in tests and samples.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Andrey Ignatov [Sun, 12 Aug 2018 17:49:28 +0000 (10:49 -0700)]
bpf: Sync bpf.h to tools/
Sync skb_ancestor_cgroup_id() related bpf UAPI changes to tools/.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Andrey Ignatov [Sun, 12 Aug 2018 17:49:27 +0000 (10:49 -0700)]
bpf: Introduce bpf_skb_ancestor_cgroup_id helper
== Problem description ==
It's useful to be able to identify cgroup associated with skb in TC so
that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
helper can help with this.
Though in real life cgroup hierarchy and hierarchy to apply a policy to
don't map 1:1.
It's often the case that there is a container and corresponding cgroup,
but there are many more sub-cgroups inside container, e.g. because it's
delegated to containerized application to control resources for its
subsystems, or to separate application inside container from infra that
belongs to containerization system (e.g. sshd).
At the same time it may be useful to apply a policy to container as a
whole.
If multiple containers like this are run on a host (what is often the
case) and many of them have sub-cgroups, it may not be possible to apply
per-container policy in TC with existing helpers such as
bpf_skb_under_cgroup or bpf_skb_cgroup_id:
* bpf_skb_cgroup_id will return id of immediate cgroup associated with
skb, i.e. if it's a sub-cgroup inside container, it can't be used to
identify container's cgroup;
* bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
i.e. if there are N containers on a host and a policy has to be
applied to M of them (0 <= M <= N), it'd require M calls to
bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
load new BPF program.
== Solution ==
The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
used to get id of cgroup v2 that is an ancestor of cgroup associated
with skb at specified level of cgroup hierarchy.
That way admin can place all containers on one level of cgroup hierarchy
(what is a good practice in general and already used in many
configurations) and identify specific cgroup on this level no matter
what sub-cgroup skb is associated with.
E.g. if there is a cgroup hierarchy:
root/
root/container1/
root/container1/app11/
root/container1/app11/sub-app-a/
root/container1/app12/
root/container2/
root/container2/app21/
root/container2/app22/
root/container2/app22/sub-app-b/
, then having skb associated with root/container1/app11/sub-app-a/ it's
possible to get ancestor at level 1, what is container1 and apply policy
for this container, or apply another policy if it's container2.
Policies can be kept e.g. in a hash map where key is a container cgroup
id and value is an action.
Levels where container cgroups are created are usually known in advance
whether cgroup hierarchy inside container may be hard to predict
especially in case when its creation is delegated to containerized
application.
== Implementation details ==
The helper gets ancestor by walking parents up to specified level.
Another option would be to get different kind of "id" from
cgroup->ancestor_ids[level] and use it with idr_find() to get struct
cgroup for ancestor. But that would require radix lookup what doesn't
seem to be better (at least it's not obviously better).
Format of return value of the new helper is same as that of
bpf_skb_cgroup_id.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Sat, 11 Aug 2018 23:59:17 +0000 (01:59 +0200)]
bpf: decouple btf from seq bpf fs dump and enable more maps
Commit
a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") and
699c86d6ec21 ("bpf: btf: add pretty
print for hash/lru_hash maps") enabled support for BTF and
dumping via BPF fs for array and hash/lru map. However, both
can be decoupled from each other such that regular BPF maps
can be supported for attaching BTF key/value information,
while not all maps necessarily need to dump via map_seq_show_elem()
callback.
The basic sanity check which is a prerequisite for all maps
is that key/value size has to match in any case, and some maps
can have extra checks via map_check_btf() callback, e.g.
probing certain types or indicating no support in general. With
that we can also enable retrieving BTF info for per-cpu map
types and lpm.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Daniel Borkmann [Fri, 10 Aug 2018 23:58:47 +0000 (01:58 +0200)]
Merge branch 'bpf-reuseport-map'
Martin KaFai Lau says:
====================
This series introduces a new map type "BPF_MAP_TYPE_REUSEPORT_SOCKARRAY"
and a new prog type BPF_PROG_TYPE_SK_REUSEPORT.
Here is a snippet from a commit message:
"To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision. In this case, decide which
SO_REUSEPORT sk to serve the incoming request.
By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map. That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).
For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map. The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information. This connection info contact/API stays within the
userspace.
Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process. The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds. [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]"
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Wed, 8 Aug 2018 08:01:31 +0000 (01:01 -0700)]
bpf: Test BPF_PROG_TYPE_SK_REUSEPORT
This patch add tests for the new BPF_PROG_TYPE_SK_REUSEPORT.
The tests cover:
- IPv4/IPv6 + TCP/UDP
- TCP syncookie
- TCP fastopen
- Cases when the bpf_sk_select_reuseport() returning errors
- Cases when the bpf prog returns SK_DROP
- Values from sk_reuseport_md
- outer_map => reuseport_array
The test depends on
commit
3eee1f75f2b9 ("bpf: fix bpf_skb_load_bytes_relative pkt length check")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Wed, 8 Aug 2018 08:01:30 +0000 (01:01 -0700)]
bpf: test BPF_MAP_TYPE_REUSEPORT_SOCKARRAY
This patch adds tests for the new BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Wed, 8 Aug 2018 08:01:29 +0000 (01:01 -0700)]
bpf: Sync bpf.h uapi to tools/
This patch sync include/uapi/linux/bpf.h to
tools/include/uapi/linux/
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Wed, 8 Aug 2018 08:01:27 +0000 (01:01 -0700)]
bpf: Refactor ARRAY_SIZE macro to bpf_util.h
This patch refactors the ARRAY_SIZE macro to bpf_util.h.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Wed, 8 Aug 2018 08:01:26 +0000 (01:01 -0700)]
bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection
This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
the earlier patch. "bpf_run_sk_reuseport()" will return -ECONNREFUSED
when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
handle this case and return NULL immediately instead of continuing the
sk search from its hashtable.
It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
BPF_PROG_TYPE_SK_REUSEPORT. The "sk_reuseport_attach_bpf()" will check
if the attaching bpf prog is in the new SK_REUSEPORT or the existing
SOCKET_FILTER type and then check different things accordingly.
One level of "__reuseport_attach_prog()" call is removed. The
"sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
back to "reuseport_attach_prog()" in sock_reuseport.c. sock_reuseport.c
seems to have more knowledge on those test requirements than filter.c.
In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
the old_prog (if any) is also directly freed instead of returning the
old_prog to the caller and asking the caller to free.
The sysctl_optmem_max check is moved back to the
"sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
bounded by the usual "bpf_prog_charge_memlock()" during load time
instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Wed, 8 Aug 2018 08:01:25 +0000 (01:01 -0700)]
bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].
At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this
point, it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.
For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb(). Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.
Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.
The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).
In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages. Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb. The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.
The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").
The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk. It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()". Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case). The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want. This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match). The bpf prog can decide if it wants to do SK_DROP as its
discretion.
When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk. If not, the kernel
will select one from "reuse->socks[]" (as before this patch).
The SK_DROP and SK_PASS handling logic will be in the next patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Wed, 8 Aug 2018 08:01:24 +0000 (01:01 -0700)]
bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision. In this case, decide which
SO_REUSEPORT sk to serve the incoming request.
By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map. That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).
For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map. The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information. This connection info contact/API stays within the
userspace.
Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process. The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds. [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]
During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP. These conditions are
ensured in "reuseport_array_update_check()".
A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map). SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.
When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also. It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]". Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.
The reuseport_array will also support lookup from the syscall
side. It will return a sock_gen_cookie(). The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).
The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update. Supporting different
value_size between lookup and update seems unintuitive also.
We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4. The syscall's lookup
will return ENOSPC on value_size=4. It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Wed, 8 Aug 2018 08:01:22 +0000 (01:01 -0700)]
net: Add ID (if needed) to sock_reuseport and expose reuseport_lock
A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
allows a SO_REUSEPORT sk to be added to a bpf map. When a sk
is removed from reuse->socks[], it also needs to be removed from
the bpf map. Also, when adding a sk to a bpf map, the bpf
map needs to ensure it is indeed in a reuse->socks[].
Hence, reuseport_lock is needed by the bpf map to ensure its
map_update_elem() and map_delete_elem() operations are in-sync with
the reuse->socks[]. The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
acquire the reuseport_lock after ensuring the adding sk is already
in a reuseport group (i.e. reuse->socks[]). The map_lookup_elem()
will be lockless.
This patch also adds an ID to sock_reuseport. A later patch
will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
a bpf prog to select a sk from a bpf map. It is inflexible to
statically enforce a bpf map can only contain the sk belonging to
a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
verification time. For example, think about the the map-in-map situation
where the inner map can be dynamically changed in runtime and the outer
map may have inner maps belonging to different reuseport groups.
Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
type) selects a sk, this selected sk has to be checked to ensure it
belongs to the requesting reuseport group (i.e. the group serving
that IP:PORT).
The "sk->sk_reuseport_cb" pointer cannot be used for this checking
purpose because the pointer value will change after reuseport_grow().
Instead of saving all checking conditions like the ones
preced calling "reuseport_add_sock()" and compare them everytime a
bpf_prog is run, a 32bits ID is introduced to survive the
reuseport_grow(). The ID is only acquired if any of the
reuse->socks[] is added to the newly introduced
"BPF_MAP_TYPE_REUSEPORT_ARRAY" map.
If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used, the changes in this
patch is a no-op.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Wed, 8 Aug 2018 08:01:21 +0000 (01:01 -0700)]
tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket
Although the actual cookie check "__cookie_v[46]_check()" does
not involve sk specific info, it checks whether the sk has recent
synq overflow event in "tcp_synq_no_recent_overflow()". The
tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
when it has sent out a syncookie (through "tcp_synq_overflow()").
The above per sk "recent synq overflow event timestamp" works well
for non SO_REUSEPORT use case. However, it may cause random
connection request reject/discard when SO_REUSEPORT is used with
syncookie because it fails the "tcp_synq_no_recent_overflow()"
test.
When SO_REUSEPORT is used, it usually has multiple listening
socks serving TCP connection requests destinated to the same local IP:PORT.
There are cases that the TCP-ACK-COOKIE may not be received
by the same sk that sent out the syncookie. For example,
if reuse->socks[] began with {sk0, sk1},
1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
was updated.
2) the reuse->socks[] became {sk1, sk2} later. e.g. sk0 was first closed
and then sk2 was added. Here, sk2 does not have ts_recent_stamp set.
There are other ordering that will trigger the similar situation
below but the idea is the same.
3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
"tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
all syncookies sent by sk1 will be handled (and rejected)
by sk2 while sk1 is still alive.
The userspace may create and remove listening SO_REUSEPORT sockets
as it sees fit. E.g. Adding new thread (and SO_REUSEPORT sock) to handle
incoming requests, old process stopping and new process starting...etc.
With or without SO_ATTACH_REUSEPORT_[CB]BPF,
the sockets leaving and joining a reuseport group makes picking
the same sk to check the syncookie very difficult (if not impossible).
The later patches will allow bpf prog more flexibility in deciding
where a sk should be located in a bpf map and selecting a particular
SO_REUSEPORT sock as it sees fit. e.g. Without closing any sock,
replace the whole bpf reuseport_array in one map_update() by using
map-in-map. Getting the syncookie check working smoothly across
socks in the same "reuse->socks[]" is important.
A partial solution is to set the newly added sk's ts_recent_stamp
to the max ts_recent_stamp of a reuseport group but that will require
to iterate through reuse->socks[] OR
pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
joining a reuseport group. However, neither of them will solve the
existing sk getting moved around the reuse->socks[] and that
sk may not have ts_recent_stamp updated, unlikely under continuous
synflood but not impossible.
This patch opts to treat the reuseport group as a whole when
considering the last synq overflow timestamp since
they are serving the same IP:PORT from the userspace
(and BPF program) perspective.
"synq_overflow_ts" is added to "struct sock_reuseport".
The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
will update/check reuse->synq_overflow_ts if the sk is
in a reuseport group. Similar to the reuseport decision in
__inet_lookup_listener(), both sk->sk_reuseport and
sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
Update on "synq_overflow_ts" happens at roughly once
every second.
A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
No meaningful performance change is observed. Before and
after the change is ~9Mpps in IPv4.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Fri, 10 Aug 2018 18:54:08 +0000 (20:54 +0200)]
Merge branch 'bpf-btf-for-htab-lru'
Yonghong Song says:
====================
Commit
a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") added pretty print support to array map.
This patch adds pretty print for hash and lru_hash maps.
The following example shows the pretty-print result of a pinned
hashmap. Without this patch set, user will get an error instead.
struct map_value {
int count_a;
int count_b;
};
cat /sys/fs/bpf/pinned_hash_map:
87907: {87907,87908}
57354: {37354,57355}
76625: {76625,76626}
...
Patch #1 fixed a bug in bpffs map_seq_next() function so that
all elements in the hash table will be traversed.
Patch #2 implemented map_seq_show_elem() and map_check_btf()
callback functions for hash and lru hash maps.
Patch #3 enhanced tools/testing/selftests/bpf/test_btf.c to
test bpffs hash and lru hash map pretty print.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Yonghong Song [Thu, 9 Aug 2018 15:55:21 +0000 (08:55 -0700)]
tools/bpf: add bpffs pretty print btf test for hash/lru_hash maps
Pretty print tests for hash/lru_hash maps are added in test_btf.c.
The btf type blob is the same as pretty print array map test.
The test result:
$ mount -t bpf bpf /sys/fs/bpf
$ ./test_btf -p
BTF pretty print array......OK
BTF pretty print hash......OK
BTF pretty print lru hash......OK
PASS:3 SKIP:0 FAIL:0
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Yonghong Song [Thu, 9 Aug 2018 15:55:20 +0000 (08:55 -0700)]
bpf: btf: add pretty print for hash/lru_hash maps
Commit
a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") added pretty print support to array map.
This patch adds pretty print for hash and lru_hash maps.
The following example shows the pretty-print result of
a pinned hashmap:
struct map_value {
int count_a;
int count_b;
};
cat /sys/fs/bpf/pinned_hash_map:
87907: {87907,87908}
57354: {37354,57355}
76625: {76625,76626}
...
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Yonghong Song [Thu, 9 Aug 2018 15:55:19 +0000 (08:55 -0700)]
bpf: fix bpffs non-array map seq_show issue
In function map_seq_next() of kernel/bpf/inode.c,
the first key will be the "0" regardless of the map type.
This works for array. But for hash type, if it happens
key "0" is in the map, the bpffs map show will miss
some items if the key "0" is not the first element of
the first bucket.
This patch fixed the issue by guaranteeing to get
the first element, if the seq_show is just started,
by passing NULL pointer key to map_get_next_key() callback.
This way, no missing elements will occur for
bpffs hash table show even if key "0" is in the map.
Fixes: a26ca7c982cb5 ("bpf: btf: Add pretty print support to the basic arraymap")
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Fri, 10 Aug 2018 14:12:22 +0000 (16:12 +0200)]
Merge branch 'bpf-veth-xdp-support'
Toshiaki Makita says:
====================
This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.
NIC -----------> veth===veth
(XDP) (redirect) (XDP)
In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.
Envisioned use-cases
--------------------
* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.
* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.
* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.
Implementation
--------------
This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
- patch 1: Export a function needed for veth XDP.
- patch 2-3: Basic implementation of veth XDP.
- patch 4-6: Add ndo_xdp_xmit.
- patch 7-9: Add XDP_TX and XDP_REDIRECT.
- patch 10: Performance optimization for multi-queue env.
Tests and performance numbers
-----------------------------
Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.
pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)
The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).
veth XDP action Flows Mpps
================================
DROP 1 10.6
DROP 2 21.2
DROP 100 36.0
TX 1 5.0
TX 2 10.0
TX 100 31.0
I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.
netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)
Direction Flows Gbps
==============================
external->veth 1 20.8
external->veth 2 23.5
external->veth 100 23.6
veth->external 1 9.0
veth->external 2 17.8
veth->external 100 22.9
Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.
v8:
- Don't use xdp_frame pointer address to calculate skb->head, headroom,
and xdp_buff.data_hard_start.
v7:
- Introduce xdp_scrub_frame() to clear kernel pointers in xdp_frame and
use it instead of memset().
v6:
- Check skb->len only if reallocation is needed.
- Add __GFP_NOWARN to alloc_page() since it can be triggered by external
events.
- Fix sparse warning around EXPORT_SYMBOL.
v5:
- Fix broken SOBs.
v4:
- Don't adjust MTU automatically.
- Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
Add comments to explain that.
- Use redirect_info instead of xdp_mem_info for storing no_direct flag
to avoid per packet copy cost.
v3:
- Drop skb bulk xmit patch since it makes little performance
difference. The hotspot in TCP skb xmit at this point is checksum
computation in skb_segment and packet copy on XDP_REDIRECT due to
cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.
v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).
====================
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toshiaki Makita [Fri, 3 Aug 2018 07:58:18 +0000 (16:58 +0900)]
veth: Support per queue XDP ring
Move XDP and napi related fields from veth_priv to newly created veth_rq
structure.
When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.
When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.
v3:
- Add extack messages.
- Fix array overrun in veth_xmit.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toshiaki Makita [Fri, 3 Aug 2018 07:58:17 +0000 (16:58 +0900)]
veth: Add XDP TX and REDIRECT
This allows further redirection of xdp_frames like
NIC -> veth--veth -> veth--veth
(XDP) (XDP) (XDP)
The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
NAPI context.
v8:
- Don't use xdp_frame pointer address for data_hard_start of xdp_buff.
v4:
- Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
xdp_mem_info.
v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toshiaki Makita [Fri, 3 Aug 2018 07:58:16 +0000 (16:58 +0900)]
xdp: Helpers for disabling napi_direct of xdp_return_frame
We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.
This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.
A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.
v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
avoid per-frame copy cost.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toshiaki Makita [Fri, 3 Aug 2018 07:58:15 +0000 (16:58 +0900)]
bpf: Make redirect_info accessible from modules
We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.
v6:
- Fix sparse warning around EXPORT_SYMBOL.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toshiaki Makita [Fri, 3 Aug 2018 07:58:14 +0000 (16:58 +0900)]
veth: Add ndo_xdp_xmit
This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.
Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.
v4:
- Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
Add comments about it and check only MTU.
v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toshiaki Makita [Fri, 3 Aug 2018 07:58:13 +0000 (16:58 +0900)]
veth: Handle xdp_frames in xdp napi ring
This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.
v8:
- Don't use xdp_frame pointer address to calculate skb->head and
headroom.
v7:
- Use xdp_scrub_frame() instead of memset().
v3:
- Revert v2 change around rings and use a flag to differentiate skb and
xdp_frame, since bulk skb xmit makes little performance difference
for now.
v2:
- Use another ring instead of using flag to differentiate skb and
xdp_frame. This approach makes bulk skb transmit possible in
veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toshiaki Makita [Fri, 3 Aug 2018 07:58:12 +0000 (16:58 +0900)]
xdp: Helper function to clear kernel pointers in xdp_frame
xdp_frame has kernel pointers which should not be readable from bpf
programs. When we want to reuse xdp_frame region but it may be read by
bpf programs later, we can use this helper to clear kernel pointers.
This is more efficient than calling memset() for the entire struct.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toshiaki Makita [Fri, 3 Aug 2018 07:58:11 +0000 (16:58 +0900)]
veth: Avoid drops by oversized packets when XDP is enabled
Oversized packets including GSO packets can be dropped if XDP is
enabled on receiver side, so don't send such packets from peer.
Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.
v4:
- Don't auto-adjust MTU but cap max MTU.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toshiaki Makita [Fri, 3 Aug 2018 07:58:10 +0000 (16:58 +0900)]
veth: Add driver XDP
This is the basic implementation of veth driver XDP.
Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.
This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.
This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.
Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.
Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.
v6:
- Check skb->len only when allocation is needed.
- Add __GFP_NOWARN to alloc_page() as it can be triggered by external
events.
v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.
v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Toshiaki Makita [Fri, 3 Aug 2018 07:58:09 +0000 (16:58 +0900)]
net: Export skb_headers_offset_update
This is needed for veth XDP which does skb_copy_expand()-like operation.
v2:
- Drop skb_copy_header part because it has already been exported now.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann [Fri, 10 Aug 2018 14:07:50 +0000 (16:07 +0200)]
Merge branch 'bpf-sample-cpumap-lb'
Jesper Dangaard Brouer says:
====================
Background: cpumap moves the SKB allocation out of the driver code,
and instead allocate it on the remote CPU, and invokes the regular
kernel network stack with the newly allocated SKB.
The idea behind the XDP CPU redirect feature, is to use XDP as a
load-balancer step in-front of regular kernel network stack. But the
current sample code does not provide a good example of this. Part of
the reason is that, I have implemented this as part of Suricata XDP
load-balancer.
Given this is the most frequent feature request I get. This patchset
implement the same XDP load-balancing as Suricata does, which is a
symmetric hash based on the IP-pairs + L4-protocol.
The expected setup for the use-case is to reduce the number of NIC RX
queues via ethtool (as XDP can handle more per core), and via
smp_affinity assign these RX queues to a set of CPUs, which will be
handling RX packets. The CPUs that runs the regular network stack is
supplied to the sample xdp_redirect_cpu tool by specifying
the --cpu option multiple times on the cmdline.
I do note that cpumap SKB creation is not feature complete yet, and
more work is coming. E.g. given GRO is not implemented yet, do expect
TCP workloads to be slower. My measurements do indicate UDP workloads
are faster.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jesper Dangaard Brouer [Fri, 10 Aug 2018 12:03:02 +0000 (14:03 +0200)]
samples/bpf: xdp_redirect_cpu load balance like Suricata
This implement XDP CPU redirection load-balancing across available
CPUs, based on the hashing IP-pairs + L4-protocol. This equivalent to
xdp-cpu-redirect feature in Suricata, which is inspired by the
Suricata 'ippair' hashing code.
An important property is that the hashing is flow symmetric, meaning
that if the source and destination gets swapped then the selected CPU
will remain the same. This is helps locality by placing both directions
of a flows on the same CPU, in a forwarding/routing scenario.
The hashing INITVAL (
15485863 the 10^6th prime number) was fairly
arbitrary choosen, but experiments with kernel tree pktgen scripts
(pktgen_sample04_many_flows.sh +pktgen_sample05_flow_per_thread.sh)
showed this improved the distribution.
This patch also change the default loaded XDP program to be this
load-balancer. As based on different user feedback, this seems to be
the expected behavior of the sample xdp_redirect_cpu.
Link: https://github.com/OISF/suricata/commit/796ec08dd7a63
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jesper Dangaard Brouer [Fri, 10 Aug 2018 12:02:57 +0000 (14:02 +0200)]
samples/bpf: add Paul Hsieh's (LGPL 2.1) hash function SuperFastHash
Adjusted function call API to take an initval. This allow the API
user to set the initial value, as a seed. This could also be used for
inputting the previous hash.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Björn Töpel [Fri, 10 Aug 2018 09:28:02 +0000 (11:28 +0200)]
Revert "xdp: add NULL pointer check in __xdp_return()"
This reverts commit
36e0f12bbfd3016f495904b35e41c5711707509f.
The reverted commit adds a WARN to check against NULL entries in the
mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
driver) fast path is required to make a paired
xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
addition, a driver using a different allocation scheme than the
default MEM_TYPE_PAGE_SHARED is required to additionally call
xdp_rxq_info_reg_mem_model.
For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
that the mem_id_ht rhashtable has a properly inserted allocator id. If
not, this would be a driver bug. A NULL pointer kernel OOPS is
preferred to the WARN.
Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David S. Miller [Thu, 9 Aug 2018 18:52:36 +0000 (11:52 -0700)]
Merge ra./pub/scm/linux/kernel/git/davem/net
Overlapping changes in RXRPC, changing to ktime_get_seconds() whilst
adding some tracepoints.
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 9 Aug 2018 18:16:29 +0000 (11:16 -0700)]
Merge branch 'Add-support-for-XGMAC2-in-stmmac'
Jose Abreu says:
====================
Add support for XGMAC2 in stmmac
This series adds support for 10Gigabit IP in stmmac. The IP is called XGMAC2
and has many similarities with GMAC4. Due to this, its relatively easy to
incorporate this new IP into stmmac driver by adding a new block and
filling the necessary callbacks.
The functionality added by this series is still reduced but its only a
starting point which will later be expanded.
I splitted the patches into funcionality and to ease the review. Only the
patch 8/9 really enables the XGMAC2 block by adding a new compatible string.
Version 4 addresses review comments of Florian Fainelli and Rob Herring.
NOTE: Although the IP supports 10G, for now it was only possible to test it
at 1G speed due to 10G PHY HW shipping problems. Here follows iperf3
results at 1G:
Connecting to host 192.168.0.10, port 5201
[ 4] local 192.168.0.3 port 39178 connected to 192.168.0.10 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 110 MBytes 920 Mbits/sec 0 482 KBytes
[ 4] 1.00-2.00 sec 113 MBytes 946 Mbits/sec 0 482 KBytes
[ 4] 2.00-3.00 sec 112 MBytes 937 Mbits/sec 0 482 KBytes
[ 4] 3.00-4.00 sec 113 MBytes 946 Mbits/sec 0 482 KBytes
[ 4] 4.00-5.00 sec 112 MBytes 935 Mbits/sec 0 482 KBytes
[ 4] 5.00-6.00 sec 113 MBytes 946 Mbits/sec 0 482 KBytes
[ 4] 6.00-7.00 sec 112 MBytes 937 Mbits/sec 0 482 KBytes
[ 4] 7.00-8.00 sec 113 MBytes 946 Mbits/sec 0 482 KBytes
[ 4] 8.00-9.00 sec 112 MBytes 937 Mbits/sec 0 482 KBytes
[ 4] 9.00-10.00 sec 113 MBytes 946 Mbits/sec 0 482 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 1.09 GBytes 940 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 1.09 GBytes 938 Mbits/sec receiver
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Wed, 8 Aug 2018 08:04:37 +0000 (09:04 +0100)]
dt-bindings: net: stmmac: Add the bindings documentation for XGMAC2.
Adds the documentation for XGMAC2 DT bindings.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: devicetree@vger.kernel.org
Cc: Rob Herring <robh+dt@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Wed, 8 Aug 2018 08:04:36 +0000 (09:04 +0100)]
net: stmmac: Add the bindings parsing for XGMAC2
Add the bindings parsing for XGMAC2 IP block.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Wed, 8 Aug 2018 08:04:35 +0000 (09:04 +0100)]
net: stmmac: Integrate XGMAC into main driver flow
Now that we have all the XGMAC related callbacks, lets start integrating
this IP block into main driver.
Also, we corrected the initialization flow to only start DMA after
setting descriptors length.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Wed, 8 Aug 2018 08:04:34 +0000 (09:04 +0100)]
net: stmmac: Add PTP support for XGMAC2
XGMAC2 uses the same engine of timestamping as GMAC4. Let's use the same
callbacks.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Wed, 8 Aug 2018 08:04:33 +0000 (09:04 +0100)]
net: stmmac: Add MDIO related functions for XGMAC2
Add the MDIO related funcionalities for the new IP block XGMAC2.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Wed, 8 Aug 2018 08:04:32 +0000 (09:04 +0100)]
net: stmmac: Add descriptor related callbacks for XGMAC2
Add the descriptor related callbacks for the new IP block XGMAC2.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Wed, 8 Aug 2018 08:04:31 +0000 (09:04 +0100)]
net: stmmac: Add DMA related callbacks for XGMAC2
Add the DMA related callbacks for the new IP block XGMAC2.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Wed, 8 Aug 2018 08:04:30 +0000 (09:04 +0100)]
net: stmmac: Add MAC related callbacks for XGMAC2
Add the MAC related callbacks for the new IP block XGMAC2.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jose Abreu [Wed, 8 Aug 2018 08:04:29 +0000 (09:04 +0100)]
net: stmmac: Add XGMAC 2.10 HWIF entry
Add a new entry to HWIF table for XGMAC 2.10. For now we fill it with
empty callbacks which will be added in posterior patches.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 9 Aug 2018 18:08:21 +0000 (11:08 -0700)]
Merge branch 'More-complete-PHYLINK-support-for-mv88e6xxx'
Andrew Lunn says:
====================
More complete PHYLINK support for mv88e6xxx
Previous patches added sufficient PHYLINK support to the mv88e6xxx
that it did not break existing use cases, basically fixed-link phys.
This patchset builds out the support so that SFP modules, up to
2.5Gbps can be supported, on mv88e6390X, on ports 9 and 10. It also
provides a framework which can be extended to support SFPs on ports
2-8 of mv88e6390X, 10Gbps PHYs, and SFP support on the 6352 family.
Russell King did much of the initial work, implementing the validate
and mac_link_state calls. However, there is an important TODO in the
commit message:
needs to call phylink_mac_change() when the port link comes up/goes down.
The remaining patches implement this, by adding more support for the
SERDES interfaces, in particular, interrupt support so we get notified
when the SERDES gains/looses sync.
This has been tested on the ZII devel C, using a Clearfog as peer
device.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:49 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: Re-setup interrupts on CMODE change.
When a port changes CMODE, the SERDES interface being used can change.
Disable interrupts for the old SERDES interface, and enable interrupts
on the new.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:48 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: Add SERDES phydev_mac_change up for 6390
phylink wants to know when the MAC layers notices a change in the
link. For the 6390 family, this is a change in the SERDES state.
Add interrupt support for the SERDES interface used to implement
SGMII/1000Base-X/2500Base-X. This is currently limited to ports 9 and
10. Support for the 10G SERDES and other ports will be added later,
building on this basic framework.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:47 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: link mv88e6xxx_port to mv88e6xxx_chip
An up coming change will register interrupts for individual switch
ports, using the mv88e6xxx_port as the interrupt context information.
Add members to the mv88e6xxx_port structure so we can link it back to
the mv88e6xxx_chip member the port belongs to and the port number of
the port.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:46 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: Power on/off SERDES on cmode change
The 6390 family has a number of SERDES interfaces per port. When the
cmode changes, eg 1000Base-X to XAUI, the SERDES interface in use will
also change. Power down the old SERDES interface and power up the new
SERDES interface.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:45 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: Cache the port cmode
The ports CMODE indicates the type of link between the MAC and the
PHY. It is used often in the SERDES code. Rather than read it each
time, cache its value.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:44 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: 2500Base-X uses the 1000Base-X SERDES
The 6390 has three different SERDES interface types. 2500Base-X is
implemented by the SGMII/1000Base-X SERDES. So power on/off the
correct SERDES.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:43 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: Add serdes register read/write helper
Add a helper for accessing SERDES registers of the 6390 family.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:42 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: Rename sgmii/10g power functions
There is a need to add more functions manipulating the SERDES
interfaces. Cleanup the namespace.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:41 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: 6390 vs 6390X SERDES support
The 6390 has two SERDES interfaces, used by ports 9 and 10. The 6390X
has eight SERDES interfaces. These allow ports 9 and 10 to do 10G. Or
if lower speeds are used, some of the SERDES interfaces can be used by
ports 2-8 for 1000Base-X.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:40 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: Refactor SERDES lane code
The 6390 family has 8 SERDES lanes. What ports use these lanes depends
on how ports 9 and 10 are configured. If 9 and 10 does not make use of
a line, one of the lower ports can use it.
Add a function to return the lane a port is using, if any, and simplify
the code to power up/down the lane.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Thu, 9 Aug 2018 13:38:39 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: add phylink support
Add rudimentary phylink support to mv88e6xxx.
TODO:
- needs to call phylink_mac_change() when the port link comes up/goes down.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Thu, 9 Aug 2018 13:38:38 +0000 (15:38 +0200)]
phylink: add helper for configuring 2500BaseX modes
Add a helper for MAC drivers to use in their validate callback to deal
with 2500BaseX vs 1000BaseX modes, where the hardware supports both
but it is not possible to automatically select between them.
This helper defaults to 1000BaseX, as that is the 802.3 standard, and
will allow users to select 2500BaseX either by forcing the speed if
AN is disabled, or by changing the advertising mask if AN is enabled.
Disabling AN is not recommended as it is only the speed that we're
interested in controlling, not the duplex or pause mode parameters.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Thu, 9 Aug 2018 13:38:37 +0000 (15:38 +0200)]
net: dsa: mv88e6xxx: Add support to enabling pause
The 6185 can enable/disable 802.3z pause be setting the MyPause bit in
the port status register. Add an op to support this.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 9 Aug 2018 17:36:11 +0000 (10:36 -0700)]
Merge branch 'mlxsw-Various-updates'
Ido Schimmel says:
====================
mlxsw: Various updates
Patches 1-3 update the driver to use a new firmware version. Due to a
recently discovered issue, the version (and future ones) does not
support matching on VLAN ID at egress. This is enforced in the driver
and reported back to the user via extack.
Patch 4 adds a new selftest for the recently introduced algorithmic
TCAM.
Patch 5 converts the driver to use SPDX identifiers.
Patches 6-7 fix a bug in ethtool stats reporting and expose counters for
all 16 TCs, following recent MC-aware changes that utilize TCs 8-15.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 9 Aug 2018 08:59:13 +0000 (11:59 +0300)]
mlxsw: spectrum: Expose counter for all 16 TCs
Before MC-aware mode was enabled in commit
7b8195306694 ("mlxsw:
spectrum: Configure MC-aware mode on mlxsw ports"), only 8 traffic
classes were used. Under MC-aware regime, however, besides using TCs
0-7 for UC traffic, it additionally uses TCs 8-15 for BUM traffic. It
is therefore desirable to show counters for these TCs as well.
Update ethtool stats pool length, mlxsw_sp_port_get_strings() and
mlxsw_sp_port_get_stats() to include artifacts for all 16 TCs. For
consistency and simplicity, expose tc_no_buffer_discard_uc_tc for BUM
TCs as well, even though it ought to stay at 0 all the time.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 9 Aug 2018 08:59:12 +0000 (11:59 +0300)]
mlxsw: spectrum: Include RFC-2819 counters in stats length
The function mlxsw_sp_port_get_sset_count() is supposed to return the
total number of ethtool strings that mlxsw supports. Specifically for
names of statistic counters (the only string type that mlxsw supports
as of now), that number is stored in MLXSW_SP_PORT_ETHTOOL_STATS_LEN.
However, when adding RFC-2891 counters, that define wasn't updated to
include the new counters. As a result, ethtool snips out the counters
towards the end of the list, which contains per-TC counters, and only
the first three traffic classes end up being reported.
Fix by adding MLXSW_SP_PORT_HW_RFC_2819_STATS_LEN as appropriate.
Fixes: 1222d15a01c7 ("mlxsw: spectrum: Expose counters for various packet sizes")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 9 Aug 2018 08:59:11 +0000 (11:59 +0300)]
mlxsw: Replace license text with SPDX identifiers and adjust copyrights
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Thu, 9 Aug 2018 08:59:10 +0000 (11:59 +0300)]
selftests: mlxsw: Add TC flower test for Spectrum-2
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 9 Aug 2018 08:59:09 +0000 (11:59 +0300)]
mlxsw: spectrum: Reset FW after flash
Recent FW fixes a bug and allows to load newly flashed FW image after
reset. So make sure the reset happens after flash. Indicate the need
down to PCI layer by -EAGAIN.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nir Dotan [Thu, 9 Aug 2018 08:59:08 +0000 (11:59 +0300)]
mlxsw: spectrum: Update the supported firmware to version 13.1702.6
This new firmware contains:
- Support for new types of cables
- Support for flashing future firmware without reboot
- Support for Router ARP BC and UC traps
Signed-off-by: Nir Dotan <nird@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nir Dotan [Thu, 9 Aug 2018 08:59:07 +0000 (11:59 +0300)]
mlxsw: spectrum_flower: Disallow usage of vlan_id key on egress
As recent spectrum FW imposes a limitation on using vlan_id key for
egress ACL, disallow the usage of that key accordingly and return a
proper extack message.
Signed-off-by: Nir Dotan <nird@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 9 Aug 2018 17:00:15 +0000 (10:00 -0700)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"This fixes a performance regression in arm64 NEON crypto as well as a
crash in x86 aegis/morus on unsupported CPUs"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: x86/aegis,morus - Fix and simplify CPUID checks
crypto: arm64 - revert NEON yield for fast AEAD implementations
Linus Torvalds [Thu, 9 Aug 2018 16:57:13 +0000 (09:57 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) The real fix for the ipv6 route metric leak Sabrina was seeing, from
Cong Wang.
2) Fix syzbot triggers AF_PACKET v3 ring buffer insufficient room
conditions, from Willem de Bruijn.
3) vsock can reinitialize active work struct, fix from Cong Wang.
4) RXRPC keepalive generator can wedge a cpu, fix from David Howells.
5) Fix locking in AF_SMC ioctl, from Ursula Braun.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
dsa: slave: eee: Allow ports to use phylink
net/smc: move sock lock in smc_ioctl()
net/smc: allow sysctl rmem and wmem defaults for servers
net/smc: no shutdown in state SMC_LISTEN
net: aquantia: Fix IFF_ALLMULTI flag functionality
rxrpc: Fix the keepalive generator [ver #2]
net/mlx5e: Cleanup of dcbnl related fields
net/mlx5e: Properly check if hairpin is possible between two functions
vhost: reset metadata cache when initializing new IOTLB
llc: use refcount_inc_not_zero() for llc_sap_find()
dccp: fix undefined behavior with 'cwnd' shift in ccid2_cwnd_restart()
tipc: fix an interrupt unsafe locking scenario
vsock: split dwork to avoid reinitializations
net: thunderx: check for failed allocation lmac->dmacs
cxgb4: mk_act_open_req() buggers ->{local, peer}_ip on big-endian hosts
packet: refine ring v3 block size test to hold one frame
ip6_tunnel: use the right value for ipv4 min mtu check in ip6_tnl_xmit
ipv6: fix double refcount of fib6_metrics
David S. Miller [Thu, 9 Aug 2018 02:34:55 +0000 (19:34 -0700)]
Merge branch 'mlx5-next'
Saeed Mahameed says:
====================
Mellanox, mlx5 next updates 2018-08-09
This series includes mlx5 core driver updates and mostly simple
cleanups.
From Denis: Use max #EQs reported by firmware to request MSIX vectors.
From Eli: Trivial cleanups, unused arguments/functions and reduce
command polling interval when command interface is in polling mode.
From Eran: Rename vport state enums, to better reflect their actual
usage.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Cohen [Wed, 8 Aug 2018 23:23:53 +0000 (16:23 -0700)]
net/mlx5: Reduce command polling interval
Use cond_resched() instead of usleep_range() to decrease the time
between polling attempts thus reducing overall driver load time.
Below is a comparison before and after the change, of loading eight
virtual functions.
Before:
real 0m8.785s
user 0m0.093s
sys 0m0.090s
After:
real 0m5.730s
user 0m0.097s
sys 0m0.087s
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Cohen [Wed, 8 Aug 2018 23:23:52 +0000 (16:23 -0700)]
net/mlx5: Unexport functions that need not be exported
mlx5_query_vport_state() and mlx5_modify_vport_admin_state() are used
only from within mlx5_core - unexport them.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Cohen [Wed, 8 Aug 2018 23:23:51 +0000 (16:23 -0700)]
net/mlx5: Remove unused mlx5_query_vport_admin_state
mlx5_query_vport_admin_state() is not used anywhere. Remove it.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Cohen [Wed, 8 Aug 2018 23:23:50 +0000 (16:23 -0700)]
net/mlx5: E-Switch, Remove unused argument when creating legacy FDB
Remove unused nvports argument.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Wed, 8 Aug 2018 23:23:49 +0000 (16:23 -0700)]
net/mlx5: Rename modify/query_vport state related enums
Modify and query vport state commands share the same admin_state and
op_mod values, rename the enums to fit them both.
In addition, remove the esw prefix from the admin state enum as this
also applied for vnic.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Denis Drozdov [Wed, 8 Aug 2018 23:23:48 +0000 (16:23 -0700)]
net/mlx5: Use max_num_eqs for calculation of required MSIX vectors
New firmware has defined new HCA capability field called "max_num_eqs",
that is the number of available EQs after subtracting reserved FW EQs.
Before this capability the FW reported the EQ number in "log_max_eqs",
the reported value also contained FW reserved EQs, but the driver might
be failing to load on 320 cpus systems due to the fact that FW
reserved EQs were not available to the driver.
Now the driver has to obtain max_num_eqs value from new FW to get real
number of EQs available.
Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 8 Aug 2018 18:56:40 +0000 (20:56 +0200)]
dsa: slave: eee: Allow ports to use phylink
For a port to be able to use EEE, both the MAC and the PHY must
support EEE. A phy can be provided by both a phydev or phylink. Verify
at least one of these exist, not just phydev.
Fixes: aab9c4067d23 ("net: dsa: Plug in PHYLINK support")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 9 Aug 2018 02:14:23 +0000 (19:14 -0700)]
Merge branch 'smc-fixes'
Ursula Braun says:
====================
net/smc: fixes 2018-08-08
here are small fixes for SMC: The first patch makes sure, shutdown code
is not executed for sockets in state SMC_LISTEN. The second patch resets
send and receive buffer values for accepted sockets, since TCP buffer size
optimizations for the internal CLC socket should not be forwarded to the
outer SMC socket. The third patch solves a race between connect and ioctl
reported by syzbot.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Wed, 8 Aug 2018 12:13:21 +0000 (14:13 +0200)]
net/smc: move sock lock in smc_ioctl()
When an SMC socket is connecting it is decided whether fallback to
TCP is needed. To avoid races between connect and ioctl move the
sock lock before the use_fallback check.
Reported-by: syzbot+5b2cece1a8ecb2ca77d8@syzkaller.appspotmail.com
Reported-by: syzbot+19557374321ca3710990@syzkaller.appspotmail.com
Fixes: 1992d99882af ("net/smc: take sock lock in smc_ioctl()")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Wed, 8 Aug 2018 12:13:20 +0000 (14:13 +0200)]
net/smc: allow sysctl rmem and wmem defaults for servers
Without setsockopt SO_SNDBUF and SO_RCVBUF settings, the sysctl
defaults net.ipv4.tcp_wmem and net.ipv4.tcp_rmem should be the base
for the sizes of the SMC sndbuf and rcvbuf. Any TCP buffer size
optimizations for servers should be ignored.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Wed, 8 Aug 2018 12:13:19 +0000 (14:13 +0200)]
net/smc: no shutdown in state SMC_LISTEN
Invoking shutdown for a socket in state SMC_LISTEN does not make
sense. Nevertheless programs like syzbot fuzzing the kernel may
try to do this. For SMC this means a socket refcounting problem.
This patch makes sure a shutdown call for an SMC socket in state
SMC_LISTEN simply returns with -ENOTCONN.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Bogdanov [Wed, 8 Aug 2018 11:06:32 +0000 (14:06 +0300)]
net: aquantia: Fix IFF_ALLMULTI flag functionality
It was noticed that NIC always pass all multicast traffic to the host
regardless of IFF_ALLMULTI flag on the interface.
The rule in MC Filter Table in NIC, that is configured to accept any
multicast packets, is turning on if IFF_MULTICAST flag is set on the
interface. It leads to passing all multicast traffic to the host.
This fix changes the condition to turn on that rule by checking
IFF_ALLMULTI flag as it should.
Fixes: b21f502f84be ("net:ethernet:aquantia: Fix for multicast filter handling.")
Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Wed, 8 Aug 2018 10:30:02 +0000 (11:30 +0100)]
rxrpc: Fix the keepalive generator [ver #2]
AF_RXRPC has a keepalive message generator that generates a message for a
peer ~20s after the last transmission to that peer to keep firewall ports
open. The implementation is incorrect in the following ways:
(1) It mixes up ktime_t and time64_t types.
(2) It uses ktime_get_real(), the output of which may jump forward or
backward due to adjustments to the time of day.
(3) If the current time jumps forward too much or jumps backwards, the
generator function will crank the base of the time ring round one slot
at a time (ie. a 1s period) until it catches up, spewing out VERSION
packets as it goes.
Fix the problem by:
(1) Only using time64_t. There's no need for sub-second resolution.
(2) Use ktime_get_seconds() rather than ktime_get_real() so that time
isn't perceived to go backwards.
(3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two
parts:
(a) The "worker" function that manages the buckets and the timer.
(b) The "dispatch" function that takes the pending peers and
potentially transmits a keepalive packet before putting them back
in the ring into the slot appropriate to the revised last-Tx time.
(4) Taking everything that's pending out of the ring and splicing it into
a temporary collector list for processing.
In the case that there's been a significant jump forward, the ring
gets entirely emptied and then the time base can be warped forward
before the peers are processed.
The warping can't happen if the ring isn't empty because the slot a
peer is in is keepalive-time dependent, relative to the base time.
(5) Limit the number of iterations of the bucket array when scanning it.
(6) Set the timer to skip any empty slots as there's no point waking up if
there's nothing to do yet.
This can be triggered by an incoming call from a server after a reboot with
AF_RXRPC and AFS built into the kernel causing a peer record to be set up
before userspace is started. The system clock is then adjusted by
userspace, thereby potentially causing the keepalive generator to have a
meltdown - which leads to a message like:
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23]
...
Workqueue: krxrpcd rxrpc_peer_keepalive_worker
EIP: lock_acquire+0x69/0x80
...
Call Trace:
? rxrpc_peer_keepalive_worker+0x5e/0x350
? _raw_spin_lock_bh+0x29/0x60
? rxrpc_peer_keepalive_worker+0x5e/0x350
? rxrpc_peer_keepalive_worker+0x5e/0x350
? __lock_acquire+0x3d3/0x870
? process_one_work+0x110/0x340
? process_one_work+0x166/0x340
? process_one_work+0x110/0x340
? worker_thread+0x39/0x3c0
? kthread+0xdb/0x110
? cancel_delayed_work+0x90/0x90
? kthread_stop+0x70/0x70
? ret_from_fork+0x19/0x24
Fixes: ace45bec6d77 ("rxrpc: Fix firewall route keepalive")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 9 Aug 2018 02:07:38 +0000 (19:07 -0700)]
Merge branch 'mlx5-fixes'
Saeed Mahameed says:
====================
Mellanox, mlx5e fixes 2018-08-07
I know it is late into 4.18 release, and this is why I am submitting
only two mlx5e ethernet fixes.
The first one from Or, is needed for -stable and it fixes hairpin
for "same device" check.
The second fix is a non risk fix from Huy which cleans up and improves
error return value reporting for dcbnl_ieee_setapp.
For -stable v4.16
- net/mlx5e: Properly check if hairpin is possible between two functions
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Huy Nguyen [Wed, 8 Aug 2018 22:48:08 +0000 (15:48 -0700)]
net/mlx5e: Cleanup of dcbnl related fields
Remove unused netdev_registered_init/remove in en.h
Return ENOSUPPORT if the check MLX5_DSCP_SUPPORTED fails.
Remove extra white space
Fixes: 2a5e7a1344f4 ("net/mlx5e: Add dcbnl dscp to priority support")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Cc: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Wed, 8 Aug 2018 22:48:07 +0000 (15:48 -0700)]
net/mlx5e: Properly check if hairpin is possible between two functions
The current check relies on function BDF addresses and can get
us wrong e.g when two VFs are assigned into a VM and the PCI
v-address is set by the hypervisor.
Fixes: 5c65c564c962 ('net/mlx5e: Support offloading TC NIC hairpin flows')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Alaa Hleihel <alaa@mellanox.com>
Tested-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Wed, 8 Aug 2018 03:10:39 +0000 (03:10 +0000)]
ieee802154: hwsim: fix missing unlock on error in hwsim_add_one()
Add the missing unlock before return from function hwsim_add_one()
in the error handling case.
Fixes: f25da51fdc38 ("ieee802154: hwsim: add replacement for fakelb")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Wed, 8 Aug 2018 02:43:46 +0000 (02:43 +0000)]
ieee802154: hwsim: fix copy-paste error in hwsim_set_edge_lqi()
The return value from kzalloc() is not checked correctly. The
test is done against a wrong variable. This patch fix it.
Fixes: f25da51fdc38 ("ieee802154: hwsim: add replacement for fakelb")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Aring [Tue, 7 Aug 2018 23:32:49 +0000 (19:32 -0400)]
ieee802154: hwsim: fix rcu handling
This patch adds missing rcu_assign_pointer()/rcu_dereference() to used rcu
pointers. There was already a previous commit
c5d99d2b35da ("ieee802154:
hwsim: fix rcu address annotation"), but there was more which was
pointed out on my side by using newest sparse version.
Cc: Stefan Schmidt <stefan@datenfreihafen.org>
Fixes: f25da51fdc38 ("ieee802154: hwsim: add replacement for fakelb")
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
John David Anglin [Sun, 5 Aug 2018 17:30:31 +0000 (13:30 -0400)]
parisc: Define mb() and add memory barriers to assembler unlock sequences
For years I thought all parisc machines executed loads and stores in
order. However, Jeff Law recently indicated on gcc-patches that this is
not correct. There are various degrees of out-of-order execution all the
way back to the PA7xxx processor series (hit-under-miss). The PA8xxx
series has full out-of-order execution for both integer operations, and
loads and stores.
This is described in the following article:
http://web.archive.org/web/
20040214092531/http://www.cpus.hp.com/technical_references/advperf.shtml
For this reason, we need to define mb() and to insert a memory barrier
before the store unlocking spinlocks. This ensures that all memory
accesses are complete prior to unlocking. The ldcw instruction performs
the same function on entry.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Helge Deller <deller@gmx.de>
Helge Deller [Sat, 28 Jul 2018 09:47:17 +0000 (11:47 +0200)]
parisc: Enable CONFIG_MLONGCALLS by default
Enable the -mlong-calls compiler option by default, because otherwise in most
cases linking the vmlinux binary fails due to truncations of R_PARISC_PCREL22F
relocations. This fixes building the 64-bit defconfig.
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Helge Deller <deller@gmx.de>
Zhao Chen [Wed, 8 Aug 2018 06:37:30 +0000 (06:37 +0000)]
net-next: hinic: fix a problem in free_tx_poll()
This patch fixes the problem below. The problem can be reproduced by the
following steps:
1) Connecting all HiNIC interfaces
2) On server side
# sudo ifconfig eth0 192.168.100.1 up #Using MLX CX4 card
# iperf -s
3) On client side
# sudo ifconfig eth0 192.168.100.2 up #Using our HiNIC card
# iperf -c 192.168.101.1 -P 10 -t 100000
after hours of testing, we will see errors:
hinic 0000:05:00.0: No MGMT msg handler, mod = 0
hinic 0000:05:00.0: No MGMT msg handler, mod = 0
hinic 0000:05:00.0: No MGMT msg handler, mod = 0
hinic 0000:05:00.0: No MGMT msg handler, mod = 0
The errors are caused by the following problem.
1) The hinic_get_wqe() checks the "wq->delta" to allocate new WQEs:
if (atomic_sub_return(num_wqebbs, &wq->delta) <= 0) {
atomic_add(num_wqebbs, &wq->delta);
return ERR_PTR(-EBUSY);
}
If the WQE occupies multiple pages, the shadow WQE will be used. Then the
hinic_xmit_frame() fills the WQE.
2) While in parallel with 1), the free_tx_poll() checks the "wq->delta"
to free old WQEs:
if ((atomic_read(&wq->delta) + num_wqebbs) > wq->q_depth)
return ERR_PTR(-EBUSY);
There is a probability that the shadow WQE which hinic_xmit_frame() is
using will be damaged by copy_wqe_to_shadow():
if (curr_pg != end_pg) {
void *shadow_addr = &wq->shadow_wqe[curr_pg * wq->max_wqe_size];
copy_wqe_to_shadow(wq, shadow_addr, num_wqebbs, *cons_idx);
return shadow_addr;
}
This can cause WQE data error and you will see the above error messages.
This patch fixes the problem.
Signed-off-by: Zhao Chen <zhaochen6@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Wed, 8 Aug 2018 03:43:04 +0000 (11:43 +0800)]
vhost: reset metadata cache when initializing new IOTLB
We need to reset metadata cache during new IOTLB initialization,
otherwise the stale pointers to previous IOTLB may be still accessed
which will lead a use after free.
Reported-by: syzbot+c51e6736a1bf614b3272@syzkaller.appspotmail.com
Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
zhong jiang [Tue, 7 Aug 2018 11:20:09 +0000 (19:20 +0800)]
net:mod: remove unneeded variable 'ret' in init_p9
The ret is modified after initalization, so just remove it and
return 0.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
zhong jiang [Tue, 7 Aug 2018 11:20:08 +0000 (19:20 +0800)]
net:af_iucv: get rid of the unneeded variable 'err' in afiucv_pm_freeze
We will not use the variable 'err' after initalization, So remove it and
return 0.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moritz Fischer [Tue, 7 Aug 2018 23:35:20 +0000 (16:35 -0700)]
net: nixge: Get rid of unused struct member 'last_link'
Get rid of unused struct member 'last_link'
Signed-off-by: Moritz Fischer <mdf@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 8 Aug 2018 00:54:21 +0000 (17:54 -0700)]
Merge branch 'net-ethernet-Mark-expected-switch-fall-throughs'
Gustavo A. R. Silva says:
====================
net: ethernet: Mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, this patchset aims
to add some annotations in order to mark switch cases where we are
expecting to fall through.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Tue, 7 Aug 2018 23:32:19 +0000 (18:32 -0500)]
net: ethernet: ti: cpts: mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Addresses-Coverity-ID: 114813 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Tue, 7 Aug 2018 23:31:46 +0000 (18:31 -0500)]
net: tlan: Mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Addresses-Coverity-ID: 141440 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Tue, 7 Aug 2018 23:31:16 +0000 (18:31 -0500)]
net: sfc: falcon: mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Addresses-Coverity-ID:
1384500 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>