David S. Miller [Mon, 4 Dec 2017 16:33:02 +0000 (11:33 -0500)]
Merge branch 'rtnetlink-rework-handler-registration'
Florian Westphal says:
====================
rtnetlink: rework handler (un)registering
Peter Zijlstra reported (referring to commit
019a316992ee0d983,
"rtnetlink: add reference counting to prevent module unload while dump is in progress"):
1) it not in fact a refcount, so using refcount_t is silly
2) there is a distinct lack of memory barriers, so we can easily
observe the decrement while the msg_handler is still in progress.
3) waiting with a schedule()/yield() loop is complete crap and subject
life-locks, imagine doing that rtnl_unregister_all() from a RT task.
In ancient times rtnetlink exposed a statically-sized table with
preset doit/dumpit handlers to be called for a protocol/type pair.
Later the rtnl_register interface was added and the table was allocated
on demand. Eventually these were also used by modules.
Problem is that nothing prevents module unload while a netlink dump
is in progress. netlink dumps can be span multiple recv calls and
netlink core saves the to-be-repeated dumper address for later invocation.
To prevent rmmod the netlink core expects callers to pass in the owning
module so a reference can be taken.
So far rtnetlink wasn't doing this, add new interface to pass THIS_MODULE.
Moreover, when converting parts of the rtnetlink handling to rcu this code
gained way too many READ_ONCE spots, remove them and the extra refcounting.
Take a module reference when running dumpit and doit callbacks
and never alter content of rtnl_link structures after they have been
published via rcu_assign_pointer.
Based partially on earlier patch from Peter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Sat, 2 Dec 2017 20:44:08 +0000 (21:44 +0100)]
rtnetlink: remove __rtnl_register
This removes __rtnl_register and switches callers to either
rtnl_register or rtnl_register_module.
Also, rtnl_register() will now print an error if memory allocation
failed rather than panic the kernel.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Sat, 2 Dec 2017 20:44:07 +0000 (21:44 +0100)]
net: use rtnl_register_module where needed
all of these can be compiled as a module, so use new
_module version to make sure module can no longer be removed
while callback/dump is in use.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Sat, 2 Dec 2017 20:44:06 +0000 (21:44 +0100)]
rtnetlink: get reference on module before invoking handlers
Add yet another rtnl_register function. It will be used by modules
that can be removed.
The passed module struct is used to prevent module unload while
a netlink dump is in progress or when a DOIT_UNLOCKED doit callback
is called.
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Sat, 2 Dec 2017 20:44:05 +0000 (21:44 +0100)]
net: rtnetlink: use rcu to free rtnl message handlers
rtnetlink is littered with READ_ONCE() because we can have read accesses
while another cpu can write to the structure we're reading by
(un)registering doit or dumpit handlers.
This patch changes this so that (un)registering cpu allocates a new
structure and then publishes it via rcu_assign_pointer, i.e. once
another cpu can see such pointer no modifications will occur anymore.
based on initial patch from Peter Zijlstra.
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Sat, 2 Dec 2017 09:50:37 +0000 (10:50 +0100)]
net: phy: broadcom: re-add mistakenly removed config settings
Previous patch mistakenly removed three chip-specific config settings.
Add them again.
Fixes: 80274abafc60 "net: phy: remove generic settings for callbacks config_aneg and read_status from drivers"
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 Dec 2017 16:04:20 +0000 (11:04 -0500)]
Merge branch 'ipv6-gre-collect_md'
William Tu says:
====================
add ip6 gre and gretap collect_md mode
Similar to gre, vxlan, geneve, ipip tunnels, allow ip6gretap tunnels to
operate in collect metadata mode. The first patch adds the support to
ip6_gre.c. The second patch enables unsetting the csum for ipv6 tunnel,
when using bpf_skb_[gs]et_tunnel_key() helpers. Finally, the last patch
adds the ip6 gre and gretap tunnel test cases to BPF sample code.
The corresponding iproute2 patch:
https://marc.info/?l=linux-netdev&m=
151216943128087&w=2
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
William Tu [Fri, 1 Dec 2017 23:26:10 +0000 (15:26 -0800)]
samples/bpf: extend test_tunnel_bpf.sh with ip6gre
Extend existing tests for vxlan, gre, geneve, ipip, erspan,
to include ip6 gre and gretap tunnel.
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
William Tu [Fri, 1 Dec 2017 23:26:09 +0000 (15:26 -0800)]
bpf: allow disabling tunnel csum for ipv6
Before the patch, BPF_F_ZERO_CSUM_TX can be used only for ipv4 tunnel.
With introduction of ip6gretap collect_md mode, the flag should be also
supported for ipv6.
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
William Tu [Fri, 1 Dec 2017 23:26:08 +0000 (15:26 -0800)]
ip6_gre: add ip6 gre and gretap collect_md mode
Similar to gre, vxlan, geneve, ipip tunnels, allow ip6 gre and gretap
tunnels to operate in collect metadata mode. bpf_skb_[gs]et_tunnel_key()
helpers can make use of it right away. OVS can use it as well in the
future.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Thu, 30 Nov 2017 22:57:00 +0000 (23:57 +0100)]
net: phy: core: don't disable device interrupts in phy_change
If state is not PHY_HALTED I see no need to temporarily disable
interrupts on the device. As long as the current interrupt isn't acked
on the device no new interrupt can happen anyway.
In addition remove a unneeded enabling of interrupts in the state
machine when handling state PHY_CHANGELINK.
Tested on a Odroid-C2 with RTL8211F phy in interrupt mode.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Thu, 30 Nov 2017 22:55:15 +0000 (23:55 +0100)]
net: phy: core: remove now uneeded disabling of interrupts
After commits
c974bdbc3e "net: phy: Use threaded IRQ, to allow IRQ from
sleeping devices" and
664fcf123a30 "net: phy: Threaded interrupts allow
some simplification" all relevant code pieces run in process context
anyway and I don't think we need the disabling of interrupts any longer.
Interestingly enough, latter commit already removed the comment
explaining why interrupts need to be temporarily disabled.
On my system phy interrupt mode works fine with this patch.
However I may miss something, especially in the context of shared phy
interrupts, therefore I'd appreciate if more people could test this.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 3 Dec 2017 15:18:28 +0000 (10:18 -0500)]
Merge branch 'tcp-2nd-listener-hash'
Martin KaFai Lau says:
====================
tcp: Add a 2nd listener hashtable (port+addr)
This patch set adds a 2nd listener hashtable. It is to resolve
the performance issue when a process is listening at many IP
addresses with the same port (e.g. [IP1]:443, [IP2]:443... [IPN]:443)
v2:
- Move the new lhash2 and lhash2_mask before the existing
listening_hash to avoid adding another cacheline
to inet_hashinfo (Suggested by Eric Dumazet, Thanks!)
- I take this chance to plug an existing 4 bytes hole while
adding 'unsigned int lhash2_mask'.
- Add some comments about lhash2 in inet_hashtables.h
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 1 Dec 2017 20:52:32 +0000 (12:52 -0800)]
tcp: Enable 2nd listener hashtable in TCP
Enable the second listener hashtable in TCP.
The scale is the same as UDP which is one slot per 2MB.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 1 Dec 2017 20:52:31 +0000 (12:52 -0800)]
inet: Add a 2nd listener hashtable (port+addr)
The current listener hashtable is hashed by port only.
When a process is listening at many IP addresses with the same port (e.g.
[IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener()
performance is degraded to a link list. It is prone to syn attack.
UDP had a similar issue and a second hashtable was added to resolve it.
This patch adds a second hashtable for the listener's sockets.
The second hashtable is hashed by port and address.
It cannot reuse the existing skc_portaddr_node which is shared
with skc_bind_node. TCP listener needs to use skc_bind_node.
Instead, this patch adds a hlist_node 'icsk_listen_portaddr_node' to
the inet_connection_sock which the listener (like TCP) also belongs to.
The new portaddr hashtable may need two lookup (First by IP:PORT.
Second by INADDR_ANY:PORT if the IP:PORT is a not found). Hence,
it implements a similar cut off as UDP such that it will only consult the
new portaddr hashtable if the current port-only hashtable has >10
sk in the link-list.
lhash2 and lhash2_mask are added to 'struct inet_hashinfo'. I take
this chance to plug a 4 bytes hole. It is done by first moving
the existing bind_bucket_cachep up and then add the new
(int lhash2_mask, *lhash2) after the existing bhash_size.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 1 Dec 2017 20:52:30 +0000 (12:52 -0800)]
udp: Move udp[46]_portaddr_hash() to net/ip[v6].h
This patch moves the udp[46]_portaddr_hash()
to net/ip[v6].h. The function name is renamed to
ipv[46]_portaddr_hash().
It will be used by a later patch which adds a second listener
hashtable hashed by the address and port.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 1 Dec 2017 20:52:29 +0000 (12:52 -0800)]
inet: Add a count to struct inet_listen_hashbucket
This patch adds a count to the 'struct inet_listen_hashbucket'.
It counts how many sk is hashed to a bucket. It will be
used to decide if the (to-be-added) portaddr listener's hashtable
should be used during inet[6]_lookup_listener().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Govindarajulu Varadarajan [Fri, 1 Dec 2017 18:21:40 +0000 (10:21 -0800)]
enic: add sw timestamp support
Add ethtool ops to advertise sw timestamping.
Call skb_tx_timestamp() just before ringing the wq doorbell.
Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 3 Dec 2017 15:10:03 +0000 (10:10 -0500)]
Merge branch 'hv_netvsc-minor-optimizations'
Stephen Hemminger says:
====================
hv_netvsc: minor optimizations
These are a set of local optimizations the Hyper-V networking driver.
Also include a vmbus patch in this set, because it depends on the
netvsc that last used that function.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Fri, 1 Dec 2017 19:01:49 +0000 (11:01 -0800)]
vmbus: make hv_get_ringbuffer_availbytes local
The last use of hv_get_ringbuffer_availbytes in drivers is now
gone. Only used by the debug info routine so make it static. Also, add
READ_ONCE() to avoid any possible issues with potentially volatile
index values.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Fri, 1 Dec 2017 19:01:48 +0000 (11:01 -0800)]
hv_netvsc: optimize initialization of RNDIS header
The memset of the whole maximum possible RNDIS header is unnecessary.
For the main part of the header use a structure assignment.
No need to memset the whole per packet info. Instead rely on caller to
set what it wants. Also get rid of cast to void and signed/unsigned
conversion. Now return pointer to per packet data (rather than the
header) which simplifies use by code setting up the packet data.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Fri, 1 Dec 2017 19:01:47 +0000 (11:01 -0800)]
hv_netvsc: use reciprocal divide to speed up percent calculation
Every packet sent checks the available ring space. The calculation
can be sped up by using reciprocal divide which is multiplication.
Since ring_size can only be configured by module parameter, so it doesn't
have to be passed around everywhere. Also it should be unsigned
since it is number of pages.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Fri, 1 Dec 2017 19:01:46 +0000 (11:01 -0800)]
hv_netvsc: replace divide with mask when computing padding
Packet alignment is always a power of 2 therefore modulus can
be replaced with a faster and operation
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Fri, 1 Dec 2017 19:01:45 +0000 (11:01 -0800)]
hv_netvsc: don't need local xmit_more
Since skb is always non-NULL in the copy portion of netvsc_send
do not need local variable.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Fri, 1 Dec 2017 19:01:44 +0000 (11:01 -0800)]
hv_netvsc: drop unused macros
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gao Feng [Fri, 1 Dec 2017 08:33:03 +0000 (16:33 +0800)]
ipvlan: Add new func ipvlan_is_valid_dev instead of duplicated codes
There are multiple duplicated condition checks in the current codes, so
I add the new func ipvlan_is_valid_dev instead of the duplicated codes to
check if the netdev is real ipvlan dev.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 3 Dec 2017 14:38:17 +0000 (09:38 -0500)]
Merge branch 'realtek-phy-improvements'
Martin Blumenstingl says:
====================
Realtek Ethernet PHY driver improvements
This series provides some small improvements and cleanups for the
Realtek Ethernet PHY driver.
None of the patches in this series should change any functionality.
The goal is to make the code a bit easier to read by:
- re-using the BIT and GENMASK macros (which makes it easier to compare
the #defines in the kernel with the values from the datasheets)
- rename a #define from a generic name to a PHY-specific name since it's
only used for one specific PHY
- logically group the register #defines and their register bit #defines
together
- indentation cleanups
- removed some code duplicating for reading/writing registers on a
Realtek specific "page"
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl [Sat, 2 Dec 2017 21:51:28 +0000 (22:51 +0100)]
net: phy: realtek: add utility functions to read/write page addresses
Realtek PHYs implement the concept of so-called "extension pages". The
reason for this is probably because these PHYs expose more registers
than available in the standard address range.
After all read/write operations on such a page are done the driver
should switch back to page 0 where the standard MII registers (such as
MII_BMCR) are available.
When referring to such a register the datasheets of RTL8211E and
RTL8211F always specify:
- the page / "ext. page" which has to be written to RTL821x_PAGE_SELECT
- an address (sometimes also called reg)
These new utility functions make the existing code easier to read since
it removes some duplication (switching back to page 0 is done within the
new helpers for example).
No functional changes are intended.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl [Sat, 2 Dec 2017 21:51:27 +0000 (22:51 +0100)]
net: phy: realtek: use the same indentation for all #defines
This simply makes the code easier to read. No functional changes.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl [Sat, 2 Dec 2017 21:51:26 +0000 (22:51 +0100)]
net: phy: realtek: group all register bit #defines for RTL821x_INER
This simply moves all register bit #defines which describe the (PHY
specific) bits in the RTL821x_INER right below the RTL821x_INER register
definition. This makes it easier to spot which registers and bits belong
together.
No functional changes.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl [Sat, 2 Dec 2017 21:51:25 +0000 (22:51 +0100)]
net: phy: realtek: rename RTL821x_INER_INIT to RTL8211B_INER_INIT
This macro is only used by the RTL8211B code. RTL8211E and RTL8211F both
use other bits to initialize the RTL821x_INER register.
No functional changes.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl [Sat, 2 Dec 2017 21:51:24 +0000 (22:51 +0100)]
net: phy: realtek: use the BIT and GENMASK macros
This makes it easier to compare the #defines with the datasheets.
No functional changes.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 3 Dec 2017 02:21:29 +0000 (21:21 -0500)]
Merge branch 'dsa-cross-chip-FDB-support'
Vivien Didelot says:
====================
net: dsa: cross-chip FDB support
DSA can have interconnected switches. For instance, the ZII Dev Rev B
board described in arch/arm/boot/dts/vf610-zii-dev-rev-b.dts has a
switch fabric composed of 3 switch devices like this:
lan4 lan6
CPU (eth1) | lan5 | lan7
| | | | |
[0 1 2 3 4 6 5]---[6 0 1 2 3 4 5]---[9 0 1 2 3 4 5 6 7 8]
| | | | | | |
lan0 | lan2 lan3 lan8 | optical4
lan1 optical3
One current issue with DSA is cross-chip FDB. If we add a static MAC
address on lan3, only its parent switch 1 (the one in the middle) will
be programmed. That is not correct in a cross-chip environment, because
the DSA ports connecting to switch 1 of adjacent switch 0 (on the left)
and switch 2 (on the right) must be programmed too.
Without this patchset, a dump of the hardware FDB of switches 0, 1 and 2
after programming a MAC address on lan3 looks like this (*):
# bridge fdb add 11:22:33:44:55:66 dev lan3
# cat /sys/kernel/debug/mv88e6xxx/sw*/atu/0 | grep -v FID
0 ff:ff:ff:ff:ff:ff MC_STATIC n 0 1 2 3 4 5 6
0 11:22:33:44:55:66 MC_STATIC_MGMT_PO n 0 - - - - - -
0 ff:ff:ff:ff:ff:ff MC_STATIC n 0 1 2 3 4 5 6
0 ff:ff:ff:ff:ff:ff MC_STATIC n 0 1 2 3 4 5 6 7 8 9
With this patchset applied, adjacent DSA ports get programmed too:
# bridge fdb add 11:22:33:44:55:66 dev lan3
# cat /sys/kernel/debug/mv88e6xxx/sw*/atu/0 | grep -v FID
0 11:22:33:44:55:66 MC_STATIC_MGMT_PO n - - - - - 5 -
0 ff:ff:ff:ff:ff:ff MC_STATIC n 0 1 2 3 4 5 6
0 11:22:33:44:55:66 MC_STATIC_MGMT_PO n 0 - - - - - -
0 ff:ff:ff:ff:ff:ff MC_STATIC n 0 1 2 3 4 5 6
0 11:22:33:44:55:66 MC_STATIC_MGMT_PO n - - - - - - - - - 9
0 ff:ff:ff:ff:ff:ff MC_STATIC n 0 1 2 3 4 5 6 7 8 9
In order to do that, the first commit introduces a dsa_towards_port()
helper which returns the local port of a switch which must be used to
reach an arbitrary switch port (local or from an adjacent switch.)
The second patch uses this helper to configure the port reaching the
target port for every switches of the fabric.
(*) a patch for squashed debugfs interface which applies on top of this
patchset is available here:
https://github.com/vivien/linux/commit/
f8e6ba34c68a72d3bf42f4dea79abacb2e61a3cc.patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Thu, 30 Nov 2017 17:56:43 +0000 (12:56 -0500)]
net: dsa: support cross-chip FDB operations
When a MAC address is added to or removed from a switch port in the
fabric, the target switch must program its port and adjacent switches
must program their local DSA port used to reach the target switch.
For this purpose, use the dsa_towards_port() helper to identify the
local switch port which must be programmed.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Thu, 30 Nov 2017 17:56:42 +0000 (12:56 -0500)]
net: dsa: introduce dsa_towards_port helper
Add a new helper returning the local port used to reach an arbitrary
switch port in the fabric.
Its only user at the moment is the dsa_upstream_port helper, which
returns the local port reaching the dedicated CPU port, but it will be
used in cross-chip FDB operations.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 3 Dec 2017 02:18:57 +0000 (21:18 -0500)]
Merge branch 'dsa-simplify-switchdev-prepare-phase'
Vivien Didelot says:
====================
net: dsa: simplify switchdev prepare phase
This patch series brings no functional changes.
It removes the unused switchdev_trans arguments from the dsa_switch_ops
for both MDB and VLAN operations, and provides functions to prepare and
add these objects for a given bitmap of ports.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Thu, 30 Nov 2017 16:24:00 +0000 (11:24 -0500)]
net: dsa: add switch mdb bitmap functions
This patch brings no functional changes.
It moves out the MDB code iterating on a multicast group into new
dsa_switch_mdb_{prepare,add}_bitmap() functions.
This gives us a better isolation of the two switchdev phases.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Thu, 30 Nov 2017 16:23:59 +0000 (11:23 -0500)]
net: dsa: add switch vlan bitmap functions
This patch brings no functional changes.
It moves out the VLAN code iterating on a list of VLAN members into new
dsa_switch_vlan_{prepare,add}_bitmap() functions.
This gives us a better isolation of the two switchdev phases.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Thu, 30 Nov 2017 16:23:58 +0000 (11:23 -0500)]
net: dsa: remove trans argument from mdb ops
The DSA switch MDB ops pass the switchdev_trans structure down to the
drivers, but no one is using them and they aren't supposed to anyway.
Remove the trans argument from MDB prepare and add operations.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Thu, 30 Nov 2017 16:23:57 +0000 (11:23 -0500)]
net: dsa: remove trans argument from vlan ops
The DSA switch VLAN ops pass the switchdev_trans structure down to the
drivers, but no one is using them and they aren't supposed to anyway.
Remove the trans argument from VLAN prepare and add operations.
At the same time, fix the following checkpatch warning:
WARNING: line over 80 characters
#74: FILE: drivers/net/dsa/dsa_loop.c:177:
+ const struct switchdev_obj_port_vlan *vlan)
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Thu, 30 Nov 2017 14:35:33 +0000 (15:35 +0100)]
openvswitch: do not propagate headroom updates to internal port
After commit
3a927bc7cf9d ("ovs: propagate per dp max headroom to
all vports") the need_headroom for the internal vport is updated
accordingly to the max needed headroom in its datapath.
That avoids the pskb_expand_head() costs when sending/forwarding
packets towards tunnel devices, at least for some scenarios.
We still require such copy when using the ovs-preferred configuration
for vxlan tunnels:
br_int
/ \
tap vxlan
(remote_ip:X)
br_phy
\
NIC
where the route towards the IP 'X' is via 'br_phy'.
When forwarding traffic from the tap towards the vxlan device, we
will call pskb_expand_head() in vxlan_build_skb() because
br-phy->needed_headroom is equal to tun->needed_headroom.
With this change we avoid updating the internal vport needed_headroom,
so that in the above scenario no head copy is needed, giving 5%
performance improvement in UDP throughput test.
As a trade-off, packets sent from the internal port towards a tunnel
device will now experience the head copy overhead. The rationale is
that the latter use-case is less relevant performance-wise.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Dec 2017 21:36:33 +0000 (16:36 -0500)]
Merge branch 'cpsw-ale-cleanups'
Grygorii Strashko says:
====================
net: ethernet: ti: cpsw/ale clean up and optimization
This is set of non critical clean ups and optimizations for TI
CPSW and ALE drivers.
Rebased on top on net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:20 +0000 (18:21 -0600)]
net: ethernet: ti: ale: fix port check in cpsw_ale_control_set/get
ALE ports number includes the Host port and ext Ports, and
ALE ports numbering starts from 0, so correct corresponding port
checks in cpsw_ale_control_set/get().
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:19 +0000 (18:21 -0600)]
net: ethernet: ti: ale: use devm_kzalloc in cpsw_ale_create()
Use cpsw_ale_create in cpsw_ale_create(). This also makes
cpsw_ale_destroy() function nop, so remove it.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:18 +0000 (18:21 -0600)]
net: ethernet: ti: ale: move static initialization in cpsw_ale_create()
Move static initialization from cpsw_ale_start() to cpsw_ale_create() as it
does not make much sence to perform static initializtion in
cpsw_ale_start() which is called everytime netif[s] is opened.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:17 +0000 (18:21 -0600)]
net: ethernet: ti: ale: optimize ale entry mask bits configuartion
The ale->params.ale_ports parameter can be used to deriver values for all
ale entry mask bits: port_mask_bits, port_mask_bits, port_num_bits.
Hence, calculate above values and drop all hardcoded values. For
port_num_bits calcualtion use order_base_2() API.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:16 +0000 (18:21 -0600)]
net: ethernet: ti: ale: disable ale from stop()
ALE is enabled from cpsw_ale_start() now, but disabled only from
cpsw_ale_destroy() which introduces inconsitance as cpsw_ale_start() is
called when netif[s] is opened, but cpsw_ale_destroy() is called when
driver is removed. Hence, move ALE disabling in cpsw_ale_stop().
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:15 +0000 (18:21 -0600)]
net: ethernet: ti: ale: use proper io apis
Switch to use writel_relaxed/readl_relaxed() IO API instead of raw version
as it is recommended.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:14 +0000 (18:21 -0600)]
net: ethernet: ti: cpsw: fix ale port numbers
TI OMAP/Sitara SoCs have fixed number of ALE ports 3, which includes Host
port also.
Hence, use fixed value instead of value calcualted from DT, which can be
set by user and might not reflect actual HW configuration.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:13 +0000 (18:21 -0600)]
net: ethernet: ti: cpsw: move mac_hi/lo defines in cpsw.h
Move mac_hi/lo defines in common header cpsw.h and re-use
them for netcp_ethss.c.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:12 +0000 (18:21 -0600)]
net: ethernet: ti: cpsw: move platform data struct to .c file
CPSW platform data struct cpsw_platform_data and struct cpsw_slave_data are
used only incide cpsw.c module, so move these definitions there.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:11 +0000 (18:21 -0600)]
net: ethernet: ti: cpsw: use proper io apis
Switch to use writel_relaxed/readl_relaxed() IO API instead of raw version
as it is recommended.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Grygorii Strashko [Fri, 1 Dec 2017 00:21:10 +0000 (18:21 -0600)]
net: ethernet: ti: cpsw: drop unused var poll from cpsw_update_channels_res
Drop unused variable "poll" from cpsw_update_channels_res().
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Thu, 30 Nov 2017 22:47:52 +0000 (23:47 +0100)]
net: phy: remove generic settings for callbacks config_aneg and read_status from drivers
Remove generic settings for callbacks config_aneg and read_status
from drivers.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Thu, 30 Nov 2017 22:46:19 +0000 (23:46 +0100)]
net: phy: core: use genphy version of callbacks read_status and config_aneg per default
read_status and config_aneg are the only mandatory callbacks and most
of the time the generic implementation is used by drivers.
So make the core fall back to the generic version if a driver doesn't
implement the respective callback.
Also currently the core doesn't seem to verify that drivers implement
the mandatory calls. If a driver doesn't do so we'd just get a NPE.
With this patch this potential issue doesn't exit any longer.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Dec 2017 20:33:27 +0000 (15:33 -0500)]
Merge branch 'ip6_gre-add-erspan-native-tunnel-for-ipv6'
William Tu says:
====================
ip6_gre: add erspan native tunnel for ipv6
The patch series add support for ERSPAN tunnel over ipv6. The first patch
refectors the existing ipv4 gre implementation and the second refactors the
ipv6 gre's xmit code. Finally the last patch introduces erspan protocol.
change in v5:
- add cover-letter description
change in v4:
- rebase on top of net-next
- use log_ecn_error in ip6_tnl_rcv
change in v3:
- add inline for functions in header
- rebase on top of net-next
change in v2:
- remove inline
- fix some indent
- fix errors reports by clang and scan-build
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
William Tu [Thu, 30 Nov 2017 19:51:29 +0000 (11:51 -0800)]
ip6_gre: Add ERSPAN native tunnel support
The patch adds support for ERSPAN tunnel over ipv6.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
William Tu [Thu, 30 Nov 2017 19:51:28 +0000 (11:51 -0800)]
ip6_gre: Refactor ip6gre xmit codes
This patch refactors the ip6gre_xmit_{ipv4, ipv6}.
It is a prep work to add the ip6erspan tunnel.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
William Tu [Thu, 30 Nov 2017 19:51:27 +0000 (11:51 -0800)]
ip_gre: Refector the erpsan tunnel code.
Move two erspan functions to header file, erspan.h, so ipv6
erspan implementation can use it.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Dec 2017 20:29:40 +0000 (15:29 -0500)]
Merge branch 'ethtool-reset-AP'
Scott Branden says:
====================
net: ethtool: add support for ETH_RESET_AP
Add support to reset appplication processors inside SmartNICs by
defining new ETH_RESET_AP bit.
And use new ETH_RESET_AP bit in bnxt ethernet driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Branden [Thu, 30 Nov 2017 19:36:00 +0000 (11:36 -0800)]
bnxt_en: Add ETH_RESET_AP support
Add ETH_RESET_AP support handling to reset the internal
Application Processor(s) of the SmartNIC card.
Signed-off-by: Scott Branden <scott.branden@broadcom.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Branden [Thu, 30 Nov 2017 19:35:59 +0000 (11:35 -0800)]
net: ethtool: add support for reset of AP inside NIC interface.
Add ETH_RESET_AP to reset the application processor(s) inside the NIC
interface.
Current ETH_RESET_MGMT supports a management processor inside this NIC.
This is typically used for remote NIC management purposes.
Application processors exist inside some SmartNICs to run various
applications inside the NIC processor - be it a simple algorithm without
an OS to as complex as hosting multiple VMs.
Signed-off-by: Scott Branden <scott.branden@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Dec 2017 20:25:15 +0000 (15:25 -0500)]
Merge branch 'rds-tcp-netns-delete-related-fixes'
Sowmini Varadhan says:
====================
rds-tcp netns delete related fixes
Patchset contains cleanup and bug fixes. Patch 1 is the removal
of some redundant code/functions. Patch 2 and 3 are fixes for
corner cases identified by syzkaller. I've not been able to
reproduce the actual use-after-free race flagged in the syzkaller
reports, thus these fixes are based on code inspection plus
manual testing to make sure the modified code paths are executed
without problems in the commonly encountered timing cases.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Nov 2017 19:11:29 +0000 (11:11 -0800)]
rds: tcp: atomically purge entries from rds_tcp_conn_list during netns delete
The rds_tcp_kill_sock() function parses the rds_tcp_conn_list
to find the rds_connection entries marked for deletion as part
of the netns deletion under the protection of the rds_tcp_conn_lock.
Since the rds_tcp_conn_list tracks rds_tcp_connections (which
have a 1:1 mapping with rds_conn_path), multiple tc entries in
the rds_tcp_conn_list will map to a single rds_connection, and will
be deleted as part of the rds_conn_destroy() operation that is
done outside the rds_tcp_conn_lock.
The rds_tcp_conn_list traversal done under the protection of
rds_tcp_conn_lock should not leave any doomed tc entries in
the list after the rds_tcp_conn_lock is released, else another
concurrently executiong netns delete (for a differnt netns) thread
may trip on these entries.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Nov 2017 19:11:28 +0000 (11:11 -0800)]
rds: tcp: correctly sequence cleanup on netns deletion.
Commit
8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net")
introduces a regression in rds-tcp netns cleanup. The cleanup_net(),
(and thus rds_tcp_dev_event notification) is only called from put_net()
when all netns refcounts go to 0, but this cannot happen if the
rds_connection itself is holding a c_net ref that it expects to
release in rds_tcp_kill_sock.
Instead, the rds_tcp_kill_sock callback should make sure to
tear down state carefully, ensuring that the socket teardown
is only done after all data-structures and workqs that depend
on it are quiesced.
The original motivation for commit
8edc3affc077 ("rds: tcp: Take explicit
refcounts on struct net") was to resolve a race condition reported by
syzkaller where workqs for tx/rx/connect were triggered after the
namespace was deleted. Those worker threads should have been
cancelled/flushed before socket tear-down and indeed,
rds_conn_path_destroy() does try to sequence this by doing
/* cancel cp_send_w */
/* cancel cp_recv_w */
/* flush cp_down_w */
/* free data structures */
Here the "flush cp_down_w" will trigger rds_conn_shutdown and thus
invoke rds_tcp_conn_path_shutdown() to close the tcp socket, so that
we ought to have satisfied the requirement that "socket-close is
done after all other dependent state is quiesced". However,
rds_conn_shutdown has a bug in that it *always* triggers the reconnect
workq (and if connection is successful, we always restart tx/rx
workqs so with the right timing, we risk the race conditions reported
by syzkaller).
Netns deletion is like module teardown- no need to restart a
reconnect in this case. We can use the c_destroy_in_prog bit
to avoid restarting the reconnect.
Fixes: 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 30 Nov 2017 19:11:27 +0000 (11:11 -0800)]
rds: tcp: remove redundant function rds_tcp_conn_paths_destroy()
A side-effect of Commit
c14b0366813a ("rds: tcp: set linger to 1
when unloading a rds-tcp") is that we always send a RST on the tcp
connection for rds_conn_destroy(), so rds_tcp_conn_paths_destroy()
is not needed any more and is removed in this patch.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 30 Nov 2017 15:47:25 +0000 (16:47 +0100)]
tipc: fall back to smaller MTU if allocation of local send skb fails
When sending node local messages the code is using an 'mtu' of 66060
bytes to avoid unnecessary fragmentation. During situations of low
memory tipc_msg_build() may sometimes fail to allocate such large
buffers, resulting in unnecessary send failures. This can easily be
remedied by falling back to a smaller MTU, and then reassemble the
buffer chain as if the message were arriving from a remote node.
At the same time, we change the initial MTU setting of the broadcast
link to a lower value, so that large messages always are fragmented
into smaller buffers even when we run in single node mode. Apart from
obtaining the same advantage as for the 'fallback' solution above, this
turns out to give a significant performance improvement. This can
probably be explained with the __pskb_copy() operation performed on the
buffer for each recipient during reception. We found the optimal value
for this, considering the most relevant skb pool, to be 3744 bytes.
Acked-by: Ying Xue <ying.xue@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 30 Nov 2017 19:12:47 +0000 (14:12 -0500)]
Merge branch 'macb-rx-packet-filtering'
Rafal Ozieblo says:
====================
Receive packets filtering for macb driver
This patch series adds support for receive packets
filtering for Cadence GEM driver. Packets can be redirect
to different hardware queues based on source IP, destination IP,
source port or destination port. To enable filtering,
support for RX queueing was added as well.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafal Ozieblo [Thu, 30 Nov 2017 18:20:44 +0000 (18:20 +0000)]
net: macb: Added support for RX filtering
This patch allows filtering received packets to different
hardware queues (aka ntuple).
Signed-off-by: Rafal Ozieblo <rafalo@cadence.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafal Ozieblo [Thu, 30 Nov 2017 18:19:56 +0000 (18:19 +0000)]
net: macb: Added some queue statistics
Added statistics per queue:
- qX_rx_packets
- qX_rx_bytes
- qX_rx_dropped
- qX_tx_packets
- qX_tx_bytes
- qX_tx_dropped
Signed-off-by: Rafal Ozieblo <rafalo@cadence.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafal Ozieblo [Thu, 30 Nov 2017 18:19:15 +0000 (18:19 +0000)]
net: macb: Added support for many RX queues
To be able for packet reception on different RX queues some
configuration has to be performed. This patch checks how many
hardware queue does GEM support and initializes them.
Signed-off-by: Rafal Ozieblo <rafalo@cadence.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shrikrishna Khare [Thu, 30 Nov 2017 18:29:51 +0000 (10:29 -0800)]
vmxnet3: increase default rx ring sizes
There are several reasons for increasing the receive ring sizes:
1. The original ring size of 256 was chosen about 10 years ago when
vmxnet3 was first created. At that time, 10Gbps Ethernet was not prevalent
and servers were dominated by 1Gbps Ethernet. Now 10Gbps is common place,
and higher bandwidth links -- 25Gbps, 40Gbps, 50Gbps -- are starting
to appear. 256 Rx ring entries are simply not enough to keep up with
higher link speed when there is a burst of network frames coming from
these high speed links. Even with full MTU size frames, they are gone
in a short time. It is also more common to have a mix of frame sizes,
and more likely bi-modal distribution of frame sizes so the average frame
size is not close to full MTU. If we consider average frame size of 800B,
1024 frames that come in a burst takes ~0.65 ms to arrive at 10Gbps. With
256 entires, it takes ~0.16 ms to arrive at 10Gbps. At 25Gbps or 40Gbps,
this time is reduced accordingly.
2. On a hypervisor where there are many VMs and CPU is over committed,
i.e. the number of VCPUs is more than the number of VCPUs, each PCPU is
in effect time shared between multiple VMs/VCPUs. The time granularity at
which this multiplexing occurs is typically coarser than between processes
on a guest OS. Trying to time slice more finely is not efficient, for
example, if memory cache is barely warmed up when switching from one VM
to another occurs. This CPU overcommit adds delay to when the driver
in a VM can service incoming packets. Whether CPU is over committed
really depends on customer workloads. For certain situations, it is very
common. For example, workloads of desktop VMs and product testing setups.
Consolidation and sharing is what drives efficiency of a customer setup
for such workloads. In these situations, the raw network bandwidth may
not be very high, but the delays between when a VM is running or not
running can also be relatively long.
Signed-off-by: Shrikrishna Khare <skhare@vmware.com>
Acked-by: Jin Heo <heoj@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Acked-by: Boon Ang <bang@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 30 Nov 2017 17:55:35 +0000 (09:55 -0800)]
net: dsa: bcm_sf2: Utilize b53_get_tag_protocol()
Utilize the much more capable b53_get_tag_protocol() which takes care of
all Broadcom switches specifics to resolve which port can have Broadcom
tags enabled or not.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Thu, 30 Nov 2017 14:39:34 +0000 (15:39 +0100)]
net/reuseport: drop legacy code
Since commit
e32ea7e74727 ("soreuseport: fast reuseport UDP socket
selection") and commit
c125e80b8868 ("soreuseport: fast reuseport
TCP socket selection") the relevant reuseport socket matching the current
packet is selected by the reuseport_select_sock() call. The only
exceptions are invalid BPF filters/filters returning out-of-range
indices.
In the latter case the code implicitly falls back to using the hash
demultiplexing, but instead of selecting the socket inside the
reuseport_select_sock() function, it relies on the hash selection
logic introduced with the early soreuseport implementation.
With this patch, in case of a BPF filter returning a bad socket
index value, we fall back to hash-based selection inside the
reuseport_select_sock() body, so that we can drop some duplicate
code in the ipv4 and ipv6 stack.
This also allows faster lookup in the above scenario and will allow
us to avoid computing the hash value for successful, BPF based
demultiplexing - in a later patch.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Walleij [Wed, 29 Nov 2017 15:34:38 +0000 (16:34 +0100)]
Documentation: net: dsa: Cut set_addr() documentation
This is not supported anymore, devices needing a MAC address
just assign one at random, it's just a driver pecularity.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 30 Nov 2017 14:54:28 +0000 (09:54 -0500)]
Merge branch 'net-dst_entry-shrink'
David Miller says:
====================
net: Significantly shrink the size of routes.
Through a combination of several things, our route structures are
larger than they need to be.
Mostly this stems from having members in dst_entry which are only used
by one class of routes. So the majority of the work in this series is
about "un-commoning" these members and pushing them into the type
specific structures.
Unfortunately, IPSEC needed the most surgery. The majority of the
changes here had to do with bundle creation and management.
The other issue is the refcount alignment in dst_entry. Once we get
rid of the not-so-common members, it really opens the door to removing
that alignment entirely.
I think the new layout looks really nice, so I'll reproduce it here:
struct net_device *dev;
struct dst_ops *ops;
unsigned long _metrics;
unsigned long expires;
struct xfrm_state *xfrm;
int (*input)(struct sk_buff *);
int (*output)(struct net *net, struct sock *sk, struct sk_buff *skb);
unsigned short flags;
short obsolete;
unsigned short header_len;
unsigned short trailer_len;
atomic_t __refcnt;
int __use;
unsigned long lastuse;
struct lwtunnel_state *lwtstate;
struct rcu_head rcu_head;
short error;
short __pad;
__u32 tclassid;
(This is for 64-bit, on 32-bit the __refcnt comes at the very end)
So, the good news:
1) struct dst_entry shrinks from 160 to 112 bytes.
2) struct rtable shrinks from 216 to 168 bytes.
3) struct rt6_info shrinks from 384 to 320 bytes.
Enjoy.
v2:
Collapse some patches logically based upon feedback.
Fix the strange patch #7.
v3: xfrm_dst_path() needs inline keyword
Properly align __refcnt on 32-bit.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Miller [Tue, 28 Nov 2017 20:41:07 +0000 (15:41 -0500)]
net: Remove dst->next
There are no more users.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
David Miller [Tue, 28 Nov 2017 20:41:01 +0000 (15:41 -0500)]
xfrm: Stop using dst->next in bundle construction.
While building ipsec bundles, blocks of xfrm dsts are linked together
using dst->next from bottom to the top.
The only thing this is used for is initializing the pmtu values of the
xfrm stack, and for updating the mtu values at xfrm_bundle_ok() time.
The bundle pmtu entries must be processed in this order so that pmtu
values lower in the stack of routes can propagate up to the higher
ones.
Avoid using dst->next by simply maintaining an array of dst pointers
as we already do for the xfrm_state objects when building the bundle.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
David Miller [Tue, 28 Nov 2017 20:40:53 +0000 (15:40 -0500)]
net: Rearrange dst_entry layout to avoid useless padding.
We have padding to try and align the refcount on a separate cache
line. But after several simplifications the padding has increased
substantially.
So now it's easy to change the layout to get rid of the padding
entirely.
We group the write-heavy __refcnt and __use with less often used
items such as the rcu_head and the error code.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
David Miller [Tue, 28 Nov 2017 20:40:46 +0000 (15:40 -0500)]
xfrm: Move dst->path into struct xfrm_dst
The first member of an IPSEC route bundle chain sets it's dst->path to
the underlying ipv4/ipv6 route that carries the bundle.
Stated another way, if one were to follow the xfrm_dst->child chain of
the bundle, the final non-NULL pointer would be the path and point to
either an ipv4 or an ipv6 route.
This is largely used to make sure that PMTU events propagate down to
the correct ipv4 or ipv6 route.
When we don't have the top of an IPSEC bundle 'dst->path == dst'.
Move it down into xfrm_dst and key off of dst->xfrm.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
David Miller [Tue, 28 Nov 2017 20:40:40 +0000 (15:40 -0500)]
ipv6: Move dst->from into struct rt6_info.
The dst->from value is only used by ipv6 routes to track where
a route "came from".
Any time we clone or copy a core ipv6 route in the ipv6 routing
tables, we have the copy/clone's ->from point to the base route.
This is used to handle route expiration properly.
Only ipv6 uses this mechanism, and only ipv6 code references
it. So it is safe to move it into rt6_info.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
David Miller [Tue, 28 Nov 2017 20:45:44 +0000 (15:45 -0500)]
xfrm: Move child route linkage into xfrm_dst.
XFRM bundle child chains look like this:
xdst1 --> xdst2 --> xdst3 --> path_dst
All of xdstN are xfrm_dst objects and xdst->u.dst.xfrm is non-NULL.
The final child pointer in the chain, here called 'path_dst', is some
other kind of route such as an ipv4 or ipv6 one.
The xfrm output path pops routes, one at a time, via the child
pointer, until we hit one which has a dst->xfrm pointer which
is NULL.
We can easily preserve the above mechanisms with child sitting
only in the xfrm_dst structure. All children in the chain
before we break out of the xfrm_output() loop have dst->xfrm
non-NULL and are therefore xfrm_dst objects.
Since we break out of the loop when we find dst->xfrm NULL, we
will not try to dereference 'dst' as if it were an xfrm_dst.
Signed-off-by: David S. Miller <davem@davemloft.net>
David Miller [Tue, 28 Nov 2017 20:40:28 +0000 (15:40 -0500)]
ipsec: Create and use new helpers for dst child access.
This will make a future change moving the dst->child pointer less
invasive.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
David Miller [Tue, 28 Nov 2017 20:40:22 +0000 (15:40 -0500)]
net: Create and use new helper xfrm_dst_child().
Only IPSEC routes have a non-NULL dst->child pointer. And IPSEC
routes are identified by a non-NULL dst->xfrm pointer.
Signed-off-by: David S. Miller <davem@davemloft.net>
David Miller [Tue, 28 Nov 2017 20:40:15 +0000 (15:40 -0500)]
ipv6: Move rt6_next from dst_entry into ipv6 route structure.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
David Miller [Tue, 28 Nov 2017 20:40:08 +0000 (15:40 -0500)]
decnet: Move dn_next into decnet route structure.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
David Miller [Tue, 28 Nov 2017 20:39:59 +0000 (15:39 -0500)]
net: dst->rt_next is unused.
Delete it.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Zhu Yanjun [Tue, 28 Nov 2017 06:42:22 +0000 (01:42 -0500)]
forcedeth: optimize the xmit with unlikely
In xmit, it is very impossible that TX_ERROR occurs. So using
unlikely optimizes the xmit process.
CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tina Ruchandani [Mon, 27 Nov 2017 14:02:17 +0000 (15:02 +0100)]
atm: mpoa: remove 32-bit timekeeping
net/atm/mpoa_* files use 'struct timeval' to store event
timestamps. struct timeval uses a 32-bit seconds field which will
overflow in the year 2038 and beyond. Morever, the timestamps are being
compared only to get seconds elapsed, so struct timeval which stores
a seconds and microseconds field is an overkill. This patch replaces
the use of struct timeval with time64_t to store a 64-bit seconds field.
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Mon, 27 Nov 2017 13:15:10 +0000 (13:15 +0000)]
atm: eni: fix several indentation issues
There are several statements that have incorrect indentation. Fix
these.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Mon, 27 Nov 2017 11:41:38 +0000 (12:41 +0100)]
openvswitch: use ktime_get_ts64() instead of ktime_get_ts()
timespec is deprecated because of the y2038 overflow, so let's convert
this one to ktime_get_ts64(). The code is already safe even on 32-bit
architectures, since it uses monotonic times. On 64-bit architectures,
nothing changes, while on 32-bit architectures this avoids one
type conversion.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Mon, 27 Nov 2017 11:39:57 +0000 (12:39 +0100)]
netxen: remove timespec usage
netxen_collect_minidump() evidently just wants to get a monotonic
timestamp. Using jiffies_to_timespec(jiffies, &ts) is not
appropriate here, since it will overflow after 2^32 jiffies,
which may be as short as 49 days of uptime.
ktime_get_seconds() is the correct interface here.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Leitner [Mon, 27 Nov 2017 07:16:45 +0000 (08:16 +0100)]
net: phy: harmonize phy_id{,_mask} data type
Previously phy_id was u32 and phy_id_mask was unsigned int. As the
phy_id_mask defines the important bits of the phy_id (and is therefore
the same size) these two variables should be the same data type.
Signed-off-by: Richard Leitner <richard.leitner@skidata.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lukas Wunner [Sat, 25 Nov 2017 11:18:19 +0000 (12:18 +0100)]
net: ethernet: davinci_emac: Deduplicate bus_find_device() by name matching
No need to reinvent the wheel, we have bus_find_device_by_name().
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Fri, 24 Nov 2017 12:04:03 +0000 (15:04 +0300)]
net: thunderx: Set max queue count taking XDP_TX into account
on T81 there are only 4 cores, hence setting max queue count to 4
would leave nothing for XDP_TX. This patch fixes this by doubling
max queue count in above scenarios.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: cjacob <cjacob@caviumnetworks.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Fri, 24 Nov 2017 12:03:26 +0000 (15:03 +0300)]
net: thunderx: Add support for xdp redirect
This patch adds support for XDP_REDIRECT. Flush is not
yet supported.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: cjacob <cjacob@caviumnetworks.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 29 Nov 2017 22:49:26 +0000 (14:49 -0800)]
Merge tag 'nfsd-4.15-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"I screwed up my merge window pull request; I only sent half of what I
meant to.
There were no new features, just bugfixes of various importance and
some very minor cleanup, so I think it's all still appropriate for
-rc2.
Highlights:
- Fixes from Trond for some races in the NFSv4 state code.
- Fix from Naofumi Honda for a typo in the blocked lock notificiation
code
- Fixes from Vasily Averin for some problems starting and stopping
lockd especially in network namespaces"
* tag 'nfsd-4.15-1' of git://linux-nfs.org/~bfields/linux: (23 commits)
lockd: fix "list_add double add" caused by legacy signal interface
nlm_shutdown_hosts_net() cleanup
race of nfsd inetaddr notifiers vs nn->nfsd_serv change
race of lockd inetaddr notifiers vs nlmsvc_rqst change
SUNRPC: make cache_detail structures const
NFSD: make cache_detail structures const
sunrpc: make the function arg as const
nfsd: check for use of the closed special stateid
nfsd: fix panic in posix_unblock_lock called from nfs4_laundromat
lockd: lost rollback of set_grace_period() in lockd_down_net()
lockd: added cleanup checks in exit_net hook
grace: replace BUG_ON by WARN_ONCE in exit_net hook
nfsd: fix locking validator warning on nfs4_ol_stateid->st_mutex class
lockd: remove net pointer from messages
nfsd: remove net pointer from debug messages
nfsd: Fix races with check_stateid_generation()
nfsd: Ensure we check stateid validity in the seqid operation checks
nfsd: Fix race in lock stateid creation
nfsd4: move find_lock_stateid
nfsd: Ensure we don't recognise lock stateids after freeing them
...
Linus Torvalds [Wed, 29 Nov 2017 22:26:50 +0000 (14:26 -0800)]
Merge tag 'for-4.15-rc2-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We've collected some fixes in since the pre-merge window freeze.
There's technically only one regression fix for 4.15, but the rest
seems important and candidates for stable.
- fix missing flush bio puts in error cases (is serious, but rarely
happens)
- fix reporting stat::st_blocks for buffered append writes
- fix space cache invalidation
- fix out of bound memory access when setting zlib level
- fix potential memory corruption when fsync fails in the middle
- fix crash in integrity checker
- incremetnal send fix, path mixup for certain unlink/rename
combination
- pass flags to writeback so compressed writes can be throttled
properly
- error handling fixes"
* tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: incremental send, fix wrong unlink path after renaming file
btrfs: tree-checker: Fix false panic for sanity test
Btrfs: fix list_add corruption and soft lockups in fsync
btrfs: Fix wild memory access in compression level parser
btrfs: fix deadlock when writing out space cache
btrfs: clear space cache inode generation always
Btrfs: fix reported number of inode blocks after buffered append writes
Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
Btrfs: bail out gracefully rather than BUG_ON
btrfs: dev_alloc_list is not protected by RCU, use normal list_del
btrfs: add missing device::flush_bio puts
btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
Btrfs: add write_flags for compression bio
Linus Torvalds [Wed, 29 Nov 2017 22:19:22 +0000 (14:19 -0800)]
Merge tag 'microblaze-4.15-rc2' of git://git.monstr.eu/linux-2.6-microblaze
Pull Microblaze fix from Michal Simek:
"Add missing header to mmu_context_mm.h"
* tag 'microblaze-4.15-rc2' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: add missing include to mmu_context_mm.h
Linus Torvalds [Wed, 29 Nov 2017 22:17:30 +0000 (14:17 -0800)]
Merge git://git./linux/kernel/git/davem/sparc
Pull sparc fix from David Miller:
"Sparc T4 and later cpu bootup regression fix"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Fix boot on T4 and later.