Subash Abhinov Kasiviswanathan [Tue, 12 Dec 2017 00:30:13 +0000 (17:30 -0700)]
net: qualcomm: rmnet: Process packets over ethernet
Add support to send and receive packets over ethernet.
An example of usage is testing the data path on UML. This can be
achieved by setting up two UML instances in multicast mode and
associating rmnet over the UML ethernet device.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subash Abhinov Kasiviswanathan [Tue, 12 Dec 2017 00:30:12 +0000 (17:30 -0700)]
net: qualcomm: rmnet: Allow only one rmnet dev per muxid per real dev
Upon de-multiplexing data from one real dev, the packets can be sent
to an unique rmnet device for a given mux id.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subash Abhinov Kasiviswanathan [Tue, 12 Dec 2017 00:30:11 +0000 (17:30 -0700)]
net: qualcomm: rmnet: Remove the some redundant macros
Multiplexing is always enabled when transmiting from a rmnet device,
so remove the redundant egress macros. De-multiplexing is always
enabled when receiving packets from a rmnet device, so remove those
ingress macros.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subash Abhinov Kasiviswanathan [Tue, 12 Dec 2017 00:30:10 +0000 (17:30 -0700)]
net: qualcomm: rmnet: Remove the rmnet_map_results enum
Only the success and consumed entries were actually in use.
Use standard error codes instead.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Mon, 11 Dec 2017 23:42:53 +0000 (15:42 -0800)]
tcp: allow TLP in ECN CWR
This patch enables tail loss probe in cwnd reduction (CWR) state
to detect potential losses. Prior to this patch, since the sender
uses PRR to determine the cwnd in CWR state, the combination of
CWR+PRR plus tcp_tso_should_defer() could cause unnecessary stalls
upon losses: PRR makes cwnd so gentle that tcp_tso_should_defer()
defers sending wait for more ACKs. The ACKs may not come due to
packet losses.
Disallowing TLP when there is unused cwnd had the primary effect
of disallowing TLP when there is TSO deferral, Nagle deferral,
or we hit the rwin limit. Because basically every application
write() or incoming ACK will cause us to run tcp_write_xmit()
to see if we can send more, and then if we sent something we call
tcp_schedule_loss_probe() to see if we should schedule a TLP. At
that point, there are a few common reasons why some cwnd budget
could still be unused:
(a) rwin limit
(b) nagle check
(c) TSO deferral
(d) TSQ
For (d), after the next packet tx completion the TSQ mechanism
will allow us to send more packets, so we don't really need a
TLP (in practice it shouldn't matter whether we schedule one
or not). But for (a), (b), (c) the sender won't send any more
packets until it gets another ACK. But if the whole flight was
lost, or all the ACKs were lost, then we won't get any more ACKs,
and ideally we should schedule and send a TLP to get more feedback.
In particular for a long time we have wanted some kind of timer for
TSO deferral, and at least this would give us some kind of timer
Reported-by: Steve Ibanez <sibanez@stanford.edu>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Nandita Dukkipati <nanditad@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Mon, 11 Dec 2017 23:35:03 +0000 (15:35 -0800)]
net_sched: switch to exit_batch for action pernet ops
Since we now hold RTNL lock in tc_action_net_exit(), it is good to
batch them to speedup tc action dismantle.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 Dec 2017 18:25:05 +0000 (13:25 -0500)]
Merge branch 'hv_netvsc-Fix-default-and-limit-of-recv-buffer'
Stephen Hemminger says:
====================
hv_netvsc: Fix default and limit of recv buffer
The default for receive buffer descriptors is not correct, it should
match the default receive buffer size and the upper limit of receive
buffer size is too low. Also, for older versions of Window servers
hosts, different lower limit check is necessary, otherwise the buffer
request will be rejected by the host, resulting vNIC not come up.
This patch set corrects these problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Mon, 11 Dec 2017 16:56:58 +0000 (08:56 -0800)]
hv_netvsc: Fix the TX/RX buffer default sizes
The values were not computed correctly. There are no significant
visible impact, though.
The intended size of RX buffer is 16 MB, and the default slot size is 1728.
So, NETVSC_DEFAULT_RX should be 16*1024*1024 / 1728 = 9709.
The intended size of TX buffer is 1 MB, and the slot size is 6144.
So, NETVSC_DEFAULT_TX should be 1024*1024 / 6144 = 170.
The patch puts the formula directly into the macro, and moves them to
hyperv_net.h, together with related macros.
Fixes: 5023a6db73196 ("netvsc: increase default receive buffer size")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Mon, 11 Dec 2017 16:56:57 +0000 (08:56 -0800)]
hv_netvsc: Fix the receive buffer size limit
The max should be 31 MB on host with NVSP version > 2.
On legacy hosts (NVSP version <=2) only 15 MB receive buffer is allowed,
otherwise the buffer request will be rejected by the host, resulting
vNIC not coming up.
The NVSP version is only available after negotiation. So, we add the
limit checking for legacy hosts in netvsc_init_buf().
Fixes: 5023a6db73196 ("netvsc: increase default receive buffer size")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 Dec 2017 16:22:54 +0000 (11:22 -0500)]
Merge branch 'fec-fix-refclk-enable-for-SMSC-LAN8710-20'
Richard Leitner says:
====================
net: fec: fix refclk enable for SMSC LAN8710/20
This patch series fixes the use of the SMSC LAN8710/20 with a Freescale ETH
when the refclk is generated by the FSL.
This patchset depends on the "phylib: Add device reset GPIO support" patch
submitted by Geert Uytterhoeven/Sergei Shtylyov, which was merged to
net-next as commit
bafbdd527d56 ("phylib: Add device reset GPIO support").
Changes v5:
- fix reset delay calculation (max_t instead of min_t)
Changes v4:
- simplify dts parsing
- simplify reset delay evaluation and execution
- fec: ensure to only reset once during fec_enet_open()
- remove dependency notes from commit message
- add reviews and acks
Changes v3:
- use phylib to hard-reset the PHY
- implement reset delays in phylib
- add new phylib API & flag (PHY_RST_AFTER_CLK_EN) to determine if
a PHY is affected
Changes v2:
- simplify and fix fec_reset_phy function to support multiple calls
- include: linux: phy: harmonize phy_id{,_mask} type
- reset the phy instead of not turning the clock on and off
(which would have caused a power consumption regression)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Leitner [Mon, 11 Dec 2017 12:17:00 +0000 (13:17 +0100)]
net: fec: add phy_reset_after_clk_enable() support
Some PHYs (for example the SMSC LAN8710/LAN8720) doesn't allow turning
the refclk on and off again during operation (according to their
datasheet). Nonetheless exactly this behaviour was introduced for power
saving reasons by commit
e8fcfcd5684a ("net: fec: optimize the clock management to save power").
Therefore add support for the phy_reset_after_clk_enable function from
phylib to mitigate this issue.
Generally speaking this issue is only relevant if the ref clk for the
PHY is generated by the SoC and therefore the PHY is configured to
"REF_CLK In Mode". In our specific case (PCB) this problem does occur at
about every 10th to 50th POR of an LAN8710 connected to an i.MX6SOLO
SoC. The typical symptom of this problem is a "swinging" ethernet link.
Similar issues were reported by users of the NXP forum:
https://community.nxp.com/thread/389902
https://community.nxp.com/message/309354
With this patch applied the issue didn't occur for at least a few
hundret PORs of our board.
Fixes: e8fcfcd5684a ("net: fec: optimize the clock management to save power")
Signed-off-by: Richard Leitner <richard.leitner@skidata.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Leitner [Mon, 11 Dec 2017 12:16:59 +0000 (13:16 +0100)]
net: phy: smsc: LAN8710/20: add PHY_RST_AFTER_CLK_EN flag
The Microchip/SMSC LAN8710/LAN8720 PHYs need (according to their
datasheet [1]) a continuous REF_CLK when configured to "REF_CLK In Mode".
Therefore set the PHY_RST_AFTER_CLK_EN flag for those PHYs to let the
ETH driver reset them after the REF_CLK is enabled.
[1] http://ww1.microchip.com/downloads/en/DeviceDoc/
00002165B.pdf
Signed-off-by: Richard Leitner <richard.leitner@skidata.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Leitner [Mon, 11 Dec 2017 12:16:58 +0000 (13:16 +0100)]
phylib: add reset after clk enable support
Some PHYs need the refclk to be a continuous clock. Therefore they don't
allow turning it off and on again during operation. Nonetheless such a
clock switching is performed by some ETH drivers (namely FEC [1]) for
power saving reasons. An example for an affected PHY is the
SMSC/Microchip LAN8720 in "REF_CLK In Mode".
In order to provide a uniform method to overcome this problem this patch
adds a new phy_driver flag (PHY_RST_AFTER_CLK_EN) and corresponding
function phy_reset_after_clk_enable() to the phylib. These should be
used to trigger reset of the PHY after the refclk is switched on again.
[1] commit
e8fcfcd5684a ("net: fec: optimize the clock management to save power")
Signed-off-by: Richard Leitner <richard.leitner@skidata.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Leitner [Mon, 11 Dec 2017 12:16:57 +0000 (13:16 +0100)]
phylib: Add device reset delay support
Some PHYs need a minimum time after the reset gpio was asserted and/or
deasserted. To ensure we meet these timing requirements add two new
optional devicetree parameters for the phy: reset-delay-us and
reset-post-delay-us.
Signed-off-by: Richard Leitner <richard.leitner@skidata.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 Dec 2017 16:16:51 +0000 (11:16 -0500)]
Merge branch 'mvpp2-various-improvements'
Antoine Tenart says:
====================
net: mvpp2: various improvements
These patches are sent as a series to avoid any possible conflict, even
though there're not entirely related. I can send them separately if
needed. The series applies on today's net-next tree.
Since v1:
- Removed the patch disabling TSO on allocation errors.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Mon, 11 Dec 2017 08:13:29 +0000 (09:13 +0100)]
net: mvpp2: adjust the coalescing parameters
This patch adjust the coalescing parameters to the vendor
recommendations for the PPv2 network controller.
Suggested-by: Yan Markman <ymarkman@marvell.com>
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Mon, 11 Dec 2017 08:13:28 +0000 (09:13 +0100)]
net: mvpp2: report the tx-usec coalescing information to ethtool
This patch adds the tx-usec value to the informations reported to
ethtool by the get_coalesce function.
Suggested-by: Yan Markman <ymarkman@marvell.com>
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Mon, 11 Dec 2017 08:13:27 +0000 (09:13 +0100)]
net: mvpp2: align values in ethtool get_coalesce
Cosmetic patch aligning values in the ethtool get_coalesce function.
This patch do not modify in anyway the driver's behaviour.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yan Markman [Mon, 11 Dec 2017 08:13:26 +0000 (09:13 +0100)]
net: mvpp2: split the max ring size from the default one
The Rx/Tx ring sizes can be adjusted thanks to ethtool given specific
network needs. This commit splits the default ring size from its max
value to allow ethtool to vary the parameters in both ways.
Signed-off-by: Yan Markman <ymarkman@marvell.com>
[Antoine: commit message]
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Mon, 11 Dec 2017 08:13:25 +0000 (09:13 +0100)]
net: mvpp2: only free the TSO header buffers when it was allocated
This patch adds a check to only free the TSO header buffer when its
allocation previously succeeded.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 12 Dec 2017 15:53:04 +0000 (10:53 -0500)]
Merge branch 'tcp-better-receiver-autotuning'
Eric Dumazet says:
====================
tcp: better receiver autotuning
Now TCP senders no longer backoff when a drop is detected,
it appears we are very often receive window limited.
This series makes tcp_rcv_space_adjust() slightly more robust
and responsive.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 11 Dec 2017 01:55:04 +0000 (17:55 -0800)]
tcp: smoother receiver autotuning
Back in linux-3.13 (commit
b0983d3c9b13 ("tcp: fix dynamic right sizing"))
I addressed the pressing issues we had with receiver autotuning.
But DRS suffers from extra latencies caused by rcv_rtt_est.rtt_us
drifts. One common problem happens during slow start, since the
apparent RTT measured by the receiver can be inflated by ~50%,
at the end of one packet train.
Also, a single drop can delay read() calls by one RTT, meaning
tcp_rcv_space_adjust() can be called one RTT too late.
By replacing the tri-modal heuristic with a continuous function,
we can offset the effects of not growing 'at the optimal time'.
The curve of the function matches prior behavior if the space
increased by 25% and 50% exactly.
Cost of added multiply/divide is small, considering a TCP flow
typically would run this part of the code few times in its life.
I tested this patch with 100 ms RTT / 1% loss link, 100 runs
of (netperf -l 5), and got an average throughput of 4600 Mbit
instead of 1700 Mbit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 11 Dec 2017 01:55:03 +0000 (17:55 -0800)]
tcp: avoid integer overflows in tcp_rcv_space_adjust()
When using large tcp_rmem[2] values (I did tests with 500 MB),
I noticed overflows while computing rcvwin.
Lets fix this before the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 11 Dec 2017 01:55:02 +0000 (17:55 -0800)]
tcp: do not overshoot window_clamp in tcp_rcv_space_adjust()
While rcvbuf is properly clamped by tcp_rmem[2], rcvwin
is left to a potentially too big value.
It has no serious effect, since :
1) tcp_grow_window() has very strict checks.
2) window_clamp can be mangled by user space to any value anyway.
tcp_init_buffer_space() and companions use tcp_full_space(),
we use tcp_win_from_space() to avoid reloading sk->sk_rcvbuf
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhu Yanjun [Sun, 10 Dec 2017 03:07:26 +0000 (22:07 -0500)]
forcedeth: remove unnecessary structure member
Since both tx_ring and first_tx are the head of tx ring, it not
necessary to use two structure members to statically indicate
the head of tx ring. So first_tx is removed.
CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 11 Dec 2017 17:08:23 +0000 (12:08 -0500)]
Merge branch 'nfp-dead-code-clean-ups-and-slight-improvements'
Jakub Kicinski says:
====================
nfp: dead code, clean ups and slight improvements
This series contains small clean ups from John and Carl, and brings
no functional changes.
John's improvements target the flower code. First he makes sure we don't
allocate space in FW request messages for MAC matches if the TC rule does
not contain any. The remaining two patches remove some dead code and
unused defines.
Carl follows up with a slight optimization to his recent ethtool FW state
dumps, byte swapping input parameters once instead of the data for every
dumped item.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Carl Heymann [Sat, 9 Dec 2017 03:37:04 +0000 (19:37 -0800)]
nfp: debug dump - decrease endian conversions
Convert the requested dump level parameter to big-endian at the start of
nfp_net_dump_calculate_size() and nfp_net_dump_populate_buffer(), then
compare and assign it directly where needed in the traversal and prolog
code. This decreases the total number of conversions used.
Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Hurley [Sat, 9 Dec 2017 03:37:03 +0000 (19:37 -0800)]
nfp: flower: remove unused defines
Delete match field defines that are not supported at this time.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Hurley [Sat, 9 Dec 2017 03:37:02 +0000 (19:37 -0800)]
nfp: flower: remove dead code paths
Port matching is selected by default on every rule so remove check for it
and delete 'else' side of the statement. Remove nfp_flower_meta_one as now
it will not feature in the code. Rename nfp_flower_meta_two given that one
has been removed.
'Additional metadata' if statement can never be true so remove it as well.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Hurley [Sat, 9 Dec 2017 03:37:01 +0000 (19:37 -0800)]
nfp: flower: do not assume mac/mpls matches
Remove the matching of mac/mpls as a default selection. These are not
necessarily set by a TC rule (unlike the port). Previously a mac/mpls
field would exist in every match and be masked out if not used. This patch
has no impact on functionality but removes unnessary memory assignment in
the match cmsg.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fabio Estevam [Fri, 8 Dec 2017 14:11:33 +0000 (12:11 -0200)]
dt-bindings: fec: Make the phy-reset-gpio polarity explicit
The GPIO polarity passed to phy-reset-gpio is ignored by the FEC
driver and it is assumed to be active low.
It can be active high only when the 'phy-reset-active-high' property
is present.
The current examples pass active high polarity and work fine, but
in order to improve the documentation make it explicit what the real
polarity is.
Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
Acked-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 11 Dec 2017 16:23:06 +0000 (11:23 -0500)]
Merge branch 'sctp-stream-interleave-part-1'
Xin Long says:
====================
sctp: Implement Stream Interleave: The I-DATA Chunk Supporting User Message Interleaving
Stream Interleave would be Implemented in two Parts:
1. The I-DATA Chunk Supporting User Message Interleaving
2. Interaction with Other SCTP Extensions
Overview in section 1.1 of RFC8260 for Part 1:
This document describes a new chunk carrying payload data called
I-DATA. This chunk incorporates the properties of the current SCTP
DATA chunk, all the flags and fields except the Stream Sequence
Number (SSN), and also adds two new fields in its chunk header -- the
Fragment Sequence Number (FSN) and the Message Identifier (MID). The
FSN is only used for reassembling all fragments that have the same
MID and the same ordering property. The TSN is only used for the
reliable transfer in combination with Selective Acknowledgment (SACK)
chunks.
In addition, the MID is also used for ensuring ordered delivery
instead of using the stream sequence number (the I-DATA chunk omits
an SSN).
As the 1st part of Stream Interleave Implementation, this patchset adds
an ops framework named sctp_stream_interleave with a bunch of stuff that
does lots of things needed somewhere.
Then it defines sctp_stream_interleave_0 to work for normal DATA chunks
and sctp_stream_interleave_1 for I-DATA chunks.
With these functions, hundreds of if-else checks for the different process
on I-DATA chunks would be avoided. Besides, very few codes could be shared
in these two function sets.
In this patchset, it adds some basic variables, structures and socket
options firstly, then implement these functions one by one to add the
procedures for ordered idata gradually, at last adjusts some codes to
make them work for unordered idata.
To make it safe to be implemented and also not break the normal data
chunk process, this feature can't be enabled to use until all stream
interleave codes are completely accomplished.
v1 -> v2:
- fixed a checkpatch warning that a blank line was missed.
- avoided a kbuild warning reported from gcc-4.9.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:04:09 +0000 (21:04 +0800)]
sctp: add support for the process of unordered idata
Unordered idata process is more complicated than unordered data:
- It has to add mid into sctp_stream_out to save the next mid value,
which is separated from ordered idata's.
- To support pd for unordered idata, another mid and pd_mode need to
be added to save the message id and pd state in sctp_stream_in.
- To make unordered idata reasm easier, it adds a new event queue
to save frags for idata.
The patch mostly adds the samilar reasm functions for unordered idata
as ordered idata's, and also adjusts some other codes on assign_mid,
abort_pd and ulpevent_data for idata.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:04:08 +0000 (21:04 +0800)]
sctp: implement abort_pd for sctp_stream_interleave
abort_pd is added as a member of sctp_stream_interleave, used to abort
partial delivery for data or idata, called in sctp_cmd_assoc_failed.
Since stream interleave allows to do partial delivery for each stream
at the same time, sctp_intl_abort_pd for idata would be very different
from the old function sctp_ulpq_abort_pd for data.
Note that sctp_ulpevent_make_pdapi will support per stream in this
patch by adding pdapi_stream and pdapi_seq in sctp_pdapi_event, as
described in section 6.1.7 of RFC6458.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:04:07 +0000 (21:04 +0800)]
sctp: implement start_pd for sctp_stream_interleave
start_pd is added as a member of sctp_stream_interleave, used to
do partial_delivery for data or idata when datalen >= asoc->rwnd
in sctp_eat_data. The codes have been done in last patches, but
they need to be extracted into start_pd, so that it could be used
for SCTP_CMD_PART_DELIVER cmd as well.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:04:06 +0000 (21:04 +0800)]
sctp: implement renege_events for sctp_stream_interleave
renege_events is added as a member of sctp_stream_interleave, used to
renege some old data or idata in reasm or lobby queue properly to free
some memory for the new data when there's memory stress.
It defines sctp_renege_events for idata, and leaves sctp_ulpq_renege
as it is for data.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:04:05 +0000 (21:04 +0800)]
sctp: implement enqueue_event for sctp_stream_interleave
enqueue_event is added as a member of sctp_stream_interleave, used to
enqueue either data, idata or notification events into user socket rx
queue.
It replaces sctp_ulpq_tail_event used in the other places with
enqueue_event.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:04:04 +0000 (21:04 +0800)]
sctp: implement ulpevent_data for sctp_stream_interleave
ulpevent_data is added as a member of sctp_stream_interleave, used to
do the most process in ulpq, including to convert data or idata chunk
to event, reasm them in reasm queue and put them in lobby queue in
right order, and deliver them up to user sk rx queue.
This procedure is described in section 2.2.3 of RFC8260.
It adds most functions for idata here to do the similar process as
the old functions for data. But since the details are very different
between them, the old functions can not be reused for idata.
event->ssn and event->ppid settings are moved to ulpevent_data from
sctp_ulpevent_make_rcvmsg, so that sctp_ulpevent_make_rcvmsg could
work for both data and idata.
Note that mid is added in sctp_ulpevent for idata, __packed has to
be used for defining sctp_ulpevent, or it would exceeds the skb cb
that saves a sctp_ulpevent variable for ulp layer process.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:04:03 +0000 (21:04 +0800)]
sctp: implement validate_data for sctp_stream_interleave
validate_data is added as a member of sctp_stream_interleave, used
to validate ssn/chunk type for data or mid (message id)/chunk type
for idata, called in sctp_eat_data.
If this check fails, an abort packet will be sent, as said in
section 2.2.3 of RFC8260.
It also adds the process for idata in rx path. As Marcelo pointed
out, there's no need to add event table for idata, but just share
chunk_event_table with data's. It would drop data chunk for idata
and drop idata chunk for data by calling validate_data in
sctp_eat_data.
As last patch did, it also replaces sizeof(struct sctp_data_chunk)
with sctp_datachk_len for rx path.
After this patch, the idata can be accepted and delivered to ulp
layer.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:04:02 +0000 (21:04 +0800)]
sctp: implement assign_number for sctp_stream_interleave
assign_number is added as a member of sctp_stream_interleave, used
to assign ssn for data or mid (message id) for idata, called in
sctp_packet_append_data. sctp_chunk_assign_ssn is left as it is,
and sctp_chunk_assign_mid is added for sctp_stream_interleave_1.
This procedure is described in section 2.2.2 of RFC8260.
All sizeof(struct sctp_data_chunk) in tx path is replaced with
sctp_datachk_len, to make it right for idata as well. And also
adjust sctp_chunk_is_data for SCTP_CID_I_DATA.
After this patch, idata can be built and sent in tx path.
Note that if sp strm_interleave is set, it has to wait_connect in
sctp_sendmsg, as asoc intl_enable need to be known after 4 shake-
hands, to decide if it should use data or idata later. data and
idata can't be mixed to send in one asoc.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:04:01 +0000 (21:04 +0800)]
sctp: implement make_datafrag for sctp_stream_interleave
To avoid hundreds of checks for the different process on I-DATA chunk,
struct sctp_stream_interleave is defined as a group of functions used
to replace the codes in some place where it needs to do different job
according to if the asoc intl_enabled is set.
With these ops, it only needs to initialize asoc->stream.si with
sctp_stream_interleave_0 for normal data if asoc intl_enable is 0,
or sctp_stream_interleave_1 for idata if asoc intl_enable is set in
sctp_stream_init.
After that, the members in asoc->stream.si can be used directly in
some special places without checking asoc intl_enable.
make_datafrag is the first member for sctp_stream_interleave, it's
used to make data or idata frags, called in sctp_datamsg_from_user.
The old function sctp_make_datafrag_empty needs to be adjust some
to fit in this ops.
Note that as idata and data chunks have different length, it also
defines data_chunk_len for sctp_stream_interleave to describe the
chunk size.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:04:00 +0000 (21:04 +0800)]
sctp: add basic structures and make chunk function for idata
sctp_idatahdr and sctp_idata_chunk are used to define and parse
I-DATA chunk format, and sctp_make_idata is a function to build
the chunk.
The I-DATA Chunk Format is defined in section 2.1 of RFC8260.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:03:59 +0000 (21:03 +0800)]
sctp: add asoc intl_enable negotiation during 4 shakehands
asoc intl_enable will be set when local sp strm_interleave is set
and there's I-DATA chunk in init and init_ack extensions, as said
in section 2.2.1 of RFC8260.
asoc intl_enable indicates all data will be sent as I-DATA chunks.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Dec 2017 13:03:58 +0000 (21:03 +0800)]
sctp: add stream interleave enable members and sockopt
This patch adds intl_enable in asoc and netns, and strm_interleave in
sctp_sock to indicate if stream interleave is enabled and supported.
netns intl_enable would be set via procfs, but that is not added yet
until all stream interleave codes are completely implemented; asoc
intl_enable will be set when doing 4-shakehands.
sp strm_interleave can be set by sockopt SCTP_INTERLEAVING_SUPPORTED
which is also added in this patch. This socket option is defined in
section 4.3.1 of RFC8260.
Note that strm_interleave can only be set by sockopt when both netns
intl_enable and sp frag_interleave are set.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Thu, 7 Dec 2017 23:15:43 +0000 (15:15 -0800)]
ipvlan: add L2 check for packets arriving via virtual devices
Packets that don't have dest mac as the mac of the master device should
not be entertained by the IPvlan rx-handler. This is mostly true as the
packet path mostly takes care of that, except when the master device is
a virtual device. As demonstrated in the following case -
ip netns add ns1
ip link add ve1 type veth peer name ve2
ip link add link ve2 name iv1 type ipvlan mode l2
ip link set dev iv1 netns ns1
ip link set ve1 up
ip link set ve2 up
ip -n ns1 link set iv1 up
ip addr add 192.168.10.1/24 dev ve1
ip -n ns1 addr 192.168.10.2/24 dev iv1
ping -c2 192.168.10.2
<Works!>
ip neigh show dev ve1
ip neigh show 192.168.10.2 lladdr <random> dev ve1
ping -c2 192.168.10.2
<Still works! Wrong!!>
This patch adds that missing check in the IPvlan rx-handler.
Reported-by: Amit Sikka <amit.sikka@ericsson.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Wed, 6 Dec 2017 23:03:20 +0000 (15:03 -0800)]
netlink: convert netlink tap spinlock to mutex
Both netlink_add_tap() and netlink_remove_tap() are
called in process context, no need to bother spinlock.
Note, in fact, currently we always hold RTNL when calling
these two functions, so we don't need any other lock at
all, but keeping this lock doesn't harm anything.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Wed, 6 Dec 2017 23:03:19 +0000 (15:03 -0800)]
netlink: make netlink tap per netns
nlmon device is not supposed to capture netlink events from
other netns, so instead of filtering events, we can simply
make netlink tap itself per netns.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 11 Dec 2017 14:58:39 +0000 (09:58 -0500)]
Merge branch 'rhashtable-New-features-in-walk-and-bucket'
Tom Herbert says:
====================
rhashtable: New features in walk and bucket
This patch contains some changes to related rhashtable:
- Above allow rhashtable_walk_start to return void
- Add a functon to peek at the next entry during a walk
- Abstract out function to compute a has for a table
- A library function to alloc a spinlocks bucket array
- Call the above function for rhashtable locks allocation
Tested: Exercised using various operations on an ILA xlat
table.
v2:
- Apply feedback from Herbert. Don't change semantics of resize
event reporting and -EAGAIN, just simplify API for callers that
ignore those.
- Add end_of_table in iter to reliably tell when the iterator has
reached to the eno.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 4 Dec 2017 18:31:45 +0000 (10:31 -0800)]
rhashtable: Call library function alloc_bucket_locks
To allocate the array of bucket locks for the hash table we now
call library function alloc_bucket_spinlocks. This function is
based on the old alloc_bucket_locks in rhashtable and should
produce the same effect.
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 4 Dec 2017 18:31:44 +0000 (10:31 -0800)]
spinlock: Add library function to allocate spinlock buckets array
Add two new library functions: alloc_bucket_spinlocks and
free_bucket_spinlocks. These are used to allocate and free an array
of spinlocks that are useful as locks for hash buckets. The interface
specifies the maximum number of spinlocks in the array as well
as a CPU multiplier to derive the number of spinlocks to allocate.
The number allocated is rounded up to a power of two to make the
array amenable to hash lookup.
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 4 Dec 2017 18:31:43 +0000 (10:31 -0800)]
rhashtable: abstract out function to get hash
Split out most of rht_key_hashfn which is calculating the hash into
its own function. This way the hash function can be called separately to
get the hash value.
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 4 Dec 2017 18:31:42 +0000 (10:31 -0800)]
rhashtable: Add rhastable_walk_peek
This function is like rhashtable_walk_next except that it only returns
the current element in the inter and does not advance the iter.
This patch also creates __rhashtable_walk_find_next. It finds the next
element in the table when the entry cached in iter is NULL or at the end
of a slot. __rhashtable_walk_find_next is called from
rhashtable_walk_next and rhastable_walk_peek.
end_of_table is an added field to the iter structure. This indicates
that the end of table was reached (walker.tbl being NULL is not a
sufficient condition for end of table).
Signed-off-by: Tom Herbert <tom@quantonium.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 4 Dec 2017 18:31:41 +0000 (10:31 -0800)]
rhashtable: Change rhashtable_walk_start to return void
Most callers of rhashtable_walk_start don't care about a resize event
which is indicated by a return value of -EAGAIN. So calls to
rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
like this is common:
ret = rhashtable_walk_start(rhiter);
if (ret && ret != -EAGAIN)
goto out;
Since zero and -EAGAIN are the only possible return values from the
function this check is pointless. The condition never evaluates to true.
This patch changes rhashtable_walk_start to return void. This simplifies
code for the callers that ignore -EAGAIN. For the few cases where the
caller cares about the resize event, particularly where the table can be
walked in mulitple parts for netlink or seq file dump, the function
rhashtable_walk_start_check has been added that returns -EAGAIN on a
resize event.
Signed-off-by: Tom Herbert <tom@quantonium.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Fri, 8 Dec 2017 23:34:13 +0000 (15:34 -0800)]
rtnetlink: fix typo in GSO max segments
Fixes: 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 10 Dec 2017 03:09:55 +0000 (22:09 -0500)]
Merge git://git./linux/kernel/git/davem/net
Conflict was two parallel additions of include files to sch_generic.c,
no biggie.
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Hocko [Wed, 6 Dec 2017 10:27:57 +0000 (11:27 +0100)]
kmemcheck: rip it out for real
Commit
4675ff05de2d ("kmemcheck: rip it out") has removed the code but
for some reason SPDX header stayed in place. This looks like a rebase
mistake in the mmotm tree or the merge mistake. Let's drop those
leftovers as well.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 8 Dec 2017 21:32:44 +0000 (13:32 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) CAN fixes from Martin Kelly (cancel URBs properly in all the CAN usb
drivers).
2) Revert returning -EEXIST from __dev_alloc_name() as this propagates
to userspace and broke some apps. From Johannes Berg.
3) Fix conn memory leaks and crashes in TIPC, from Jon Malloc and Cong
Wang.
4) Gianfar MAC can't do EEE so don't advertise it by default, from
Claudiu Manoil.
5) Relax strict netlink attribute validation, but emit a warning. From
David Ahern.
6) Fix regression in checksum offload of thunderx driver, from Florian
Westphal.
7) Fix UAPI bpf issues on s390, from Hendrik Brueckner.
8) New card support in iwlwifi, from Ihab Zhaika.
9) BBR congestion control bug fixes from Neal Cardwell.
10) Fix port stats in nfp driver, from Pieter Jansen van Vuuren.
11) Fix leaks in qualcomm rmnet, from Subash Abhinov Kasiviswanathan.
12) Fix DMA API handling in sh_eth driver, from Thomas Petazzoni.
13) Fix spurious netpoll warnings in bnxt_en, from Calvin Owens.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
net: mvpp2: fix the RSS table entry offset
tcp: evaluate packet losses upon RTT change
tcp: fix off-by-one bug in RACK
tcp: always evaluate losses in RACK upon undo
tcp: correctly test congestion state in RACK
bnxt_en: Fix sources of spurious netpoll warnings
tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
tcp_bbr: reset full pipe detection on loss recovery undo
tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
sfc: pass valid pointers from efx_enqueue_unwind
gianfar: Disable EEE autoneg by default
tcp: invalidate rate samples during SACK reneging
can: peak/pcie_fd: fix potential bug in restarting tx queue
can: usb_8dev: cancel urb on -EPIPE and -EPROTO
can: kvaser_usb: cancel urb on -EPIPE and -EPROTO
can: esd_usb2: cancel urb on -EPIPE and -EPROTO
can: ems_usb: cancel urb on -EPIPE and -EPROTO
can: mcba_usb: cancel urb on -EPROTO
usbnet: fix alignment for frames with no ethernet header
tcp: use current time in tcp_rcv_space_adjust()
...
Linus Torvalds [Fri, 8 Dec 2017 21:18:47 +0000 (13:18 -0800)]
Merge tag 'media/v4.15-2' of git://git./linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"A series of fixes for the media subsytem:
- The largest amount of fixes in this series is with regards to
comments that aren't kernel-doc, but start with "/**".
A new check added for 4.15 makes it to produce a *huge* amount of
new warnings (I'm compiling here with W=1). Most of the patches in
this series fix those.
No code changes - just comment changes at the source files
- rc: some fixed in order to better handle RC repetition codes
- v4l-async: use the v4l2_dev from the root notifier when matching
sub-devices
- v4l2-fwnode: Check subdev count after checking port
- ov 13858 and et8ek8: compilation fix with randconfigs
- usbtv: a trivial new USB ID addition
- dibusb-common: don't do DMA on stack on firmware load
- imx274: Fix error handling, add MAINTAINERS entry
- sir_ir: detect presence of port"
* tag 'media/v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (50 commits)
media: imx274: Fix error handling, add MAINTAINERS entry
media: v4l: async: use the v4l2_dev from the root notifier when matching sub-devices
media: v4l2-fwnode: Check subdev count after checking port
media: et8ek8: select V4L2_FWNODE
media: ov13858: Select V4L2_FWNODE
media: rc: partial revert of "media: rc: per-protocol repeat period"
media: dvb: i2c transfers over usb cannot be done from stack
media: dvb-frontends: complete kernel-doc markups
media: docs: add documentation for frontend attach info
media: dvb_frontends: fix kernel-doc macros
media: drivers: remove "/**" from non-kernel-doc comments
media: lm3560: add a missing kernel-doc parameter
media: rcar_jpu: fix two kernel-doc markups
media: vsp1: add a missing kernel-doc parameter
media: soc_camera: fix a kernel-doc markup
media: mt2063: fix some kernel-doc warnings
media: radio-wl1273: fix a parameter name at kernel-doc macro
media: s3c-camif: add missing description at s3c_camif_find_format()
media: mtk-vpu: add description for wdt fields at struct mtk_vpu
media: vdec: fix some kernel-doc warnings
...
Linus Torvalds [Fri, 8 Dec 2017 21:11:57 +0000 (13:11 -0800)]
Merge tag 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"This pull is a bit larger than I'd like but a large bunch of it is
license fixes, AMD wanted to fix the licenses for a bunch of files
that were missing them,
Otherwise a bunch of TTM regression fix since the hugepage support,
some i915 and gvt fixes, a core connector free in a safe context fix,
and one bridge fix"
* tag 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux: (26 commits)
drm/bridge: analogix dp: Fix runtime PM state in get_modes() callback
Revert "drm/i915: Display WA #1133 WaFbcSkipSegments:cnl, glk"
drm/vc4: Fix false positive WARN() backtrace on refcount_inc() usage
drm/i915: Call i915_gem_init_userptr() before taking struct_mutex
drm/exynos: remove unnecessary function declaration
drm/exynos: remove unnecessary descrptions
drm/exynos: gem: Drop NONCONTIG flag for buffers allocated without IOMMU
drm/exynos: Fix dma-buf import
drm/ttm: swap consecutive allocated pooled pages v4
drm: safely free connectors from connector_iter
drm/i915/gvt: set max priority for gvt context
drm/i915/gvt: Don't mark vgpu context as inactive when preempted
drm/i915/gvt: Limit read hw reg to active vgpu
drm/i915/gvt: Export intel_gvt_render_mmio_to_ring_id()
drm/i915/gvt: Emulate PCI expansion ROM base address register
drm/ttm: swap consecutive allocated cached pages v3
drm/ttm: roundup the shrink request to prevent skip huge pool
drm/ttm: add page order support in ttm_pages_put
drm/ttm: add set_pages_wb for handling page order more than zero
drm/ttm: add page order in page pool
...
Linus Torvalds [Fri, 8 Dec 2017 21:03:02 +0000 (13:03 -0800)]
Merge tag 'md/4.15-rc2' of git://git./linux/kernel/git/shli/md
Pull md fixes from Shaohua Li:
"Some MD fixes.
The notable one is a raid5-cache deadlock bug with dm-raid, others are
not significant"
* tag 'md/4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/raid1/10: add missed blk plug
md: limit mdstat resync progress to max_sectors
md/r5cache: move mddev_lock() out of r5c_journal_mode_set()
md/raid5: correct degraded calculation in raid5_error
Linus Torvalds [Fri, 8 Dec 2017 21:00:51 +0000 (13:00 -0800)]
Merge tag 'devicetree-fixes-for-4.15-part2' of git://git./linux/kernel/git/robh/linux
Pull DeviceTree fixes from Rob Herring:
"Another set of DT fixes:
- Fixes from overlay code rework. A trifecta of fixes to the locking,
an out of bounds access, and a memory leak in of_overlay_apply()
- Clean-up at25 eeprom binding document
- Remove leading '0x' in unit-addresses from binding docs"
* tag 'devicetree-fixes-for-4.15-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of: overlay: Make node skipping in init_overlay_changeset() clearer
of: overlay: Fix out-of-bounds write in init_overlay_changeset()
of: overlay: Fix (un)locking in of_overlay_apply()
of: overlay: Fix memory leak in of_overlay_apply() error path
dt-bindings: eeprom: at25: Document device-specific compatible values
dt-bindings: eeprom: at25: Grammar s/are can/can/
dt-bindings: Remove leading 0x from bindings notation
of: overlay: Remove else after goto
of: Spelling s/changset/changeset/
of: unittest: Remove bogus overlay mutex release from overlay_data_add()
Linus Torvalds [Fri, 8 Dec 2017 20:58:51 +0000 (12:58 -0800)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull virtio bugfixes from Michael Tsirkin:
"A couple of minor bugfixes"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_net: fix return value check in receive_mergeable()
virtio_mmio: add cleanup for virtio_mmio_remove
virtio_mmio: add cleanup for virtio_mmio_probe
Linus Torvalds [Fri, 8 Dec 2017 20:53:43 +0000 (12:53 -0800)]
Merge tag 'for-linus-4.15-rc3-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Just two small fixes for the new pvcalls frontend driver"
* tag 'for-linus-4.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/pvcalls: Fix a check in pvcalls_front_remove()
xen/pvcalls: check for xenbus_read() errors
Linus Torvalds [Fri, 8 Dec 2017 20:52:09 +0000 (12:52 -0800)]
Merge tag 'powerpc-4.15-4' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"One notable fix for kexec on Power9, where we were not clearing MMU
PID properly which sometimes leads to hangs. Finally debugged to a
root cause by Nick.
A revert of a patch which tried to rework our panic handling to get
more output on the console, but inadvertently broke reporting the
panic to the hypervisor, which apparently people care about.
Then a fix for an oops in the PMU code, and finally some s/%p/%px/ in
xmon.
Thanks to: David Gibson, Nicholas Piggin, Ravi Bangoria"
* tag 'powerpc-4.15-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/xmon: Don't print hashed pointers in xmon
powerpc/64s: Initialize ISAv3 MMU registers before setting partition table
Revert "powerpc: Do not call ppc_md.panic in fadump panic notifier"
powerpc/perf: Fix oops when grouping different pmu events
David S. Miller [Fri, 8 Dec 2017 19:53:54 +0000 (14:53 -0500)]
Merge tag 'linux-can-fixes-for-4.15-
20171208' of git://git./linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2017-12-08
this is a pull request of 6 patches for net/master.
Martin Kelly provides 5 patches for various USB based CAN drivers, that
properly cancel the URBs on adapter unplug, so that the driver doesn't
end up in an endless loop. Stephane Grosjean provides a patch to restart
the tx queue if zero length packages are transmitted.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Girish Moodalbail [Fri, 8 Dec 2017 14:03:26 +0000 (06:03 -0800)]
macvlan: fix memory hole in macvlan_dev
Move 'macaddr_count' from after 'netpoll' to after 'nest_level' to pack
and reduce a memory hole.
Fixes: 88ca59d1aaf28c25 (macvlan: remove unused fields in struct macvlan_dev)
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Dec 2017 19:48:49 +0000 (14:48 -0500)]
Merge tag 'wireless-drivers-for-davem-2017-12-08' of git://git./linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
wireless-drivers fixes for 4.15
Second set of fixes for 4.15. This time a lot of iwlwifi patches and
two brcmfmac patches. Most important here are the MIC and IVC fixes
for iwlwifi to unbreak 9000 series.
iwlwifi
* fix rate-scaling to not start lowest possible rate
* fix the TX queue hang detection for AP/GO modes
* fix the TX queue hang timeout in monitor interfaces
* fix packet injection
* remove a wrong error message when dumping PCI registers
* fix race condition with RF-kill
* tell mac80211 when the MIC has been stripped (9000 series)
* tell mac80211 when the IVC has been stripped (9000 series)
* add 2 new PCI IDs, one for 9000 and one for 22000
* fix a queue hang due during a P2P Remain-on-Channel operation
brcmfmac
* fix a race which sometimes caused a crash during sdio unbind
* fix a kernel-doc related build error
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Kleine-Budde [Fri, 8 Dec 2017 11:18:59 +0000 (12:18 +0100)]
slip: sl_alloc(): remove unused parameter "dev_t line"
The first and only parameter of sl_alloc() is unused, so remove it.
Fixes: 5342b77c4123 slip: ("Clean up create and destroy")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 8 Dec 2017 09:24:20 +0000 (10:24 +0100)]
net: mvpp2: fix the RSS table entry offset
The macro used to access or set an RSS table entry was using an offset
of 8, while it should use an offset of 0. This lead to wrongly configure
the RSS table, not accessing the right entries.
Fixes: 1d7d15d79fb4 ("net: mvpp2: initialize the RSS tables")
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Dec 2017 19:31:51 +0000 (14:31 -0500)]
Merge branch 'cxgb4-collect-hardware-logs-via-ethtool'
Rahul Lakkireddy says:
====================
cxgb4: collect hardware logs via ethtool
Collect more hardware logs via ethtool --get-dump facility.
Patch 1 collects on-chip memory layout information.
Patch 2 collects on-chip MC memory dumps.
Patch 3 collects HMA memory dump.
Patch 4 evaluates and skips TX and RX payload regions in memory dumps.
Patch 5 collects egress and ingress SGE queue contexts.
Patch 6 collects PCIe configuration logs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 8 Dec 2017 04:18:41 +0000 (09:48 +0530)]
cxgb4: collect PCIe configuration logs
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 8 Dec 2017 04:18:40 +0000 (09:48 +0530)]
cxgb4: collect egress and ingress SGE queue contexts
Use meminfo to identify the egress and ingress context regions and
fetch all valid contexts from those regions. Also flush all contexts
before attempting collection to prevent stale information.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 8 Dec 2017 04:18:39 +0000 (09:48 +0530)]
cxgb4: skip TX and RX payload regions in memory dumps
Use meminfo to identify TX and RX payload regions and skip them in
collection of EDC, MC, and HMA.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 8 Dec 2017 04:18:38 +0000 (09:48 +0530)]
cxgb4: collect HMA memory dump
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 8 Dec 2017 04:18:37 +0000 (09:48 +0530)]
cxgb4: collect MC memory dump
Use meminfo to get base address and size of MC memory. Also use same
meminfo for EDC memory dumps.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 8 Dec 2017 04:18:36 +0000 (09:48 +0530)]
cxgb4: collect on-chip memory information
Collect memory layout of various on-chip memory regions. Move code
for collecting on-chip memory information to cudbg_lib.c and update
cxgb4_debugfs.c to use the common function. Also include
cudbg_entity.h before cudbg_lib.h to avoid adding cudbg entity
structure forward declarations in cudbg_lib.h.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Dec 2017 19:23:00 +0000 (14:23 -0500)]
Merge branch 'veth-and-GSO-maximums'
Stephen Hemminger says:
====================
veth and GSO maximums
This is the more general way to solving the issue of GSO limits
not being set correctly for containers on Azure. If a GSO packet
is sent to host that exceeds the limit (reported by NDIS), then
the host is forced to do segmentation in software which has noticeable
performance impact.
The core rtnetlink infrastructure already has the messages and
infrastructure to allow changing gso limits. With an updated iproute2
the following already works:
# ip li set dev dummy0 gso_max_size 30000
These patches are about making it easier with veth.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Thu, 7 Dec 2017 23:40:20 +0000 (15:40 -0800)]
veth: set peer GSO values
When new veth is created, and GSO values have been configured
on one device, clone those values to the peer.
For example:
# ip link add dev vm1 gso_max_size 65530 type veth peer name vm2
This should create vm1 <--> vm2 with both having GSO maximum
size set to 65530.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Thu, 7 Dec 2017 23:40:19 +0000 (15:40 -0800)]
rtnetlink: allow GSO maximums to be set on device creation
Netlink device already allows changing GSO sizes with
ip set command. The part that is missing is allowing overriding
GSO settings on device creation.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Niklas Cassel [Thu, 7 Dec 2017 22:56:10 +0000 (23:56 +0100)]
net: stmmac: fix broken dma_interrupt handling for multi-queues
There is nothing that says that number of TX queues == number of RX
queues. E.g. the ARTPEC-6 SoC has 2 TX queues and 1 RX queue.
This code is obviously wrong:
for (chan = 0; chan < tx_channel_count; chan++) {
struct stmmac_rx_queue *rx_q = &priv->rx_queue[chan];
priv->rx_queue has size MTL_MAX_RX_QUEUES, so this will send an
uninitialized napi_struct to __napi_schedule(), causing us to
crash in net_rx_action(), because napi_struct->poll is zero.
[12846.759880] Unable to handle kernel NULL pointer dereference at virtual address
00000000
[12846.768014] pgd = (ptrval)
[12846.770742] [
00000000] *pgd=
39ec7831, *pte=
00000000, *ppte=
00000000
[12846.777023] Internal error: Oops:
80000007 [#1] PREEMPT SMP ARM
[12846.782942] Modules linked in:
[12846.785998] CPU: 0 PID: 161 Comm: dropbear Not tainted
4.15.0-rc2-00285-gf5fb5f2f39a7 #36
[12846.794177] Hardware name: Axis ARTPEC-6 Platform
[12846.798879] task: (ptrval) task.stack: (ptrval)
[12846.803407] PC is at 0x0
[12846.805942] LR is at net_rx_action+0x274/0x43c
[12846.810383] pc : [<
00000000>] lr : [<
80bff064>] psr:
200e0113
[12846.816648] sp :
b90d9ae8 ip :
b90d9ae8 fp :
b90d9b44
[12846.821871] r10:
00000008 r9 :
0013250e r8 :
00000100
[12846.827094] r7 :
0000012c r6 :
00000000 r5 :
00000001 r4 :
bac84900
[12846.833619] r3 :
00000000 r2 :
b90d9b08 r1 :
00000000 r0 :
bac84900
Since each DMA channel can be used for rx and tx simultaneously,
the current code should probably be rewritten so that napi_struct is
embedded in a new struct stmmac_channel.
That way, stmmac_poll() can call stmmac_tx_clean() on just the tx queue
where we got the IRQ, instead of looping through all tx queues.
This is also how the xgbe driver does it (another driver for this IP).
Fixes: c22a3f48ef99 ("net: stmmac: adding multiple napi mechanism")
Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Dec 2017 19:14:12 +0000 (14:14 -0500)]
Merge branch 'tcp-RACK-loss-recovery-bug-fixes'
Yuchung Cheng says:
====================
tcp: RACK loss recovery bug fixes
This patch set has four minor bug fixes in TCP RACK loss recovery.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Thu, 7 Dec 2017 19:33:33 +0000 (11:33 -0800)]
tcp: evaluate packet losses upon RTT change
RACK skips an ACK unless it advances the most recently delivered
TX timestamp (rack.mstamp). Since RACK also uses the most recent
RTT to decide if a packet is lost, RACK should still run the
loss detection whenever the most recent RTT changes. For example,
an ACK that does not advance the timestamp but triggers the cwnd
undo due to reordering, would then use the most recent (higher)
RTT measurement to detect further losses.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Thu, 7 Dec 2017 19:33:32 +0000 (11:33 -0800)]
tcp: fix off-by-one bug in RACK
RACK should mark a packet lost when remaining wait time is zero.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Thu, 7 Dec 2017 19:33:31 +0000 (11:33 -0800)]
tcp: always evaluate losses in RACK upon undo
When sender detects spurious retransmission, all packets
marked lost are remarked to be in-flight. However some may
be considered lost based on its timestamps in RACK. This patch
forces RACK to re-evaluate, which may be skipped previously if
the ACK does not advance RACK timestamp.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Thu, 7 Dec 2017 19:33:30 +0000 (11:33 -0800)]
tcp: correctly test congestion state in RACK
RACK does not test the loss recovery state correctly to compute
the reordering window. It assumes if lost_out is zero then TCP is
not in loss recovery. But it can be zero during recovery before
calling tcp_rack_detect_loss(): when an ACK acknowledges all
packets marked lost before receiving this ACK, but has not yet
to discover new ones by tcp_rack_detect_loss(). The fix is to
simply test the congestion state directly.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Egil Hjelmeland [Thu, 7 Dec 2017 18:56:04 +0000 (19:56 +0100)]
net: dsa: lan9303: Protect ALR operations with mutex
ALR table operations are a sequence of related register operations which
should be protected from concurrent access. The alr_cache should also be
protected. Add alr_mutex doing that.
Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 8 Dec 2017 18:27:27 +0000 (19:27 +0100)]
net: sched: fix use-after-free in tcf_block_put_ext
Since the block is freed with last chain being put, once we reach the
end of iteration of list_for_each_entry_safe, the block may be
already freed. I'm hitting this only by creating and deleting clsact:
[ 202.171952] ==================================================================
[ 202.180182] BUG: KASAN: use-after-free in tcf_block_put_ext+0x240/0x390
[ 202.187590] Read of size 8 at addr
ffff880225539a80 by task tc/796
[ 202.194508]
[ 202.196185] CPU: 0 PID: 796 Comm: tc Not tainted 4.15.0-rc2jiri+ #5
[ 202.203200] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
[ 202.213613] Call Trace:
[ 202.216369] dump_stack+0xda/0x169
[ 202.220192] ? dma_virt_map_sg+0x147/0x147
[ 202.224790] ? show_regs_print_info+0x54/0x54
[ 202.229691] ? tcf_chain_destroy+0x1dc/0x250
[ 202.234494] print_address_description+0x83/0x3d0
[ 202.239781] ? tcf_block_put_ext+0x240/0x390
[ 202.244575] kasan_report+0x1ba/0x460
[ 202.248707] ? tcf_block_put_ext+0x240/0x390
[ 202.253518] tcf_block_put_ext+0x240/0x390
[ 202.258117] ? tcf_chain_flush+0x290/0x290
[ 202.262708] ? qdisc_hash_del+0x82/0x1a0
[ 202.267111] ? qdisc_hash_add+0x50/0x50
[ 202.271411] ? __lock_is_held+0x5f/0x1a0
[ 202.275843] clsact_destroy+0x3d/0x80 [sch_ingress]
[ 202.281323] qdisc_destroy+0xcb/0x240
[ 202.285445] qdisc_graft+0x216/0x7b0
[ 202.289497] tc_get_qdisc+0x260/0x560
Fix this by holding the block also by chain 0 and put chain 0
explicitly, out of the list_for_each_entry_safe loop at the very
end of tcf_block_put_ext.
Fixes: efbf78973978 ("net_sched: get rid of rcu_barrier() in tcf_block_put_ext()")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calvin Owens [Fri, 8 Dec 2017 17:05:26 +0000 (09:05 -0800)]
bnxt_en: Fix sources of spurious netpoll warnings
After applying
2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and
903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."),
we still see the following WARN fire:
------------[ cut here ]------------
WARNING: CPU: 0 PID:
1875170 at net/core/netpoll.c:165 netpoll_poll_dev+0x15a/0x160
bnxt_poll+0x0/0xd0 exceeded budget in poll
<snip>
Call Trace:
[<
ffffffff814be5cd>] dump_stack+0x4d/0x70
[<
ffffffff8107e013>] __warn+0xd3/0xf0
[<
ffffffff8107e07f>] warn_slowpath_fmt+0x4f/0x60
[<
ffffffff8179519a>] netpoll_poll_dev+0x15a/0x160
[<
ffffffff81795f38>] netpoll_send_skb_on_dev+0x168/0x250
[<
ffffffff817962fc>] netpoll_send_udp+0x2dc/0x440
[<
ffffffff815fa9be>] write_ext_msg+0x20e/0x250
[<
ffffffff810c8125>] call_console_drivers.constprop.23+0xa5/0x110
[<
ffffffff810c9549>] console_unlock+0x339/0x5b0
[<
ffffffff810c9a88>] vprintk_emit+0x2c8/0x450
[<
ffffffff810c9d5f>] vprintk_default+0x1f/0x30
[<
ffffffff81173df5>] printk+0x48/0x50
[<
ffffffffa0197713>] edac_raw_mc_handle_error+0x563/0x5c0 [edac_core]
[<
ffffffffa0197b9b>] edac_mc_handle_error+0x42b/0x6e0 [edac_core]
[<
ffffffffa01c3a60>] sbridge_mce_output_error+0x410/0x10d0 [sb_edac]
[<
ffffffffa01c47cc>] sbridge_check_error+0xac/0x130 [sb_edac]
[<
ffffffffa0197f3c>] edac_mc_workq_function+0x3c/0x90 [edac_core]
[<
ffffffff81095f8b>] process_one_work+0x19b/0x480
[<
ffffffff810967ca>] worker_thread+0x6a/0x520
[<
ffffffff8109c7c4>] kthread+0xe4/0x100
[<
ffffffff81884c52>] ret_from_fork+0x22/0x40
This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting
in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually
given a non-zero budget.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 8 Dec 2017 18:32:27 +0000 (13:32 -0500)]
Merge branch 'lockless-qdisc-series'
John Fastabend says:
====================
lockless qdisc series
This series adds support for building lockless qdiscs. This is
the result of noticing the qdisc lock is a common hot-spot in
perf analysis of the Linux network stack, especially when testing
with high packet per second rates. However, nothing is free and
most qdiscs rely on the qdisc lock for their data structures so
each qdisc must be converted on a case by case basis. In this
series, to kick things off, we make pfifo_fast, mq, and mqprio
lockless. Follow up series can address additional qdiscs as needed.
For example sch_tbf might be useful. To allow this the lockless
design is an opt-in flag. In some future utopia we convert all
qdiscs and we get to drop this case analysis, but in order to
make progress we live in the real-world.
There are also a handful of optimizations I have behind this
series and a few code cleanups that I couldn't figure out how
to fit neatly into this series with out increasing the patch
count. Once this is in additional patches can address this. The
most notable is in skb_dequeue we can push the consumer lock
out a bit and consume multiple skbs off the skb_array in pfifo
fast per iteration. Ideally we could push arrays of packets at
drivers as well but we would need the infrastructure for this.
The other notable improvement is to do less locking in the
overrun cases where bad tx queue list and gso_skb are being
hit. Although, nice in theory in practice this is the error
case and I haven't found a benchmark where this matters yet.
For testing...
My first test case uses multiple containers (via cilium) where
multiple client containers use 'wrk' to benchmark connections with
a server container running lighttpd. Where lighttpd is configured
to use multiple threads, one per core. Additionally this test has
a proxy agent running so all traffic takes an extra hop through a
proxy container. In these cases each TCP packet traverses the egress
qdisc layer at least four times and the ingress qdisc layer an
additional four times. This makes for a good stress test IMO, perf
details below.
The other micro-benchmark I run is injecting packets directly into
qdisc layer using pktgen. This uses the benchmark script,
./pktgen_bench_xmit_mode_queue_xmit.sh
Benchmarks taken in two cases, "base" running latest net-next no
changes to qdisc layer and "qdisc" tests run with qdisc lockless
updates. Numbers reported in req/sec. All virtual 'veth' devices
run with pfifo_fast in the qdisc test case.
`wrk -t16 -c $conns -d30 "http://[$SERVER_IP4]:80"`
conns 16 32 64 1024
-----------------------------------------------
base: 18831 20201 21393 29151
qdisc: 19309 21063 23899 29265
notice in all cases we see performance improvement when running
with qdisc case.
Microbenchmarks using pktgen are as follows,
`pktgen_bench_xmit_mode_queue_xmit.sh -t 1 -i eth2 -c
20000000
base(mq): 2.1Mpps
base(pfifo_fast): 2.1Mpps
qdisc(mq): 2.6Mpps
qdisc(pfifo_fast): 2.6Mpps
notice numbers are the same for mq and pfifo_fast because only
testing a single thread here. In both tests we see a nice bump
in performance gain. The key with 'mq' is it is already per
txq ring so contention is minimal in the above cases. Qdiscs
such as tbf or htb which have more contention will likely show
larger gains when/if lockless versions are implemented.
Thanks to everyone who helped with this work especially Daniel
Borkmann, Eric Dumazet and Willem de Bruijn for discussing the
design and reviewing versions of the code.
Changes from the RFC: dropped a couple patches off the end,
fixed a bug with skb_queue_walk_safe not unlinking skb in all
cases, fixed a lockdep splat with pfifo_fast_destroy not calling
*_bh lock variant, addressed _most_ of Willem's comments, there
was a bug in the bulk locking (final patch) of the RFC series.
@Willem, I left out lockdep annotation for a follow on series
to add lockdep more completely, rather than just in code I
touched.
Comments and feedback welcome.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:58:19 +0000 (09:58 -0800)]
net: sched: pfifo_fast use skb_array
This converts the pfifo_fast qdisc to use the skb_array data structure
and set the lockless qdisc bit. pfifo_fast is the first qdisc to support
the lockless bit that can be a child of a qdisc requiring locking. So
we add logic to clear the lock bit on initialization in these cases when
the qdisc graft operation occurs.
This also removes the logic used to pick the next band to dequeue from
and instead just checks a per priority array for packets from top priority
to lowest. This might need to be a bit more clever but seems to work
for now.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:57:59 +0000 (09:57 -0800)]
net: skb_array: expose peek API
This adds a peek routine to skb_array.h for use with qdisc.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:57:39 +0000 (09:57 -0800)]
net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio
The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.
This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:57:20 +0000 (09:57 -0800)]
net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
The sch_mq qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.
This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.
Also exports __gnet_stats_copy_queue() to use as a helper function.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:57:00 +0000 (09:57 -0800)]
net: sched: helpers to sum qlen and qlen for per cpu logic
Add qdisc qlen helper routines for lockless qdiscs to use.
The qdisc qlen is no longer used in the hotpath but it is reported
via stats query on the qdisc so it still needs to be tracked. This
adds the per cpu operations needed along with a helper to return
the summation of per cpu stats.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:56:42 +0000 (09:56 -0800)]
net: sched: check for frozen queue before skb_bad_txq check
I can not think of any reason to pull the bad txq skb off the qdisc if
the txq we plan to send this on is still frozen. So check for frozen
queue first and abort before dequeuing either skb_bad_txq skb or
normal qdisc dequeue() skb.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:56:23 +0000 (09:56 -0800)]
net: sched: use skb list for skb_bad_tx
Similar to how gso is handled use skb list for skb_bad_tx this is
required with lockless qdiscs because we may have multiple cores
attempting to push skbs into skb_bad_tx concurrently
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:56:04 +0000 (09:56 -0800)]
net: sched: drop qdisc_reset from dev_graft_qdisc
In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy'
operation is called on the old qdisc. The destroy operation will wait
a rcu grace period and call qdisc_rcu_free(). At which point
gso_cpu_skb is free'd along with all stats so no need to zero stats
and gso_cpu_skb from the graft operation itself.
Further after dropping the qdisc locks we can not continue to call
qdisc_reset before waiting an rcu grace period so that the qdisc is
detached from all cpus. By removing the qdisc_reset() here we get
the correct property of waiting an rcu grace period and letting the
qdisc_destroy operation clean up the qdisc correctly.
Note, a refcnt greater than 1 would cause the destroy operation to
be aborted however if this ever happened the reference to the qdisc
would be lost and we would have a memory leak.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:55:45 +0000 (09:55 -0800)]
net: sched: explicit locking in gso_cpu fallback
This work is preparing the qdisc layer to support egress lockless
qdiscs. If we are running the egress qdisc lockless in the case we
overrun the netdev, for whatever reason, the netdev returns a busy
error code and the skb is parked on the gso_skb pointer. With many
cores all hitting this case at once its possible to have multiple
sk_buffs here so we turn gso_skb into a queue.
This should be the edge case and if we see this frequently then
the netdev/qdisc layer needs to back off.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:55:26 +0000 (09:55 -0800)]
net: sched: a dflt qdisc may be used with per cpu stats
Enable dflt qdisc support for per cpu stats before this patch a dflt
qdisc was required to use the global statistics qstats and bstats.
This adds a static flags field to qdisc_ops that is propagated
into qdisc->flags in qdisc allocate call. This allows the allocation
block to completely allocate the qdisc object so we don't have
dangling allocations after qdisc init.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Dec 2017 17:55:07 +0000 (09:55 -0800)]
net: sched: provide per cpu qstat helpers
The per cpu qstats support was added with per cpu bstat support which
is currently used by the ingress qdisc. This patch adds a set of
helpers needed to make other qdiscs that use qstats per cpu as well.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>