David S. Miller [Thu, 20 Dec 2018 18:53:28 +0000 (10:53 -0800)]
Merge git://git./linux/kernel/git/davem/net
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.
Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 20 Dec 2018 16:26:16 +0000 (08:26 -0800)]
Merge branch 'bnxt_en-next'
Michael Chan says:
====================
bnxt_en: Update for net-next.
Three main changes in this series, besides the usual firmware spec
update:
1. Add support for a new firmware communication channel direct to the
firmware processor that handles flow offloads. This speeds up
flow offload operations.
2. Use 64-bit internal flow handles to increase the number of flows
that can be offloaded.
3. Add level-2 context memory paging so that we can configure more
context memory for RDMA on the 57500 chips. Allocate more context
memory if RDMA is enabled on the 57500 chips.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 20 Dec 2018 08:38:53 +0000 (03:38 -0500)]
bnxt_en: Adjust default RX coalescing ticks to 10 us.
For a little better performance on faster machines and faster link
speeds.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Venkat Duvvuru [Thu, 20 Dec 2018 08:38:52 +0000 (03:38 -0500)]
bnxt_en: Support for 64-bit flow handle.
Older firmware only supports 16-bit flow handle, because of which the
number of flows that can be offloaded can’t scale beyond a point.
Newer firmware supports 64-bit flow handle enabling the host to scale
upto millions of flows. With the new 64-bit flow handle support, driver
has to query flow stats in a different way compared to the older approach.
This patch adds support for 64-bit flow handle and new way to query
flow stats.
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 20 Dec 2018 08:38:51 +0000 (03:38 -0500)]
bnxt_en: Increase context memory allocations on 57500 chips for RDMA.
If RDMA is supported on the 57500 chip, increase context memory
allocations for the resources used by RDMA.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 20 Dec 2018 08:38:50 +0000 (03:38 -0500)]
bnxt_en: Add Level 2 context memory paging support.
Add the new functions bnxt_alloc_ctx_pg_tbls()/bnxt_free_ctx_pg_tbls()
to allocate and free pages for context memory. The new functions
will handle the different levels of paging support and allocate/free
the pages accordingly using the existing functions.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 20 Dec 2018 08:38:49 +0000 (03:38 -0500)]
bnxt_en: Enhance bnxt_alloc_ring()/bnxt_free_ring().
To support level 2 context page memory structures, enhance the
bnxt_ring_mem_info structure with a "depth" field to specify the page
level and add a flag to specify using full pages for L1 and L2 page
tables. This is needed to support RDMA functionality on 57500 chips
since RDMA requires more context memory.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Venkat Duvvuru [Thu, 20 Dec 2018 08:38:48 +0000 (03:38 -0500)]
bnxt_en: Add support for 2nd firmware message channel.
Earlier, some of the firmware commands (ex: CFA_FLOW_*) which are processed
by KONG processor were sent to the CHIMP processor from the host. This
approach was taken as there was no direct message channel to KONG.
CHIMP in turn used to send them to KONG. Newer firmware supports a new
message channel which the host can send messages directly to the KONG
processor.
This patch adds support for required changes needed in the driver
to support direct KONG message channel. This speeds up flow related
messages sent to the firmware for CLS_FLOWER offload.
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Venkat Duvvuru [Thu, 20 Dec 2018 08:38:47 +0000 (03:38 -0500)]
bnxt_en: Introduce bnxt_get_hwrm_resp_addr & bnxt_get_hwrm_seq_id routines.
These routines will be enhanced in the subsequent patch to
return the 2nd firmware comm. channel's hwrm response address &
sequence id respectively.
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Venkat Duvvuru [Thu, 20 Dec 2018 08:38:46 +0000 (03:38 -0500)]
bnxt_en: Avoid arithmetic on void * pointer.
Typecast hwrm_cmd_resp_addr to (u8 *) from (void *) before doing
arithmetic.
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Venkat Duvvuru [Thu, 20 Dec 2018 08:38:45 +0000 (03:38 -0500)]
bnxt_en: Use macros for firmware message doorbell offsets.
In preparation for adding a 2nd communication channel to firmware.
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Venkat Duvvuru [Thu, 20 Dec 2018 08:38:44 +0000 (03:38 -0500)]
bnxt_en: Set hwrm_intr_seq_id value to its inverted value.
Set hwrm_intr_seq_id value to its inverted value instead of
HWRM_SEQ_INVALID, when an hwrm completion of type
CMPL_BASE_TYPE_HWRM_DONE is received. This will enable us to use
the complete 16-bit sequence ID space.
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 20 Dec 2018 08:38:43 +0000 (03:38 -0500)]
bnxt_en: Update firmware interface spec. to 1.10.0.33.
The major changes are in the flow offload firmware APIs.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 20 Dec 2018 16:21:47 +0000 (08:21 -0800)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2018-12-20
Two last patches for this release cycle:
1) Remove an unused variable in xfrm_policy_lookup_bytype().
From YueHaibing.
2) Fix possible infinite loop in __xfrm6_tunnel_alloc_spi().
Also from YueHaibing.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 20 Dec 2018 15:35:16 +0000 (07:35 -0800)]
Merge tag 'm68k-for-v4.20-tag2' of git://git./linux/kernel/git/geert/linux-m68k
Pull m68k fix from Geert Uytterhoeven:
"Fix memblock-related crashes"
* tag 'm68k-for-v4.20-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: Fix memblock-related crashes
Linus Torvalds [Thu, 20 Dec 2018 15:33:09 +0000 (07:33 -0800)]
Merge tag 'kbuild-fixes-v4.20-2' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fix from Masahiro Yamada:
"Fix false positive warning/error about missing library for objtool"
* tag 'kbuild-fixes-v4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: fix false positive warning/error about missing libelf
Linus Torvalds [Thu, 20 Dec 2018 15:30:37 +0000 (07:30 -0800)]
Merge tag 'char-misc-4.20-rc8' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are three tiny last-minute driver fixes for 4.20-rc8 that resolve
some reported issues, and one MAINTAINERS file update.
All of them are related to the hyper-v subsystem, it seems people are
actually testing and using it now, which is nice to see :)
The fixes are:
- uio_hv_generic: fix for opening multiple times
- Remove PCI dependancy on hyperv drivers
- return proper error code for an unopened channel.
And Sasha has signed up to help out with the hyperv maintainership.
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.20-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
x86, hyperv: remove PCI dependency
MAINTAINERS: Patch monkey for the Hyper-V code
uio_hv_generic: set callbacks on open
Linus Torvalds [Thu, 20 Dec 2018 15:29:11 +0000 (07:29 -0800)]
Merge tag 'tty-4.20-rc8' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial fix from Greg KH:
"Here is a single fix, a revert, for the 8250 serial driver to resolve
a reported problem.
There was some attempted patches to fix the issue, but people are
arguing about them, so reverting the patch to revert back to the 4.19
and older behavior is the best thing to do at this late in the release
cycle.
The revert has been in linux-next with no reported issues"
* tag 'tty-4.20-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Revert "serial: 8250: Fix clearing FIFOs in RS485 mode again"
Linus Torvalds [Thu, 20 Dec 2018 15:27:39 +0000 (07:27 -0800)]
Merge tag 'usb-4.20-rc8' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes and ids from Greg KH:
"Here are some late xhci fixes for 4.20-rc8 as well as a few new device
ids for the option usb-serial driver.
The xhci fixes resolve some many-reported issues and all of these have
been in linux-next for a while with no reported problems"
* tag 'usb-4.20-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: xhci: fix 'broken_suspend' placement in struct xchi_hcd
xhci: Don't prevent USB2 bus suspend in state check intended for USB3 only
USB: serial: option: add Telit LN940 series
USB: serial: option: add Fibocom NL668 series
USB: serial: option: add Simcom SIM7500/SIM7600 (MBIM mode)
USB: serial: option: add GosunCn ZTE WeLink ME3630
USB: serial: option: add HP lt4132
Linus Torvalds [Thu, 20 Dec 2018 15:25:31 +0000 (07:25 -0800)]
Merge tag 'mmc-v4.20-rc7' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Restore code to allow BKOPS and CACHE ctrl even if no HPI support
- Reset HPI enabled state during re-init
- Use a default minimum timeout when enabling CACHE ctrl
MMC host:
- omap_hsmmc: Fix DMA API warning
- sdhci-tegra: Fix dt parsing of SDMMC pads autocal values
- Correct register accesses when enabling v4 mode"
* tag 'mmc-v4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: core: Use a minimum 1600ms timeout when enabling CACHE ctrl
mmc: core: Allow BKOPS and CACHE ctrl even if no HPI support
mmc: core: Reset HPI enabled state during re-init and in case of errors
mmc: omap_hsmmc: fix DMA API warning
mmc: tegra: Fix for SDMMC pads autocal parsing from dt
mmc: sdhci: Fix sdhci_do_enable_v4_mode
Dave Chinner [Thu, 20 Dec 2018 12:23:24 +0000 (23:23 +1100)]
iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()"
This reverts commit
61c6de667263184125d5ca75e894fcad632b0dd3.
The reverted commit added page reference counting to iomap page
structures that are used to track block size < page size state. This
was supposed to align the code with page migration page accounting
assumptions, but what it has done instead is break XFS filesystems.
Every fstests run I've done on sub-page block size XFS filesystems
has since picking up this commit 2 days ago has failed with bad page
state errors such as:
# ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038"
....
SECTION -- xfs
FSTYP -- xfs (debug)
PLATFORM -- Linux/x86_64 test1 4.20.0-rc6-dgc+
MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
generic/038 454s ...
run fstests generic/038 at 2018-12-20 18:43:05
XFS (sdc): Unmounting Filesystem
XFS (sdc): Mounting V5 Filesystem
XFS (sdc): Ending clean mount
BUG: Bad page state in process kswapd0 pfn:3a7fa
page:
ffffea0000ccbeb0 count:0 mapcount:0 mapping:
ffff88800d9b6360 index:0x1
flags: 0xfffffc0000000()
raw:
000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360
raw:
0000000000000001 0000000000000000 00000000ffffffff
page dumped because: non-NULL mapping
CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
Call Trace:
dump_stack+0x67/0x90
bad_page.cold.116+0x8a/0xbd
free_pcppages_bulk+0x4bf/0x6a0
free_unref_page_list+0x10f/0x1f0
shrink_page_list+0x49d/0xf50
shrink_inactive_list+0x19d/0x3b0
shrink_node_memcg.constprop.77+0x398/0x690
? shrink_slab.constprop.81+0x278/0x3f0
shrink_node+0x7a/0x2f0
kswapd+0x34b/0x6d0
? node_reclaim+0x240/0x240
kthread+0x11f/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x24/0x30
Disabling lock debugging due to kernel taint
....
The failures are from anyway that frees pages and empties the
per-cpu page magazines, so it's not a predictable failure or an easy
to debug failure.
generic/038 is a reliable reproducer of this problem - it has a 9 in
10 failure rate on one of my test machines. Failure on other
machines have been at random points in fstests runs but every run
has ended up tripping this problem. Hence generic/038 was used to
bisect the failure because it was the most reliable failure.
It is too close to the 4.20 release (not to mention holidays) to
try to diagnose, fix and test the underlying cause of the problem,
so reverting the commit is the only option we have right now. The
revert has been tested against a current tot 4.20-rc7+ kernel across
multiple machines running sub-page block size XFs filesystems and
none of the bad page state failures have been seen.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Piotr Jaroszynski <pjaroszynski@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Ahern [Thu, 20 Dec 2018 04:02:36 +0000 (20:02 -0800)]
neighbor: Use nda_policy for validating attributes in adds and dump requests
Add NDA_PROTOCOL to nda_policy and use the policy for attribute parsing and
validation for adding neighbors and in dump requests. Remove the now duplicate
checks on nla_len.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 20 Dec 2018 07:47:59 +0000 (23:47 -0800)]
Merge branch 'hns3-next'
Peng Li says:
====================
net: hns3: code optimizations & bugfixes for HNS3 driver
This patchset includes bugfixes and code optimizations for the HNS3
ethernet controller driver
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Thu, 20 Dec 2018 03:52:06 +0000 (11:52 +0800)]
net: hns3: remove redundant variable initialization
This patch removes the redundant variable initialization,
as driver will devm_kzalloc to set value to hdev soon.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Thu, 20 Dec 2018 03:52:05 +0000 (11:52 +0800)]
net: hns3: fix the descriptor index when get rss type
Driver gets rss information from the last descriptor of the packet.
When driver handle the rss type, ring->next_to_clean indicates the
first descriptor of next packet.
This patch fix the descriptor index with "ring->next_to_clean - 1".
Fixes: 232fc64b6e62 ("net: hns3: Add HW RSS hash information to RX skb")
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 20 Dec 2018 03:52:04 +0000 (11:52 +0800)]
net: hns3: don't restore rules when flow director is disabled
When user disables flow director, all the rules will be disabled. But
when reset happens, it will restore all the rules again. It's not
reasonable. This patch fixes it by add flow director status check before
restore fules.
Fixes: 6871af29b3ab ("net: hns3: Add reset handle for flow director")
Fixes: c17852a8932f ("net: hns3: Add support for enable/disable flow director")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 20 Dec 2018 03:52:03 +0000 (11:52 +0800)]
net: hns3: fix vf id check issue when add flow director rule
When add flow director fule for vf, the vf id is used as array
subscript before valid checking, which may cause memory overflow.
Fixes: dd74f815dd41 ("net: hns3: Add support for rule add/delete for flow director")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Thu, 20 Dec 2018 03:52:02 +0000 (11:52 +0800)]
net: hns3: reset tqp while doing DOWN operation
While doing DOWN operation, the driver will reclaim the memory which has
already used for TX. If the hardware is processing this memory, it will
cause a RCB error to the hardware. According the hardware's description,
the driver should reset the tqp before reclaim the memory during DOWN.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 20 Dec 2018 03:52:01 +0000 (11:52 +0800)]
net: hns3: add max vector number check for pf
Each pf supports max 64 vectors and 128 tqps. For 2p/4p core scenario,
there may be more than 64 cpus online. So the result of min_t(u16,
num_Online_cpus(), tqp_num) may be more than 64. This patch adds check
for the vector number.
Fixes: dd38c72604dc ("net: hns3: fix for coalesce configuration lost during reset")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Thu, 20 Dec 2018 03:52:00 +0000 (11:52 +0800)]
net: hns3: fix a bug caused by udelay
udelay() in driver may always occupancy processor. If there is only
one cpu in system, the VF driver may initialize fail when insmod
PF and VF driver in the same system. This patch use msleep() to free
cpu when VF wait PF message.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 20 Dec 2018 03:51:59 +0000 (11:51 +0800)]
net: hns3: change default tc state to close
In original codes, default tc value is set to the max tc. It's more
reasonable to close tc by changing default tc value to 1. Users can
enable it with lldp tool when they want to use tc.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 20 Dec 2018 03:51:58 +0000 (11:51 +0800)]
net: hns3: refine the handle for hns3_nic_net_open/stop()
When triggering nic down, there is a time window between bringing down
the protocol stack and stopping the work task. If the net is up in the
time window, it may bring up the protocol stack again.
This patch fixes it by stop the work task at the beginning of
hns3_nic_net_stop(). To keep symmetrical, start the work task at the
end of hns3_nic_net_open().
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 20 Dec 2018 07:34:33 +0000 (23:34 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Off by one in netlink parsing of mac802154_hwsim, from Alexander
Aring.
2) nf_tables RCU usage fix from Taehee Yoo.
3) Flow dissector needs nhoff and thoff clamping, from Stanislav
Fomichev.
4) Missing sin6_flowinfo initialization in SCTP, from Xin Long.
5) Spectrev1 in ipmr and ip6mr, from Gustavo A. R. Silva.
6) Fix r8169 crash when DEBUG_SHIRQ is enabled, from Heiner Kallweit.
7) Fix SKB leak in rtlwifi, from Larry Finger.
8) Fix state pruning in bpf verifier, from Jakub Kicinski.
9) Don't handle completely duplicate fragments as overlapping, from
Michal Kubecek.
10) Fix memory corruption with macb and 64-bit DMA, from Anssi Hannula.
11) Fix TCP fallback socket release in smc, from Myungho Jung.
12) gro_cells_destroy needs to napi_disable, from Lorenzo Bianconi.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (130 commits)
rds: Fix warning.
neighbor: NTF_PROXY is a valid ndm_flag for a dump request
net: mvpp2: fix the phylink mode validation
net/sched: cls_flower: Remove old entries from rhashtable
net/tls: allocate tls context using GFP_ATOMIC
iptunnel: make TUNNEL_FLAGS available in uapi
gro_cell: add napi_disable in gro_cells_destroy
lan743x: Remove MAC Reset from initialization
net/mlx5e: Remove the false indication of software timestamping support
net/mlx5: Typo fix in del_sw_hw_rule
net/mlx5e: RX, Fix wrong early return in receive queue poll
ipv6: explicitly initialize udp6_addr in udp_sock_create6()
bnxt_en: Fix ethtool self-test loopback.
net/rds: remove user triggered WARN_ON in rds_sendmsg
net/rds: fix warn in rds_message_alloc_sgs
ath10k: skip sending quiet mode cmd for WCN3990
mac80211: free skb fraglist before freeing the skb
nl80211: fix memory leak if validate_pae_over_nl80211() fails
net/smc: fix TCP fallback socket release
vxge: ensure data0 is initialized in when fetching firmware version information
...
David S. Miller [Thu, 20 Dec 2018 04:53:18 +0000 (20:53 -0800)]
rds: Fix warning.
>> net/rds/send.c:1109:42: warning: Using plain integer as NULL pointer
Fixes: ea010070d0a7 ("net/rds: fix warn in rds_message_alloc_sgs")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 20 Dec 2018 02:40:48 +0000 (18:40 -0800)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull virtio fix from Michael Tsirkin:
"A last-minute fix for a test build"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio: fix test build after uio.h change
Linus Torvalds [Thu, 20 Dec 2018 02:38:54 +0000 (18:38 -0800)]
Merge tag 'nfs-for-4.20-6' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Fix TCP socket disconnection races by ensuring we always call
xprt_disconnect_done() after releasing the socket.
- Fix a race when clearing both XPRT_CONNECTING and XPRT_LOCKED
- Remove xprt_connect_status() so it does not mask errors that should
be handled by call_connect_status()
* tag 'nfs-for-4.20-6' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Remove xprt_connect_status()
SUNRPC: Fix a race with XPRT_CONNECTING
SUNRPC: Fix disconnection races
Linus Torvalds [Thu, 20 Dec 2018 02:27:58 +0000 (18:27 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- One nasty use-after-free bugfix, from this merge window however
- A less nasty use-after-free that can only zero some words at the
beginning of the page, and hence is not really exploitable
- A NULL pointer dereference
- A dummy implementation of an AMD chicken bit MSR that Windows uses
for some unknown reason
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs
KVM: X86: Fix NULL deref in vcpu_scan_ioapic
KVM: Fix UAF in nested posted interrupt processing
KVM: fix unregistering coalesced mmio zone from wrong bus
Linus Torvalds [Thu, 20 Dec 2018 02:16:17 +0000 (18:16 -0800)]
Merge tag 'dma-mapping-4.20-4' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
"Fix a regression in dma-direct that didn't take account the magic AMD
memory encryption mask in the DMA address"
* tag 'dma-mapping-4.20-4' of git://git.infradead.org/users/hch/dma-mapping:
dma-direct: do not include SME mask in the DMA supported check
David Ahern [Thu, 20 Dec 2018 00:54:38 +0000 (16:54 -0800)]
neighbor: NTF_PROXY is a valid ndm_flag for a dump request
When dumping proxy entries the dump request has NTF_PROXY set in
ndm_flags. strict mode checking needs to be updated to allow this
flag.
Fixes: 51183d233b5a ("net/neighbor: Update neigh_dump_info for strict data checking")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 19 Dec 2018 23:53:22 +0000 (15:53 -0800)]
neighbor: Initialize protocol when new pneigh_entry are created
pneigh_lookup uses kmalloc versus kzalloc when new entries are allocated.
Given that the newly added protocol field needs to be initialized.
Fixes: df9b0e30d44c ("neighbor: Add protocol attribute")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peter Oskolkov [Wed, 19 Dec 2018 18:20:09 +0000 (10:20 -0800)]
selftests: net: refactor reuseport_addr_any test
This patch refactors reuseport_add_any selftest a bit:
- makes it more modular (eliminates several copy/pasted blocks);
- skips DCCP tests if DCCP is not supported
V2: added "Signed-off-by" tag.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 19 Dec 2018 17:28:54 +0000 (18:28 +0100)]
net: dsa: mv88e6xxx: Add missing watchdog ops for 6320 family
The 6320 family of switches uses the same watchdog registers as the
6390.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Wed, 19 Dec 2018 17:00:12 +0000 (18:00 +0100)]
net: mvpp2: fix the phylink mode validation
The mvpp2_phylink_validate() sets all modes that are supported by a
given PPv2 port. An mistake made the 10000baseT_Full mode being
advertised in some cases when a port wasn't configured to perform at
10G. This patch fixes this.
Fixes: d97c9f4ab000 ("net: mvpp2: 1000baseX support")
Reported-by: Russell King <linux@armlinux.org.uk>
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roi Dayan [Wed, 19 Dec 2018 16:07:56 +0000 (18:07 +0200)]
net/sched: cls_flower: Remove old entries from rhashtable
When replacing a rule we add the new rule to the rhashtable
but only remove the old if not in skip_sw.
This commit fix this and remove the old rule anyway.
Fixes: 35cc3cefc4de ("net/sched: cls_flower: Reject duplicated rules also under skip_sw")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ganesh Goudar [Wed, 19 Dec 2018 11:48:22 +0000 (17:18 +0530)]
net/tls: allocate tls context using GFP_ATOMIC
create_ctx can be called from atomic context, hence use
GFP_ATOMIC instead of GFP_KERNEL.
[ 395.962599] BUG: sleeping function called from invalid context at mm/slab.h:421
[ 395.979896] in_atomic(): 1, irqs_disabled(): 0, pid: 16254, name: openssl
[ 395.996564] 2 locks held by openssl/16254:
[ 396.010492] #0:
00000000347acb52 (sk_lock-AF_INET){+.+.}, at: do_tcp_setsockopt.isra.44+0x13b/0x9a0
[ 396.029838] #1:
000000006c9552b5 (device_spinlock){+...}, at: tls_init+0x1d/0x280
[ 396.047675] CPU: 5 PID: 16254 Comm: openssl Tainted: G O 4.20.0-rc6+ #25
[ 396.066019] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0c 09/25/2017
[ 396.083537] Call Trace:
[ 396.096265] dump_stack+0x5e/0x8b
[ 396.109876] ___might_sleep+0x216/0x250
[ 396.123940] kmem_cache_alloc_trace+0x1b0/0x240
[ 396.138800] create_ctx+0x1f/0x60
[ 396.152504] tls_init+0xbd/0x280
[ 396.166135] tcp_set_ulp+0x191/0x2d0
[ 396.180035] ? tcp_set_ulp+0x2c/0x2d0
[ 396.193960] do_tcp_setsockopt.isra.44+0x148/0x9a0
[ 396.209013] __sys_setsockopt+0x7c/0xe0
[ 396.223054] __x64_sys_setsockopt+0x20/0x30
[ 396.237378] do_syscall_64+0x4a/0x180
[ 396.251200] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: df9d4a178022 ("net/tls: sleeping function from invalid context")
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 20 Dec 2018 00:24:59 +0000 (16:24 -0800)]
Merge branch 'mt2712'
Biao Huang says:
====================
add ethernet binding and modify ethernet driver for mt2712
changes in v3:
resend this series base on the latest net-next tree.
changes in v2 as comments from Sean:
1. fix typo.
2. use capital letters for RMII/MII/RGMII in driver and bindings.
v1:
This new series is the result of discussion in:
http://lkml.org/lkml/2018/12/13/1007
http://lkml.org/lkml/2018/12/14/53
1. ethernet binding file move to this series.
2. remove fine tune property in device tree
3. remove fine tune flow in ethernet driver
4. set rgmii timing according to the value in device tree,
and don't care whether phy insert internal delay or not.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Biao Huang [Wed, 19 Dec 2018 07:22:41 +0000 (15:22 +0800)]
net-next: stmmac: dwmac-mediatek: remove fine-tune property
1. remove fine-tune property and related setting to simplify
the timing adjustment flow.
2. set timing value according to the value from device tree,
and will not care whether PHY insert internal delay.
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Biao Huang [Wed, 19 Dec 2018 07:22:40 +0000 (15:22 +0800)]
net-next: dt-binding: dwmac-mediatek: remove fine-tune property
remove fine-tune property in device tree, modify
the corresponding description in dt-binding.
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wenxu [Wed, 19 Dec 2018 06:11:15 +0000 (14:11 +0800)]
iptunnel: make TUNNEL_FLAGS available in uapi
ip l add dev tun type gretap external
ip r a 10.0.0.1 encap ip dst 192.168.152.171 id 1000 dev gretap
For gretap Key example when the command set the id but don't set the
TUNNEL_KEY flags. There is no key field in the send packet
In the lwtunnel situation, some TUNNEL_FLAGS should can be set by
userspace
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Wed, 19 Dec 2018 22:23:00 +0000 (23:23 +0100)]
gro_cell: add napi_disable in gro_cells_destroy
Add napi_disable routine in gro_cells_destroy since starting from
commit
c42858eaf492 ("gro_cells: remove spinlock protecting receive
queues") gro_cell_poll and gro_cells_destroy can run concurrently on
napi_skbs list producing a kernel Oops if the tunnel interface is
removed while gro_cell_poll is running. The following Oops has been
triggered removing a vxlan device while the interface is receiving
traffic
[ 5628.948853] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[ 5628.949981] PGD 0 P4D 0
[ 5628.950308] Oops: 0002 [#1] SMP PTI
[ 5628.950748] CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.20.0-rc6+ #41
[ 5628.952940] RIP: 0010:gro_cell_poll+0x49/0x80
[ 5628.955615] RSP: 0018:
ffffc9000004fdd8 EFLAGS:
00010202
[ 5628.956250] RAX:
0000000000000000 RBX:
ffffe8ffffc08150 RCX:
0000000000000000
[ 5628.957102] RDX:
0000000000000000 RSI:
ffff88802356bf00 RDI:
ffffe8ffffc08150
[ 5628.957940] RBP:
0000000000000026 R08:
0000000000000000 R09:
0000000000000000
[ 5628.958803] R10:
0000000000000001 R11:
0000000000000000 R12:
0000000000000040
[ 5628.959661] R13:
ffffe8ffffc08100 R14:
0000000000000000 R15:
0000000000000040
[ 5628.960682] FS:
0000000000000000(0000) GS:
ffff88803ea00000(0000) knlGS:
0000000000000000
[ 5628.961616] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 5628.962359] CR2:
0000000000000008 CR3:
000000000221c000 CR4:
00000000000006b0
[ 5628.963188] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 5628.964034] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 5628.964871] Call Trace:
[ 5628.965179] net_rx_action+0xf0/0x380
[ 5628.965637] __do_softirq+0xc7/0x431
[ 5628.966510] run_ksoftirqd+0x24/0x30
[ 5628.966957] smpboot_thread_fn+0xc5/0x160
[ 5628.967436] kthread+0x113/0x130
[ 5628.968283] ret_from_fork+0x3a/0x50
[ 5628.968721] Modules linked in:
[ 5628.969099] CR2:
0000000000000008
[ 5628.969510] ---[ end trace
9d9dedc7181661fe ]---
[ 5628.970073] RIP: 0010:gro_cell_poll+0x49/0x80
[ 5628.972965] RSP: 0018:
ffffc9000004fdd8 EFLAGS:
00010202
[ 5628.973611] RAX:
0000000000000000 RBX:
ffffe8ffffc08150 RCX:
0000000000000000
[ 5628.974504] RDX:
0000000000000000 RSI:
ffff88802356bf00 RDI:
ffffe8ffffc08150
[ 5628.975462] RBP:
0000000000000026 R08:
0000000000000000 R09:
0000000000000000
[ 5628.976413] R10:
0000000000000001 R11:
0000000000000000 R12:
0000000000000040
[ 5628.977375] R13:
ffffe8ffffc08100 R14:
0000000000000000 R15:
0000000000000040
[ 5628.978296] FS:
0000000000000000(0000) GS:
ffff88803ea00000(0000) knlGS:
0000000000000000
[ 5628.979327] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 5628.980044] CR2:
0000000000000008 CR3:
000000000221c000 CR4:
00000000000006b0
[ 5628.980929] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 5628.981736] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 5628.982409] Kernel panic - not syncing: Fatal exception in interrupt
[ 5628.983307] Kernel Offset: disabled
Fixes: c42858eaf492 ("gro_cells: remove spinlock protecting receive queues")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bryan Whitehead [Wed, 19 Dec 2018 21:55:15 +0000 (16:55 -0500)]
lan743x: Remove MAC Reset from initialization
The MAC Reset was noticed to erase important EEPROM settings.
It is also unnecessary since a chip wide reset was done earlier
in initialization, and that reset preserves EEPROM settings.
There for this patch removes the unnecessary MAC specific reset.
Signed-off-by: Bryan Whitehead <Bryan.Whitehead@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael S. Tsirkin [Wed, 19 Dec 2018 23:21:51 +0000 (18:21 -0500)]
virtio: fix test build after uio.h change
Fixes: d38499530e5 ("fs: decouple READ and WRITE from the block layer ops")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
David S. Miller [Wed, 19 Dec 2018 21:44:12 +0000 (13:44 -0800)]
Merge tag 'mlx5-fixes-2018-12-19' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-fixes-2018-12-19
Some fixes for the mlx5 driver
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Dec 2018 21:37:34 +0000 (13:37 -0800)]
Merge branch 'neigh-get-support'
Roopa Prabhu says:
====================
neigh get support
This series adds support for neigh get similar
to route and recently added fdb get.
v2: fix key len check. and some other fixes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Wed, 19 Dec 2018 20:51:39 +0000 (12:51 -0800)]
selftests: rtnetlink.sh: add testcase for neigh get
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Wed, 19 Dec 2018 20:51:38 +0000 (12:51 -0800)]
neighbour: register rtnl doit handler
this patch registers neigh doit handler. The doit handler
returns a neigh entry given dst and dev. This is similar
to route and fdb doit (get) handlers. Also moves nda_policy
declaration from rtnetlink.c to neighbour.c
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alaa Hleihel [Sun, 25 Nov 2018 09:46:09 +0000 (11:46 +0200)]
net/mlx5e: Remove the false indication of software timestamping support
mlx5 driver falsely advertises support of software timestamping.
Fix it by removing the false indication.
Fixes: ef9814deafd0 ("net/mlx5e: Add HW timestamping (TS) support")
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Yuval Avnery [Thu, 13 Dec 2018 00:26:46 +0000 (02:26 +0200)]
net/mlx5: Typo fix in del_sw_hw_rule
Expression terminated with "," instead of ";", resulted in
set_fte getting bad value for modify_enable_mask field.
Fixes: bd5251dbf156 ("net/mlx5_core: Introduce flow steering destination of type counter")
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Tariq Toukan [Sun, 2 Dec 2018 13:45:53 +0000 (15:45 +0200)]
net/mlx5e: RX, Fix wrong early return in receive queue poll
When the completion queue of the RQ is empty, do not immediately return.
If left-over decompressed CQEs (from the previous cycle) were processed,
need to go to the finalization part of the poll function.
Bug exists only when CQE compression is turned ON.
This solves the following issue:
mlx5_core 0000:82:00.1: mlx5_eq_int:544:(pid 0): CQ error on CQN 0xc08, syndrome 0x1
mlx5_core 0000:82:00.1 p4p2: mlx5e_cq_error_event: cqn=0x000c08 event=0x04
Fixes: 4b7dfc992514 ("net/mlx5e: Early-return on empty completion queues")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
David S. Miller [Wed, 19 Dec 2018 20:28:08 +0000 (12:28 -0800)]
Merge branch 'mlxsw-Make-driver-more-robust'
Ido Schimmel says:
====================
mlxsw: Make driver more robust
In recent months we fixed several bugs in the driver that could have
been avoided by re-evaluating some of the involved code paths and by
introducing relevant and comprehensive test cases.
This patchset tries to do that by introducing a set of small and mostly
non-functional changes in addition to a new test. I have further
improvements in mind, but they can be done in a different set.
Patch #1 makes sure we correctly sanitize upper devices of a VLAN
interface.
Patch #2 removes an unexpected behavior from the driver, in which routes
configured on a VLAN interface will cease being offloaded after certain
operations.
Patch #3 is a small cleanup.
Patch #4 simplifies the driver by removing reference counting from VLAN
entries configured on a port.
Patches #5-#6 simplify linking/unlinking from a bridge, especially when
LAG and VLAN devices are involved. They make both operations symmetric
even when ports are unlinked from a bridged LAG device.
Patch #7-#9 make router interface (RIF) deletion more robust by removing
reliance on device chain to indicate whether a NETDEV_DOWN event in the
inet{,6}addr notification chains should be processed. This is due to the
fact that IP addresses can be flushed from a netdev after it was
unlinked from its lower device.
Patch #10 adds a new test to for valid and invalid configurations over
mlxsw ports. Some of the test cases are derived from recent fixes. I
expect that more test cases will be added over time.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 19 Dec 2018 06:08:51 +0000 (06:08 +0000)]
selftests: mlxsw: Add rtnetlink tests
Add a new test that is focused on rtnetlink configuration. Its purpose
is to test valid and invalid (as deemed by mlxsw) configurations and
make sure that they succeed / fail without producing a trace.
Some of the test cases are derived from recent fixes in order to make
sure that the fixed bugs are not introduced again.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 19 Dec 2018 06:08:50 +0000 (06:08 +0000)]
mlxsw: spectrum_router: Hold a reference on RIF's netdev
Previous patches tried to make RIF deletion more robust and avoid
use-after-free situations.
As another precaution, hold a reference on a RIF's netdev and release it
when the RIF is deleted.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 19 Dec 2018 06:08:48 +0000 (06:08 +0000)]
mlxsw: spectrum_router: Make RIF deletion more robust
In the past we had multiple instances where RIFs were not properly
deleted.
One of the reasons for leaking a RIF was that at the time when IP
addresses were flushed from the respective netdev (prompting the
destruction of the RIF), the netdev was no longer a mlxsw upper. This
caused the inet{,6}addr notification blocks to ignore the NETDEV_DOWN
event and leak the RIF.
Instead of checking whether the netdev is our upper when an IP address
is removed, we can instead check if the netdev has a RIF configured.
To look up a RIF we need to access mlxsw private data, so the patch
stores the notification blocks inside a mlxsw struct. This then allows
us to use container_of() and extract the required private data.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 19 Dec 2018 06:08:47 +0000 (06:08 +0000)]
mlxsw: spectrum_router: Propagate 'struct mlxsw_sp' further
Next patch is going to make RIF deletion more robust by removing
reliance on fragile mlxsw_sp_lower_get(). This is because a netdev is
not necessarily our upper anymore when its IP addresses are flushed.
The inet{,6}addr notification blocks are going to resolve 'struct
mlxsw_sp' using container_of(), but the functions they call still use
mlxsw_sp_lower_get().
As a preparation for the next patch, propagate 'struct mlxsw_sp' down to
the functions called from the notification blocks and remove reliance on
mlxsw_sp_lower_get().
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 19 Dec 2018 06:08:45 +0000 (06:08 +0000)]
mlxsw: spectrum: Properly cleanup LAG uppers when removing port from LAG
When a LAG device or a VLAN device on top of it is enslaved to a bridge,
the driver propagates the CHANGEUPPER event to the LAG's slaves.
This causes each physical port to increase the reference count of the
internal representation of the bridge port by calling
mlxsw_sp_port_bridge_join().
However, when a port is removed from a LAG, the corresponding leave()
function is not called and the reference count is not decremented. This
leads to ugly hacks such as mlxsw_sp_bridge_port_should_destroy() that
try to understand if the bridge port should be destroyed even when its
reference count is not 0.
Instead, make sure that when a port is unlinked from a LAG it would see
the same events as if the LAG (or its uppers) were unlinked from a
bridge.
The above is achieved by walking the LAG's uppers when a port is
unlinked and calling mlxsw_sp_port_bridge_leave() for each upper that is
enslaved to a bridge.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 19 Dec 2018 06:08:43 +0000 (06:08 +0000)]
mlxsw: spectrum: Remove reference count from VLAN entries
Commit
b3529af6bb0d ("spectrum: Reference count VLAN entries") started
reference counting port-VLAN entries in a similar fashion to the 8021q
driver.
However, this is not actually needed and only complicates things.
Instead, the driver should forbid the creation of a VLAN on a port if
this VLAN already exists. This would also solve the issue fixed by the
mentioned commit.
Therefore, remove the get()/put() API and use create()/destroy()
instead.
One place that needs special attention is VLAN addition in a VLAN-aware
bridge via switchdev operations. In case the VLAN flags (e.g., 'pvid')
are toggled, then the VLAN entry already exists. To prevent the driver
from wrongly returning EEXIST, the driver is changed to check in the
prepare phase whether the entry already exists and only returns an error
in case it is not associated with the correct bridge port.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 19 Dec 2018 06:08:41 +0000 (06:08 +0000)]
mlxsw: spectrum: Handle VLAN device unlinking
In commit
993107fea5ee ("mlxsw: spectrum_switchdev: Fix VLAN device
deletion via ioctl") I fixed a bug caused by the fact that the driver
views differently the deletion of a VLAN device when it is deleted via
an ioctl and netlink.
Instead of relying on a specific order of events (device being
unregistered vs. VLAN filter being updated), simply make sure that the
driver performs the necessary cleanup when the VLAN device is unlinked,
which always happens before the other two events.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 19 Dec 2018 06:08:40 +0000 (06:08 +0000)]
mlxsw: spectrum_fid: Remove unused function
This function is no longer used. Remove it.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 19 Dec 2018 06:08:38 +0000 (06:08 +0000)]
mlxsw: spectrum_router: Do not destroy RIFs based on FID's reference count
Currently, when a RIF is constructed on top of a FID, the RIF increments
the FID's reference count and the RIF is destroyed when the FID's
reference count drops to 1. This effectively means that when no local
ports are member in the FID, the FID is destroyed regardless if the
router port is a member in the FID or not.
The above can lead to the unexpected behavior in which routes using a
VLAN interface as their nexthop device are no longer offloaded after the
last local port leaves the corresponding VLAN (FID).
Example:
# ip -4 route show dev br0.10
192.0.2.0/24 proto kernel scope link src 192.0.2.1 offload
# bridge vlan del vid 10 dev swp3
# ip -4 route show dev br0.10
192.0.2.0/24 proto kernel scope link src 192.0.2.1
After the patch, the route is offloaded before and after the VLAN is
removed from local port 'swp3', as the RIF corresponding to 'br0.10'
continues to exists.
In order to remove RIFs' reliance on the underlying FID's reference
count, we need to add a reference count to sub-port RIFs, which are RIFs
that correspond to physical ports and their uppers (e.g., LAG devices).
In this case, each {Port, VID} ('struct mlxsw_sp_port_vlan') needs to
hold a reference on the RIF. For example:
bond0.10
|
bond0
|
+-------+
| |
swp1 swp2
Both {Port 1, VID 10} and {Port 2, VID 10} will hold a reference on the
RIF corresponding to 'bond0.10'. When the last reference is dropped, the
RIF will be destroyed.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 19 Dec 2018 06:08:37 +0000 (06:08 +0000)]
mlxsw: spectrum: Sanitize VLAN interface's uppers
Currently, only VRF and macvlan uppers are supported on top of VLAN
device configured over a bridge, so make sure the driver forbids other
uppers.
Note that enslavement to a VRF is handled earlier in the notification
block, so there is no need to check for a VRF upper here.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Wed, 19 Dec 2018 05:17:44 +0000 (21:17 -0800)]
ipv6: explicitly initialize udp6_addr in udp_sock_create6()
syzbot reported the use of uninitialized udp6_addr::sin6_scope_id.
We can just set ::sin6_scope_id to zero, as tunnels are unlikely
to use an IPv6 address that needs a scope id and there is no
interface to bind in this context.
For net-next, it looks different as we have cfg->bind_ifindex there
so we can probably call ipv6_iface_scope_id().
Same for ::sin6_flowinfo, tunnels don't use it.
Fixes: 8024e02879dd ("udp: Add udp_sock_create for UDP tunnels to open listener socket")
Reported-by: syzbot+c56449ed3652e6720f30@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hoang Le [Wed, 19 Dec 2018 04:42:19 +0000 (11:42 +0700)]
tipc: fix uninitialized value for broadcast retransmission
When sending broadcast message on high load system, there are a lot of
unnecessary packets restranmission. That issue was caused by missing in
initial criteria for retransmission.
To prevent this happen, just initialize this criteria for retransmission
in next 10 milliseconds.
Fixes: 31c4f4cc32f7 ("tipc: improve broadcast retransmission algorithm")
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Dec 2018 19:49:25 +0000 (11:49 -0800)]
Merge branch 'tipc-tracepoints'
Tuong Lien says:
====================
tipc: tracepoints and trace_events in TIPC
The patch series is the first step of introducing a tracing framework in
TIPC, which will assist in collecting complete & plentiful data for post
analysis, even in the case of a single failure occurrence e.g. when the
failure is unreproducible.
The tracing code in TIPC utilizes the powerful kernel tracepoints, trace
events features along with particular dump functions to trace the TIPC
object data and events (incl. bearer, link, socket, node, etc.).
The tracing code should generate zero-load to TIPC when the trace events
are not enabled.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tuong Lien [Wed, 19 Dec 2018 02:18:00 +0000 (09:18 +0700)]
tipc: add trace_events for tipc bearer
The commit adds the new trace_event for TIPC bearer, L2 device event:
trace_tipc_l2_device_event()
Also, it puts the trace at the tipc_l2_device_event() function, then
the device/bearer events and related info can be traced out during
runtime when needed.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tuong Lien [Wed, 19 Dec 2018 02:17:59 +0000 (09:17 +0700)]
tipc: add trace_events for tipc node
The commit adds the new trace_events for TIPC node object:
trace_tipc_node_create()
trace_tipc_node_delete()
trace_tipc_node_lost_contact()
trace_tipc_node_timeout()
trace_tipc_node_link_up()
trace_tipc_node_link_down()
trace_tipc_node_reset_links()
trace_tipc_node_fsm_evt()
trace_tipc_node_check_state()
Also, enables the traces for the following cases:
- When a node is created/deleted;
- When a node contact is lost;
- When a node timer is timed out;
- When a node link is up/down;
- When all node links are reset;
- When node state is changed;
- When a skb comes and node state needs to be checked/updated.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tuong Lien [Wed, 19 Dec 2018 02:17:58 +0000 (09:17 +0700)]
tipc: add trace_events for tipc socket
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tuong Lien [Wed, 19 Dec 2018 02:17:57 +0000 (09:17 +0700)]
tipc: add trace_events for tipc link
The commit adds the new trace_events for TIPC link object:
trace_tipc_link_timeout()
trace_tipc_link_fsm()
trace_tipc_link_reset()
trace_tipc_link_too_silent()
trace_tipc_link_retrans()
trace_tipc_link_bc_ack()
trace_tipc_link_conges()
And the traces for PROTOCOL messages at building and receiving:
trace_tipc_proto_build()
trace_tipc_proto_rcv()
Note:
a) The 'tipc_link_too_silent' event will only happen when the
'silent_intv_cnt' is about to reach the 'abort_limit' value (and the
event is enabled). The benefit for this kind of event is that we can
get an early indication about TIPC link loss issue due to timeout, then
can do some necessary actions for troubleshooting.
For example: To trigger the 'tipc_proto_rcv' when the 'too_silent'
event occurs:
echo 'enable_event:tipc:tipc_proto_rcv' > \
events/tipc/tipc_link_too_silent/trigger
And disable it when TIPC link is reset:
echo 'disable_event:tipc:tipc_proto_rcv' > \
events/tipc/tipc_link_reset/trigger
b) The 'tipc_link_retrans' or 'tipc_link_bc_ack' event is useful to
trace TIPC retransmission issues.
In addition, the commit adds the 'trace_tipc_list/link_dump()' at the
'retransmission failure' case. Then, if the issue occurs, the link
'transmq' along with the link data can be dumped for post-analysis.
These dump events should be enabled by default since it will only take
effect when the failure happens.
The same approach is also applied for the faulty case that the
validation of protocol message is failed.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tuong Lien [Wed, 19 Dec 2018 02:17:56 +0000 (09:17 +0700)]
tipc: enable tracepoints in tipc
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Wed, 19 Dec 2018 18:46:50 +0000 (13:46 -0500)]
bnxt_en: Fix ethtool self-test loopback.
The current code has 2 problems. It assumes that the RX ring for
the loopback packet is combined with the TX ring. This is not
true if the ethtool channels are set to non-combined mode. The
second problem is that it won't work on 57500 chips without
adjusting the logic to get the proper completion ring (cpr) pointer.
Fix both issues by locating the proper cpr pointer through the RX
ring.
Fixes: e44758b78ae8 ("bnxt_en: Use bnxt_cp_ring_info struct pointer as parameter for RX path.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Dec 2018 19:21:45 +0000 (11:21 -0800)]
Merge branch 'sk_buff-add-extension-infrastructure'
Florian Westphal says:
====================
sk_buff: add extension infrastructure
TL;DR:
- objdiff shows no change if CONFIG_XFRM=n && BR_NETFILTER=n
- small size reduction when one or both options are set
- no changes in ipsec performance
Changes since v1:
- Allocate entire extension space from a kmem_cache.
- Avoid atomic_dec_and_test operation on skb_ext_put() for refcnt == 1 case.
(similar to kfree_skbmem() fclone_ref use).
This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.
The third (future) user is Multipath TCP which is still out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.
This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.
Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.
mptcp has same requirements as secpath/bridge netfilter:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.
Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
are available for this skb.
This has two purposes.
a) avoids the need to initialize the pointer.
b) allows to "delete" an extension by clearing its bit
value in ->active_extensions.
While it would be possible to store the active_extensions byte
in the extension struct instead of sk_buff, there is one problem
with this:
When an extension has to be disabled, we can always clear the
bit in skb->active_extensions. But in case it would be stored in the
extension buffer itself, we might have to COW it first, if
we are dealing with a cloned skb. On kmalloc failure we would
be unable to turn an extension off.
2. extension pointer, located at the end of the sk_buff.
If the active_extensions byte is 0, the pointer is undefined,
it is not initialized on skb allocation.
This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
the series.
It is possible to add support for extensions that are not preseved on
clones/copies:
1. define a bitmask of all extensions that need copy/cow on clone
2. change __skb_ext_copy() to check
->active_extensions & SKB_EXT_PRESERVE_ON_CLONE
3. set clone->active_extensions to 0 if test is false.
This isn't done here because all extensions that get added here
need the copy/cow semantics.
Last patch converts skb->sp, secpath information gets stored as
new SKB_EXT_SEC_PATH, so the 'sp' pointer is removed from skbuff.
Extra code added to skb clone and free paths (to deal with refcount/free
of extension area) replaces the existing code that does the same for
skb->nf_bridge and skb->secpath.
I don't see any other in-tree users that could benefit from this
infrastructure, it doesn't make sense to add an extension just for the sake
of a single flag bit (like skb->nf_trace).
Adding a new extension is a good fit if all of the following are true:
1. Data is related to the skb/packet aggregate
2. Data should be freed when the skb is free'd
3. Data is not going to be relevant/needed in normal case (udp, tcp,
forwarding workloads, ...)
4. There are no fancy action(s) needed on clone/free, such as callbacks
into kernel modules.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:27 +0000 (17:15 +0100)]
net: switch secpath to use skb extension infrastructure
Remove skb->sp and allocate secpath storage via extension
infrastructure. This also reduces sk_buff by 8 bytes on x86_64.
Total size of allyesconfig kernel is reduced slightly, as there is
less inlined code (one conditional atomic op instead of two on
skb_clone).
No differences in throughput in following ipsec performance tests:
- transport mode with aes on 10GB link
- tunnel mode between two network namespaces with aes and null cipher
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:26 +0000 (17:15 +0100)]
xfrm: prefer secpath_set over secpath_dup
secpath_set is a wrapper for secpath_dup that will not perform
an allocation if the secpath attached to the skb has a reference count
of one, i.e., it doesn't need to be COW'ed.
Also, secpath_dup doesn't attach the secpath to the skb, it leaves
this to the caller.
Use secpath_set in places that immediately assign the return value to
skb.
This allows to remove skb->sp without touching these spots again.
secpath_dup can eventually be removed in followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:25 +0000 (17:15 +0100)]
drivers: chelsio: use skb_sec_path helper
reduce noise when skb->sp is removed later in the series.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:24 +0000 (17:15 +0100)]
xfrm: use secpath_exist where applicable
Will reduce noise when skb->sp is removed later in this series.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:23 +0000 (17:15 +0100)]
drivers: net: netdevsim: use skb_sec_path helper
... so this won't have to be changed when skb->sp goes away.
v2: no changes, preserve ack.
Acked-by: Shannon Nelson <shannon.lee.nelson@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:22 +0000 (17:15 +0100)]
drivers: net: ethernet: mellanox: use skb_sec_path helper
Will avoid touching this when sp pointer is removed from sk_buff struct.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:21 +0000 (17:15 +0100)]
drivers: net: intel: use secpath helpers in more places
Use skb_sec_path and secpath_exists helpers where possible.
This reduces noise in followup patch that removes skb->sp pointer.
v2: no changes, preseve acks from v1.
Acked-by: Shannon Nelson <shannon.lee.nelson@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:20 +0000 (17:15 +0100)]
net: use skb_sec_path helper in more places
skb_sec_path gains 'const' qualifier to avoid
xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type
same reasoning as previous conversions: Won't need to touch these
spots anymore when skb->sp is removed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:19 +0000 (17:15 +0100)]
net: move secpath_exist helper to sk_buff.h
Future patch will remove skb->sp pointer.
To reduce noise in those patches, move existing helper to
sk_buff and use it in more places to ease skb->sp replacement later.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:18 +0000 (17:15 +0100)]
xfrm: change secpath_set to return secpath struct, not error value
It can only return 0 (success) or -ENOMEM.
Change return value to a pointer to secpath struct.
This avoids direct access to skb->sp:
err = secpath_set(skb);
if (!err) ..
skb->sp-> ...
Becomes:
sp = secpath_set(skb)
if (!sp) ..
sp-> ..
This reduces noise in followup patch which is going to remove skb->sp.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:17 +0000 (17:15 +0100)]
net: convert bridge_nf to use skb extension infrastructure
This converts the bridge netfilter (calling iptables hooks from bridge)
facility to use the extension infrastructure.
The bridge_nf specific hooks in skb clone and free paths are removed, they
have been replaced by the skb_ext hooks that do the same as the bridge nf
allocations hooks did.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:16 +0000 (17:15 +0100)]
sk_buff: add skb extension infrastructure
This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.
objdiff shows no changes if kernel is built without xfrm and br_netfilter
support.
The third (planned future) user is Multipath TCP which is still
out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.
This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.
Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.
mptcp has same requirements as secpath/bridge netfilter:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.
The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
mapping for tx and rx processing.
Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
are available for this skb.
This has two purposes.
a) avoids the need to initialize the pointer.
b) allows to "delete" an extension by clearing its bit
value in ->active_extensions.
While it would be possible to store the active_extensions byte
in the extension struct instead of sk_buff, there is one problem
with this:
When an extension has to be disabled, we can always clear the
bit in skb->active_extensions. But in case it would be stored in the
extension buffer itself, we might have to COW it first, if
we are dealing with a cloned skb. On kmalloc failure we would
be unable to turn an extension off.
2. extension pointer, located at the end of the sk_buff.
If the active_extensions byte is 0, the pointer is undefined,
it is not initialized on skb allocation.
This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
the series.
It is possible to add support for extensions that are not preseved on
clones/copies.
To do this, it would be needed to define a bitmask of all extensions that
need copy/cow semantics, and change __skb_ext_copy() to check
->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
->active_extensions to 0 on the new clone.
This isn't done here because all extensions that get added here
need the copy/cow semantics.
v2:
Allocate entire extension space using kmem_cache.
Upside is that this allows better tracking of used memory,
downside is that we will allocate more space than strictly needed in
most cases (its unlikely that all extensions are active/needed at same
time for same skb).
The allocated memory (except the small extension header) is not cleared,
so no additonal overhead aside from memory usage.
Avoid atomic_dec_and_test operation on skb_ext_put()
by using similar trick as kfree_skbmem() does with fclone_ref:
If recount is 1, there is no concurrent user and we can free right away.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 18 Dec 2018 16:15:15 +0000 (17:15 +0100)]
netfilter: avoid using skb->nf_bridge directly
This pointer is going to be removed soon, so use the existing helpers in
more places to avoid noise when the removal happens.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Dec 2018 18:37:23 +0000 (10:37 -0800)]
Merge branch 'dpaa2-eth-add-QBMAN-statistics'
Ioana Ciornei says:
====================
dpaa2-eth: add QBMAN statistics
This patch set adds ethtool statistics for pending frames/bytes
in Rx/Tx conf FQs and number of buffers in pool.
The first patch adds support for the query APIs in the DPIO driver
while the latter actually exposes the statistics through ethtool.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ioana Radulescu [Tue, 18 Dec 2018 15:23:01 +0000 (15:23 +0000)]
dpaa2-eth: Add QBMAN related stats
Add statistics for pending frames in Rx/Tx conf FQs and
number of buffers in pool. Available through ethtool -S.
Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: Ioana ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roy Pledge [Tue, 18 Dec 2018 15:23:01 +0000 (15:23 +0000)]
soc: fsl: dpio: Add BP and FQ query APIs
Add FQ (Frame Queue) and BP (Buffer Pool) query APIs that
users of QBMan can invoke to see the status of the queues
and pools that they are using.
Signed-off-by: Roy Pledge <roy.pledge@nxp.com>
Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Raju Lakkaraju [Tue, 18 Dec 2018 09:27:56 +0000 (14:57 +0530)]
net: phy: mscc: Fix the VSC 8531/41 Chip Init sequence
- Turn on Broadcast writes
- UNH 1.8.1 clear bias for UNH 1000BT distortion
- UNH 1.8.7 optimize pre-emphasis for 100BasTx UNH 100W fix
- Enable Token-ring during 'Coma Mode'
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 19 Dec 2018 18:27:58 +0000 (10:27 -0800)]
Merge branch 'rds-fixes'
Shamir Rabinovitch says:
====================
WARNING in rds_message_alloc_sgs
This patch set fix google syzbot rds bug found in linux-next.
The first patch solve the syzbot issue.
The second patch fix issue mentioned by Leon Romanovsky that
drivers should not call WARN_ON as result from user input.
syzbot bug report can be foud here: https://lkml.org/lkml/2018/10/31/28
v1->v2:
- patch 1: make rds_iov_vector fields name more descriptive (Hakon)
- patch 1: fix potential mem leak in rds_rm_size if krealloc fail
(Hakon)
v2->v3:
- patch 2: harden rds_sendmsg for invalid number of sgs (Gerd)
v3->v4
- Santosh a.b. on both patches + repost to net-dev
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
shamir rabinovitch [Sun, 16 Dec 2018 07:01:09 +0000 (09:01 +0200)]
net/rds: remove user triggered WARN_ON in rds_sendmsg
per comment from Leon in rdma mailing list
https://lkml.org/lkml/2018/10/31/312 :
Please don't forget to remove user triggered WARN_ON.
https://lwn.net/Articles/769365/
"Greg Kroah-Hartman raised the problem of core kernel API code that will
use WARN_ON_ONCE() to complain about bad usage; that will not generate
the desired result if WARN_ON_ONCE() is configured to crash the machine.
He was told that the code should just call pr_warn() instead, and that
the called function should return an error in such situations. It was
generally agreed that any WARN_ON() or WARN_ON_ONCE() calls that can be
triggered from user space need to be fixed."
in addition harden rds_sendmsg to detect and overcome issues with
invalid sg count and fail the sendmsg.
Suggested-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
shamir rabinovitch [Sun, 16 Dec 2018 07:01:08 +0000 (09:01 +0200)]
net/rds: fix warn in rds_message_alloc_sgs
redundant copy_from_user in rds_sendmsg system call expose rds
to issue where rds_rdma_extra_size walk the rds iovec and and
calculate the number pf pages (sgs) it need to add to the tail of
rds message and later rds_cmsg_rdma_args copy the rds iovec again
and re calculate the same number and get different result causing
WARN_ON in rds_message_alloc_sgs.
fix this by doing the copy_from_user only once per rds_sendmsg
system call.
When issue occur the below dump is seen:
WARNING: CPU: 0 PID: 19789 at net/rds/message.c:316 rds_message_alloc_sgs+0x10c/0x160 net/rds/message.c:316
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 19789 Comm: syz-executor827 Not tainted 4.19.0-next-
20181030+ #101
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x39d lib/dump_stack.c:113
panic+0x2ad/0x55c kernel/panic.c:188
__warn.cold.8+0x20/0x45 kernel/panic.c:540
report_bug+0x254/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:969
RIP: 0010:rds_message_alloc_sgs+0x10c/0x160 net/rds/message.c:316
Code: c0 74 04 3c 03 7e 6c 44 01 ab 78 01 00 00 e8 2b 9e 35 fa 4c 89 e0 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 14 9e 35 fa <0f> 0b 31 ff 44 89 ee e8 18 9f 35 fa 45 85 ed 75 1b e8 fe 9d 35 fa
RSP: 0018:
ffff8801c51b7460 EFLAGS:
00010293
RAX:
ffff8801bc412080 RBX:
ffff8801d7bf4040 RCX:
ffffffff8749c9e6
RDX:
0000000000000000 RSI:
ffffffff8749ca5c RDI:
0000000000000004
RBP:
ffff8801c51b7490 R08:
ffff8801bc412080 R09:
ffffed003b5c5b67
R10:
ffffed003b5c5b67 R11:
ffff8801dae2db3b R12:
0000000000000000
R13:
000000000007165c R14:
000000000007165c R15:
0000000000000005
rds_cmsg_rdma_args+0x82d/0x1510 net/rds/rdma.c:623
rds_cmsg_send net/rds/send.c:971 [inline]
rds_sendmsg+0x19a2/0x3180 net/rds/send.c:1273
sock_sendmsg_nosec net/socket.c:622 [inline]
sock_sendmsg+0xd5/0x120 net/socket.c:632
___sys_sendmsg+0x7fd/0x930 net/socket.c:2117
__sys_sendmsg+0x11d/0x280 net/socket.c:2155
__do_sys_sendmsg net/socket.c:2164 [inline]
__se_sys_sendmsg net/socket.c:2162 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x44a859
Code: e8 dc e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 6b cb fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:
00007f1d4710ada8 EFLAGS:
00000297 ORIG_RAX:
000000000000002e
RAX:
ffffffffffffffda RBX:
00000000006dcc28 RCX:
000000000044a859
RDX:
0000000000000000 RSI:
0000000020001600 RDI:
0000000000000003
RBP:
00000000006dcc20 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000297 R12:
00000000006dcc2c
R13:
646e732f7665642f R14:
00007f1d4710b9c0 R15:
00000000006dcd2c
Kernel Offset: disabled
Rebooting in 86400 seconds..
Reported-by: syzbot+26de17458aeda9d305d8@syzkaller.appspotmail.com
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>