openwrt/staging/blogic.git
5 years agortnetlink: ifinfo: perform strict checks also for doit handler
Jakub Kicinski [Fri, 18 Jan 2019 18:46:16 +0000 (10:46 -0800)]
rtnetlink: ifinfo: perform strict checks also for doit handler

Make RTM_GETLINK's doit handler use strict checks when
NETLINK_F_STRICT_CHK is set.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agortnetlink: stats: reject requests for unknown stats
Jakub Kicinski [Fri, 18 Jan 2019 18:46:15 +0000 (10:46 -0800)]
rtnetlink: stats: reject requests for unknown stats

In the spirit of strict checks reject requests of stats the kernel
does not support when NETLINK_F_STRICT_CHK is set.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agortnetlink: stats: validate attributes in get as well as dumps
Jakub Kicinski [Fri, 18 Jan 2019 18:46:14 +0000 (10:46 -0800)]
rtnetlink: stats: validate attributes in get as well as dumps

Make sure NETLINK_GET_STRICT_CHK influences both GETSTATS doit
as well as the dump.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: netlink: add helper to retrieve NETLINK_F_STRICT_CHK
Jakub Kicinski [Fri, 18 Jan 2019 18:46:13 +0000 (10:46 -0800)]
net: netlink: add helper to retrieve NETLINK_F_STRICT_CHK

Dumps can read state of the NETLINK_F_STRICT_CHK flag from
a field in the callback structure.  For non-dump GET requests
we need a way to access the state of that flag from a socket.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovirtio-net: per-queue RPS config
Willem de Bruijn [Fri, 18 Jan 2019 01:08:53 +0000 (20:08 -0500)]
virtio-net: per-queue RPS config

On multiqueue network devices, RPS maps are configured independently
for each receive queue through /sys/class/net/$DEV/queues/rx-*.

On virtio-net currently all packets use the map from rx-0, because the
real rx queue is not known at time of map lookup by get_rps_cpu.

Call skb_record_rx_queue in the driver rx path to make lookup work.

Recording the receive queue has ramifications beyond RPS, such as in
sticky load balancing decisions for sockets (skb_tx_hash) and XPS.

Reported-by: Mark Hlady <mhlady@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosch_api: Change signature of qdisc_tree_reduce_backlog() to use ints
Toke Høiland-Jørgensen [Wed, 9 Jan 2019 16:10:57 +0000 (17:10 +0100)]
sch_api: Change signature of qdisc_tree_reduce_backlog() to use ints

There are now several places where qdisc_tree_reduce_backlog() is called
with a negative number of packets (to signal an increase in number of
packets in the queue). Rather than rely on overflow behaviour, change the
function signature to use signed integers to communicate this usage to
people reading the code.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'hns3-fixes'
David S. Miller [Fri, 18 Jan 2019 23:10:22 +0000 (15:10 -0800)]
Merge branch 'hns3-fixes'

Huazhong Tan says:

====================
net: hns3: code optimizations & bugfixes for HNS3 driver

This patchset includes bugfixes and code optimizations for the HNS3
ethernet controller driver
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: add HNAE3_RESTORE_CLIENT interface in enet module
Yunsheng Lin [Fri, 18 Jan 2019 08:13:14 +0000 (16:13 +0800)]
net: hns3: add HNAE3_RESTORE_CLIENT interface in enet module

The HNAE3_INIT_CLIENT interface is also used when changing tc
configuration, vlan/mac hardware table does not need to be restored
when tc configuration changes.

This patch adds a HNAE3_RESTORE_CLIENT interface to restore the
vlan/mac hardware table when resetting.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: do reinitialization while ETS configuration changed
Huazhong Tan [Fri, 18 Jan 2019 08:13:13 +0000 (16:13 +0800)]
net: hns3: do reinitialization while ETS configuration changed

When the ETS information is changed, the network device needs to be
re-initialized, otherwise the information such as the receiving queue
will be incorrect.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: fix wrong combined count returned by ethtool -l
Huazhong Tan [Fri, 18 Jan 2019 08:13:12 +0000 (16:13 +0800)]
net: hns3: fix wrong combined count returned by ethtool -l

The current code returns the number of all queues that can be used and
the number of queues that have been allocated, which is incorrect.
What should be returned is the number of queues allocated for each enabled
TC and the number of queues that can be allocated.

This patch fixes it.

Fixes: 482d2e9c1cc7 ("net: hns3: add support to query tqps number")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: adjust the use of alloc_tqps and num_tqps
Huazhong Tan [Fri, 18 Jan 2019 08:13:11 +0000 (16:13 +0800)]
net: hns3: adjust the use of alloc_tqps and num_tqps

The alloc_tqps field of struct hclge_vport represents the total number
of tqps allocated to the vport. The num_tqps of struct
hnae3_knic_private_info indicates the total number of all enabled tqps,
which needs to be distinguished during use.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: fix user configuration loss for ethtool -L
Huazhong Tan [Fri, 18 Jan 2019 08:13:10 +0000 (16:13 +0800)]
net: hns3: fix user configuration loss for ethtool -L

Ethtool -L option with the combined parameter is for changing the number of
multi-purpose channels of the specified network device. Under the current
scheme, the user configuration information will be lost after the reset or
TC information changed.

This patch fixes this issue. By default, this configuration is set to the
minimum between the number of queues for each enabled TCs and the maximum
number support available in the hardware. When there is a user
configuration, regardless of the reset or TC information change, it should
keep the user's configuration while it is under the hardware limits,
otherwise set to the maximum number support available in the hardware.

Fixes: 09f2af6405b8 ("net: hns3: add support to modify tqps number")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: remove redundant codes in hclge_knic_setup
Huazhong Tan [Fri, 18 Jan 2019 08:13:09 +0000 (16:13 +0800)]
net: hns3: remove redundant codes in hclge_knic_setup

The TC info will be updated in hclge_tm_vport_tc_info_update(),
so hclge_knic_setup() no need to do it again.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: modify parameter checks in the hns3_set_channels
Huazhong Tan [Fri, 18 Jan 2019 08:13:08 +0000 (16:13 +0800)]
net: hns3: modify parameter checks in the hns3_set_channels

The number of queues for each enabled TC should range from 1 to
the maximum available value, and return directly if the value
is same as the current one.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: add interface hclge_tm_bp_setup
Huazhong Tan [Fri, 18 Jan 2019 08:13:07 +0000 (16:13 +0800)]
net: hns3: add interface hclge_tm_bp_setup

Provide a common interface to complete the back pressure settings
of all enabled TCs. So other functions directly call this interface
to complete the corresponding operation.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: reuse reinitialization interface in the hns3_set_channels
Huazhong Tan [Fri, 18 Jan 2019 08:13:06 +0000 (16:13 +0800)]
net: hns3: reuse reinitialization interface in the hns3_set_channels

There is already common interface for network device reinitialization,
so hns3_set_channels() should just call them.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: remove unnecessary hns3_adjust_tqps_num
Huazhong Tan [Fri, 18 Jan 2019 08:13:05 +0000 (16:13 +0800)]
net: hns3: remove unnecessary hns3_adjust_tqps_num

The parameter passed to hns3_set_channels() are already the number of
queues per channel of the enabled TC, so it is not need to divide
the number of enabled TCs.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: remove unused member in struct hns3_enet_ring
Huazhong Tan [Fri, 18 Jan 2019 08:13:04 +0000 (16:13 +0800)]
net: hns3: remove unused member in struct hns3_enet_ring

The irq_init_flag field in struct hns3_enet_ring is unnecessary.
This patch removes it.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: modify enet reinitialization interface
Huazhong Tan [Fri, 18 Jan 2019 08:13:03 +0000 (16:13 +0800)]
net: hns3: modify enet reinitialization interface

hns3_reset_notify_init_enet and hns3_reset_notify_uninit_enet are the
reinitialization interface that will be called when the device reset,
the number of TC changed, or the queue length changed. So these two
function should call hns3_get_ring_config() and hns3_put_ring_config()
to allocate and free memory for the ring with the correct number.

Also this patch fixes a double free problem when
hns3_reset_notify_uninit_enet calling hns3_nic_dealloc_vector_data

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'Devlink-health-reporting-and-recovery-system'
David S. Miller [Fri, 18 Jan 2019 22:51:23 +0000 (14:51 -0800)]
Merge branch 'Devlink-health-reporting-and-recovery-system'

Eran Ben Elisha says:

====================
Devlink health reporting and recovery system

The health mechanism is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
  information.

The main idea is to unify and centralize driver health reports in the
generic devlink instance and allow the user to set different
attributes of the health reporting and recovery procedures.

The devlink health reporter:
Device driver creates a "health reporter" per each error/health type.
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by devlink.
Device driver can provide specific callbacks for each "health reporter", e.g.
 - Recovery procedures
 - Diagnostics and object dump procedures
 - OOB initial attributes
Different parts of the driver can register different types of health reporters
with different handlers.

Once an error is reported, devlink health will do the following actions:
  * A log is being send to the kernel trace events buffer
  * Health status and statistics are being updated for the reporter instance
  * Object dump is being taken and saved at the reporter instance (as long as
    there is no other dump which is already stored)
  * Auto recovery attempt is being done. Depends on:
    - Auto-recovery configuration
    - Grace period vs. time passed since last recover

The user interface:
User can access/change each reporter attributes and driver specific callbacks
via devlink, e.g per error type (per health reporter)
 - Configure reporter's generic attributes (like: Disable/enable auto recovery)
 - Invoke recovery procedure
 - Run diagnostics
 - Object dump

The devlink health interface (via netlink):
DEVLINK_CMD_HEALTH_REPORTER_GET
  Retrieves status and configuration info per DEV and reporter.
DEVLINK_CMD_HEALTH_REPORTER_SET
  Allows reporter-related configuration setting.
DEVLINK_CMD_HEALTH_REPORTER_RECOVER
  Triggers a reporter's recovery procedure.
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
  Retrieves diagnostics data from a reporter on a device.
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
  Retrieves the last stored dump. Devlink health
  saves a single dump. If an dump is not already stored by the devlink
  for this reporter, devlink generates a new dump.
  dump output is defined by the reporter.
DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
  Clears the last saved dump file for the specified reporter.

                                               netlink
                                      +--------------------------+
                                      |                          |
                                      |            +             |
                                      |            |             |
                                      +--------------------------+
                                                   |request for ops
                                                   |(diagnose,
 mlx5_core                             devlink     |recover,
                                                   |dump)
+--------+                            +--------------------------+
|        |                            |    reporter|             |
|        |                            |  +---------v----------+  |
|        |   ops execution            |  |                    |  |
|     <----------------------------------+                    |  |
|        |                            |  |                    |  |
|        |                            |  + ^------------------+  |
|        |                            |    | request for ops     |
|        |                            |    | (recover, dump)     |
|        |                            |    |                     |
|        |                            |  +-+------------------+  |
|        |     health report          |  | health handler     |  |
|        +------------------------------->                    |  |
|        |                            |  +--------------------+  |
|        |     health reporter create |                          |
|        +---------------------------->                          |
+--------+                            +--------------------------+

In this patchset, mlx5e TX reporter is implemented.

v2:
- Remove FW* reporters to decrease the amount of patches in the patchset
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodevlink: Add Documentation/networking/devlink-health.txt
Aya Levin [Thu, 17 Jan 2019 21:59:20 +0000 (23:59 +0200)]
devlink: Add Documentation/networking/devlink-health.txt

This patch adds a new file to add information about devlink health
mechanism.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/mlx5e: Add TX timeout support for mlx5e TX reporter
Eran Ben Elisha [Thu, 17 Jan 2019 21:59:19 +0000 (23:59 +0200)]
net/mlx5e: Add TX timeout support for mlx5e TX reporter

With this patch, ndo_tx_timeout callback will be redirected to the TX
reporter in order to detect a TX timeout error and report it to the
devlink health. (The watchdog detects TX timeouts, but the driver verify
the issue still exists before launching any recover method).

In addition, recover from TX timeout in case of lost interrupt was added
to the TX reporter recover method. The TX timeout recover from lost
interrupt is not a new feature in the driver, this patch re-organize the
functionality and move it to the TX reporter recovery flow.

TX timeout example:
(with auto_recover set to false, if set to true, the manual recover and
diagnose sections are irrelevant)

$cat /sys/kernel/debug/tracing/trace
...
devlink_health_report: bus_name=pci dev_name=0000:00:09.0
driver_name=mlx5_core reporter_name=TX: TX timeout on queue: 0, SQ: 0xd8a, CQ:
0x406, SQ Cons: 0x2 SQ Prod: 0x2, usecs since last trans: 13972000

$devlink health diagnose pci/0000:00:09 reporter TX
SQ 0xd8a: HW state: 1, stopped: 1
SQ 0xe44: HW state: 1, stopped: 0
SQ 0xeb4: HW state: 1, stopped: 0
SQ 0xf1f: HW state: 1, stopped: 0
SQ 0xf80: HW state: 1, stopped: 0
SQ 0xfe5: HW state: 1, stopped: 0

$devlink health recover pci/0000:00:09 reporter TX
$devlink health show
pci/0000:00:09.0:
  name TX state healthy #err 1 #recover 1 last_dump_ts N/A dump_available false
    attributes:
        grace_period 500 auto_recover false

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/mlx5e: Add TX reporter support
Eran Ben Elisha [Thu, 17 Jan 2019 21:59:18 +0000 (23:59 +0200)]
net/mlx5e: Add TX reporter support

Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of TX errors.
This patch declares the TX reporter operations and allocate it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.

For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full TX recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.

The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodevlink: Add health dump {get,clear} commands
Eran Ben Elisha [Thu, 17 Jan 2019 21:59:17 +0000 (23:59 +0200)]
devlink: Add health dump {get,clear} commands

Add devlink health dump commands, in order to run an dump operation
over a specific reporter.

The supported operations are dump_get in order to get last saved
dump (if not exist, dump now) and dump_clear to clear last saved
dump.

It is expected from driver's callback for diagnose command to fill it
via the buffer descriptors API. Devlink will parse it and convert it to
netlink nla API in order to pass it to the user.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodevlink: Add health diagnose command
Eran Ben Elisha [Thu, 17 Jan 2019 21:59:16 +0000 (23:59 +0200)]
devlink: Add health diagnose command

Add devlink health diagnose command, in order to run a diagnose
operation over a specific reporter.

It is expected from driver's callback for diagnose command to fill it
via the buffer descriptors API. Devlink will parse it and convert it to
netlink nla API in order to pass it to the user.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodevlink: Add health recover command
Eran Ben Elisha [Thu, 17 Jan 2019 21:59:15 +0000 (23:59 +0200)]
devlink: Add health recover command

Add devlink health recover command to the uapi, in order to allow the user
to execute a recover operation over a specific reporter.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodevlink: Add health set command
Eran Ben Elisha [Thu, 17 Jan 2019 21:59:14 +0000 (23:59 +0200)]
devlink: Add health set command

Add devlink health set command, in order to set configuration parameters
for a specific reporter.
Supported parameters are:
- graceful_period: Time interval between auto recoveries (in msec)
- auto_recover: Determines if the devlink shall execute recover upon
receiving error for the reporter

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodevlink: Add health get command
Eran Ben Elisha [Thu, 17 Jan 2019 21:59:13 +0000 (23:59 +0200)]
devlink: Add health get command

Add devlink health get command to provide reporter/s data for user space.
Add the ability to get data per reporter or dump data from all available
reporters.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodevlink: Add health report functionality
Eran Ben Elisha [Thu, 17 Jan 2019 21:59:12 +0000 (23:59 +0200)]
devlink: Add health report functionality

Upon error discover, every driver can report it to the devlink health
mechanism via devlink_health_report function, using the appropriate
reporter registered to it. Driver can pass error specific context which
will be delivered to it as part of the dump / recovery callbacks.

Once an error is reported, devlink health will do the following actions:
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
* Object dump is being taken and stored at the reporter instance (as long
  as there is no other dump which is already stored)
* Auto recovery attempt is being done. depends on:
  - Auto Recovery configuration
  - Grace period vs. time since last recover

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodevlink: Add health reporter create/destroy functionality
Eran Ben Elisha [Thu, 17 Jan 2019 21:59:11 +0000 (23:59 +0200)]
devlink: Add health reporter create/destroy functionality

Devlink health reporter is an instance for reporting, diagnosing and
recovering from run time errors discovered by the reporters.
Define it's data structure and supported operations.
In addition, expose devlink API to create and destroy a reporter.
Each devlink instance will hold it's own reporters list.

As part of the allocation, driver shall provide a set of callbacks which
will be used the devlink in order to handle health reports and user
commands related to this reporter. In addition, driver is entitled to
provide some priv pointer, which can be fetched from the reporter by
devlink_health_reporter_priv function.

For each reporter, devlink will hold a metadata of statistics,
buffers and status.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodevlink: Add health buffer support
Eran Ben Elisha [Thu, 17 Jan 2019 21:59:10 +0000 (23:59 +0200)]
devlink: Add health buffer support

Devlink health buffer is a mechanism to pass descriptors between drivers
and devlink. The API allows the driver to add objects, object pair,
value array (nested attributes), value and name.

Driver can use this API to fill the buffers in a format which can be
translated by the devlink to the netlink message.

In order to fulfill it, an internal buffer descriptor is defined. This
will hold the data and metadata per each attribute and by used to pass
actual commands to the netlink.

This mechanism will be later used in devlink health for dump and diagnose
data store by the drivers.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet_sched: add hit counter for matchall
Cong Wang [Thu, 17 Jan 2019 20:44:25 +0000 (12:44 -0800)]
net_sched: add hit counter for matchall

Although matchall always matches packets, however, it still
relies on a protocol match first. So it is still useful to have
such a counter for matchall. Of course, unlike u32, every time
we hit a matchall filter, it is always a success, so we don't
have to distinguish them.

Sample output:

filter protocol 802.1Q pref 100 matchall chain 0
filter protocol 802.1Q pref 100 matchall chain 0 handle 0x1
  not_in_hw (rule hit 10)
action order 1: vlan  pop continue
 index 1 ref 1 bind 1 installed 40 sec used 1 sec
Action statistics:
Sent 836 bytes 10 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

Reported-by: Martin Olsson <martin.olsson+netdev@sentorsecurity.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'phy-improve-stopping-PHY'
David S. Miller [Fri, 18 Jan 2019 22:12:25 +0000 (14:12 -0800)]
Merge branch 'phy-improve-stopping-PHY'

Heiner Kallweit says:

====================
net: phy: improve stopping PHY

This patchset improves and simplifies stopping the PHY.

Heiner Kallweit (3):
  net: phy: stop PHY if needed when entering phy_disconnect
  net: phy: ensure phylib state machine is stopped after calling phy_stop
  net: phy: remove phy_stop_interrupts

v2:
- break down the patch to a patchset
v3:
- don't warn if driver didn't call phy_stop before phy_disconnect
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: remove phy_stop_interrupts
Heiner Kallweit [Thu, 17 Jan 2019 19:09:21 +0000 (20:09 +0100)]
net: phy: remove phy_stop_interrupts

Interrupts have been disabled in phy_stop() already. So we can remove
phy_stop_interrupts() and free the interrupt in phy_disconnect()
directly.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: ensure phylib state machine is stopped after calling phy_stop
Heiner Kallweit [Thu, 17 Jan 2019 19:08:39 +0000 (20:08 +0100)]
net: phy: ensure phylib state machine is stopped after calling phy_stop

The call to the phylib state machine in phy_stop() just ensures that
the state machine isn't re-triggered, but a state machine call may
be scheduled already. So lets's call phy_stop_machine().
This also allows to get rid of the call to phy_stop_machine() in
phy_disconnect().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: stop PHY if needed when entering phy_disconnect
Heiner Kallweit [Thu, 17 Jan 2019 19:07:54 +0000 (20:07 +0100)]
net: phy: stop PHY if needed when entering phy_disconnect

Stop PHY if needed when entering phy_disconnect. This allows drivers
that don't need a separate call to phy_stop() to omit this call.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: declare tcp_mmap() only when CONFIG_MMU is set
Yafang Shao [Thu, 17 Jan 2019 10:03:14 +0000 (18:03 +0800)]
tcp: declare tcp_mmap() only when CONFIG_MMU is set

Since tcp_mmap() is defined when CONFIG_MMU is set.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: jme: fix indentation issues
Colin Ian King [Thu, 17 Jan 2019 00:03:26 +0000 (00:03 +0000)]
net: jme: fix indentation issues

There are two lines that have indentation issues, fix these. Also remove
an empty line.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: vxge: fix indentation issue
Colin Ian King [Wed, 16 Jan 2019 23:59:10 +0000 (23:59 +0000)]
net: vxge: fix indentation issue

There is a goto statement that indented too deeply, fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: improve get_phy_id
Heiner Kallweit [Wed, 16 Jan 2019 18:52:51 +0000 (19:52 +0100)]
net: phy: improve get_phy_id

Only caller of get_phy_id() is get_phy_device(). There a PHY ID of
0xffffffff is translated back to -ENODEV. So we can avoid some
overhead by returning -ENODEV directly.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: remove state PHY_CHANGELINK
Heiner Kallweit [Wed, 16 Jan 2019 18:47:57 +0000 (19:47 +0100)]
net: phy: remove state PHY_CHANGELINK

Since recent changes to the phylib state machine state PHY_CHANGELINK
isn't used any longer. Therefore let's remove it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ip6_gre: remove gre_hdr_len from ip6erspan_rcv
Lorenzo Bianconi [Wed, 16 Jan 2019 18:38:05 +0000 (19:38 +0100)]
net: ip6_gre: remove gre_hdr_len from ip6erspan_rcv

Remove gre_hdr_len from ip6erspan_rcv routine signature since
it is not longer used

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'tcp_openreq_child'
David S. Miller [Fri, 18 Jan 2019 06:19:05 +0000 (22:19 -0800)]
Merge branch 'tcp_openreq_child'

Eric Dumazet says:

====================
tcp: remove code from tcp_create_openreq_child()

tcp_create_openreq_child() is essentially cloning a listener, then
must initialize some fields that can not be inherited.

Listeners are either fresh sockets, or sockets that came through
tcp_disconnect() after a session that dirtied many fields.

By moving code to tcp_disconnect(), we can shorten time taken
to create a clone, since tcp_disconnect() operation is very
unlikely.
====================

Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: move rx_opt & syn_data_acked init to tcp_disconnect()
Eric Dumazet [Thu, 17 Jan 2019 19:23:42 +0000 (11:23 -0800)]
tcp: move rx_opt & syn_data_acked init to tcp_disconnect()

If we make sure all listeners have these fields cleared, then a clone
will also inherit zero values.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: move tp->rack init to tcp_disconnect()
Eric Dumazet [Thu, 17 Jan 2019 19:23:41 +0000 (11:23 -0800)]
tcp: move tp->rack init to tcp_disconnect()

If we make sure all listeners have proper tp->rack value,
then a clone will also inherit proper initial value.

Note that fresh sockets init tp->rack from tcp_init_sock()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: move app_limited init to tcp_disconnect()
Eric Dumazet [Thu, 17 Jan 2019 19:23:40 +0000 (11:23 -0800)]
tcp: move app_limited init to tcp_disconnect()

If we make sure all listeners have app_limited set to ~0U,
then a clone will also inherit proper initial value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: move retrans_out, sacked_out, tlp_high_seq, last_oow_ack_time init to tcp_discon...
Eric Dumazet [Thu, 17 Jan 2019 19:23:39 +0000 (11:23 -0800)]
tcp: move retrans_out, sacked_out, tlp_high_seq, last_oow_ack_time init to tcp_disconnect()

If we make sure all listeners have these fields cleared, then a clone
will also inherit zero values.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: do not clear urg_data in tcp_create_openreq_child
Eric Dumazet [Thu, 17 Jan 2019 19:23:38 +0000 (11:23 -0800)]
tcp: do not clear urg_data in tcp_create_openreq_child

All listeners have this field cleared already, since tcp_disconnect()
clears it and newly created sockets have also a zero value here.

So a clone will inherit a zero value here.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: move snd_cwnd & snd_cwnd_cnt init to tcp_disconnect()
Eric Dumazet [Thu, 17 Jan 2019 19:23:37 +0000 (11:23 -0800)]
tcp: move snd_cwnd & snd_cwnd_cnt init to tcp_disconnect()

Passive connections can inherit proper value by cloning,
if we make sure all listeners have the proper values there.

tcp_disconnect() was setting snd_cwnd to 2, which seems
quite obsolete since IW10 adoption.

Also remove an obsolete comment.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: move mdev_us init to tcp_disconnect()
Eric Dumazet [Thu, 17 Jan 2019 19:23:36 +0000 (11:23 -0800)]
tcp: move mdev_us init to tcp_disconnect()

If we make sure a listener always has its mdev_us
field set to TCP_TIMEOUT_INIT, we do not need to rewrite
this field after a new clone is created.

tcp_disconnect() is very seldom used in real applications.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: do not clear srtt_us in tcp_create_openreq_child
Eric Dumazet [Thu, 17 Jan 2019 19:23:35 +0000 (11:23 -0800)]
tcp: do not clear srtt_us in tcp_create_openreq_child

All listeners have this field cleared already, since tcp_disconnect()
clears it and newly created sockets have also a zero value here.

So a clone will inherit a zero value here.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: do not clear packets_out in tcp_create_openreq_child()
Eric Dumazet [Thu, 17 Jan 2019 19:23:34 +0000 (11:23 -0800)]
tcp: do not clear packets_out in tcp_create_openreq_child()

New sockets have this field cleared, and tcp_disconnect()
calls tcp_write_queue_purge() which among other things
also clear tp->packets_out

So a listener is guaranteed to have this field cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: move icsk_rto init to tcp_disconnect()
Eric Dumazet [Thu, 17 Jan 2019 19:23:33 +0000 (11:23 -0800)]
tcp: move icsk_rto init to tcp_disconnect()

If we make sure a listener always has its icsk_rto
field set to TCP_TIMEOUT_INIT, we do not need to rewrite
this field after a new clone is created.

tcp_disconnect() is very seldom used in real applications.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: do not set snd_ssthresh in tcp_create_openreq_child()
Eric Dumazet [Thu, 17 Jan 2019 19:23:32 +0000 (11:23 -0800)]
tcp: do not set snd_ssthresh in tcp_create_openreq_child()

New sockets get the field set to TCP_INFINITE_SSTHRESH in tcp_init_sock()
In case a socket had this field changed and transitions to TCP_LISTEN
state, tcp_disconnect() also makes sure snd_ssthresh is set to
TCP_INFINITE_SSTHRESH.

So a listener has this field set to TCP_INFINITE_SSTHRESH already.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/mlx4: remove unneeded semicolon
YueHaibing [Thu, 17 Jan 2019 13:03:56 +0000 (21:03 +0800)]
net/mlx4: remove unneeded semicolon

Remove unneeded semicolon.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw-phy-sel: remove unneeded semicolon
YueHaibing [Thu, 17 Jan 2019 12:59:19 +0000 (20:59 +0800)]
net: ethernet: ti: cpsw-phy-sel: remove unneeded semicolon

Remove unneeded semicolon.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotipc: remove unneeded semicolon in trace.c
YueHaibing [Thu, 17 Jan 2019 12:57:08 +0000 (20:57 +0800)]
tipc: remove unneeded semicolon in trace.c

Remove unneeded semicolon

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoqed: remove duplicated include from qed_if.h
YueHaibing [Thu, 17 Jan 2019 07:22:20 +0000 (15:22 +0800)]
qed: remove duplicated include from qed_if.h

Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Denis Bolotin <dbolotin@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosb1000: fix a couple of indentation issues and remove assignment in if statements
Colin Ian King [Thu, 17 Jan 2019 00:35:43 +0000 (00:35 +0000)]
sb1000: fix a couple of indentation issues and remove assignment in if statements

There is an if statement and a return statement that are incorrectly
indented. Fix these.  Also replace the assignment-in-if statements
to assignment followed by an if to keep to the coding style.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: add a route cache full diagnostic message
Peter Oskolkov [Wed, 16 Jan 2019 16:50:28 +0000 (08:50 -0800)]
net: add a route cache full diagnostic message

In some testing scenarios, dst/route cache can fill up so quickly
that even an explicit GC call occasionally fails to clean it up. This leads
to sporadically failing calls to dst_alloc and "network unreachable" errors
to the user, which is confusing.

This patch adds a diagnostic message to make the cause of the failure
easier to determine.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodpaa2-eth: Fix ndo_stop routine
Ioana Ciocoi Radulescu [Wed, 16 Jan 2019 16:51:44 +0000 (16:51 +0000)]
dpaa2-eth: Fix ndo_stop routine

In the current implementation, on interface down we disabled NAPI and
then manually drained any remaining ingress frames. This could lead
to a situation when, under heavy traffic, the data availability
notification for some of the channels would not get rearmed correctly.

Change the implementation such that we let all remaining ingress frames
be processed as usual and only disable NAPI once the hardware queues
are empty.

We also add a wait on the Tx side, to allow hardware time to process
all in-flight Tx frames before issueing the disable command.

Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agowan: dscc4: fix various indentation issues
Colin Ian King [Wed, 16 Jan 2019 17:08:17 +0000 (17:08 +0000)]
wan: dscc4: fix various indentation issues

There are some lines that have indentation issues, fix these.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'vxlan-FDB-veto'
David S. Miller [Thu, 17 Jan 2019 23:18:47 +0000 (15:18 -0800)]
Merge branch 'vxlan-FDB-veto'

Petr Machata says:

====================
vxlan: Allow vetoing FDB operations

mlxsw does not implement handling of the more advanced types of VXLAN
FDB entries. In order to provide visibility to users, it is important to
be able to reject such FDB entries, ideally with an explanation passed
in extended ack. This patch set implements this.

In patches #1-#4, vxlan is gradually transformed to support vetoing of
FDB entries added (or modified) through vxlan_fdb_update(), and the
default FDB entry added in __vxlan_dev_create().

Patches #5-#7 deal with vxlan_changelink(). The existing code recognizes
that vxlan_fdb_update() may fail, but doesn't attempt to keep things
intact if it does. These patches change the function in several steps to
gracefully handle vetoes (or other failures).

Then in patches #8-#11, extack arguments are added, respectively, to
ndo_fdb_add(), mlxsw's mlxsw_sp_nve_ops.fdb_replay, the functions that
connect to the VXLAN vetoing code, and call_switchdev_notifiers(). Note
that call_switchdev_blocking_notifiers() already does support extack.

Finally in patch #12, mlxsw is extended to add extack messages to
rejected FDB entries. In patch #13, the functionality is tested.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: mlxsw: Test veto of unsupported VXLAN FDBs
Petr Machata [Wed, 16 Jan 2019 23:07:00 +0000 (23:07 +0000)]
selftests: mlxsw: Test veto of unsupported VXLAN FDBs

mlxsw doesn't implement offloading of all types of FDB entries that the
VXLAN driver supports. Test that such FDB entries are rejected. That
makes sure that the decision made by the existing validation code in
mlxsw propagates up the stack. It also exercises rollback functionality
in VXLAN, and tests that extack is returned.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Add extack messages to VXLAN FDB rejection
Petr Machata [Wed, 16 Jan 2019 23:06:58 +0000 (23:06 +0000)]
mlxsw: spectrum: Add extack messages to VXLAN FDB rejection

Annotate the rejections in mlxsw_sp_switchdev_vxlan_work_prepare() with
textual reasons.

Because this code ends up being invoked for FDB replay as well, drop the
default message from there, so that the more accurate error message is
not overwritten.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoswitchdev: Add extack argument to call_switchdev_notifiers()
Petr Machata [Wed, 16 Jan 2019 23:06:56 +0000 (23:06 +0000)]
switchdev: Add extack argument to call_switchdev_notifiers()

A follow-up patch will enable vetoing of FDB entries. Make it possible
to communicate details of why an FDB entry is not acceptable back to the
user.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovxlan: Add extack to switchdev operations
Petr Machata [Wed, 16 Jan 2019 23:06:54 +0000 (23:06 +0000)]
vxlan: Add extack to switchdev operations

There are four sources of VXLAN switchdev notifier calls:

- the changelink() link operation, which already supports extack,
- ndo_fdb_add() which got extack support in a previous patch,
- FDB updates due to packet forwarding,
- and vxlan_fdb_replay().

Extend vxlan_fdb_switchdev_call_notifiers() to include extack in the
switchdev message that it sends, and propagate the argument upwards to
the callers. For the first two cases, pass in the extack gotten through
the operation. For case #3, pass in NULL.

To cover the last case, extend vxlan_fdb_replay() to take extack
argument, which might come from whatever operation necessitated the FDB
replay.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: Add extack to mlxsw_sp_nve_ops.fdb_replay
Petr Machata [Wed, 16 Jan 2019 23:06:52 +0000 (23:06 +0000)]
mlxsw: Add extack to mlxsw_sp_nve_ops.fdb_replay

A follow-up patch will extend vxlan_fdb_replay() with an extack
argument. Extend the fdb_replay callback in mlxsw likewise so that the
argument is ready for the vxlan conversion.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: Add extack argument to ndo_fdb_add()
Petr Machata [Wed, 16 Jan 2019 23:06:50 +0000 (23:06 +0000)]
net: Add extack argument to ndo_fdb_add()

Drivers may not be able to support certain FDB entries, and an error
code is insufficient to give clear hints as to the reasons of rejection.

In order to make it possible to communicate the rejection reason, extend
ndo_fdb_add() with an extack argument. Adapt the existing
implementations of ndo_fdb_add() to take the parameter (and ignore it).
Pass the extack parameter when invoking ndo_fdb_add() from rtnl_fdb_add().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovxlan: changelink: Delete remote after update
Petr Machata [Wed, 16 Jan 2019 23:06:43 +0000 (23:06 +0000)]
vxlan: changelink: Delete remote after update

If a change in remote address prompts a change in a default FDB entry,
that change might be vetoed. If that happens, it would then be necessary
to reinstate the already-removed default FDB entry corresponding to the
previous remote address.

Instead, arrange to have the previous address removed only after the
FDB is successfully vetted.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovxlan: changelink: Postpone vxlan_config_apply()
Petr Machata [Wed, 16 Jan 2019 23:06:41 +0000 (23:06 +0000)]
vxlan: changelink: Postpone vxlan_config_apply()

When an FDB entry is vetoed, it is necessary to unroll the changes that
have already been done. To avoid having to unroll vxlan_config_apply(),
postpone the call after the point where the vetoing takes place. Since
the call can't fail, it doesn't necessitate any cleanups in the
preceding FDB update logic.

Correspondingly, move down the mod_timer() call as well.

References to *dst need to be replaced with references to conf.
Additionally, old_dst and old_age_interval are not necessary anymore,
and therefore drop them.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovxlan: changelink: Inline vxlan_dev_configure()
Petr Machata [Wed, 16 Jan 2019 23:06:39 +0000 (23:06 +0000)]
vxlan: changelink: Inline vxlan_dev_configure()

The changelink operation may cause change in remote address, and
therefore an FDB update, which can be vetoed. To properly handle
vetoing, vxlan_changelink() needs to be gradually updated.

In this patch simply replace vxlan_dev_configure() with the two
constituent calls.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovxlan: Allow vetoing of FDB notifications
Petr Machata [Wed, 16 Jan 2019 23:06:38 +0000 (23:06 +0000)]
vxlan: Allow vetoing of FDB notifications

Change vxlan_fdb_switchdev_call_notifiers() to return the result from
calling switchdev notifiers. Propagate the error number up the stack.

In vxlan_fdb_update_existing() and vxlan_fdb_update_create() add
rollbacks to clean up the work that was done before the veto.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovxlan: Have vxlan_fdb_replace() save original rdst value
Petr Machata [Wed, 16 Jan 2019 23:06:34 +0000 (23:06 +0000)]
vxlan: Have vxlan_fdb_replace() save original rdst value

To enable rollbacks after vetoed FDB updates, extend vxlan_fdb_replace()
to take an additional argument where it should store the original values
of a modified rdst. Update the sole caller.

The following patch will make use of the saved value.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovxlan: Split vxlan_fdb_update() in two
Petr Machata [Wed, 16 Jan 2019 23:06:32 +0000 (23:06 +0000)]
vxlan: Split vxlan_fdb_update() in two

In order to make it easier to implement rollbacks after FDB update
vetoing, separate the FDB update code to two parts: one that deals with
updates of existing FDB entries, and one that creates new entries.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovxlan: Move up vxlan_fdb_free(), vxlan_fdb_destroy()
Petr Machata [Wed, 16 Jan 2019 23:06:30 +0000 (23:06 +0000)]
vxlan: Move up vxlan_fdb_free(), vxlan_fdb_destroy()

These functions will be needed for rollbacks of vetoed FDB entries. Move
them up so that they are visible at their intended point of use.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'improving-TCP-behavior-on-host-congestion'
David S. Miller [Thu, 17 Jan 2019 23:12:26 +0000 (15:12 -0800)]
Merge branch 'improving-TCP-behavior-on-host-congestion'

Yuchung Cheng says:

====================
improving TCP behavior on host congestion

This patch set aims to improve how TCP handle local qdisc congestion
by simplifying the previous implementation.  Previously when an
skb fails to (re)transmit due to local qdisc congestion or other
resource issue, TCP refrains from setting the skb timestamp or the
recovery starting time.

This design makes determining when to abort a stalling socket more
complicated, as the timestamps of these tranmission attempts were
missing. The stack needs to sort of infer when the original attempt
happens. A by-product is a socket may disregard the system timeout
limit (i.e. sysctl net.ipv4.tcp_retries2 or USER_TIMEOUT option),
and continue to retry until the transmission is successful.

In data-center environment when TCP RTO is small, this could cause
the socket to retry frequently for long during qdisc congestion.

The solution is to first unconditionally timestamp skb and recovery
attempt. Then retry more conservatively (twice a second) on local
qdisc congestion but abort the sockets according to the system limit.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: less aggressive window probing on local congestion
Yuchung Cheng [Wed, 16 Jan 2019 23:05:35 +0000 (15:05 -0800)]
tcp: less aggressive window probing on local congestion

Previously when the sender fails to send (original) data packet or
window probes due to congestion in the local host (e.g. throttling
in qdisc), it'll retry within an RTO or two up to 500ms.

In low-RTT networks such as data-centers, RTO is often far below
the default minimum 200ms. Then local host congestion could trigger
a retry storm pouring gas to the fire. Worse yet, the probe counter
(icsk_probes_out) is not properly updated so the aggressive retry
may exceed the system limit (15 rounds) until the packet finally
slips through.

On such rare events, it's wise to retry more conservatively
(500ms) and update the stats properly to reflect these incidents
and follow the system limit. Note that this is consistent with
the behaviors when a keep-alive probe or RTO retry is dropped
due to local congestion.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: retry more conservatively on local congestion
Yuchung Cheng [Wed, 16 Jan 2019 23:05:34 +0000 (15:05 -0800)]
tcp: retry more conservatively on local congestion

Previously when the sender fails to retransmit a data packet on
timeout due to congestion in the local host (e.g. throttling in
qdisc), it'll retry within an RTO up to 500ms.

In low-RTT networks such as data-centers, RTO is often far
below the default minimum 200ms (and the cap 500ms). Then local
host congestion could trigger a retry storm pouring gas to the
fire. Worse yet, the retry counter (icsk_retransmits) is not
properly updated so the aggressive retry may exceed the system
limit (15 rounds) until the packet finally slips through.

On such rare events, it's wise to retry more conservatively (500ms)
and update the stats properly to reflect these incidents and follow
the system limit. Note that this is consistent with the behavior
when a keep-alive probe is dropped due to local congestion.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: simplify window probe aborting on USER_TIMEOUT
Yuchung Cheng [Wed, 16 Jan 2019 23:05:33 +0000 (15:05 -0800)]
tcp: simplify window probe aborting on USER_TIMEOUT

Previously we use the next unsent skb's timestamp to determine
when to abort a socket stalling on window probes. This no longer
works as skb timestamp reflects the last instead of the first
transmission.

Instead we can estimate how long the socket has been stalling
with the probe count and the exponential backoff behavior.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: create a helper to model exponential backoff
Yuchung Cheng [Wed, 16 Jan 2019 23:05:32 +0000 (15:05 -0800)]
tcp: create a helper to model exponential backoff

Create a helper to model TCP exponential backoff for the next patch.
This is pure refactor w no behavior change.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: properly track retry time on passive Fast Open
Yuchung Cheng [Wed, 16 Jan 2019 23:05:31 +0000 (15:05 -0800)]
tcp: properly track retry time on passive Fast Open

This patch addresses a corner issue on timeout behavior of a
passive Fast Open socket.  A passive Fast Open server may write
and close the socket when it is re-trying SYN-ACK to complete
the handshake. After the handshake is completely, the server does
not properly stamp the recovery start time (tp->retrans_stamp is
0), and the socket may abort immediately on the very first FIN
timeout, instead of retying until it passes the system or user
specified limit.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: always set retrans_stamp on recovery
Yuchung Cheng [Wed, 16 Jan 2019 23:05:30 +0000 (15:05 -0800)]
tcp: always set retrans_stamp on recovery

Previously TCP socket's retrans_stamp is not set if the
retransmission has failed to send. As a result if a socket is
experiencing local issues to retransmit packets, determining when
to abort a socket is complicated w/o knowning the starting time of
the recovery since retrans_stamp may remain zero.

This complication causes sub-optimal behavior that TCP may use the
latest, instead of the first, retransmission time to compute the
elapsed time of a stalling connection due to local issues. Then TCP
may disrecard TCP retries settings and keep retrying until it finally
succeed: not a good idea when the local host is already strained.

The simple fix is to always timestamp the start of a recovery.
It's worth noting that retrans_stamp is also used to compare echo
timestamp values to detect spurious recovery. This patch does
not break that because retrans_stamp is still later than when the
original packet was sent.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: always timestamp on every skb transmission
Yuchung Cheng [Wed, 16 Jan 2019 23:05:29 +0000 (15:05 -0800)]
tcp: always timestamp on every skb transmission

Previously TCP skbs are not always timestamped if the transmission
failed due to memory or other local issues. This makes deciding
when to abort a socket tricky and complicated because the first
unacknowledged skb's timestamp may be 0 on TCP timeout.

The straight-forward fix is to always timestamp skb on every
transmission attempt. Also every skb retransmission needs to be
flagged properly to avoid RTT under-estimation. This can happen
upon receiving an ACK for the original packet and the a previous
(spurious) retransmission has failed.

It's worth noting that this reverts to the old time-stamping
style before commit 8c72c65b426b ("tcp: update skb->skb_mstamp more
carefully") which addresses a problem in computing the elapsed time
of a stalled window-probing socket. The problem will be addressed
differently in the next patches with a simpler approach.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: exit if nothing to retransmit on RTO timeout
Yuchung Cheng [Wed, 16 Jan 2019 23:05:28 +0000 (15:05 -0800)]
tcp: exit if nothing to retransmit on RTO timeout

Previously TCP only warns if its RTO timer fires and the
retransmission queue is empty, but it'll cause null pointer
reference later on. It's better to avoid such catastrophic failure
and simply exit with a warning.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: micrel: use phy_read_mmd and phy_write_mmd
Heiner Kallweit [Wed, 16 Jan 2019 20:52:22 +0000 (21:52 +0100)]
net: phy: micrel: use phy_read_mmd and phy_write_mmd

This driver implements open-coded versions of phy_read_mmd() and
phy_write_mmd() for KSZ9031. That's not needed, let's use the
phylib functions directly.

This is compile-tested only because I have no such hardware.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodavicom: Annotate implicit fall through in dm9000_set_io
Mathieu Malaterre [Wed, 16 Jan 2019 19:49:25 +0000 (20:49 +0100)]
davicom: Annotate implicit fall through in dm9000_set_io

There is a plan to build the kernel with -Wimplicit-fallthrough and
this place in the code produced a warning (W=1).

This commit removes the following warning:

  include/linux/device.h:1480:5: warning: this statement may fall through [-Wimplicit-fallthrough=]
  drivers/net/ethernet/davicom/dm9000.c:397:3: note: in expansion of macro 'dev_dbg'
  drivers/net/ethernet/davicom/dm9000.c:398:2: note: here

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/ipv6/udp_tunnel: prefer SO_BINDTOIFINDEX over SO_BINDTODEVICE
David Herrmann [Tue, 15 Jan 2019 13:42:16 +0000 (14:42 +0100)]
net/ipv6/udp_tunnel: prefer SO_BINDTOIFINDEX over SO_BINDTODEVICE

The udp-tunnel setup allows binding sockets to a network device. Prefer
the new SO_BINDTOIFINDEX to avoid temporarily resolving the device-name
just to look it up in the ioctl again.

Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/ipv4/udp_tunnel: prefer SO_BINDTOIFINDEX over SO_BINDTODEVICE
David Herrmann [Tue, 15 Jan 2019 13:42:15 +0000 (14:42 +0100)]
net/ipv4/udp_tunnel: prefer SO_BINDTOIFINDEX over SO_BINDTODEVICE

The udp-tunnel setup allows binding sockets to a network device. Prefer
the new SO_BINDTOIFINDEX to avoid temporarily resolving the device-name
just to look it up in the ioctl again.

Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: introduce SO_BINDTOIFINDEX sockopt
David Herrmann [Tue, 15 Jan 2019 13:42:14 +0000 (14:42 +0100)]
net: introduce SO_BINDTOIFINDEX sockopt

This introduces a new generic SOL_SOCKET-level socket option called
SO_BINDTOIFINDEX. It behaves similar to SO_BINDTODEVICE, but takes a
network interface index as argument, rather than the network interface
name.

User-space often refers to network-interfaces via their index, but has
to temporarily resolve it to a name for a call into SO_BINDTODEVICE.
This might pose problems when the network-device is renamed
asynchronously by other parts of the system. When this happens, the
SO_BINDTODEVICE might either fail, or worse, it might bind to the wrong
device.

In most cases user-space only ever operates on devices which they
either manage themselves, or otherwise have a guarantee that the device
name will not change (e.g., devices that are UP cannot be renamed).
However, particularly in libraries this guarantee is non-obvious and it
would be nice if that race-condition would simply not exist. It would
make it easier for those libraries to operate even in situations where
the device-name might change under the hood.

A real use-case that we recently hit is trying to start the network
stack early in the initrd but make it survive into the real system.
Existing distributions rename network-interfaces during the transition
from initrd into the real system. This, obviously, cannot affect
devices that are up and running (unless you also consider moving them
between network-namespaces). However, the network manager now has to
make sure its management engine for dormant devices will not run in
parallel to these renames. Particularly, when you offload operations
like DHCP into separate processes, these might setup their sockets
early, and thus have to resolve the device-name possibly running into
this race-condition.

By avoiding a call to resolve the device-name, we no longer depend on
the name and can run network setup of dormant devices in parallel to
the transition off the initrd. The SO_BINDTOIFINDEX ioctl plugs this
race.

Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotls: Fix recvmsg() to be able to peek across multiple records
Vakul Garg [Wed, 16 Jan 2019 10:40:16 +0000 (10:40 +0000)]
tls: Fix recvmsg() to be able to peek across multiple records

This fixes recvmsg() to be able to peek across multiple tls records.
Without this patch, the tls's selftests test case
'recv_peek_large_buf_mult_recs' fails. Each tls receive context now
maintains a 'rx_list' to retain incoming skb carrying tls records. If a
tls record needs to be retained e.g. for peek case or for the case when
the buffer passed to recvmsg() has a length smaller than decrypted
record length, then it is added to 'rx_list'. Additionally, records are
added in 'rx_list' if the crypto operation runs in async mode. The
records are dequeued from 'rx_list' after the decrypted data is consumed
by copying into the buffer passed to recvmsg(). In case, the MSG_PEEK
flag is used in recvmsg(), then records are not consumed or removed
from the 'rx_list'.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'dsa-lantiq_gswip-probe-fixes-and-remove-cleanup'
David S. Miller [Thu, 17 Jan 2019 20:12:19 +0000 (12:12 -0800)]
Merge branch 'dsa-lantiq_gswip-probe-fixes-and-remove-cleanup'

Johan Hovold says:

====================
net: dsa: lantiq_gswip: probe fixes and remove cleanup

This series fix a few issues found through inspection when fixing up new
bad uses of of_find_compatible_node() that have crept in since 4.19.

Note that these have only been compile tested.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: lantiq_gswip: drop bogus drvdata check
Johan Hovold [Wed, 16 Jan 2019 10:23:35 +0000 (11:23 +0100)]
net: dsa: lantiq_gswip: drop bogus drvdata check

The platform-device driver data is set on successful probe and will
never be NULL on remove (or we have much bigger problems).

Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: lantiq_gswip: fix OF child-node lookups
Johan Hovold [Wed, 16 Jan 2019 10:23:34 +0000 (11:23 +0100)]
net: dsa: lantiq_gswip: fix OF child-node lookups

Use the new of_get_compatible_child() helper to look up child nodes to
avoid ever matching non-child nodes elsewhere in the tree.

Also fix up the related struct device_node leaks.

Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Cc: stable <stable@vger.kernel.org> # 4.20
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: lantiq_gswip: fix use-after-free on failed probe
Johan Hovold [Wed, 16 Jan 2019 10:23:33 +0000 (11:23 +0100)]
net: dsa: lantiq_gswip: fix use-after-free on failed probe

Make sure to disable and deregister the switch on late probe errors to
avoid use-after-free when the device-resource-managed switch is freed.

Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Cc: stable <stable@vger.kernel.org> # 4.20
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agosfc: extend MTD support for newer hardware
Bert Kenward [Wed, 16 Jan 2019 10:00:39 +0000 (10:00 +0000)]
sfc: extend MTD support for newer hardware

The X2 family of NICs (based on the SFC9250) have additional
MTD partitions for firmware and configuration. This includes
partitions that are read-only.

The NICs also have extended versions of the NVRAM interface,
allowing more detailed status information to be returned.

Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests/tls: Fix recv partial/large_buff test cases
Vakul Garg [Wed, 16 Jan 2019 08:40:58 +0000 (08:40 +0000)]
selftests/tls: Fix recv partial/large_buff test cases

TLS test cases recv_partial & recv_peek_large_buf_mult_recs expect to
receive a certain amount of data and then compare it against known
strings using memcmp. To prevent recvmsg() from returning lesser than
expected number of bytes (compared in memcmp), MSG_WAITALL needs to be
passed in recvmsg().

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: check return code when requesting PHY driver module
Heiner Kallweit [Wed, 16 Jan 2019 07:07:38 +0000 (08:07 +0100)]
net: phy: check return code when requesting PHY driver module

When requesting the PHY driver module fails we'll bind the genphy
driver later. This isn't obvious to the user and may cause, depending
on the PHY, different types of issues. Therefore check the return code
of request_module(). Note that we only check for failures in loading
the module, not whether a module exists for the respective PHY ID.

v2:
- add comment explaining what is checked and what is not
- return error from phy_device_create() if loading module fails

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/tls: Make function tls_sw_do_sendpage static
YueHaibing [Wed, 16 Jan 2019 02:39:28 +0000 (10:39 +0800)]
net/tls: Make function tls_sw_do_sendpage static

Fixes the following sparse warning:

 net/tls/tls_sw.c:1023:5: warning:
 symbol 'tls_sw_do_sendpage' was not declared. Should it be static?

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/tls: remove unused function tls_sw_sendpage_locked
YueHaibing [Wed, 16 Jan 2019 02:39:27 +0000 (10:39 +0800)]
net/tls: remove unused function tls_sw_sendpage_locked

There are no in-tree callers.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>