openwrt/staging/blogic.git
6 years agonvme: fixup crash on failed discovery
Hannes Reinecke [Tue, 7 Aug 2018 10:43:42 +0000 (12:43 +0200)]
nvme: fixup crash on failed discovery

When the initial discovery fails the subsystem hasn't been setup yet
in nvme_mpath_stop, and we can't dereference ctrl->subsys.

Fixes: 0d0b660f ("nvme: add ANA support")
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agolightnvm: remove minor version check for 2.0
Matias Bjørling [Sun, 5 Aug 2018 14:25:19 +0000 (08:25 -0600)]
lightnvm: remove minor version check for 2.0

A minor version number increase should not break backwards
compatibility.

Fixes: 3cb98f84d368b ("lightnvm: add minor version to generic geometry")
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoMerge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block2
Jens Axboe [Mon, 6 Aug 2018 01:34:09 +0000 (19:34 -0600)]
Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block2

Pull NVMe changes from Christoph:

"This contains the support for TP4004, Asymmetric Namespace Access,
 which makes NVMe multipathing usable in practice."

* 'nvme-4.19' of git://git.infradead.org/nvme:
  nvmet: use Retain Async Event bit to clear AEN
  nvmet: support configuring ANA groups
  nvmet: add minimal ANA support
  nvmet: track and limit the number of namespaces per subsystem
  nvmet: keep a port pointer in nvmet_ctrl
  nvme: add ANA support
  nvme: remove nvme_req_needs_failover
  nvme: simplify the API for getting log pages
  nvme.h: add ANA definitions
  nvme.h: add support for the log specific field

Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoMerge tag 'v4.18-rc6' into for-4.19/block2
Jens Axboe [Mon, 6 Aug 2018 01:32:09 +0000 (19:32 -0600)]
Merge tag 'v4.18-rc6' into for-4.19/block2

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
6 years agoscsi: Check sense buffer size at build time
Kees Cook [Tue, 31 Jul 2018 19:51:54 +0000 (12:51 -0700)]
scsi: Check sense buffer size at build time

To avoid introducing problems like those fixed in commit f7068114d45e
("sr: pass down correctly sized SCSI sense buffer"), this creates a macro
wrapper for scsi_execute() that verifies the size of the sense buffer
similar to what was done for command string sizes in commit 3756f6401c30
("exec: avoid gcc-8 warning for get_task_comm").

Another solution could be to add a length argument to scsi_execute(),
but this function already takes a lot of arguments and Jens was not fond
of that approach.

Additionally, this moves the SCSI_SENSE_BUFFERSIZE definition into
scsi_device.h, and removes a redundant include for scsi_device.h from
scsi_cmnd.h.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolibata-scsi: Move sense buffers onto stack
Kees Cook [Tue, 31 Jul 2018 19:51:53 +0000 (12:51 -0700)]
libata-scsi: Move sense buffers onto stack

To support future compile-time sizeof() checks that will be able to
validate the length of sense buffers, this removes the only dynamically
allocated sense buffers in the tree by putting the 96 byte sense buffers
on the stack.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agocdrom: Use struct scsi_sense_hdr internally
Kees Cook [Tue, 31 Jul 2018 19:51:52 +0000 (12:51 -0700)]
cdrom: Use struct scsi_sense_hdr internally

This removes more casts of struct request_sense and uses the standard
struct scsi_sense_hdr instead. This also fixes any possible stale values
since the prior code did not check the sense length.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoide-cd: Remove redundant sense buffer
Kees Cook [Tue, 31 Jul 2018 19:51:51 +0000 (12:51 -0700)]
ide-cd: Remove redundant sense buffer

This is already able to process the sense buffer, so remove the redundant
parsing during the failure path. This also fixes any possible stale values
since the prior code did not check the sense length.

Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: Switch struct packet_command to use struct scsi_sense_hdr
Kees Cook [Thu, 2 Aug 2018 21:22:13 +0000 (15:22 -0600)]
block: Switch struct packet_command to use struct scsi_sense_hdr

There is a lot of needless struct request_sense usage in the CDROM
code. These can all be struct scsi_sense_hdr instead, to avoid any
confusion over their respective structure sizes. This patch is a lot
of noise changing "sense" to "sshdr", but the final code is more
readable to distinguish between "sense" meaning "struct request_sense"
and "sshdr" meaning "struct scsi_sense_hdr".

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agotarget: don't depend on SCSI
Christoph Hellwig [Tue, 31 Jul 2018 19:51:49 +0000 (12:51 -0700)]
target: don't depend on SCSI

The core target code only needs code from scsi_common.c, which is now
separately selectable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoscsi: build scsi_common.o for all scsi passthrough request users
Christoph Hellwig [Tue, 31 Jul 2018 19:51:48 +0000 (12:51 -0700)]
scsi: build scsi_common.o for all scsi passthrough request users

Split scsi_common.o out of SCSI so that non-SCSI users can pull it in
easily for future sense buffer helper usage.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoscsi: cxlflash: Drop unused sense buffers
Kees Cook [Tue, 31 Jul 2018 19:51:47 +0000 (12:51 -0700)]
scsi: cxlflash: Drop unused sense buffers

This removes the unused sense buffer in read_cap16() and write_same16().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoide-cd: Drop unused sense buffers
Kees Cook [Tue, 31 Jul 2018 19:51:46 +0000 (12:51 -0700)]
ide-cd: Drop unused sense buffers

This drops unused sense buffers from:

cdrom_eject()
cdrom_read_capacity()
cdrom_read_tocentry()
ide_cd_lockdoor()
ide_cd_read_toc()

Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblk-mq: fix updating tags depth
Ming Lei [Thu, 2 Aug 2018 10:23:26 +0000 (18:23 +0800)]
blk-mq: fix updating tags depth

The passed 'nr' from userspace represents the total depth, meantime
inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
and 'nr_reserved_tags' stores the reserved part.

There are two issues in blk_mq_tag_update_depth() now:

1) for growing tags, we should have used the passed 'nr', and keep the
number of reserved tags not changed.

2) the passed 'nr' should have been used for checking against
'tags->nr_tags', instead of number of the normal part.

This patch fixes the above two cases, and avoids kernel crash caused
by wrong resizing sbitmap queue.

Cc: "Ewan D. Milne" <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Tested by: Marco Patalano <mpatalan@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: really disable runtime-pm for blk-mq
Ming Lei [Mon, 30 Jul 2018 12:02:19 +0000 (20:02 +0800)]
block: really disable runtime-pm for blk-mq

Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block:
disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
it can't take effect in that way since user space still can switch
it on via 'echo auto > /sys/block/sdN/device/power/control'.

This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
and fixes all kinds of PM related kernel crash.

Cc: Tomas Janousek <tomi@nomi.cz>
Cc: Przemek Socha <soprwa@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: <stable@vger.kernel.org>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoaoe: mark expected switch fall-through
Gustavo A. R. Silva [Wed, 1 Aug 2018 22:30:20 +0000 (17:30 -0500)]
aoe: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114722 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: make iolatency avg_lat exponentially decay
Dennis Zhou (Facebook) [Thu, 2 Aug 2018 06:15:41 +0000 (23:15 -0700)]
block: make iolatency avg_lat exponentially decay

Currently, avg_lat is calculated by accumulating the mean of every
window in a long running cumulative average. As time goes on, the metric
becomes less and less useful due to the accumulated history.

This patch reuses the same calculation done in load averages to make the
avg_lat metric more lively. Unlike load averages, the avg only advances
when a window elapses (due to an io). Idle periods extend the most
recent window. Bucketing is used to limit the history of avg_lat by
binding it to the window size. So, the window range for 1/exp (decay
rate) is [1 min, 2.5 min) when windows elapse immediately.

The current sample window size is exposed in the debug info to enable
calculation of the window range.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblk-cgroup: clear the throttle queue on fork
Josef Bacik [Tue, 31 Jul 2018 16:39:04 +0000 (12:39 -0400)]
blk-cgroup: clear the throttle queue on fork

We were hitting a panic in production where we put too many times on the
request queue.  This is because we'd get the throttle_queue of the
parent if we fork()'ed while we needed to be throttled, but we didn't
have a reference on it.  Instead just clear these flags on fork so the
child doesn't pay for the sins of its father.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblk-cgroup: hold the queue ref during throttling
Josef Bacik [Tue, 31 Jul 2018 16:39:03 +0000 (12:39 -0400)]
blk-cgroup: hold the queue ref during throttling

The blkg lifetime is protected by the queue lifetime, so we need to put
the queue _after_ we're done using the blkg.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblk-iolatency: fix blkg leak in timer_fn
Josef Bacik [Tue, 31 Jul 2018 16:39:02 +0000 (12:39 -0400)]
blk-iolatency: fix blkg leak in timer_fn

At this point we have a ref on the blkg, we need to drop it if we don't
have a iolat.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock/bsg-lib: use PTR_ERR_OR_ZERO to simplify the flow path
zhong jiang [Tue, 31 Jul 2018 16:13:14 +0000 (00:13 +0800)]
block/bsg-lib: use PTR_ERR_OR_ZERO to simplify the flow path

Simplify the code by using the PTR_ERR_OR_ZERO, instead of the
open code. It is better.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agot10-pi: provide empty t10_pi_complete() for !CONFIG_BLK_DEV_INTEGRITY
Jens Axboe [Tue, 31 Jul 2018 15:10:26 +0000 (09:10 -0600)]
t10-pi: provide empty t10_pi_complete() for !CONFIG_BLK_DEV_INTEGRITY

Fixes a link failure whtn BLK_DEV_INTEGRITY isn't defined.

Fixes: 10c41ddd6132 ("block: move dif_prepare/dif_complete functions to block layer")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: blk_init_allocated_queue() set q->fq as NULL in the fail case
xiao jin [Mon, 30 Jul 2018 06:11:12 +0000 (14:11 +0800)]
block: blk_init_allocated_queue() set q->fq as NULL in the fail case

We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.

Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.

Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.

The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().

Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery")
Cc: <stable@vger.kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: xiao jin <jin.xiao@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agonvme: use blk API to remap ref tags for IOs with metadata
Max Gurtovoy [Sun, 29 Jul 2018 21:15:33 +0000 (00:15 +0300)]
nvme: use blk API to remap ref tags for IOs with metadata

Also moved the logic of the remapping to the nvme core driver instead
of implementing it in the nvme pci driver. This way all the other nvme
transport drivers will benefit from it (in case they'll implement metadata
support).

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: move dif_prepare/dif_complete functions to block layer
Max Gurtovoy [Sun, 29 Jul 2018 21:15:32 +0000 (00:15 +0300)]
block: move dif_prepare/dif_complete functions to block layer

Currently these functions are implemented in the scsi layer, but their
actual place should be the block layer since T10-PI is a general data
integrity feature that is used in the nvme protocol as well. Also, use
the tuple size from the integrity profile since it may vary between
integrity types.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: move ref_tag calculation func to the block layer
Max Gurtovoy [Sun, 29 Jul 2018 21:15:31 +0000 (00:15 +0300)]
block: move ref_tag calculation func to the block layer

Currently this function is implemented in the scsi layer, but it's
actual place should be the block layer since T10-PI is a general
data integrity feature that is used in the nvme protocol as well.

Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: don't account for split bio's size in cgroup stats
Josef Bacik [Mon, 30 Jul 2018 14:10:01 +0000 (10:10 -0400)]
block: don't account for split bio's size in cgroup stats

We need to check in blkcg_bio_issue_check if the bio is flagged as
QUEUE_ENTERED, because if it is then we've already accounted for the
size of the IO in the cgroup stats.  We can still however account for
the extra IO since it'll be another request.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agopktcdvd: Fix possible Spectre-v1 for pkt_devs
Jinbum Park [Sat, 28 Jul 2018 04:20:44 +0000 (13:20 +0900)]
pktcdvd: Fix possible Spectre-v1 for pkt_devs

User controls @dev_minor which to be used as index of pkt_devs.
So, It can be exploited via Spectre-like attack. (speculative execution)

This kind of attack leaks address of pkt_devs, [1]
It leads an attacker to bypass security mechanism such as KASLR.

So sanitize @dev_minor before using it to prevent attack.

[1] https://github.com/jinb-park/linux-exploit/
tree/master/exploit-remaining-spectre-gadget/leak_pkt_devs.c

Signed-off-by: Jinbum Park <jinb.park7@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agonvmet: use Retain Async Event bit to clear AEN
Chaitanya Kulkarni [Thu, 26 Jul 2018 21:00:41 +0000 (14:00 -0700)]
nvmet: use Retain Async Event bit to clear AEN

In the current implementation, we clear the AEN bit when we get the
"get log page" command if given log page is associated with AEN.
This patch allows optionally retaining the AEN for the ctrl
under consideration when Retain Asynchronous Event (RAE) bit is set
as a part of "get log page" command.

This allows the host to read the Log page and optionally retaining the
AEN associated with this log page when using userspace tools like
nvme-cli.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
[hch: also use the new helper in the just merged ANA code]
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvmet: support configuring ANA groups
Christoph Hellwig [Fri, 1 Jun 2018 06:59:25 +0000 (08:59 +0200)]
nvmet: support configuring ANA groups

Allow creating non-default ANA groups (group ID > 1).  Groups are created
either by assigning the group ID to a namespace, or by creating a configfs
group object under a specific port.  All namespaces assigned to a group
that doesn't have a configfs object for a given port are marked as
inaccessible.

Allow changing the ANA state on a per-port basis by creating an
ana_groups directory under each port, and another directory with an
ana_state file in it.  The default ANA group 1 directory is created
automatically for each port.

For all changes in ANA configuration the ANA change AEN is sent.  We only
keep a global changecount instead of additional per-group changecounts to
keep the implementation as simple as possible.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
6 years agonvmet: add minimal ANA support
Christoph Hellwig [Thu, 19 Jul 2018 14:35:20 +0000 (07:35 -0700)]
nvmet: add minimal ANA support

Add support for Asynchronous Namespace Access as specified in NVMe 1.3
TP 4004.

Just add a default ANA group 1 that is optimized on all ports.  This is
(and will remain) the default assignment for any namespace not epxlicitly
assigned to another ANA group.  The ANA state can be manually changed
through the configfs interface, including the change state.

Includes fixes and improvements from Hannes Reinecke.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
6 years agonvmet: track and limit the number of namespaces per subsystem
Christoph Hellwig [Sun, 13 May 2018 17:00:13 +0000 (19:00 +0200)]
nvmet: track and limit the number of namespaces per subsystem

TP 4004 introduces a new 'Maximum Number of Allocated Namespaces' field
in the Identify controller data to help the host size resources.  Put
an upper limit on the supported namespaces to be able to support this
value as supporting 32-bits worth of namespaces would lead to very
large buffers.  The limit is completely arbitrary at this point.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
6 years agonvmet: keep a port pointer in nvmet_ctrl
Christoph Hellwig [Thu, 7 Jun 2018 13:09:50 +0000 (15:09 +0200)]
nvmet: keep a port pointer in nvmet_ctrl

This will be needed for the ANA AEN code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
6 years agonvme: add ANA support
Christoph Hellwig [Mon, 14 May 2018 06:48:54 +0000 (08:48 +0200)]
nvme: add ANA support

Add support for Asynchronous Namespace Access as specified in NVMe 1.3
TP 4004.  With ANA each namespace attached to a controller belongs to an
ANA group that describes the characteristics of accessing the namespaces
through this controller.  In the optimized and non-optimized states
namespaces can be accessed regularly, although in a multi-pathing
environment we should always prefer to access a namespace through a
controller where an optimized relationship exists.  Namespaces in
Inaccessible, Permanent-Loss or Change state for a given controller
should not be accessed.

The states are updated through reading the ANA log page, which is read
once during controller initialization, whenever the ANA change notice
AEN is received, or when one of the ANA specific status codes that
signal a state change is received on a command.

The ANA state is kept in the nvme_ns structure, which makes the checks in
the fast path very simple.  Updating the ANA state when reading the log
page is also very simple, the only downside is that finding the initial
ANA state when scanning for namespaces is a bit cumbersome.

The gendisk for a ns_head is only registered once a live path for it
exists.  Without that the kernel would hang during partition scanning.

Includes fixes and improvements from Hannes Reinecke.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
6 years agonvme: remove nvme_req_needs_failover
Christoph Hellwig [Mon, 4 Jun 2018 06:43:00 +0000 (08:43 +0200)]
nvme: remove nvme_req_needs_failover

Now that we just call out to blk_path_error there isn't really any good
reason to not merge it into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
6 years agonvme: simplify the API for getting log pages
Christoph Hellwig [Wed, 6 Jun 2018 12:39:00 +0000 (14:39 +0200)]
nvme: simplify the API for getting log pages

Merge nvme_get_log and nvme_get_log_ext into a single helper, which takes
a plain nsid instead of the nvme_ns pointer.  Also add support for the
log specific field while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
6 years agonvme.h: add ANA definitions
Christoph Hellwig [Sun, 13 May 2018 16:53:57 +0000 (18:53 +0200)]
nvme.h: add ANA definitions

Add various defintions from NVMe 1.3 TP 4004.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
6 years agonvme.h: add support for the log specific field
Christoph Hellwig [Sat, 12 May 2018 16:18:12 +0000 (18:18 +0200)]
nvme.h: add support for the log specific field

NVMe 1.3 added a new log specific field to the get log page CQ
defintion, add it to our get_log_page SQ structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
6 years agopartitions/aix: append null character to print data from disk
Mauricio Faria de Oliveira [Thu, 26 Jul 2018 01:46:29 +0000 (22:46 -0300)]
partitions/aix: append null character to print data from disk

Even if properly initialized, the lvname array (i.e., strings)
is read from disk, and might contain corrupt data (e.g., lack
the null terminating character for strings).

So, make sure the partition name string used in pr_warn() has
the null terminating character.

Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
Suggested-by: Daniel J. Axtens <daniel.axtens@canonical.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agopartitions/aix: fix usage of uninitialized lv_info and lvname structures
Mauricio Faria de Oliveira [Thu, 26 Jul 2018 01:46:28 +0000 (22:46 -0300)]
partitions/aix: fix usage of uninitialized lv_info and lvname structures

The if-block that sets a successful return value in aix_partition()
uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized.

For example, if 'numlvs' is zero or alloc_lvn() fails, neither is
initialized, but are used anyway if alloc_pvd() succeeds after it.

So, make the alloc_pvd() call conditional on their initialization.

This has been hit when attaching an apparently corrupted/stressed
AIX LUN, misleading the kernel to pr_warn() invalid data and hang.

    [...] partition (null) (11 pp's found) is not contiguous
    [...] partition (null) (2 pp's found) is not contiguous
    [...] partition (null) (3 pp's found) is not contiguous
    [...] partition (null) (64 pp's found) is not contiguous

Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: stop using the deprecated get_seconds()
Arnd Bergmann [Thu, 26 Jul 2018 04:17:41 +0000 (12:17 +0800)]
bcache: stop using the deprecated get_seconds()

The get_seconds function is deprecated now since it returns a 32-bit
value that will eventually overflow, and we are replacing it throughout
the kernel with ktime_get_seconds() or ktime_get_real_seconds() that
return a time64_t.

bcache uses get_seconds() to read the current system time and store it in
the superblock as well as in uuid_entry structures that are user visible.

Unfortunately, the two structures in are still limited to 32 bits, so this
won't fix any real problems but will still overflow in year 2106. Let's
at least document that properly, in case we get an updated format in the
future it can be fixed. We still have a long time before the overflow
and checking the tools at https://github.com/koverstreet/bcache-tools
reveals no access to any of them.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: do not assign in if condition in bcache_device_init()
Florian Schmaus [Thu, 26 Jul 2018 04:17:40 +0000 (12:17 +0800)]
bcache: do not assign in if condition in bcache_device_init()

Fixes an error condition reported by checkpatch.pl which is caused by
assigning a variable in an if condition.

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: do not assign in if condition in bcache_init()
Florian Schmaus [Thu, 26 Jul 2018 04:17:39 +0000 (12:17 +0800)]
bcache: do not assign in if condition in bcache_init()

Fixes an error condition reported by checkpatch.pl which is caused by
assigning a variable in an if condition.

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: free heap cache_set->flush_btree in bch_journal_free
Shenghui Wang [Thu, 26 Jul 2018 04:17:38 +0000 (12:17 +0800)]
bcache: free heap cache_set->flush_btree in bch_journal_free

Free the cache_set->flush_bree heap memory on journal free.

Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: do not assign in if condition register_bcache()
Florian Schmaus [Thu, 26 Jul 2018 04:17:37 +0000 (12:17 +0800)]
bcache: do not assign in if condition register_bcache()

Fixes an error condition reported by checkpatch.pl which is caused by
assigning a variable in an if condition.

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: fix I/O significant decline while backend devices registering
Tang Junhui [Thu, 26 Jul 2018 04:17:36 +0000 (12:17 +0800)]
bcache: fix I/O significant decline while backend devices registering

I attached several backend devices in the same cache set, and produced lots
of dirty data by running small rand I/O writes in a long time, then I
continue run I/O in the others cached devices, and stopped a cached device,
after a mean while, I register the stopped device again, I see the running
I/O in the others cached devices dropped significantly, sometimes even
jumps to zero.

In currently code, bcache would traverse each keys and btree node to count
the dirty data under read locker, and the writes threads can not get the
btree write locker, and when there is a lot of keys and btree node in the
registering device, it would last several seconds, so the write I/Os in
others cached device are blocked and declined significantly.

In this patch, when a device registering to a ache set, which exist others
cached devices with running I/Os, we get the amount of dirty data of the
device in an incremental way, and do not block other cached devices all the
time.

Patch v2: Rename some variables and macros name as Coly suggested.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: calculate the number of incremental GC nodes according to the total of btree...
Tang Junhui [Thu, 26 Jul 2018 04:17:35 +0000 (12:17 +0800)]
bcache: calculate the number of incremental GC nodes according to the total of btree nodes

This patch base on "[PATCH] bcache: finish incremental GC".

Since incremental GC would stop 100ms when front side I/O comes, so when
there are many btree nodes, if GC only processes constant (100) nodes each
time, GC would last a long time, and the front I/Os would run out of the
buckets (since no new bucket can be allocated during GC), and I/Os be
blocked again.

So GC should not process constant nodes, but varied nodes according to the
number of btree nodes. In this patch, GC is divided into constant (100)
times, so when there are many btree nodes, GC can process more nodes each
time, otherwise GC will process less nodes each time (but no less than
MIN_GC_NODES).

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: finish incremental GC
Tang Junhui [Thu, 26 Jul 2018 04:17:34 +0000 (12:17 +0800)]
bcache: finish incremental GC

In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.

This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.

By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.

Patch v2: Rename some variables and macros name as Coly suggested.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: simplify the calculation of the total amount of flash dirty data
Tang Junhui [Thu, 26 Jul 2018 04:17:33 +0000 (12:17 +0800)]
bcache: simplify the calculation of the total amount of flash dirty data

Currently we calculate the total amount of flash only devices dirty data
by adding the dirty data of each flash only device under registering
locker. It is very inefficient.

In this patch, we add a member flash_dev_dirty_sectors in struct cache_set
to record the total amount of flash only devices dirty data in real time,
so we didn't need to calculate the total amount of dirty data any more.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoreadahead: stricter check for bdi io_pages
Markus Stockhausen [Fri, 27 Jul 2018 15:09:53 +0000 (09:09 -0600)]
readahead: stricter check for bdi io_pages

ondemand_readahead() checks bdi->io_pages to cap the maximum pages
that need to be processed. This works until the readit section. If
we would do an async only readahead (async size = sync size) and
target is at beginning of window we expand the pages by another
get_next_ra_size() pages. Btrace for large reads shows that kernel
always issues a doubled size read at the beginning of processing.
Add an additional check for io_pages in the lower part of the func.
The fix helps devices that hard limit bio pages and rely on proper
handling of max_hw_read_sectors (e.g. older FusionIO cards). For
that reason it could qualify for stable.

Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
Cc: stable@vger.kernel.org
Signed-off-by: Markus Stockhausen stockhausen@collogia.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoscsi: virtio_scsi: fix pi_bytes{out,in} on 4 KiB block size devices
Greg Edwards [Thu, 26 Jul 2018 19:52:54 +0000 (15:52 -0400)]
scsi: virtio_scsi: fix pi_bytes{out,in} on 4 KiB block size devices

When the underlying device is a 4 KiB logical block size device with a
protection interval exponent of 0, i.e. 4096 bytes data + 8 bytes PI, the
driver miscalculates the pi_bytes{out,in} by a factor of 8x (64 bytes).

This leads to errors on all reads and writes on 4 KiB logical block size
devices when CONFIG_BLK_DEV_INTEGRITY is enabled and the
VIRTIO_SCSI_F_T10_PI feature bit has been negotiated.

Fixes: e6dc783a38ec0 ("virtio-scsi: Enable DIF/DIX modes in SCSI host LLD")
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: move bio_integrity_{intervals,bytes} into blkdev.h
Greg Edwards [Wed, 25 Jul 2018 14:22:58 +0000 (10:22 -0400)]
block: move bio_integrity_{intervals,bytes} into blkdev.h

This allows bio_integrity_bytes() to be called from drivers instead of
open coding it.

Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoxen/blkfront: remove unused macros
Juergen Gross [Wed, 25 Jul 2018 07:42:07 +0000 (09:42 +0200)]
xen/blkfront: remove unused macros

Remove some macros not used anywhere.

Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoMerge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block
Jens Axboe [Wed, 25 Jul 2018 14:43:30 +0000 (08:43 -0600)]
Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block

Pull NVMe updates from Christoph:

"Highlights:

 - massively improved tracepoints (Keith Busch)
 - support for larger inline data in the RDMA host and target
   (Steve Wise)
 - RDMA setup/teardown path fixes and refactor (Sagi Grimberg)
 - Command Supported and Effects log support for the NVMe target
   (Chaitanya Kulkarni)
 - buffered I/O support for the NVMe target (Chaitanya Kulkarni)

 plus the usual set of cleanups and small enhancements."

* 'nvme-4.19' of git://git.infradead.org/nvme:
  nvmet: don't use uuid_le type
  nvmet: check fileio lba range access boundaries
  nvmet: fix file discard return status
  nvme-rdma: centralize admin/io queue teardown sequence
  nvme-rdma: centralize controller setup sequence
  nvme-rdma: unquiesce queues when deleting the controller
  nvme-rdma: mark expected switch fall-through
  nvme: add disk name to trace events
  nvme: add controller name to trace events
  nvme: use hw qid in trace events
  nvme: cache struct nvme_ctrl reference to struct nvme_request
  nvmet-rdma: add an error flow for post_recv failures
  nvmet-rdma: add unlikely check in the fast path
  nvmet-rdma: support max(16KB, PAGE_SIZE) inline data
  nvme-rdma: support up to 4 segments of inline data
  nvmet: add buffered I/O support for file backed ns
  nvmet: add commands supported and effects log page
  nvme: move init of keep_alive work item to controller initialization
  nvme.h: resync with nvme-cli

6 years agoblock: allow max_discard_segments to be stacked
Mike Snitzer [Fri, 20 Jul 2018 18:57:38 +0000 (14:57 -0400)]
block: allow max_discard_segments to be stacked

Set max_discard_segments to USHRT_MAX in blk_set_stacking_limits() so
that blk_stack_limits() can stack up this limit for stacked devices.

before:

$ cat /sys/block/nvme0n1/queue/max_discard_segments
256
$ cat /sys/block/dm-0/queue/max_discard_segments
1

after:

$ cat /sys/block/nvme0n1/queue/max_discard_segments
256
$ cat /sys/block/dm-0/queue/max_discard_segments
256

Fixes: 1e739730c5b9e ("block: optionally merge discontiguous discard bios into a single request")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: unexport bio_clone_bioset
Christoph Hellwig [Tue, 24 Jul 2018 07:52:34 +0000 (09:52 +0200)]
block: unexport bio_clone_bioset

Now only used by the bounce code, so move it there and mark the function
static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agomd: remove a bogus comment
Christoph Hellwig [Tue, 24 Jul 2018 07:52:33 +0000 (09:52 +0200)]
md: remove a bogus comment

The function name mentioned doesn't exist, and the code next to it
doesn't match the description either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: remove bio_clone_kmalloc
Christoph Hellwig [Tue, 24 Jul 2018 07:52:32 +0000 (09:52 +0200)]
block: remove bio_clone_kmalloc

Unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoexofs: use bio_clone_fast in _write_mirror
Christoph Hellwig [Tue, 24 Jul 2018 07:52:31 +0000 (09:52 +0200)]
exofs: use bio_clone_fast in _write_mirror

The mirroring code never changes the bio data or biovecs.  This means
we can reuse the biovec allocation easily instead of duplicating it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by Boaz Harrosh <ooo@electrozaur.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: don't clone bio in bch_data_verify
Christoph Hellwig [Tue, 24 Jul 2018 07:52:30 +0000 (09:52 +0200)]
bcache: don't clone bio in bch_data_verify

We immediately overwrite the biovec array, so instead just allocate
a new bio and copy over the disk, setor and size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: bio_set_pages_dirty can't see NULL bv_page in a valid bio_vec
Christoph Hellwig [Tue, 24 Jul 2018 12:04:13 +0000 (14:04 +0200)]
block: bio_set_pages_dirty can't see NULL bv_page in a valid bio_vec

So don't bother handling it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: simplify bio_check_pages_dirty
Christoph Hellwig [Tue, 24 Jul 2018 12:04:12 +0000 (14:04 +0200)]
block: simplify bio_check_pages_dirty

bio_check_pages_dirty currently inviolates the invariant that bv_page of
a bio_vec inside bi_vcnt shouldn't be zero, and that is going to become
really annoying with multpath biovecs.  Fortunately there isn't any
all that good reason for it - once we decide to defer freeing the bio
to a workqueue holding onto a few additional pages isn't really an
issue anymore.  So just check if there is a clean page that needs
dirtying in the first path, and do a second pass to free them if there
was none, while the cache is still hot.

Also use the chance to micro-optimize bio_dirty_fn a bit by not saving
irq state - we know we are called from a workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: Rename the null_blk_mod kernel module back into null_blk
Bart Van Assche [Mon, 23 Jul 2018 21:18:33 +0000 (14:18 -0700)]
block: Rename the null_blk_mod kernel module back into null_blk

Commit ca4b2a011948 ("null_blk: add zone support") breaks several
blktests scripts because it renamed the null_blk kernel module into
null_blk_mod. Hence rename null_blk_mod back into null_blk.

Fixes: ca4b2a011948 ("null_blk: add zone support")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Matias Bjorling <matias.bjorling@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agonvmet: don't use uuid_le type
Andy Shevchenko [Tue, 17 Jul 2018 14:17:36 +0000 (17:17 +0300)]
nvmet: don't use uuid_le type

Don't use sizeof(uuid_le) where none of the parameters is type of uuid_le.
Since both arguments are u8 [16], use size of destination there.

Moreover, uuid_le is a deprecated type, and nvmet is using uuid_t
already.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvmet: check fileio lba range access boundaries
Sagi Grimberg [Wed, 11 Jul 2018 13:13:04 +0000 (16:13 +0300)]
nvmet: check fileio lba range access boundaries

Fail out-of-bounds with a proper status code.

Fixes: d5eff33ee6f8 ("nvmet: add simple file backed ns support")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvmet: fix file discard return status
Sagi Grimberg [Wed, 11 Jul 2018 09:43:16 +0000 (12:43 +0300)]
nvmet: fix file discard return status

If nvmet_copy_from_sgl failed, we falsly return successful
completion status.

Fixes: d5eff33ee6f8 ("nvmet: add simple file backed ns support")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme-rdma: centralize admin/io queue teardown sequence
Sagi Grimberg [Mon, 9 Jul 2018 09:49:07 +0000 (12:49 +0300)]
nvme-rdma: centralize admin/io queue teardown sequence

We follow the same queue teardown sequence in delete, reset and error
recovery. Centralize the logic.  This patch does not change any
functionality.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme-rdma: centralize controller setup sequence
Sagi Grimberg [Mon, 9 Jul 2018 09:49:06 +0000 (12:49 +0300)]
nvme-rdma: centralize controller setup sequence

Centralize controller sequence to a single routine that correctly cleans
up after failures instead of having multiple apperances in several flows
(create, reset, reconnect).

One thing that we also gain here are the sanity/boundary checks also
when connecting back to a dynamic controller.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme-rdma: unquiesce queues when deleting the controller
Sagi Grimberg [Mon, 9 Jul 2018 09:49:05 +0000 (12:49 +0300)]
nvme-rdma: unquiesce queues when deleting the controller

If the controller is going away, we need to unquiesce the IO queues so
that all pending request can fail gracefully before moving forward with
controller deletion. Do that before we destroy the IO queues so
blk_cleanup_queue won't block in freeze.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme-rdma: mark expected switch fall-through
Gustavo A. R. Silva [Thu, 5 Jul 2018 13:14:00 +0000 (08:14 -0500)]
nvme-rdma: mark expected switch fall-through

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme: add disk name to trace events
Keith Busch [Fri, 29 Jun 2018 22:50:03 +0000 (16:50 -0600)]
nvme: add disk name to trace events

This will print the disk name to the nvme event trace for io requests so
a user can better distinguish traffic to different disks. This can be used
to  create disk based filters. For example, to see only nvme0n2 traffic:

  echo "disk == \"nvme0n2\"" > /sys/kernel/debug/tracing/events/nvme/filter

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
[hch: turned __assign_disk_name into an inline function]
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme: add controller name to trace events
Keith Busch [Mon, 2 Jul 2018 15:15:03 +0000 (09:15 -0600)]
nvme: add controller name to trace events

This appends the controller instance to the nvme trace buffer to
distinguish which controller is dispatching and completing a command.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme: use hw qid in trace events
Keith Busch [Fri, 29 Jun 2018 22:50:01 +0000 (16:50 -0600)]
nvme: use hw qid in trace events

We can not match a command to its completion based on the command
id alone. We need the submitting queue identifier to pair with the
completion, so this patch adds that to the trace buffer.

This patch is also collapsing the admin and IO submission traces into a
single one so we don't need to duplicate this and creating unnecessary
code branches: we know if the command is an admin vs IO based on the qid.

And since we're here, the patch fixes code formatting in the area.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
[hch: move the qid helper to nvme.h and made it an inline function]
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme: cache struct nvme_ctrl reference to struct nvme_request
Sagi Grimberg [Fri, 29 Jun 2018 22:50:00 +0000 (16:50 -0600)]
nvme: cache struct nvme_ctrl reference to struct nvme_request

We will need to reference the controller in the setup and completion
time for tracing and future traffic based keep alive support.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvmet-rdma: add an error flow for post_recv failures
Max Gurtovoy [Sun, 1 Jul 2018 09:20:24 +0000 (12:20 +0300)]
nvmet-rdma: add an error flow for post_recv failures

Posting receive buffer operation can fail, thus we should make
sure to have an error flow during initialization phase. While
we're here, add a debug print in case of a failure.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvmet-rdma: add unlikely check in the fast path
Max Gurtovoy [Wed, 27 Jun 2018 11:58:02 +0000 (14:58 +0300)]
nvmet-rdma: add unlikely check in the fast path

ib_post_send operation should succeed unless something unusual
happened to the ib device.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvmet-rdma: support max(16KB, PAGE_SIZE) inline data
Steve Wise [Wed, 20 Jun 2018 14:15:10 +0000 (07:15 -0700)]
nvmet-rdma: support max(16KB, PAGE_SIZE) inline data

The patch enables inline data sizes using up to 4 recv sges, and capping
the size at 16KB or at least 1 page size.  So on a 4K page system, up to
16KB is supported, and for a 64K page system 1 page of 64KB is supported.

We avoid > 0 order page allocations for the inline buffers by using
multiple recv sges, one for each page.  If the device cannot support
the configured inline data size due to lack of enough recv sges, then
log a warning and reduce the inline size.

Add a new configfs port attribute, called param_inline_data_size,
to allow configuring the size of inline data for a given nvmf port.
The maximum size allowed is still enforced by nvmet-rdma with
NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is now max(16KB, PAGE_SIZE).
And the default size, if not specified via configfs, is still PAGE_SIZE.
This preserves the existing behavior, but allows larger inline sizes
for small page systems.  If the configured inline data size exceeds
NVMET_RDMA_MAX_INLINE_DATA_SIZE, a warning is logged and the size is
reduced.  If param_inline_data_size is set to 0, then inline data is
disabled for that nvmf port.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme-rdma: support up to 4 segments of inline data
Steve Wise [Wed, 20 Jun 2018 14:15:05 +0000 (07:15 -0700)]
nvme-rdma: support up to 4 segments of inline data

Allow up to 4 segments of inline data for NVMF WRITE operations. This
reduces latency for small WRITEs by removing the need for the target to
issue a READ WR for IB, or a REG_MR + READ WR chain for iWarp.

Also cap the inline segments used based on the limitations of the
device.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvmet: add buffered I/O support for file backed ns
Chaitanya Kulkarni [Wed, 20 Jun 2018 04:01:41 +0000 (00:01 -0400)]
nvmet: add buffered I/O support for file backed ns

Add a new "buffered_io" attribute, which disabled direct I/O and thus
enables page cache based caching when enabled.   The attribute can only
be changed when the namespace is disabled as the file has to be reopend
for the change to take effect.

The possibly blocking read/write are deferred to a newly introduced
global workqueue.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvmet: add commands supported and effects log page
Chaitanya Kulkarni [Mon, 11 Jun 2018 17:40:07 +0000 (13:40 -0400)]
nvmet: add commands supported and effects log page

This patch adds support for Commands Supported and Effects log page
(Log Identifier 05h) for NVMeOF. This also makes it easier to find
which commands are supported, e.g. :-

subnqn    : testnqn1
Admin Command Set
ACS2     [Get Log Page                    ] 00000001
ACS6     [Identify                        ] 00000001
ACS8     [Abort                           ] 00000001
ACS9     [Set Features                    ] 00000001
ACS10    [Get Features                    ] 00000001
ACS12    [Asynchronous Event Request      ] 00000001
ACS24    [Keep Alive                      ] 00000001

NVM Command Set
IOCS0    [Flush                           ] 00000001
IOCS1    [Write                           ] 00000001
IOCS2    [Read                            ] 00000001
IOCS8    [Write Zeroes                    ] 00000001
IOCS9    [Dataset Management              ] 00000001

This partticular functionality can be used from the host side to examine
the NVMeOF ctrl commands supported.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme: move init of keep_alive work item to controller initialization
James Smart [Tue, 12 Jun 2018 23:28:24 +0000 (16:28 -0700)]
nvme: move init of keep_alive work item to controller initialization

Currently, the code initializes the keep alive work item whenever
nvme_start_keep_alive() is called. However, this routine is called
several times while reconnecting, etc. Although it's hoped that keep
alive is always disabled and not scheduled when start is called,
re-initing if it were scheduled or completing can have very bad
side effects. There's no need for re-initialization.

Move the keep_alive work item and cmd struct initialization to
controller init.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agonvme.h: resync with nvme-cli
Revanth Rajashekar [Fri, 15 Jun 2018 18:39:27 +0000 (12:39 -0600)]
nvme.h: resync with nvme-cli

Added some feature ids present in nvme-cli but not kernel.

Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 years agoblk-mq: fail the request in case issue failure
Ming Lei [Sun, 22 Jul 2018 06:10:15 +0000 (14:10 +0800)]
blk-mq: fail the request in case issue failure

Inside blk_mq_try_issue_list_directly(), if the request is issued as
failed, we shouldn't try to do it again, otherwise the warning in
blk_mq_start_request() will be triggered. This change is aligned to
behaviour of other ways of request issue & dispatch.

Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: LKP <lkp@01.org>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoLinux 4.18-rc6
Linus Torvalds [Sun, 22 Jul 2018 21:12:20 +0000 (14:12 -0700)]
Linux 4.18-rc6

6 years agoMerge tag 'nvme-for-4.18' of git://git.infradead.org/nvme
Linus Torvalds [Sun, 22 Jul 2018 20:21:45 +0000 (13:21 -0700)]
Merge tag 'nvme-for-4.18' of git://git.infradead.org/nvme

Pull NVMe fixes from Christoph Hellwig:

 - fix a regression in 4.18 that causes a memory leak on probe failure
   (Keith Bush)

 - fix a deadlock in the passthrough ioctl code (Scott Bauer)

 - don't enable AENs if not supported (Weiping Zhang)

 - fix an old regression in metadata handling in the passthrough ioctl
   code (Roland Dreier)

* tag 'nvme-for-4.18' of git://git.infradead.org/nvme:
  nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMD
  nvme: don't enable AEN if not supported
  nvme: ensure forward progress during Admin passthru
  nvme-pci: fix memory leak on probe failure

6 years agoMerge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sun, 22 Jul 2018 19:04:51 +0000 (12:04 -0700)]
Merge branch 'fixes' of git://git./linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
 "Fix several places that screw up cleanups after failures halfway
  through opening a file (one open-coding filp_clone_open() and getting
  it wrong, two misusing alloc_file()). That part is -stable fodder from
  the 'work.open' branch.

  And Christoph's regression fix for uapi breakage in aio series;
  include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
  definition of sigset_t, the reason for doing so in the first place had
  been bogus - there's no need to expose struct __aio_sigset in
  aio_abi.h at all"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: don't expose __aio_sigset in uapi
  ocxlflash_getfile(): fix double-iput() on alloc_file() failures
  cxl_getfile(): fix double-iput() on alloc_file() failures
  drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()

6 years agoalpha: fix osf_wait4() breakage
Al Viro [Sun, 22 Jul 2018 14:07:11 +0000 (15:07 +0100)]
alpha: fix osf_wait4() breakage

kernel_wait4() expects a userland address for status - it's only
rusage that goes as a kernel one (and needs a copyout afterwards)

[ Also, fix the prototype of kernel_wait4() to have that __user
  annotation   - Linus ]

Fixes: 92ebce5ac55d ("osf_wait4: switch to kernel_wait4()")
Cc: stable@kernel.org # v4.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 years agoblk-rq-qos: make depth comparisons unsigned
Josef Bacik [Fri, 20 Jul 2018 01:42:13 +0000 (21:42 -0400)]
blk-rq-qos: make depth comparisons unsigned

With the change to use UINT_MAX I broke the depth check as any value of
inflight (ie 0) would be less than (int)UINT_MAX.  Fix this by changing
everything to unsigned int to match the depth.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoMerge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Linus Torvalds [Sun, 22 Jul 2018 00:27:42 +0000 (17:27 -0700)]
Merge tag 'armsoc-fixes' of git://git./linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Olof Johansson:

 - Fix interrupt type on ethernet switch for i.MX-based RDU2

 - GPC on i.MX exposed too large a register window which resulted in
   userspace being able to crash the machine.

 - Fixup of bad merge resolution moving GPIO DT nodes under pinctrl on
   droid4.

* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  ARM: dts: imx6: RDU2: fix irq type for mv88e6xxx switch
  soc: imx: gpc: restrict register range for regmap access
  ARM: dts: omap4-droid4: fix dts w.r.t. pwm

6 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 22 Jul 2018 00:25:49 +0000 (17:25 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fix from Ingo Molnar:
 "A single fix for a MCE-polling regression, which prevented the
  disabling of polling"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/MCE: Remove min interval polling limitation

6 years agoMerge branch 'x86-pti-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 22 Jul 2018 00:23:58 +0000 (17:23 -0700)]
Merge branch 'x86-pti-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 pti fixes from Ingo Molnar:
 "An APM fix, and a BTS hardware-tracing fix related to PTI changes"

* 'x86-pti-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apm: Don't access __preempt_count with zeroed fs
  x86/events/intel/ds: Fix bts_interrupt_threshold alignment

6 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 22 Jul 2018 00:21:34 +0000 (17:21 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Two fixes: a stop-machine preemption fix and a SCHED_DEADLINE fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Fix switched_from_dl() warning
  stop_machine: Disable preemption when waking two stopper threads

6 years agoMerge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 21 Jul 2018 23:52:08 +0000 (16:52 -0700)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull core kernel fixes from Ingo Molnar:
 "This is mostly the copy_to_user_mcsafe() related fixes from Dan
  Williams, and an ORC fix for Clang"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handling
  lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()
  lib/iov_iter: Document _copy_to_iter_flushcache()
  lib/iov_iter: Document _copy_to_iter_mcsafe()
  objtool: Use '.strtab' if '.shstrtab' doesn't exist, to support ORC tables on Clang

6 years agoMerge tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sat, 21 Jul 2018 23:46:53 +0000 (16:46 -0700)]
Merge tag 'powerpc-4.18-4' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Two regression fixes, one for xmon disassembly formatting and the
  other to fix the E500 build.

  Two commits to fix a potential security issue in the VFIO code under
  obscure circumstances.

  And finally a fix to the Power9 idle code to restore SPRG3, which is
  user visible and used for sched_getcpu().

  Thanks to: Alexey Kardashevskiy, David Gibson. Gautham R. Shenoy,
  James Clarke"

* tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/powernv: Fix save/restore of SPRG3 on entry/exit from stop (idle)
  powerpc/Makefile: Assemble with -me500 when building for E500
  KVM: PPC: Check if IOMMU page is contained in the pinned physical page
  vfio/spapr: Use IOMMU pageshift rather than pagesize
  powerpc/xmon: Fix disassembly since printf changes

6 years agoMerge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Sat, 21 Jul 2018 23:42:03 +0000 (16:42 -0700)]
Merge tag 'for-4.18-rc5-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A fix of a corruption regarding fsync and clone, under some very
  specific conditions explained in the patch.

  The fix is marked for stable 3.16+ so I'd like to get it merged now
  given the impact"

* tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix file data corruption after cloning a range and fsync

6 years agomm: make vm_area_alloc() initialize core fields
Linus Torvalds [Sat, 21 Jul 2018 22:24:03 +0000 (15:24 -0700)]
mm: make vm_area_alloc() initialize core fields

Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.

The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 years agomm: make vm_area_dup() actually copy the old vma data
Linus Torvalds [Sat, 21 Jul 2018 21:48:45 +0000 (14:48 -0700)]
mm: make vm_area_dup() actually copy the old vma data

.. and re-initialize th eanon_vma_chain head.

This removes some boiler-plate from the users, and also makes it clear
why it didn't need use the 'zalloc()' version.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 years agomm: use helper functions for allocating and freeing vm_area structs
Linus Torvalds [Sat, 21 Jul 2018 20:48:51 +0000 (13:48 -0700)]
mm: use helper functions for allocating and freeing vm_area structs

The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.

We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.

Right now those functions are literally just wrappers around the
kmem_cache_*() calls.  This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sat, 21 Jul 2018 20:14:17 +0000 (13:14 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge fixes from Andrew Morton:
 "5 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: memcg: fix use after free in mem_cgroup_iter()
  mm/huge_memory.c: fix data loss when splitting a file pmd
  fat: fix memory allocation failure handling of match_strdup()
  MAINTAINERS: Peter has moved
  mm/memblock: add missing include <linux/bootmem.h>

6 years agomm: memcg: fix use after free in mem_cgroup_iter()
Jing Xia [Sat, 21 Jul 2018 00:53:48 +0000 (17:53 -0700)]
mm: memcg: fix use after free in mem_cgroup_iter()

It was reported that a kernel crash happened in mem_cgroup_iter(), which
can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.

Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
......
Call trace:
  mem_cgroup_iter+0x2e0/0x6d4
  shrink_zone+0x8c/0x324
  balance_pgdat+0x450/0x640
  kswapd+0x130/0x4b8
  kthread+0xe8/0xfc
  ret_from_fork+0x10/0x20

  mem_cgroup_iter():
      ......
      if (css_tryget(css))    <-- crash here
    break;
      ......

The crashing reason is that mem_cgroup_iter() uses the memcg object whose
pointer is stored in iter->position, which has been freed before and
filled with POISON_FREE(0x6b).

And the root cause of the use-after-free issue is that
invalidate_reclaim_iterators() fails to reset the value of iter->position
to NULL when the css of the memcg is released in non- hierarchical mode.

Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
Fixes: 6df38689e0e9 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
Signed-off-by: Jing Xia <jing.xia.mail@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <chunyan.zhang@unisoc.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>