openwrt/staging/blogic.git
7 years agoblk-mq: don't special case flush inserts for blk-mq-sched
Jens Axboe [Fri, 17 Feb 2017 18:38:36 +0000 (11:38 -0700)]
blk-mq: don't special case flush inserts for blk-mq-sched

The current request insertion machinery works just fine for
directly inserting flushes, so no need to special case
this anymore.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
7 years agoblk-mq-sched: don't add flushes to the head of requeue queue
Jens Axboe [Fri, 17 Feb 2017 18:37:14 +0000 (11:37 -0700)]
blk-mq-sched: don't add flushes to the head of requeue queue

If we are currently out of driver tags, we don't want to add a
new flush (without a tag) to the head of the requeue list. We
want to add it to the back, behind the others that are
potentially also waiting for a tag.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
7 years agoblk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
Jens Axboe [Fri, 17 Feb 2017 18:35:35 +0000 (11:35 -0700)]
blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not

Currently we're almost there, but if we dispatch nothing, then we
still return success.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
7 years agoblock: do not allow updates through sysfs until registration completes
Tahsin Erdogan [Wed, 15 Feb 2017 03:27:38 +0000 (19:27 -0800)]
block: do not allow updates through sysfs until registration completes

When a new disk shows up, sysfs queue directory is created before elevator
is registered. This allows a user to attempt a scheduler switch even though
the initial registration hasn't completed yet.

In one scenario, blk_register_queue() calls elv_register_queue() and
right before cfq_registered_queue() is called, another process executes
elevator_switch() and replaces q->elevator with deadline scheduler. When
cfq_registered_queue() executes it interprets e->elevator_data as struct
cfq_data even though it is actually struct deadline_data.

Grab q->sysfs_lock in blk_register_queue() to synchronize with sysfs
callers.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-sched: don't hold queue_lock when calling exit_icq
Omar Sandoval [Fri, 10 Feb 2017 18:32:34 +0000 (10:32 -0800)]
blk-mq-sched: don't hold queue_lock when calling exit_icq

None of the other blk-mq elevator hooks are called with this lock held.
Additionally, it can lead to circular locking dependencies between
queue_lock and the private scheduler lock.

Reported-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: set make_request_fn manually in blk_mq_update_nr_hw_queues
Josef Bacik [Fri, 10 Feb 2017 18:03:33 +0000 (13:03 -0500)]
block: set make_request_fn manually in blk_mq_update_nr_hw_queues

Calling blk_queue_make_request resets a bunch of settings on the
request_queue, but all we really want is to update the make_request_fn,
so do this directly so we don't lose things like the logical and
physical block sizes.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: pass bio to blk_mq_sched_get_rq_priv
Paolo Valente [Tue, 7 Feb 2017 17:24:43 +0000 (18:24 +0100)]
blk-mq: pass bio to blk_mq_sched_get_rq_priv

bio is used in bfq-mq's get_rq_priv, to get the request group. We could
pass directly the group here, but I thought that passing the bio was
more general, giving the possibility to get other pieces of information
if needed.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: fix double-free in the failure path of cgwb_bdi_init()
Tejun Heo [Wed, 8 Feb 2017 20:19:07 +0000 (15:19 -0500)]
block: fix double-free in the failure path of cgwb_bdi_init()

When !CONFIG_CGROUP_WRITEBACK, bdi has single bdi_writeback_congested
at bdi->wb_congested.  cgwb_bdi_init() allocates it with kzalloc() and
doesn't do further initialization.  This usually works fine as the
reference count gets bumped to 1 by wb_init() and the put from
wb_exit() releases it.

However, when wb_init() fails, it puts the wb base ref automatically
freeing the wb and the explicit kfree() in cgwb_bdi_init() error path
ends up trying to free the same pointer the second time causing a
double-free.

Fix it by explicitly initilizing the refcnt to 1 and putting the base
ref from cgwb_bdi_destroy().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agonvme: support ranged discard requests
Christoph Hellwig [Wed, 8 Feb 2017 13:46:50 +0000 (14:46 +0100)]
nvme: support ranged discard requests

NVMe supports up to 256 ranges per DSM command, so wire up support
for ranged discards up to that limit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: optionally merge discontiguous discard bios into a single request
Christoph Hellwig [Wed, 8 Feb 2017 13:46:49 +0000 (14:46 +0100)]
block: optionally merge discontiguous discard bios into a single request

Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code.  I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.

Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: enumify ELEVATOR_*_MERGE
Christoph Hellwig [Wed, 8 Feb 2017 13:46:48 +0000 (14:46 +0100)]
block: enumify ELEVATOR_*_MERGE

Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: move req_set_nomerge to blk.h
Christoph Hellwig [Wed, 8 Feb 2017 13:46:47 +0000 (14:46 +0100)]
block: move req_set_nomerge to blk.h

This makes it available outside of blk-merge.c, and inlining such a trivial
helper seems pretty useful to start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-sched: (un)register elevator when (un)registering queue
Omar Sandoval [Mon, 6 Feb 2017 20:52:24 +0000 (12:52 -0800)]
blk-mq-sched: (un)register elevator when (un)registering queue

I noticed that when booting with a default blk-mq I/O scheduler, the
/sys/block/*/queue/iosched directory was missing. However, switching
after boot did create the directory. This is because we skip the initial
elevator register/unregister when we don't have a ->request_fn(), but we
should still do it for the ->mq_ops case.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agodm: don't allow ioctls to targets that don't map to whole devices
Christoph Hellwig [Sat, 4 Feb 2017 09:45:03 +0000 (10:45 +0100)]
dm: don't allow ioctls to targets that don't map to whole devices

.. at least for unprivileged users.  Before we called into the SCSI
ioctl code to allow excemptions for a few SCSI passthrough ioctls,
but this is pretty unsafe and except for this call dm knows nothing
about SCSI ioctls.

As the SCSI ioctl code is now optional, we really don't want to
drag it in for DM, and the exception is not very useful anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: free merged request in the caller
Jens Axboe [Fri, 3 Feb 2017 16:48:28 +0000 (09:48 -0700)]
block: free merged request in the caller

If we end up doing a request-to-request merge when we have completed
a bio-to-request merge, we free the request from deep down in that
path. For blk-mq-sched, the merge path has to hold the appropriate
lock, but we don't need it for freeing the request. And in fact
holding the lock is problematic, since we are now calling the
mq sched put_rq_private() hook with the lock held. Other call paths
do not hold this lock.

Fix this inconsistency by ensuring that the caller frees a merged
request. Then we can do it outside of the lock, making it both more
efficient and fixing the blk-mq-sched problem of invoking parts of
the scheduler with an unknown lock state.

Reported-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
7 years agoblk-merge: return the merged request
Jens Axboe [Thu, 2 Feb 2017 15:54:40 +0000 (08:54 -0700)]
blk-merge: return the merged request

When we attempt to merge request-to-request, we return a 0/1 if we
ended up merging or not. Change that to return the pointer to the
request that we freed. We will use this to move the freeing of
that request out of the merge logic, so that callers can drop
locks before freeing the request.

There should be no functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
7 years agoblkcg: fix double free of new_blkg in blkcg_init_queue
Hou Tao [Fri, 3 Feb 2017 09:19:07 +0000 (17:19 +0800)]
blkcg: fix double free of new_blkg in blkcg_init_queue

If blkg_create fails, new_blkg passed as an argument will
be freed by blkg_create, so there is no need to free it again.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-sched: bypass the scheduler for flushes entirely
Omar Sandoval [Thu, 2 Feb 2017 23:42:39 +0000 (15:42 -0800)]
blk-mq-sched: bypass the scheduler for flushes entirely

There's a weird inconsistency that flushes are mostly hidden from the
scheduler, but it needs to be aware of them in ->insert_requests().
Instead of having every scheduler call blk_mq_sched_bypass_insert(),
let's do it in the common framework.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agozram_drv: update for backing dev info changes
Jens Axboe [Thu, 2 Feb 2017 23:53:07 +0000 (16:53 -0700)]
zram_drv: update for backing dev info changes

A previous commit made the bdi embedded in the request queue
a pointer, but neglected to fixup zram. Fix it up.

Fixes: dc3b17cc8bf ("block: Use pointer to backing_dev_info from request_queue")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblktrace: use existing disk debugfs directory
Omar Sandoval [Tue, 31 Jan 2017 22:53:22 +0000 (14:53 -0800)]
blktrace: use existing disk debugfs directory

We may already have a directory to put the blktrace stuff in if

1. The disk uses blk-mq
2. CONFIG_BLK_DEBUG_FS is enabled
3. We are tracing the whole disk and not a partition

Instead of hardcoding this very specific case, let's use the new
debugfs_lookup(). If the directory exists, we use it, otherwise we
create one and clean it up later.

Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: move debugfs_remove() of disk dir to blk_release_queue()
Omar Sandoval [Tue, 31 Jan 2017 22:53:21 +0000 (14:53 -0800)]
blk-mq: move debugfs_remove() of disk dir to blk_release_queue()

This needs to happen after we tear down blktrace.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: use same block debugfs directory for blk-mq and blktrace
Omar Sandoval [Tue, 31 Jan 2017 22:53:20 +0000 (14:53 -0800)]
block: use same block debugfs directory for blk-mq and blktrace

When I added the blk-mq debugging information to debugfs, I didn't
notice that blktrace also creates a "block" directory in debugfs. Make
them use the same dentry, now created in the core block code. Based on a
patch from Jens.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblktrace: make do_blk_trace_setup() static
Omar Sandoval [Tue, 31 Jan 2017 22:53:19 +0000 (14:53 -0800)]
blktrace: make do_blk_trace_setup() static

This isn't used outside of blktrace.c anymore.

Fixes: 62c2a7d969f3 ("block: push BKL into blktrace ioctls")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: fix debugfs config conditional in struct request_queue
Omar Sandoval [Tue, 31 Jan 2017 22:53:18 +0000 (14:53 -0800)]
block: fix debugfs config conditional in struct request_queue

The debugfs dentries are only used for CONFIG_BLK_DEBUG_FS, so make them
conditional on that instead of CONFIG_DEBUG_FS.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agodebugfs: add debugfs_lookup()
Omar Sandoval [Tue, 31 Jan 2017 22:53:17 +0000 (14:53 -0800)]
debugfs: add debugfs_lookup()

We don't always have easy access to the dentry of a file or directory we
created in debugfs. Add a helper which allows us to get a dentry we
previously created.

The motivation for this change is a problem with blktrace and the blk-mq
debugfs entries introduced in 07e4fead45e6 ("blk-mq: create debugfs
directory tree"). Namely, in some cases, the directory that blktrace
needs to create may already exist, but in other cases, it may not. We
_could_ rely on a bunch of implied knowledge to decide whether to create
the directory or not, but it's much cleaner on our end to just look it
up.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoscsi, block: fix duplicate bdi name registration crashes
Dan Williams [Wed, 1 Feb 2017 22:05:23 +0000 (14:05 -0800)]
scsi, block: fix duplicate bdi name registration crashes

Warnings of the following form occur because scsi reuses a devt number
while the block layer still has it referenced as the name of the bdi
[1]:

 WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
 [..]
 Call Trace:
  dump_stack+0x86/0xc3
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  ? kernfs_path_from_node+0x4f/0x60
  sysfs_warn_dup+0x62/0x80
  sysfs_create_dir_ns+0x77/0x90
  kobject_add_internal+0xb2/0x350
  kobject_add+0x75/0xd0
  device_add+0x15a/0x650
  device_create_groups_vargs+0xe0/0xf0
  device_create_vargs+0x1c/0x20
  bdi_register+0x90/0x240
  ? lockdep_init_map+0x57/0x200
  bdi_register_owner+0x36/0x60
  device_add_disk+0x1bb/0x4e0
  ? __pm_runtime_use_autosuspend+0x5c/0x70
  sd_probe_async+0x10d/0x1c0
  async_run_entry_fn+0x39/0x170

This is a brute-force fix to pass the devt release information from
sd_probe() to the locations where we register the bdi,
device_add_disk(), and unregister the bdi, blk_cleanup_queue().

Thanks to Omar for the quick reproducer script [2]. This patch survives
where an unmodified kernel fails in a few seconds.

[1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4
[2]: http://marc.info/?l=linux-block&m=148554717109098&w=2

Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Reported-by: Omar Sandoval <osandov@osandov.com>
Tested-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: Get rid of blk_get_backing_dev_info()
Jan Kara [Thu, 2 Feb 2017 14:56:53 +0000 (15:56 +0100)]
block: Get rid of blk_get_backing_dev_info()

blk_get_backing_dev_info() is now a simple dereference. Remove that
function and simplify some code around that.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: Make blk_get_backing_dev_info() safe without open bdev
Jan Kara [Thu, 2 Feb 2017 14:56:52 +0000 (15:56 +0100)]
block: Make blk_get_backing_dev_info() safe without open bdev

Currenly blk_get_backing_dev_info() is not safe to be called when the
block device is not open as bdev->bd_disk is NULL in that case. However
inode_to_bdi() uses this function and may be call called from flusher
worker or other writeback related functions without bdev being open
which leads to crashes such as:

[113031.075540] Unable to handle kernel paging request for data at address 0x00000000
[113031.075614] Faulting instruction address: 0xc0000000003692e0
0:mon> t
[c0000000fb65f900c00000000036cb6c writeback_sb_inodes+0x30c/0x590
[c0000000fb65fa10c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
[c0000000fb65fa70c00000000036d33c wb_writeback+0x30c/0x450
[c0000000fb65fb40c00000000036e198 wb_workfn+0x268/0x580
[c0000000fb65fc50c0000000000f3470 process_one_work+0x1e0/0x590
[c0000000fb65fce0c0000000000f38c8 worker_thread+0xa8/0x660
[c0000000fb65fd80c0000000000fc4b0 kthread+0x110/0x130
[c0000000fb65fe30c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c

Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: Dynamically allocate and refcount backing_dev_info
Jan Kara [Thu, 2 Feb 2017 14:56:51 +0000 (15:56 +0100)]
block: Dynamically allocate and refcount backing_dev_info

Instead of storing backing_dev_info inside struct request_queue,
allocate it dynamically, reference count it, and free it when the last
reference is dropped. Currently only request_queue holds the reference
but in the following patch we add other users referencing
backing_dev_info.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: Use pointer to backing_dev_info from request_queue
Jan Kara [Thu, 2 Feb 2017 14:56:50 +0000 (15:56 +0100)]
block: Use pointer to backing_dev_info from request_queue

We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: Unhash block device inodes on gendisk destruction
Jan Kara [Thu, 2 Feb 2017 14:56:49 +0000 (15:56 +0100)]
block: Unhash block device inodes on gendisk destruction

Currently, block device inodes stay around after corresponding gendisk
hash died until memory reclaim finds them and frees them. Since we will
make block device inode pin the bdi, we want to free the block device
inode as soon as the device goes away so that bdi does not stay around
unnecessarily. Furthermore we need to avoid issues when new device with
the same major,minor pair gets created since reusing the bdi structure
would be rather difficult in this case.

Unhashing block device inode on gendisk destruction nicely deals with
these problems. Once last block device inode reference is dropped (which
may be directly in del_gendisk()), the inode gets evicted. Furthermore if
the major,minor pair gets reallocated, we are guaranteed to get new
block device inode even if old block device inode is not yet evicted and
thus we avoid issues with possible reuse of bdi.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agonbd: use an idr to keep track of nbd devices
Josef Bacik [Wed, 1 Feb 2017 21:11:40 +0000 (16:11 -0500)]
nbd: use an idr to keep track of nbd devices

To prepare for dynamically adding new nbd devices to the system switch
from using an array for the nbd devices and instead use an idr.  This
copies what loop does for keeping track of its devices.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agonbd: use our own workqueue for recv threads
Josef Bacik [Wed, 1 Feb 2017 21:11:11 +0000 (16:11 -0500)]
nbd: use our own workqueue for recv threads

Since we are in the memory reclaim path we need our recv work to be on a
workqueue that has WQ_MEM_RECLAIM set so we can avoid deadlocks.  Also
set WQ_HIGHPRI since we are in the completion path for IO.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debug: Introduce debugfs_create_files()
Bart Van Assche [Wed, 1 Feb 2017 18:20:59 +0000 (10:20 -0800)]
blk-mq-debug: Introduce debugfs_create_files()

Replace the two debugfs_create_file() loops by a call to the new
debugfs_create_files() function. Add an empty element at the end
of the two attribute arrays such that the array size does not have
to be passed to debugfs_create_files().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debug: Make show() operations interruptible
Bart Van Assche [Wed, 1 Feb 2017 18:20:58 +0000 (10:20 -0800)]
blk-mq-debug: Make show() operations interruptible

Allow users to interrupt show operations instead of making a user
space process unkillable if ownership of q->sysfs_lock cannot be
obtained.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debug: Avoid that sparse complains about req_flags_t usage
Bart Van Assche [Wed, 1 Feb 2017 19:22:23 +0000 (12:22 -0700)]
blk-mq-debug: Avoid that sparse complains about req_flags_t usage

Avoid that sparse reports the following complaints:

block/elevator.c:541:29: warning: incorrect type in assignment (different base types)
block/elevator.c:541:29:    expected bool [unsigned] [usertype] next_sorted
block/elevator.c:541:29:    got restricted req_flags_t

block/blk-mq-debugfs.c:92:54: warning: cast from restricted req_flags_t

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debugfs: Add missing __acquires() / __releases() annotations
Bart Van Assche [Wed, 1 Feb 2017 18:20:56 +0000 (10:20 -0800)]
blk-mq-debugfs: Add missing __acquires() / __releases() annotations

This patch avoids that sparse complains about lock imbalances.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: move internal_tag to same cache line as tag
Jens Axboe [Tue, 31 Jan 2017 19:34:41 +0000 (12:34 -0700)]
block: move internal_tag to same cache line as tag

Since we removed cmd_type, we now have a hole in the struct. Move
the internal_tag member to the same cacheline as tag, since we
use them at the same time.

This doesn't fix the hole, just moves it elsewhere.

Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: fold cmd_type into the REQ_OP_ space
Christoph Hellwig [Tue, 31 Jan 2017 15:57:31 +0000 (16:57 +0100)]
block: fold cmd_type into the REQ_OP_ space

Instead of keeping two levels of indirection for requests types, fold it
all into the operations.  The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.

Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoide: don't abuse cmd_type
Christoph Hellwig [Tue, 31 Jan 2017 15:57:30 +0000 (16:57 +0100)]
ide: don't abuse cmd_type

Currently the legacy ide driver defines several request types of it's own,
which is in the way of removing that field entirely.

Instead add a type field to struct ide_request and use that to distinguish
the different types of IDE-internal requests.

It's a bit of a mess, but so is the surrounding code..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: introduce blk_rq_is_passthrough
Christoph Hellwig [Tue, 31 Jan 2017 15:57:29 +0000 (16:57 +0100)]
block: introduce blk_rq_is_passthrough

This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agonbd: move request validity checking into nbd_send_cmd
Christoph Hellwig [Tue, 31 Jan 2017 15:57:28 +0000 (16:57 +0100)]
nbd: move request validity checking into nbd_send_cmd

This is where we do the rest of the request handling, which will
become much simpler soon, too.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agonbd: remove REQ_TYPE_DRV_PRIV leftovers
Christoph Hellwig [Tue, 31 Jan 2017 15:57:27 +0000 (16:57 +0100)]
nbd: remove REQ_TYPE_DRV_PRIV leftovers

Disconnects don't use block layer requests these days, so all handling
of private requests is dead code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agomspro_block: remove pointless prep_fn
Christoph Hellwig [Tue, 31 Jan 2017 15:57:26 +0000 (16:57 +0100)]
mspro_block: remove pointless prep_fn

This driver will never see non-fs requests, and doesn't do anything
else in the prep_fn.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoms_block: remove pointless prep_fn
Christoph Hellwig [Tue, 31 Jan 2017 15:57:25 +0000 (16:57 +0100)]
ms_block: remove pointless prep_fn

This driver will never see non-fs requests, and doesn't do anything
else in the prep_fn.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agommc: remove pointless request type check in mmc_prep_request
Christoph Hellwig [Tue, 31 Jan 2017 15:57:24 +0000 (16:57 +0100)]
mmc: remove pointless request type check in mmc_prep_request

The block layer won't send requests the driver isn't asking for,
so remove this check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoѕd: remove pointless REQ_TYPE_FS check
Christoph Hellwig [Tue, 31 Jan 2017 15:57:23 +0000 (16:57 +0100)]
Ñ•d: remove pointless REQ_TYPE_FS check

->done can only be called for fs requests, so no need to check again here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoscm_blk: remove unneeded REQ_TYPE_FS check
Christoph Hellwig [Tue, 31 Jan 2017 15:57:22 +0000 (16:57 +0100)]
scm_blk: remove unneeded REQ_TYPE_FS check

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agovirtio_blk: make SCSI passthrough support configurable
Christoph Hellwig [Sat, 28 Jan 2017 08:32:53 +0000 (09:32 +0100)]
virtio_blk: make SCSI passthrough support configurable

The SCSI passthrough idea was a a bad idea to start with (guess who came
up with it?), and has been removed from the virtio 1.O spec, and is not
enabled by defauly by any host I know of.  Add a separate config option
for it so that we don't need to enable it for most setups.  That way
any bugs related to it (like the one recently fixed for vmapped stacks)
do not affect other users, and the size of the virtblk_req structure
also shrinks significantly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agovirtio_blk: remove struct request backpointer from virtblk_req
Christoph Hellwig [Sat, 28 Jan 2017 08:32:52 +0000 (09:32 +0100)]
virtio_blk: remove struct request backpointer from virtblk_req

We can simply use blk_mq_rq_from_pdu to get back at the request at
I/O completion time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: make scsi_request and scsi ioctl support optional
Christoph Hellwig [Sat, 28 Jan 2017 08:32:51 +0000 (09:32 +0100)]
block: make scsi_request and scsi ioctl support optional

We only need this code to support scsi, ide, cciss and virtio.  And at
least for virtio it's a deprecated feature to start with.

This should shrink the kernel size for embedded device that only use,
say eMMC a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoskd: implement trivial scsi ioctls directly
Christoph Hellwig [Sat, 28 Jan 2017 08:32:50 +0000 (09:32 +0100)]
skd: implement trivial scsi ioctls directly

This way there is no need to drag in a dependency on the
BLOCK_PC code, which is going to become optional.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agonvme/scsi: don't rely on BLK_MAX_CDB
Christoph Hellwig [Sat, 28 Jan 2017 08:32:49 +0000 (09:32 +0100)]
nvme/scsi: don't rely on BLK_MAX_CDB

The NVMe SCSI emulation doesn't use BLOCK_PC requests, so BLK_MAX_CDB
doesn't have a meaning for it.  Instead opencode the value of 16
and refactor the code a bit so that related checks are next to each
other and we only need to use the value in one place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agonvme: fix compilation of scsi component
Jens Axboe [Mon, 30 Jan 2017 03:04:49 +0000 (20:04 -0700)]
nvme: fix compilation of scsi component

Since we moved the cdb parts and define out of the block proper,
we need to include scsi/scsi_request.h for the nvme scsi layer.

Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: don't assign cmd_flags in __blk_rq_prep_clone
Christoph Hellwig [Mon, 23 Jan 2017 13:31:09 +0000 (14:31 +0100)]
block: don't assign cmd_flags in __blk_rq_prep_clone

These days we have the proper flags set since request allocation time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: split scsi_request out of struct request
Christoph Hellwig [Fri, 27 Jan 2017 08:46:29 +0000 (09:46 +0100)]
block: split scsi_request out of struct request

And require all drivers that want to support BLOCK_PC to allocate it
as the first thing of their private data.  To support this the legacy
IDE and BSG code is switched to set cmd_size on their queues to let
the block layer allocate the additional space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock/bsg: move queue creation into bsg_setup_queue
Christoph Hellwig [Tue, 3 Jan 2017 12:25:02 +0000 (15:25 +0300)]
block/bsg: move queue creation into bsg_setup_queue

Simply the boilerplate code needed for bsg nodes a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoscsi: allocate scsi_cmnd structures as part of struct request
Christoph Hellwig [Mon, 2 Jan 2017 18:55:26 +0000 (21:55 +0300)]
scsi: allocate scsi_cmnd structures as part of struct request

Rely on the new block layer functionality to allocate additional driver
specific data behind struct request instead of implementing it in SCSI
itѕelf.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoscsi: remove __scsi_alloc_queue
Christoph Hellwig [Mon, 2 Jan 2017 18:52:10 +0000 (21:52 +0300)]
scsi: remove __scsi_alloc_queue

Instead do an internal export of __scsi_init_queue for the transport
classes that export BSG nodes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoscsi: remove scsi_cmd_dma_pool
Christoph Hellwig [Mon, 2 Jan 2017 12:26:34 +0000 (15:26 +0300)]
scsi: remove scsi_cmd_dma_pool

There is no need for GFP_DMA allocations of the scsi_cmnd structures
themselves, all that might be DMAed to or from is the actual payload,
or the sense buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoscsi: respect unchecked_isa_dma for blk-mq
Christoph Hellwig [Tue, 3 Jan 2017 05:28:41 +0000 (08:28 +0300)]
scsi: respect unchecked_isa_dma for blk-mq

Currently blk-mq always allocates the sense buffer using normal GFP_KERNEL
allocation.  Refactor the cmd pool code to split the cmd and sense allocation
and share the code to allocate the sense buffers as well as the sense buffer
slab caches between the legacy and blk-mq path.

Note that this switches to lazy allocation of the sense slab caches - the
slab caches (not the actual allocations) won't be destroy until the scsi
module is unloaded instead of keeping track of hosts using them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoscsi: remove gfp_flags member in scsi_host_cmd_pool
Christoph Hellwig [Mon, 2 Jan 2017 11:38:03 +0000 (14:38 +0300)]
scsi: remove gfp_flags member in scsi_host_cmd_pool

When using the slab allocator we already decide at cache creation time if
an allocation comes from a GFP_DMA pool using the SLAB_CACHE_DMA flag,
and there is no point passing the kmalloc-family only GFP_DMA flag to
kmem_cache_alloc.  Drop all the infrastructure for doing so.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoscsi_dh_hp_sw: switch to scsi_execute_req_flags()
Hannes Reinecke [Thu, 3 Nov 2016 13:20:23 +0000 (14:20 +0100)]
scsi_dh_hp_sw: switch to scsi_execute_req_flags()

Switch to scsi_execute_req_flags() instead of using the block interface
directly.  This will set REQ_QUIET and REQ_PREEMPT, but this is okay as
we're evaluating the errors anyway and should be able to send the command
even if the device is quiesced.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoscsi_dh_emc: switch to scsi_execute_req_flags()
Hannes Reinecke [Thu, 3 Nov 2016 13:20:22 +0000 (14:20 +0100)]
scsi_dh_emc: switch to scsi_execute_req_flags()

Switch to scsi_execute_req_flags() and scsi_get_vpd_page() instead of
open-coding it.  Using scsi_execute_req_flags() will set REQ_QUIET and
REQ_PREEMPT, but this is okay as we're evaluating the errors anyway and
should be able to send the command even if the device is quiesced.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoscsi_dh_rdac: switch to scsi_execute_req_flags()
Hannes Reinecke [Thu, 3 Nov 2016 13:20:21 +0000 (14:20 +0100)]
scsi_dh_rdac: switch to scsi_execute_req_flags()

Switch to scsi_execute_req_flags() and scsi_get_vpd_page() instead of
open-coding it.  Using scsi_execute_req_flags() will set REQ_QUIET and
REQ_PREEMPT, but this is okay as we're evaluating the errors anyway and
should be able to send the command even if the device is quiesced.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agodm: always defer request allocation to the owner of the request_queue
Christoph Hellwig [Sun, 22 Jan 2017 17:32:46 +0000 (18:32 +0100)]
dm: always defer request allocation to the owner of the request_queue

DM already calls blk_mq_alloc_request on the request_queue of the
underlying device if it is a blk-mq device.  But now that we allow drivers
to allocate additional data and initialize it ahead of time we need to do
the same for all drivers.   Doing so and using the new cmd_size
infrastructure in the block layer greatly simplifies the dm-rq and mpath
code, and should also make arbitrary combinations of SQ and MQ devices
with SQ or MQ device mapper tables easily possible as a further step.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agodm: remove incomplete BLOCK_PC support
Christoph Hellwig [Tue, 10 Jan 2017 09:03:39 +0000 (10:03 +0100)]
dm: remove incomplete BLOCK_PC support

DM tries to copy a few fields around for BLOCK_PC requests, but given
that no dm-target ever wires up scsi_cmd_ioctl BLOCK_PC can't actually
be sent to dm.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: cleanup tracing
Christoph Hellwig [Fri, 27 Jan 2017 08:35:54 +0000 (09:35 +0100)]
block: cleanup tracing

A couple tweaks to the tracing code:

 - trace the request size for all requests
 - trace request sector and nr_sectors only for fs requests, enforced by
   helpers
 - drop SCSI CDB tracing - we have SCSI tracing for this and are going
   to me the CDB out of the generic struct request soon.

With this the tracing code stops to know about BLOCK_PC requests entirely,
it's just FS vs passthrough requests now, where the latter includes any
driver-private requests.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: allow specifying size for extra command data
Christoph Hellwig [Fri, 27 Jan 2017 16:51:45 +0000 (09:51 -0700)]
block: allow specifying size for extra command data

This mirrors the blk-mq capabilities to allocate extra drivers-specific
data behind struct request by setting a cmd_size field, as well as having
a constructor / destructor for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: simplify blk_init_allocated_queue
Christoph Hellwig [Tue, 3 Jan 2017 11:52:44 +0000 (14:52 +0300)]
block: simplify blk_init_allocated_queue

Return an errno value instead of the passed in queue so that the callers
don't have to keep track of two queues, and move the assignment of the
request_fn and lock to the caller as passing them as argument doesn't
simplify anything.  While we're at it also remove two pointless NULL
assignments, given that the request structure is zeroed on allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: fix elevator init check
Christoph Hellwig [Wed, 25 Jan 2017 10:17:11 +0000 (11:17 +0100)]
block: fix elevator init check

We can't initalize the elevator fields for flushes as flush share space
in struct request with the elevator data.  But currently we can't
communicate that a request is a flush through blk_get_request as we
can only pass READ or WRITE, and the low-level code looks at the
possible NULL bio to check for a flush.

Fix this by allowing to pass any block op and flags, and by checking for
the flush flags in __get_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agomd: cleanup bio op / flags handling in raid1_write_request
Christoph Hellwig [Wed, 25 Jan 2017 10:15:20 +0000 (11:15 +0100)]
md: cleanup bio op / flags handling in raid1_write_request

No need for the local variables, the bio is still live and we can just
assign the bits we want directly.  Make me wonder why we can't assign
all the bio flags to start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoMerge branch 'for-4.11/block' into for-4.11/rq-refactor
Jens Axboe [Fri, 27 Jan 2017 22:08:31 +0000 (15:08 -0700)]
Merge branch 'for-4.11/block' into for-4.11/rq-refactor

Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: fix debugfs compilation issues
Omar Sandoval [Fri, 27 Jan 2017 22:03:01 +0000 (15:03 -0700)]
blk-mq: fix debugfs compilation issues

This fixes a couple of problems:

1. In the !CONFIG_DEBUG_FS case, the stub definitions were bogus.
2. In the !CONFIG_BLOCK case, blk-mq-debugfs.c shouldn't be compiled at
   all.

Fix the stub definitions and add a CONFIG_BLK_DEBUG_FS Kconfig option.

Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Augment Kconfig description.

Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: cleanup remaining manual checks for PREFLUSH|FUA
Jens Axboe [Fri, 27 Jan 2017 16:08:23 +0000 (09:08 -0700)]
block: cleanup remaining manual checks for PREFLUSH|FUA

Use op_is_flush() where applicable.

Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-sched: add flush insertion into blk_mq_sched_insert_request()
Jens Axboe [Fri, 27 Jan 2017 08:00:47 +0000 (01:00 -0700)]
blk-mq-sched: add flush insertion into blk_mq_sched_insert_request()

Instead of letting the caller check this and handle the details
of inserting a flush request, put the logic in the scheduler
insertion function. This fixes direct flush insertion outside
of the usual make_request_fn calls, like from dm via
blk_insert_cloned_request().

Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: add a op_is_flush helper
Christoph Hellwig [Fri, 27 Jan 2017 15:30:47 +0000 (08:30 -0700)]
block: add a op_is_flush helper

This centralizes the checks for bios that needs to be go into the flush
state machine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-sched: change ->dispatch_requests() to ->dispatch_request()
Jens Axboe [Thu, 26 Jan 2017 19:40:07 +0000 (12:40 -0700)]
blk-mq-sched: change ->dispatch_requests() to ->dispatch_request()

When we invoke dispatch_requests(), the scheduler empties everything
into the passed in list. This isn't always a good thing, since it
means that we remove items that we could have potentially merged
with.

Change the function to dispatch single requests at the time. If
we do that, we can backoff exactly at the point where the device
can't consume more IO, and leave the rest with the scheduler for
better merging and future dispatch decision making.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
7 years agoblk-mq-sched: fix starvation for multiple hardware queues and shared tags
Jens Axboe [Thu, 26 Jan 2017 21:42:34 +0000 (14:42 -0700)]
blk-mq-sched: fix starvation for multiple hardware queues and shared tags

If we have both multiple hardware queues and shared tag map between
devices, we need to ensure that we propagate the hardware queue
restart bit higher up. This is because we can get into a situation
where we don't have any IO pending on a hardware queue, yet we fail
getting a tag to start new IO. If that happens, it's not enough to
mark the hardware queue as needing a restart, we need to bubble
that up to the higher level queue as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
7 years agoblk-mq: release driver tag on a requeue event
Jens Axboe [Thu, 26 Jan 2017 19:32:32 +0000 (12:32 -0700)]
blk-mq: release driver tag on a requeue event

We don't want to hold on to this resource when we have a scheduler
attached.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
7 years agoblk-mq: fix potential race in queue restart and driver tag allocation
Jens Axboe [Thu, 26 Jan 2017 19:50:36 +0000 (12:50 -0700)]
blk-mq: fix potential race in queue restart and driver tag allocation

Once we mark the queue as needing a restart, re-check if we can
get a driver tag. This fixes a theoretical issue where the needed
IO completes _after_ blk_mq_get_driver_tag() fails, but before we
manage to set the restart bit.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
7 years agoblk-mq: improve scheduler queue sync/async running
Jens Axboe [Thu, 26 Jan 2017 19:28:10 +0000 (12:28 -0700)]
blk-mq: improve scheduler queue sync/async running

We'll use the same criteria for whether we need to run the queue sync
or async when we have a scheduler, as we do without one.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
7 years agoblk-mq: move hctx and ctx counters from sysfs to debugfs
Omar Sandoval [Wed, 25 Jan 2017 16:06:49 +0000 (08:06 -0800)]
blk-mq: move hctx and ctx counters from sysfs to debugfs

These counters aren't as out-of-place in sysfs as the other stuff, but
debugfs is a slightly better home for them.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: move hctx io_poll, stats, and dispatched from sysfs to debugfs
Omar Sandoval [Wed, 25 Jan 2017 16:06:48 +0000 (08:06 -0800)]
blk-mq: move hctx io_poll, stats, and dispatched from sysfs to debugfs

These statistics _might_ be useful to userspace, but it's better not to
commit to an ABI for these yet. Also, the dispatched file in sysfs
couldn't be cleared, so make it clearable like the others in debugfs.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: add tags and sched_tags bitmaps to debugfs
Omar Sandoval [Wed, 25 Jan 2017 16:06:47 +0000 (08:06 -0800)]
blk-mq: add tags and sched_tags bitmaps to debugfs

These can be used to debug issues like tag leaks and stuck requests.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: move tags and sched_tags info from sysfs to debugfs
Omar Sandoval [Wed, 25 Jan 2017 16:06:46 +0000 (08:06 -0800)]
blk-mq: move tags and sched_tags info from sysfs to debugfs

These are very tied to the blk-mq tag implementation, so exposing them
to sysfs isn't a great idea. Move the debugging information to debugfs
and add basic entries for the number of tags and the number of reserved
tags to sysfs.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: export software queue pending map to debugfs
Omar Sandoval [Wed, 25 Jan 2017 16:06:45 +0000 (08:06 -0800)]
blk-mq: export software queue pending map to debugfs

This is useful for debugging problems where we've gotten stuck with
requests in the software queues.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agosbitmap: add helpers for dumping to a seq_file
Omar Sandoval [Wed, 25 Jan 2017 22:32:13 +0000 (14:32 -0800)]
sbitmap: add helpers for dumping to a seq_file

This is useful debugging information that will be used in the blk-mq
debugfs directory.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Changed 'weight' to 'busy'.

Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: add extra request information to debugfs
Omar Sandoval [Wed, 25 Jan 2017 16:06:43 +0000 (08:06 -0800)]
blk-mq: add extra request information to debugfs

The request pointers by themselves aren't super useful.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: move hctx->dispatch and ctx->rq_list from sysfs to debugfs
Omar Sandoval [Wed, 25 Jan 2017 16:06:42 +0000 (08:06 -0800)]
blk-mq: move hctx->dispatch and ctx->rq_list from sysfs to debugfs

These lists are only useful for debugging; they definitely don't belong
in sysfs. Putting them in debugfs also removes the limitation of a
single page of output.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: add hctx->{state,flags} to debugfs
Omar Sandoval [Wed, 25 Jan 2017 16:06:41 +0000 (08:06 -0800)]
blk-mq: add hctx->{state,flags} to debugfs

hctx->state could come in handy for bugs where the hardware queue gets
stuck in the stopped state, and hctx->flags is just useful to know.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: create debugfs directory tree
Omar Sandoval [Wed, 25 Jan 2017 16:06:40 +0000 (08:06 -0800)]
blk-mq: create debugfs directory tree

In preparation for putting blk-mq debugging information in debugfs,
create a directory tree mirroring the one in sysfs:

    # tree -d /sys/kernel/debug/block
    /sys/kernel/debug/block
    |-- nvme0n1
    |   `-- mq
    |       |-- 0
    |       |   `-- cpu0
    |       |-- 1
    |       |   `-- cpu1
    |       |-- 2
    |       |   `-- cpu2
    |       `-- 3
    |           `-- cpu3
    `-- vda
        `-- mq
            `-- 0
                |-- cpu0
                |-- cpu1
                |-- cpu2
                `-- cpu3

Also add the scaffolding for the actual files that will go in here,
either under the hardware queue or software queue directories.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-sched: check for successful allocation before assigning tag
Jens Axboe [Thu, 26 Jan 2017 21:52:20 +0000 (14:52 -0700)]
blk-mq-sched: check for successful allocation before assigning tag

We don't trigger this from the normal IO path, since we always use
blocking allocations from there. But Bart saw it testing multipath
dm, since that is a heavy user of atomic request allocations in
the map and clone path.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: don't lose flags passed in to blk_mq_alloc_request()
Jens Axboe [Thu, 26 Jan 2017 19:22:11 +0000 (12:22 -0700)]
blk-mq: don't lose flags passed in to blk_mq_alloc_request()

If we come in from blk_mq_alloc_requst() with NOWAIT set in flags,
we must ensure that we don't later overwrite that in
blk_mq_sched_get_request(). Initialize alloc_data->flags before
passing it in.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: only apply active queue tag throttling for driver tags
Jens Axboe [Wed, 25 Jan 2017 15:11:38 +0000 (08:11 -0700)]
blk-mq: only apply active queue tag throttling for driver tags

If we have a scheduler attached, we have two sets of tags. We don't
want to apply our active queue throttling for the scheduler side
of tags, that only applies to driver tags since that's the resource
we need to dispatch an IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agocfq-iosched: Adjust one function call together with a variable assignment
Markus Elfring [Sat, 21 Jan 2017 21:44:07 +0000 (22:44 +0100)]
cfq-iosched: Adjust one function call together with a variable assignment

The script "checkpatch.pl" pointed information out like the following.

ERROR: do not use assignment in if condition

Thus fix the affected source code place.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-throttle: Adjust two function calls together with a variable assignment
Markus Elfring [Sat, 21 Jan 2017 21:15:33 +0000 (22:15 +0100)]
blk-throttle: Adjust two function calls together with a variable assignment

The script "checkpatch.pl" pointed information out like the following.

ERROR: do not use assignment in if condition

Thus fix the affected source code places.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock: Initialize cfqq->ioprio_class in cfq_get_queue()
Alexander Potapenko [Mon, 23 Jan 2017 14:06:43 +0000 (15:06 +0100)]
block: Initialize cfqq->ioprio_class in cfq_get_queue()

KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
uninitialized memory in cfq_init_cfqq():

==================================================================
BUG: KMSAN: use of unitialized memory
...
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff8202ac97>] dump_stack+0x157/0x1d0 lib/dump_stack.c:51
 [<ffffffff813e9b65>] kmsan_report+0x205/0x360 ??:?
 [<ffffffff813eabbb>] __msan_warning+0x5b/0xb0 ??:?
 [<     inline     >] cfq_init_cfqq block/cfq-iosched.c:3754
 [<ffffffff8201e110>] cfq_get_queue+0xc80/0x14d0 block/cfq-iosched.c:3857
...
origin:
 [<ffffffff8103ab37>] save_stack_trace+0x27/0x50 arch/x86/kernel/stacktrace.c:67
 [<ffffffff813e836b>] kmsan_internal_poison_shadow+0xab/0x150 ??:?
 [<ffffffff813e88ab>] kmsan_poison_slab+0xbb/0x120 ??:?
 [<     inline     >] allocate_slab mm/slub.c:1627
 [<ffffffff813e533f>] new_slab+0x3af/0x4b0 mm/slub.c:1641
 [<     inline     >] new_slab_objects mm/slub.c:2407
 [<ffffffff813e0ef3>] ___slab_alloc+0x323/0x4a0 mm/slub.c:2564
 [<     inline     >] __slab_alloc mm/slub.c:2606
 [<     inline     >] slab_alloc_node mm/slub.c:2669
 [<ffffffff813dfb42>] kmem_cache_alloc_node+0x1d2/0x1f0 mm/slub.c:2746
 [<ffffffff8201d90d>] cfq_get_queue+0x47d/0x14d0 block/cfq-iosched.c:3850
...
==================================================================
(the line numbers are relative to 4.8-rc6, but the bug persists
upstream)

The uninitialized struct cfq_queue is created by kmem_cache_alloc_node()
and then passed to cfq_init_cfqq(), which accesses cfqq->ioprio_class
before it's initialized.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoLinux 4.10-rc5
Linus Torvalds [Sun, 22 Jan 2017 20:54:15 +0000 (12:54 -0800)]
Linux 4.10-rc5

7 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 22 Jan 2017 20:47:48 +0000 (12:47 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fix from Thomas Gleixner:
 "Restore the retrigger callbacks in the IO APIC irq chips. That
  addresses a long standing regression which got introduced with the
  rewrite of the x86 irq subsystem two years ago and went unnoticed so
  far"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ioapic: Restore IO-APIC irq_chip retrigger callback