openwrt/staging/blogic.git
6 years agomq-deadline: make it clear that __dd_dispatch_request() works on all hw queues
Jens Axboe [Sat, 6 Jan 2018 16:23:11 +0000 (09:23 -0700)]
mq-deadline: make it clear that __dd_dispatch_request() works on all hw queues

Don't pass in the hardware queue to __dd_dispatch_request(), since it
leads the reader to believe that we are returning a request for that
specific hardware queue. That's not how mq-deadline works, the state
for determining which request to serve next is shared across all
hardware queues for a device.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agotarget: Use sgl_alloc_order() and sgl_free()
Bart Van Assche [Fri, 5 Jan 2018 16:26:50 +0000 (08:26 -0800)]
target: Use sgl_alloc_order() and sgl_free()

Use the sgl_alloc_order() and sgl_free() functions instead of open
coding these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agonvmet/rdma: Use sgl_alloc() and sgl_free()
Bart Van Assche [Fri, 5 Jan 2018 16:26:49 +0000 (08:26 -0800)]
nvmet/rdma: Use sgl_alloc() and sgl_free()

Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agonvmet/fc: Use sgl_alloc() and sgl_free()
Bart Van Assche [Fri, 5 Jan 2018 16:26:48 +0000 (08:26 -0800)]
nvmet/fc: Use sgl_alloc() and sgl_free()

Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agocrypto: scompress - use sgl_alloc() and sgl_free()
Bart Van Assche [Fri, 5 Jan 2018 16:26:47 +0000 (08:26 -0800)]
crypto: scompress - use sgl_alloc() and sgl_free()

Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolib/scatterlist: Introduce sgl_alloc() and sgl_free()
Bart Van Assche [Fri, 5 Jan 2018 16:26:46 +0000 (08:26 -0800)]
lib/scatterlist: Introduce sgl_alloc() and sgl_free()

Many kernel drivers contain code that allocates and frees both a
scatterlist and the pages that populate that scatterlist.
Introduce functions in lib/scatterlist.c that perform these tasks
instead of duplicating this functionality in multiple drivers.
Only include these functions in the build if CONFIG_SGL_ALLOC=y
to avoid that the kernel size increases if this functionality is
not used.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agowriteback: update comment in inode_io_list_move_locked
Wang Long [Tue, 5 Dec 2017 12:23:19 +0000 (07:23 -0500)]
writeback: update comment in inode_io_list_move_locked

The @head can be wb->b_dirty_time, so update the comment.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoDAC960: split up ioctl function to reduce stack size
Arnd Bergmann [Mon, 11 Dec 2017 12:11:17 +0000 (13:11 +0100)]
DAC960: split up ioctl function to reduce stack size

When CONFIG_KASAN is set, all the local variables in this function are
allocated on the stack together, leading to a warning about possible
kernel stack overflow:

drivers/block/DAC960.c: In function 'DAC960_gam_ioctl':
drivers/block/DAC960.c:7061:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

By splitting up the function into smaller chunks, we can avoid that and
make the code slightly more readable at the same time. The coding style
in this file is completely nonstandard, and I chose to not touch that
at all, leaving the unconventional intendation unchanged to make it
easier to review the diff.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: blk-merge: remove unnecessary check
Ming Lei [Mon, 18 Dec 2017 12:22:16 +0000 (20:22 +0800)]
block: blk-merge: remove unnecessary check

In this case, 'sectors' can't be zero at all, so remove the check
and let the bio be split.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: blk-merge: try to make front segments in full size
Ming Lei [Mon, 18 Dec 2017 12:22:15 +0000 (20:22 +0800)]
block: blk-merge: try to make front segments in full size

When merging one bvec into segment, if the bvec is too big
to merge, current policy is to move the whole bvec into another
new segment.

This patchset changes the policy into trying to maximize size of
front segments, that means in above situation, part of bvec
is merged into current segment, and the remainder is put
into next segment.

This patch prepares for support multipage bvec because
it can be quite common to see this case and we should try
to make front segments in full size.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblk-merge: compute bio->bi_seg_front_size efficiently
Ming Lei [Mon, 18 Dec 2017 12:22:14 +0000 (20:22 +0800)]
blk-merge: compute bio->bi_seg_front_size efficiently

It is enough to check and compute bio->bi_seg_front_size just
after the 1st segment is found, but current code checks that
for each bvec, which is inefficient.

This patch follows the way in  __blk_recalc_rq_segments()
for computing bio->bi_seg_front_size, and it is more efficient
and code becomes more readable too.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agodm-crypt: don't clear bvec->bv_page in crypt_free_buffer_pages()
Ming Lei [Mon, 18 Dec 2017 12:22:13 +0000 (20:22 +0800)]
dm-crypt: don't clear bvec->bv_page in crypt_free_buffer_pages()

The bio is always freed after running crypt_free_buffer_pages(), so it
isn't necessary to clear bv->bv_page.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc:dm-devel@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobtrfs: avoid accessing bvec table directly for a cloned bio
Ming Lei [Mon, 18 Dec 2017 12:22:12 +0000 (20:22 +0800)]
btrfs: avoid accessing bvec table directly for a cloned bio

Commit 17347cec15f919901c90(Btrfs: change how we iterate bios in endio)
mentioned that for dio the submitted bio may be fast cloned, we
can't access the bvec table directly for a cloned bio, so use
bio_get_first_bvec() to retrieve the 1st bvec.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Acked: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobtrfs: avoid access to .bi_vcnt directly
Ming Lei [Mon, 18 Dec 2017 12:22:11 +0000 (20:22 +0800)]
btrfs: avoid access to .bi_vcnt directly

BTRFS uses bio->bi_vcnt to figure out page numbers, this approach is no
longer valid once we start enabling multipage bvecs.
correct once we start to enable multipage bvec.

Use bio_nr_pages() to do that instead.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: move bio_alloc_pages() to bcache
Ming Lei [Mon, 18 Dec 2017 12:22:10 +0000 (20:22 +0800)]
block: move bio_alloc_pages() to bcache

bcache is the only user of bio_alloc_pages(), so move this function into
bcache, and avoid it being misused in the future.

Also rename it to bch_bio_allo_pages() since it is bcache only.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agobcache: comment on direct access to bvec table
Ming Lei [Mon, 18 Dec 2017 12:22:09 +0000 (20:22 +0800)]
bcache: comment on direct access to bvec table

All direct access to bvec table are safe even after multipage bvec is
supported.

Cc: linux-bcache@vger.kernel.org
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agodm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE
Ming Lei [Mon, 18 Dec 2017 12:22:08 +0000 (20:22 +0800)]
dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE

For BIO based DM, some targets aren't ready for dealing with bigger
incoming bio than 1Mbyte, such as crypt target.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc:dm-devel@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: bounce: don't access bio->bi_io_vec in copy_to_high_bio_irq
Ming Lei [Mon, 18 Dec 2017 12:22:07 +0000 (20:22 +0800)]
block: bounce: don't access bio->bi_io_vec in copy_to_high_bio_irq

Firstly this patch introduces BVEC_ITER_ALL_INIT for iterating one bio
from start to end.

As we need to support multipage bvecs, don't access bio->bi_io_vec
in copy_to_high_bio_irq(), and just use the standard iterator for that.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: bounce: avoid direct access to bvec table
Ming Lei [Mon, 18 Dec 2017 12:22:06 +0000 (20:22 +0800)]
block: bounce: avoid direct access to bvec table

We will support multipage bvecs in the future, so change to iterator way
for getting bv_page of bvec from original bio.

Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agofs: convert to bio_last_bvec_all()
Ming Lei [Mon, 18 Dec 2017 12:22:05 +0000 (20:22 +0800)]
fs: convert to bio_last_bvec_all()

This patch converts 3 users to bio_last_bvec_all(), so that we can go
ahead and convert to multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: convert to bio_first_bvec_all & bio_first_page_all
Ming Lei [Mon, 18 Dec 2017 12:22:04 +0000 (20:22 +0800)]
block: convert to bio_first_bvec_all & bio_first_page_all

This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: introduce bio helpers for converting to multipage bvec
Ming Lei [Mon, 18 Dec 2017 12:22:03 +0000 (20:22 +0800)]
block: introduce bio helpers for converting to multipage bvec

The following helpers are introduced for converting current users of
direct access to bvec table, and prepares for supporting multipage bvec:

bio_pages_all()
bio_first_bvec_all()
bio_first_page_all()
bio_last_bvec_all()

All are named as bio_*_all() to following bio_for_each_segment_all(),
they can only be used on bio of !bio_flagged(bio, BIO_CLONED), that means
the whole bvec table is covered.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock, bfq: remove batches of confusing ifdefs
Paolo Valente [Mon, 4 Dec 2017 10:42:05 +0000 (11:42 +0100)]
block, bfq: remove batches of confusing ifdefs

Commit a33801e8b473 ("block, bfq: move debug blkio stats behind
CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs:
one reported in [1], plus a similar one in another function. This
commit removes both batches, in the way suggested in [1].

[1] https://www.spinics.net/lists/linux-block/msg20043.html

Fixes: a33801e8b473 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock, bfq: consider also past I/O in soft real-time detection
Paolo Valente [Fri, 15 Dec 2017 06:23:12 +0000 (07:23 +0100)]
block, bfq: consider also past I/O in soft real-time detection

BFQ privileges the I/O of soft real-time applications, such as video
players, to guarantee to these application a high bandwidth and a low
latency. In this respect, it is not easy to correctly detect when an
application is soft real-time. A particularly nasty false positive is
that of an I/O-bound application that occasionally happens to meet all
requirements to be deemed as soft real-time. After being detected as
soft real-time, such an application monopolizes the device. Fortunately,
BFQ will realize soon that the application is actually not soft
real-time and suspend every privilege. Yet, the application may happen
again to be wrongly detected as soft real-time, and so on.

As highlighted by our tests, this problem causes BFQ to occasionally
fail to guarantee a high responsiveness, in the presence of heavy
background I/O workloads. The reason is that the background workload
happens to be detected as soft real-time, more or less frequently,
during the execution of the interactive task under test. To give an
idea, because of this problem, Libreoffice Writer occasionally takes 8
seconds, instead of 3, to start up, if there are sequential reads and
writes in the background, on a Kingston SSDNow V300.

This commit addresses this issue by leveraging the following facts.

The reason why some applications are detected as soft real-time despite
all BFQ checks to avoid false positives, is simply that, during high
CPU or storage-device load, I/O-bound applications may happen to do
I/O slowly enough to meet all soft real-time requirements, and pass
all BFQ extra checks. Yet, this happens only for limited time periods:
slow-speed time intervals are usually interspersed between other time
intervals during which these applications do I/O at a very high speed.
To exploit these facts, this commit introduces a little change, in the
detection of soft real-time behavior, to systematically consider also
the recent past: the higher the speed was in the recent past, the
later next I/O should arrive for the application to be considered as
soft real-time. At the beginning of a slow-speed interval, the minimum
arrival time allowed for the next I/O usually happens to still be so
high, to fall *after* the end of the slow-speed period itself. As a
consequence, the application does not risk to be deemed as soft
real-time during the slow-speed interval. Then, during the next
high-speed interval, the application cannot, evidently, be deemed as
soft real-time (exactly because of its speed), and so on.

This extra filtering proved to be rather effective: in the above test,
the frequency of false positives became so low that the start-up time
was 3 seconds in all iterations (apart from occasional outliers,
caused by page-cache-management issues, which are out of the scope of
this commit, and cannot be solved by an I/O scheduler).

Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock, bfq: remove superfluous check in queue-merging setup
Angelo Ruocco [Wed, 20 Dec 2017 11:38:34 +0000 (12:38 +0100)]
block, bfq: remove superfluous check in queue-merging setup

When two or more processes do I/O in a way that the their requests are
sequential in respect to one another, BFQ merges the bfq_queues associated
with the processes. This way the overall I/O pattern becomes sequential,
and thus there is a boost in througput.
These cooperating processes usually start or restart to do I/O shortly
after each other. So, in order to avoid merging non-cooperating processes,
BFQ ensures that none of these queues has been in weight raising for too
long.

In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged
only shortly after being created", BFQ checks whether any queue (and not
only weight-raised ones) is doing I/O continuously from too long to be
merged.

This new additional check makes the first one useless: a queue doing
I/O from long enough, if being weight-raised, is also a queue in
weight raising for too long to be merged. Accordingly, this commit
removes the first check.

Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock, bfq: let a queue be merged only shortly after starting I/O
Paolo Valente [Wed, 20 Dec 2017 11:38:33 +0000 (12:38 +0100)]
block, bfq: let a queue be merged only shortly after starting I/O

In BFQ and CFQ, two processes are said to be cooperating if they do
I/O in such a way that the union of their I/O requests yields a
sequential I/O pattern. To get such a sequential I/O pattern out of
the non-sequential pattern of each cooperating process, BFQ and CFQ
merge the queues associated with these processes. In more detail,
cooperating processes, and thus their associated queues, usually
start, or restart, to do I/O shortly after each other. This is the
case, e.g., for the I/O threads of KVM/QEMU and of the dump
utility. Basing on this assumption, this commit allows a bfq_queue to
be merged only during a short time interval (100ms) after it starts,
or re-starts, to do I/O.  This filtering provides two important
benefits.

First, it greatly reduces the probability that two non-cooperating
processes have their queues merged by mistake, if they just happen to
do I/O close to each other for a short time interval. These spurious
merges cause loss of service guarantees. A low-weight bfq_queue may
unjustly get more than its expected share of the throughput: if such a
low-weight queue is merged with a high-weight queue, then the I/O for
the low-weight queue is served as if the queue had a high weight. This
may damage other high-weight queues unexpectedly.  For instance,
because of this issue, lxterminal occasionally took 7.5 seconds to
start, instead of 6.5 seconds, when some sequential readers and
writers did I/O in the background on a FUJITSU MHX2300BT HDD.  The
reason is that the bfq_queues associated with some of the readers or
the writers were merged with the high-weight queues of some processes
that had to do some urgent but little I/O. The readers then exploited
the inherited high weight for all or most of their I/O, during the
start-up of terminal. The filtering introduced by this commit
eliminated any outlier caused by spurious queue merges in our start-up
time tests.

This filtering also provides a little boost of the throughput
sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
once a bfq_queue cannot be merged any longer, this commit makes BFQ
stop updating the data needed to handle merging for the queue.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock, bfq: check low_latency flag in bfq_bfqq_save_state()
Angelo Ruocco [Wed, 20 Dec 2017 11:38:32 +0000 (12:38 +0100)]
block, bfq: check low_latency flag in bfq_bfqq_save_state()

A just-created bfq_queue will certainly be deemed as interactive on
the arrival of its first I/O request, if the low_latency flag is
set. Yet, if the queue is merged with another queue on the arrival of
its first I/O request, it will not have the chance to be flagged as
interactive. Nevertheless, if the queue is then split soon enough, it
has to be flagged as interactive after the split.

To handle this early-merge scenario correctly, BFQ saves the state of
the queue, on the merge, as if the latter had already been deemed
interactive. So, if the queue is split soon, it will get
weight-raised, because the previous state of the queue is resumed on
the split.

Unfortunately, in the act of saving the state of the newly-created
queue, BFQ doesn't check whether the low_latency flag is set, and this
causes early-merged queues to be then weight-raised, on queue splits,
even if low_latency is off. This commit addresses this problem by
adding the missing check.

Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock, bfq: add missing rq_pos_tree update on rq removal
Paolo Valente [Wed, 20 Dec 2017 11:38:31 +0000 (12:38 +0100)]
block, bfq: add missing rq_pos_tree update on rq removal

If two processes do I/O close to each other, then BFQ merges the
bfq_queues associated with these processes, to get a more sequential
I/O, and thus a higher throughput.  In this respect, to detect whether
two processes are doing I/O close to each other, BFQ keeps a list of
the head-of-line I/O requests of all active bfq_queues.  The list is
ordered by initial sectors, and implemented through a red-black tree
(rq_pos_tree).

Unfortunately, the update of the rq_pos_tree was incomplete, because
the tree was not updated on the removal of the head-of-line I/O
request of a bfq_queue, in case the queue did not remain empty. This
commit adds the missing update.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock, bfq: increase threshold to deem I/O as random
Paolo Valente [Wed, 20 Dec 2017 16:27:36 +0000 (17:27 +0100)]
block, bfq: increase threshold to deem I/O as random

If two processes do I/O close to each other, i.e., are cooperating
processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their
associated bfq_queues, so as to get sequential I/O from the union of
the I/O requests of the processes, and thus reach a higher
throughput. A merged queue is then split if its I/O stops being
sequential. In this respect, BFQ deems the I/O of a bfq_queue as
(mostly) sequential only if less than 4 I/O requests are random, out
of the last 32 requests inserted into the queue.

Unfortunately, extensive testing (with the interleaved_io benchmark of
the S suite [1], and with real applications spawning cooperating
processes) has clearly shown that, with such a low threshold, only a
rather low I/O throughput may be reached when several cooperating
processes do I/O. In particular, the outcome of each test run was
bimodal: if queue merging occurred and was stable during the test,
then the throughput was close to the peak rate of the storage device,
otherwise the throughput was arbitrarily low (usually around 1/10 of
the peak rate with a rotational device). The probability to get the
unlucky outcomes grew with the number of cooperating processes: it was
already significant with 5 processes, and close to one with 7 or more
processes.

The cause of the low throughput in the unlucky runs was that the
merged queues containing the I/O of these cooperating processes were
soon split, because they contained more random I/O requests than those
tolerated by the 4/32 threshold, but
- that I/O would have however allowed the storage device to reach
  peak throughput or almost peak throughput;
- in contrast, the I/O of these processes, if served individually
  (from separate queues) yielded a rather low throughput.

So we repeated our tests with increasing values of the threshold,
until we found the minimum value (19) for which we obtained maximum
throughput, reliably, with at least up to 9 cooperating
processes. Then we checked that the use of that higher threshold value
did not cause any regression for any other benchmark in the suite [1].
This commit raises the threshold to such a higher value.

[1] https://github.com/Algodev-github/S

Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agodeadline-iosched: Introduce zone locking support
Damien Le Moal [Thu, 21 Dec 2017 06:43:42 +0000 (15:43 +0900)]
deadline-iosched: Introduce zone locking support

Introduce zone write locking to avoid write request reordering with
zoned block devices. This is achieved using a finer selection of the
next request to dispatch:
1) Any non-write request is always allowed to proceed.
2) Any write to a conventional zone is always allowed to proceed.
3) For a write to a sequential zone, the zone lock is first checked.
   a) If the zone is not locked, the write is allowed to proceed after
      its target zone is locked.
   b) If the zone is locked, the write request is skipped and the next
      request in the dispatch queue tested (back to step 1).

For a write request that has locked its target zone, the zone is
unlocked either when the request completes and the method
deadline_request_completed() is called, or when the request is requeued
using the method deadline_add_request().

Requests targeting a locked zone are always left in the scheduler queue
to preserve the initial write order. If no write request can be
dispatched, allow reads to be dispatched even if the write batch is not
done.

If the device used is not a zoned block device, or if zoned block device
support is disabled, this patch does not modify deadline behavior.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agodeadline-iosched: Introduce dispatch helpers
Damien Le Moal [Thu, 21 Dec 2017 06:43:41 +0000 (15:43 +0900)]
deadline-iosched: Introduce dispatch helpers

Avoid directly referencing the next_rq and fifo_list arrays using the
helper functions deadline_next_request() and deadline_fifo_request() to
facilitate changes in the dispatch request selection in
deadline_dispatch_requests() for zoned block devices.

While at it, also remove the unnecessary forward declaration of the
function deadline_move_request().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agomq-deadline: Introduce zone locking support
Damien Le Moal [Thu, 21 Dec 2017 06:43:40 +0000 (15:43 +0900)]
mq-deadline: Introduce zone locking support

Introduce zone write locking to avoid write request reordering with
zoned block devices. This is achieved using a finer selection of the
next request to dispatch:
1) Any non-write request is always allowed to proceed.
2) Any write to a conventional zone is always allowed to proceed.
3) For a write to a sequential zone, the zone lock is first checked.
   a) If the zone is not locked, the write is allowed to proceed after
      its target zone is locked.
   b) If the zone is locked, the write request is skipped and the next
      request in the dispatch queue tested (back to step 1).

For a write request that has locked its target zone, the zone is
unlocked either when the request completes with a call to the method
deadline_request_completed() or when the request is requeued using
dd_insert_request().

Requests targeting a locked zone are always left in the scheduler queue
to preserve the lba ordering for write requests. If no write request
can be dispatched, allow reads to be dispatched even if the write batch
is not done.

If the device used is not a zoned block device, or if zoned block device
support is disabled, this patch does not modify mq-deadline behavior.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agomq-deadline: Introduce dispatch helpers
Damien Le Moal [Thu, 21 Dec 2017 06:43:39 +0000 (15:43 +0900)]
mq-deadline: Introduce dispatch helpers

Avoid directly referencing the next_rq and fifo_list arrays using the
helper functions deadline_next_request() and deadline_fifo_request() to
facilitate changes in the dispatch request selection in
__dd_dispatch_request() for zoned block devices.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblock: introduce zoned block devices zone write locking
Christoph Hellwig [Thu, 21 Dec 2017 06:43:38 +0000 (15:43 +0900)]
block: introduce zoned block devices zone write locking

Components relying only on the request_queue structure for accessing
block devices (e.g. I/O schedulers) have a limited knowledged of the
device characteristics. In particular, the device capacity cannot be
easily discovered, which for a zoned block device also result in the
inability to easily know the number of zones of the device (the zone
size is indicated by the chunk_sectors field of the queue limits).

Introduce the nr_zones field to the request_queue structure to simplify
access to this information. Also, add the bitmap seq_zone_bitmap which
indicates which zones of the device are sequential zones (write
preferred or write required) and the bitmap seq_zones_wlock which
indicates if a zone is write locked, that is, if a write request
targeting a zone was dispatched to the device. These fields are
initialized by the low level block device driver (sd.c for ZBC/ZAC
disks). They are not initialized by stacking drivers (device mappers)
handling zoned block devices (e.g. dm-linear).

Using this, I/O schedulers can introduce zone write locking to control
request dispatching to a zoned block device and avoid write request
reordering by limiting to at most a single write request per zone
outside of the scheduler at any time.

Based on previous patches from Damien Le Moal.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[Damien]
* Fixed comments and identation in blkdev.h
* Changed helper functions
* Fixed this commit message
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agopktcdvd: Fix a recently introduced NULL pointer dereference
Bart Van Assche [Tue, 2 Jan 2018 19:39:48 +0000 (11:39 -0800)]
pktcdvd: Fix a recently introduced NULL pointer dereference

Call bdev_get_queue(bdev) after bdev->bd_disk has been initialized
instead of just before that pointer has been initialized. This patch
avoids that the following command

pktsetup 1 /dev/sr0

triggers the following kernel crash:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000548
IP: pkt_setup_dev+0x2db/0x670 [pktcdvd]
CPU: 2 PID: 724 Comm: pktsetup Not tainted 4.15.0-rc4-dbg+ #1
Call Trace:
 pkt_ctl_ioctl+0xce/0x1c0 [pktcdvd]
 do_vfs_ioctl+0x8e/0x670
 SyS_ioctl+0x3c/0x70
 entry_SYSCALL_64_fastpath+0x23/0x9a

Reported-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: commit ca18d6f769d2 ("block: Make most scsi_req_init() calls implicit")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agopktcdvd: Fix pkt_setup_dev() error path
Bart Van Assche [Tue, 2 Jan 2018 19:39:47 +0000 (11:39 -0800)]
pktcdvd: Fix pkt_setup_dev() error path

Commit 523e1d399ce0 ("block: make gendisk hold a reference to its queue")
modified add_disk() and disk_release() but did not update any of the
error paths that trigger a put_disk() call after disk->queue has been
assigned. That introduced the following behavior in the pktcdvd driver
if pkt_new_dev() fails:

Kernel BUG at 00000000e98fd882 [verbose debug info unavailable]

Since disk_release() calls blk_put_queue() anyway if disk->queue != NULL,
fix this by removing the blk_cleanup_queue() call from the pkt_setup_dev()
error path.

Fixes: commit 523e1d399ce0 ("block: make gendisk hold a reference to its queue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: <stable@vger.kernel.org> # v3.2
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: refactor pblk_ppa_comp function
Matias Bjørling [Fri, 5 Jan 2018 13:16:21 +0000 (14:16 +0100)]
lightnvm: pblk: refactor pblk_ppa_comp function

Shorten function to simply return the value of the if statement.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: add iostat support
Javier González [Fri, 5 Jan 2018 13:16:20 +0000 (14:16 +0100)]
lightnvm: pblk: add iostat support

Since pblk registers its own block device, the iostat accounting is
not automatically done for us. Therefore, add the necessary
accounting logic to satisfy the iostat interface.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: print instance name on instance info
Javier González [Fri, 5 Jan 2018 13:16:19 +0000 (14:16 +0100)]
lightnvm: pblk: print instance name on instance info

Add the instance name to the information printed out on target creation.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: free write buffer on init failure
Javier González [Fri, 5 Jan 2018 13:16:18 +0000 (14:16 +0100)]
lightnvm: pblk: free write buffer on init failure

Refactor the way we free the write buffer to ensure that all entries get
freed in case of an error on the init sequence.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: ensure kthread alloc. before kicking it
Javier González [Fri, 5 Jan 2018 13:16:17 +0000 (14:16 +0100)]
lightnvm: pblk: ensure kthread alloc. before kicking it

When creating the write thread, ensure that the kthread has been created
before initializing the timer responsible from kicking it. Otherwise, if
the kthread creation fails or gets killed from used space, we risk
kicking an empty thread structure.

Also, since the kthread creation can be interrupted form user space,
adapt the error path to not report an error when this happens, since it
is intentional that the instance creation is aborted.

Signed-off-by: Javier González <javier@cnexlabs.com>
Updated source to reflect the new timer_setup API.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: do not log recovery read errors
Javier González [Fri, 5 Jan 2018 13:16:16 +0000 (14:16 +0100)]
lightnvm: pblk: do not log recovery read errors

On scan recovery, reads can fail. This happens because the first page
for each line is read in order to determined if the line has been used
(and thus needs to be recovered), or not. This can lead to "empty page"
read errors.

Since these errors are normal, do not log them, as they are confusing
when reviewing the logs.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: ignore high ecc errors on recovery
Javier González [Fri, 5 Jan 2018 13:16:15 +0000 (14:16 +0100)]
lightnvm: pblk: ignore high ecc errors on recovery

On recovery, do not stop L2P recovery if reads report high ECC error
as the data is still available.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: set target over-provision on create ioctl
Javier González [Fri, 5 Jan 2018 13:16:14 +0000 (14:16 +0100)]
lightnvm: set target over-provision on create ioctl

Allow to set the over-provision percentage on target creation. In case
that the value is not provided, fall back to the default value set by
the target.

In pblk, set the default OP to 11% of the total size of the device

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: use exact free block counter in RL
Javier González [Fri, 5 Jan 2018 13:16:13 +0000 (14:16 +0100)]
lightnvm: pblk: use exact free block counter in RL

Until now, pblk's rate-limiter has used a heuristic to reserve space for
GC I/O given that the over-provision area was fixed.

In preparation for allowing to define the over-provision area on target
creation, define a dedicated free_block counter in the rate-limiter to
track the number of blocks being used for user data.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: remove pblk_gc_stop
Hans Holmberg [Fri, 5 Jan 2018 13:16:12 +0000 (14:16 +0100)]
lightnvm: pblk: remove pblk_gc_stop

pblk_gc_stop just sets pblk->gc->gc_active to zero, ignoring
the flush parameter. This is plain confusing, so remove the
function and set the gc active flag at the call points instead.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: prevent premature sync point resets
Hans Holmberg [Fri, 5 Jan 2018 13:16:11 +0000 (14:16 +0100)]
lightnvm: pblk: prevent premature sync point resets

Unless we protect flush pointer updates with a lock, we risk
resetting new flush points before we've synced all sectors
up to that point.

This patch protects new flush points with the same spin lock
that is being held when advancing the sync pointer and
resetting completed flush points.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: clear flush point on completed writes
Hans Holmberg [Fri, 5 Jan 2018 13:16:10 +0000 (14:16 +0100)]
lightnvm: pblk: clear flush point on completed writes

Move completion of syncs and clearing of flush points to the
write completion path - this ensures that the data has been
comitted to the media before completing bios containing syncs.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: rename sync_point to flush_point
Hans Holmberg [Fri, 5 Jan 2018 13:16:09 +0000 (14:16 +0100)]
lightnvm: pblk: rename sync_point to flush_point

Sync point is a really confusing name for keeping track of
the last entry that needs to be flushed so change the name
to to flush_point instead.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: refactor emeta consistency check
Hans Holmberg [Fri, 5 Jan 2018 13:16:08 +0000 (14:16 +0100)]
lightnvm: pblk: refactor emeta consistency check

Currently pblk_recov_get_lba list does two separate things:
it checks the consistency of the emeta and extracts the lba list.

This patch separates the consistency check to make the code easier
to read and to prepare for version checks of the line emeta
persistent data format version.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: remove pblk_for_each_lun helper
Javier González [Fri, 5 Jan 2018 13:16:07 +0000 (14:16 +0100)]
lightnvm: pblk: remove pblk_for_each_lun helper

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: pblk: compress and reorder helper functions
Javier González [Fri, 5 Jan 2018 13:16:06 +0000 (14:16 +0100)]
lightnvm: pblk: compress and reorder helper functions

Through time, we have generated some redundant helper functions.
Refactor them to eliminate redundant and unnecessary code. Also, reorder
them to improve readability

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: guarantee target unique name across devs.
Javier González [Fri, 5 Jan 2018 13:16:05 +0000 (14:16 +0100)]
lightnvm: guarantee target unique name across devs.

Until now, target unique naming is only guaranteed per device. This is
ok from a lightnvm perspective, but not from a sysfs one, since groups
will collide regardless of the underlying device.

Check that names are unique across all lightnvm-capable devices.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: refactor target type lookup
Javier González [Fri, 5 Jan 2018 13:16:04 +0000 (14:16 +0100)]
lightnvm: refactor target type lookup

Refactor target type lookup to use/not use locks explicitly instead of
using a hidden parameter to make the function locking.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: make geometry structures 2.0 ready
Matias Bjørling [Fri, 5 Jan 2018 13:16:03 +0000 (14:16 +0100)]
lightnvm: make geometry structures 2.0 ready

Prepare for the 2.0 revision by adapting the geometry
structures to coexist with the 1.2 revision.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: remove lower page tables
Matias Bjørling [Fri, 5 Jan 2018 13:16:02 +0000 (14:16 +0100)]
lightnvm: remove lower page tables

The lower page table is unused. All page tables reported by 1.2
devices are all reporting a sequential 1:1 page mapping. This is
also not used going forward with the 2.0 revision.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: remove unnecessary field from nvm_rq
Javier González [Fri, 5 Jan 2018 13:16:01 +0000 (14:16 +0100)]
lightnvm: remove unnecessary field from nvm_rq

Remove the wait filed in nvm_rq. It is not used anymore, as targets rely
on the functionality provided by the LightNVM subsystem when sending
sync I/O.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: remove hybrid ocssd 1.2 support
Matias Bjørling [Fri, 5 Jan 2018 13:16:00 +0000 (14:16 +0100)]
lightnvm: remove hybrid ocssd 1.2 support

Now that rrpc have been removed. Also remove the hybrid 1.2 support
from the core.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: use internal pblk methods
Matias Bjørling [Fri, 5 Jan 2018 13:15:59 +0000 (14:15 +0100)]
lightnvm: use internal pblk methods

Now that rrpc has been removed, the only users of the ppa helpers
is pblk. However, pblk already defines similar functions.

Switch pblk to use the internal ones, and remove the generic ppa
helpers.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agolightnvm: remove rrpc
Matias Bjørling [Fri, 5 Jan 2018 13:15:58 +0000 (14:15 +0100)]
lightnvm: remove rrpc

The hybrid mode for 1.2 revision was deprecated, and have
no users. Remove to make it easier to move to the 2.0 revision.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agonull_blk: remove lightnvm support
Matias Bjørling [Fri, 5 Jan 2018 13:15:57 +0000 (14:15 +0100)]
null_blk: remove lightnvm support

With rrpc to be removed, the null_blk lightnvm support is no longer
functional. Remove the lightnvm implementation and maybe add it to
another module in the future if someone takes on the challenge.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 years agoblk-mq: remove confusing comment of blk_mq_sched_dispatch_requests
Liu Bo [Fri, 5 Jan 2018 07:09:06 +0000 (00:09 -0700)]
blk-mq: remove confusing comment of blk_mq_sched_dispatch_requests

Commit de1482974080
("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
changes the function to return bool type, and then commit 1f460b63d4b3
("blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE")
changes it back to void, but the comment remains.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 years agoblk-mq: improve heavily contended tag case
Jens Axboe [Tue, 14 Nov 2017 17:24:58 +0000 (10:24 -0700)]
blk-mq: improve heavily contended tag case

Even with a number of waitqueues, we can get into a situation where we
are heavily contended on the waitqueue lock. I got a report on spc1
where we're spending seconds doing this. Arguably the use case is nasty,
I reproduce it with one device and 1000 threads banging on the device.
But that doesn't mean we shouldn't be handling it better.

What ends up happening is that a thread will fail to get a tag, add
itself to the waitqueue, and subsequently get woken up when a tag is
freed - only to find itself going back to sleep on the waitqueue.

Instead of waking all threads, use an exclusive wait and wake up our
sbitmap batch count instead. This seems to work well for me (massive
improvement for this use case), and it survives basic testing. But I
haven't fully verified it yet.

An additional improvement is running the queue and checking for a new
tag BEFORE needing to add ourselves to the waitqueue.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 years agoLinux 4.15-rc4
Linus Torvalds [Mon, 18 Dec 2017 02:59:59 +0000 (18:59 -0800)]
Linux 4.15-rc4

7 years agoRevert "exec: avoid RLIMIT_STACK races with prlimit()"
Kees Cook [Tue, 12 Dec 2017 19:28:38 +0000 (11:28 -0800)]
Revert "exec: avoid RLIMIT_STACK races with prlimit()"

This reverts commit 04e35f4495dd560db30c25efca4eecae8ec8c375.

SELinux runs with secureexec for all non-"noatsecure" domain transitions,
which means lots of processes end up hitting the stack hard-limit change
that was introduced in order to fix a race with prlimit(). That race fix
will need to be redesigned.

Reported-by: Laura Abbott <labbott@redhat.com>
Reported-by: Tomáš Trnka <trnka@scm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoMerge branch 'WIP.x86-pti.base-for-linus' of git://git.kernel.org/pub/scm/linux/kerne...
Linus Torvalds [Sun, 17 Dec 2017 21:57:08 +0000 (13:57 -0800)]
Merge branch 'WIP.x86-pti.base-for-linus' of git://git./linux/kernel/git/tip/tip

Pull Page Table Isolation (PTI) v4.14 backporting base tree from Ingo Molnar:
 "This tree contains the v4.14 PTI backport preparatory tree, which
  consists of four merges of upstream trees and 7 cherry-picked commits,
  which the upcoming PTI work depends on"

NOTE! The resulting tree is exactly the same as the original base tree
(ie the diff between this commit and its immediate first parent is
empty).

The only reason for this merge is literally to have a common point for
the actual PTI changes so that the commits can be shared in both the
4.15 and 4.14 trees.

* 'WIP.x86-pti.base-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/kasan: Don't use vmemmap_populate() to initialize shadow
  locking/barriers: Convert users of lockless_dereference() to READ_ONCE()
  locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()
  bpf: fix build issues on um due to mising bpf_perf_event.h
  perf/x86: Enable free running PEBS for REGS_USER/INTR
  x86: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD
  x86/cpufeature: Add User-Mode Instruction Prevention definitions

7 years agoMerge branch 'WIP.x86-pti.base.prep-for-linus' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sun, 17 Dec 2017 21:54:31 +0000 (13:54 -0800)]
Merge branch 'WIP.x86-pti.base.prep-for-linus' of git://git./linux/kernel/git/tip/tip

Pull Page Table Isolation (PTI) preparatory tree from Ingo Molnar:
 "This does a rename to free up linux/pti.h to be used by the upcoming
  page table isolation feature"

* 'WIP.x86-pti.base.prep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  drivers/misc/intel/pti: Rename the header file to free up the namespace

7 years agoMerge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 17 Dec 2017 21:48:50 +0000 (13:48 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
 "A single bugfix which prevents arbitrary sigev_notify values in
  posix-timers"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-timer: Properly check sigevent->sigev_notify

7 years agoMerge tag 'dmaengine-fix-4.15-rc4' of git://git.infradead.org/users/vkoul/slave-dma
Linus Torvalds [Sun, 17 Dec 2017 21:28:49 +0000 (13:28 -0800)]
Merge tag 'dmaengine-fix-4.15-rc4' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fixes from Vinod Koul:
 "This time consisting of fixes in a bunch of drivers and the dmatest
  module:

   - Fix for disable clk on error path in fsl-edma driver
   - Disable clk fail fix in jz4740 driver
   - Fix long pending bug in dmatest driver for dangling pointer
   - Fix potential NULL pointer dereference in at_hdmac driver
   - Error handling path in ioat driver"

* tag 'dmaengine-fix-4.15-rc4' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: fsl-edma: disable clks on all error paths
  dmaengine: jz4740: disable/unprepare clk if probe fails
  dmaengine: dmatest: move callback wait queue to thread context
  dmaengine: at_hdmac: fix potential NULL pointer dereference in atc_prep_dma_interleaved
  dmaengine: ioat: Fix error handling path

7 years agocramfs: fix MTD dependency
Arnd Bergmann [Fri, 10 Nov 2017 14:57:21 +0000 (15:57 +0100)]
cramfs: fix MTD dependency

With CONFIG_MTD=m and CONFIG_CRAMFS=y, we now get a link failure:

  fs/cramfs/inode.o: In function `cramfs_mount': inode.c:(.text+0x220): undefined reference to `mount_mtd'
  fs/cramfs/inode.o: In function `cramfs_mtd_fill_super':
  inode.c:(.text+0x6d8): undefined reference to `mtd_point'
  inode.c:(.text+0xae4): undefined reference to `mtd_unpoint'

This adds a more specific Kconfig dependency to avoid the broken
configuration.

Alternatively we could make CRAMFS itself depend on "MTD || !MTD" with a
similar result.

Fixes: 99c18ce580c6 ("cramfs: direct memory access support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sun, 17 Dec 2017 20:18:35 +0000 (12:18 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
 "The alloc_super() one is a regression in this merge window, lazytime
  thing is older..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Handle lazytime in do_mount()
  alloc_super(): do ->s_umount initialization earlier

7 years agoMerge tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso...
Linus Torvalds [Sun, 17 Dec 2017 20:14:33 +0000 (12:14 -0800)]
Merge tag 'ext4_for_stable' of git://git./linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a regression which caused us to fail to interpret symlinks in very
  ancient ext3 file system images.

  Also fix two xfstests failures, one of which could cause an OOPS, plus
  an additional bug fix caught by fuzz testing"

* tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix crash when a directory's i_size is too small
  ext4: add missing error check in __ext4_new_inode()
  ext4: fix fdatasync(2) after fallocate(2) operation
  ext4: support fast symlinks from ext3 file systems

7 years agox86/mm/kasan: Don't use vmemmap_populate() to initialize shadow
Andrey Ryabinin [Thu, 16 Nov 2017 01:36:35 +0000 (17:36 -0800)]
x86/mm/kasan: Don't use vmemmap_populate() to initialize shadow

[ Note, this is a Git cherry-pick of the following commit:

    d17a1d97dc20: ("x86/mm/kasan: don't use vmemmap_populate() to initialize shadow")

  ... for easier x86 PTI code testing and back-porting. ]

The KASAN shadow is currently mapped using vmemmap_populate() since that
provides a semi-convenient way to map pages into init_top_pgt.  However,
since that no longer zeroes the mapped pages, it is not suitable for
KASAN, which requires zeroed shadow memory.

Add kasan_populate_shadow() interface and use it instead of
vmemmap_populate().  Besides, this allows us to take advantage of
gigantic pages and use them to populate the shadow, which should save us
some memory wasted on page tables and reduce TLB pressure.

Link: http://lkml.kernel.org/r/20171103185147.2688-2-pasha.tatashin@oracle.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agolocking/barriers: Convert users of lockless_dereference() to READ_ONCE()
Will Deacon [Tue, 24 Oct 2017 10:22:48 +0000 (11:22 +0100)]
locking/barriers: Convert users of lockless_dereference() to READ_ONCE()

[ Note, this is a Git cherry-pick of the following commit:

    506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

  ... for easier x86 PTI code testing and back-porting. ]

READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agolocking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()
Will Deacon [Tue, 24 Oct 2017 10:22:47 +0000 (11:22 +0100)]
locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()

[ Note, this is a Git cherry-pick of the following commit:

    76ebbe78f739 ("locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()")

  ... for easier x86 PTI code testing and back-porting. ]

In preparation for the removal of lockless_dereference(), which is the
same as READ_ONCE() on all architectures other than Alpha, add an
implicit smp_read_barrier_depends() to READ_ONCE() so that it can be
used to head dependency chains on all architectures.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agobpf: fix build issues on um due to mising bpf_perf_event.h
Daniel Borkmann [Tue, 12 Dec 2017 01:25:31 +0000 (02:25 +0100)]
bpf: fix build issues on um due to mising bpf_perf_event.h

[ Note, this is a Git cherry-pick of the following commit:

    a23f06f06dbe ("bpf: fix build issues on um due to mising bpf_perf_event.h")

  ... for easier x86 PTI code testing and back-porting. ]

Since c895f6f703ad ("bpf: correct broken uapi for
BPF_PROG_TYPE_PERF_EVENT program type") um (uml) won't build
on i386 or x86_64:

  [...]
    CC      init/main.o
  In file included from ../include/linux/perf_event.h:18:0,
                   from ../include/linux/trace_events.h:10,
                   from ../include/trace/syscall.h:7,
                   from ../include/linux/syscalls.h:82,
                   from ../init/main.c:20:
  ../include/uapi/linux/bpf_perf_event.h:11:32: fatal error:
  asm/bpf_perf_event.h: No such file or directory #include
  <asm/bpf_perf_event.h>
  [...]

Lets add missing bpf_perf_event.h also to um arch. This seems
to be the only one still missing.

Fixes: c895f6f703ad ("bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: Richard Weinberger <richard@sigma-star.at>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Richard Weinberger <richard@sigma-star.at>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agoperf/x86: Enable free running PEBS for REGS_USER/INTR
Andi Kleen [Thu, 31 Aug 2017 21:46:30 +0000 (14:46 -0700)]
perf/x86: Enable free running PEBS for REGS_USER/INTR

[ Note, this is a Git cherry-pick of the following commit:

    a47ba4d77e12 ("perf/x86: Enable free running PEBS for REGS_USER/INTR")

  ... for easier x86 PTI code testing and back-porting. ]

Currently free running PEBS is disabled when user or interrupt
registers are requested. Most of the registers are actually
available in the PEBS record and can be supported.

So we just need to check for the supported registers and then
allow it: it is all except for the segment register.

For user registers this only works when the counter is limited
to ring 3 only, so this also needs to be checked.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170831214630.21892-1-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agox86: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD
Rudolf Marek [Tue, 28 Nov 2017 21:01:06 +0000 (22:01 +0100)]
x86: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD

[ Note, this is a Git cherry-pick of the following commit:

    2b67799bdf25 ("x86: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD")

  ... for easier x86 PTI code testing and back-porting. ]

The latest AMD AMD64 Architecture Programmer's Manual
adds a CPUID feature XSaveErPtr (CPUID_Fn80000008_EBX[2]).

If this feature is set, the FXSAVE, XSAVE, FXSAVEOPT, XSAVEC, XSAVES
/ FXRSTOR, XRSTOR, XRSTORS always save/restore error pointers,
thus making the X86_BUG_FXSAVE_LEAK workaround obsolete on such CPUs.

Signed-Off-By: Rudolf Marek <r.marek@assembler.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/bdcebe90-62c5-1f05-083c-eba7f08b2540@assembler.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agox86/cpufeature: Add User-Mode Instruction Prevention definitions
Ricardo Neri [Mon, 6 Nov 2017 02:27:51 +0000 (18:27 -0800)]
x86/cpufeature: Add User-Mode Instruction Prevention definitions

[ Note, this is a Git cherry-pick of the following commit: (limited to the cpufeatures.h file)

    3522c2a6a4f3 ("x86/cpufeature: Add User-Mode Instruction Prevention definitions")

  ... for easier x86 PTI code testing and back-porting. ]

User-Mode Instruction Prevention is a security feature present in new
Intel processors that, when set, prevents the execution of a subset of
instructions if such instructions are executed in user mode (CPL > 0).
Attempting to execute such instructions causes a general protection
exception.

The subset of instructions comprises:

 * SGDT - Store Global Descriptor Table
 * SIDT - Store Interrupt Descriptor Table
 * SLDT - Store Local Descriptor Table
 * SMSW - Store Machine Status Word
 * STR  - Store Task Register

This feature is also added to the list of disabled-features to allow
a cleaner handling of build-time configuration.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ricardo.neri@intel.com
Link: http://lkml.kernel.org/r/1509935277-22138-7-git-send-email-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agoMerge commit 'upstream-x86-virt' into WIP.x86/mm
Ingo Molnar [Fri, 1 Dec 2017 09:34:04 +0000 (10:34 +0100)]
Merge commit 'upstream-x86-virt' into WIP.x86/mm

Merge a minimal set of virt cleanups, for a base for the MM isolation patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agoMerge branch 'upstream-acpi-fixes' into WIP.x86/pti.base
Ingo Molnar [Sun, 17 Dec 2017 12:09:31 +0000 (13:09 +0100)]
Merge branch 'upstream-acpi-fixes' into WIP.x86/pti.base

Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agoMerge branch 'upstream-x86-selftests' into WIP.x86/pti.base
Ingo Molnar [Sun, 17 Dec 2017 12:04:28 +0000 (13:04 +0100)]
Merge branch 'upstream-x86-selftests' into WIP.x86/pti.base

Conflicts:
arch/x86/kernel/cpu/Makefile

Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agoMerge commit 'upstream-x86-entry' into WIP.x86/mm
Ingo Molnar [Fri, 1 Dec 2017 09:32:48 +0000 (10:32 +0100)]
Merge commit 'upstream-x86-entry' into WIP.x86/mm

Pull in a minimal set of v4.15 entry code changes, for a base for the MM isolation patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agodrivers/misc/intel/pti: Rename the header file to free up the namespace
Ingo Molnar [Tue, 5 Dec 2017 13:14:47 +0000 (14:14 +0100)]
drivers/misc/intel/pti: Rename the header file to free up the namespace

We'd like to use the 'PTI' acronym for 'Page Table Isolation' - free up the
namespace by renaming the <linux/pti.h> driver header to <linux/intel-pti.h>.

(Also standardize the header guard name while at it.)

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: J Freyensee <james_p_freyensee@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
7 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Linus Torvalds [Sat, 16 Dec 2017 21:43:08 +0000 (13:43 -0800)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "More fixes from testing done on the rc kernel, including more SELinux
  testing. Looking forward, lockdep found regression today in ipoib
  which is still being fixed.

  Summary:

   - Fix for SELinux on the umad SMI path. Some old hardware does not
     fill the PKey properly exposing another bug in the newer SELinux
     code.

   - Check the input port as we can exceed array bounds from this user
     supplied value

   - Users are unable to use the hash field support as they want due to
     incorrect checks on the field restrictions, correct that so the
     feature works as intended

   - User triggerable oops in the NETLINK_RDMA handler

   - cxgb4 driver fix for a bad interaction with CQ flushing in iser
     caused by patches in this merge window, and bad CQ flushing during
     normal close.

   - Unbalanced memalloc_noio in ipoib in an error path"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/ipoib: Restore MM behavior in case of tx_ring allocation failure
  iw_cxgb4: only insert drain cqes if wq is flushed
  iw_cxgb4: only clear the ARMED bit if a notification is needed
  RDMA/netlink: Fix general protection fault
  IB/mlx4: Fix RSS hash fields restrictions
  IB/core: Don't enforce PKey security on SMI MADs
  IB/core: Bound check alternate path port number

7 years agoMerge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sat, 16 Dec 2017 21:34:38 +0000 (13:34 -0800)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "Two bugfixes for the AT24 I2C eeprom driver and some minor corrections
  for I2C bus drivers"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: piix4: Fix port number check on release
  i2c: stm32: Fix copyrights
  i2c-cht-wc: constify platform_device_id
  eeprom: at24: change nvmem stride to 1
  eeprom: at24: fix I2C device selection for runtime PM

7 years agoMerge tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Linus Torvalds [Sat, 16 Dec 2017 21:12:53 +0000 (13:12 -0800)]
Merge tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "This has two stable bugfixes, one to fix a BUG_ON() when
  nfs_commit_inode() is called with no outstanding commit requests and
  another to fix a race in the SUNRPC receive codepath.

  Additionally, there are also fixes for an NFS client deadlock and an
  xprtrdma performance regression.

  Summary:

  Stable bugfixes:
   - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
     commit in the case that there were no commit requests.
   - SUNRPC: Fix a race in the receive code path

  Other fixes:
   - NFS: Fix a deadlock in nfs client initialization
   - xprtrdma: Fix a performance regression for small IOs"

* tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Fix a race in the receive code path
  nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
  xprtrdma: Spread reply processing over more CPUs
  nfs: fix a deadlock in nfs client initialization

7 years agoRevert "mm: replace p??_write with pte_access_permitted in fault + gup paths"
Linus Torvalds [Sat, 16 Dec 2017 02:53:22 +0000 (18:53 -0800)]
Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths"

This reverts commits 5c9d2d5c269cc7da82b894e9, and e7fe7b5cae90.

We'll probably need to revisit this, but basically we should not
complicate the get_user_pages_fast() case, and checking the actual page
table protection key bits will require more care anyway, since the
protection keys depend on the exact state of the VM in question.

Particularly when doing a "remote" page lookup (ie in somebody elses VM,
not your own), you need to be much more careful than this was.  Dave
Hansen says:

 "So, the underlying bug here is that we now a get_user_pages_remote()
  and then go ahead and do the p*_access_permitted() checks against the
  current PKRU. This was introduced recently with the addition of the
  new p??_access_permitted() calls.

  We have checks in the VMA path for the "remote" gups and we avoid
  consulting PKRU for them. This got missed in the pkeys selftests
  because I did a ptrace read, but not a *write*. I also didn't
  explicitly test it against something where a COW needed to be done"

It's also not entirely clear that it makes sense to check the protection
key bits at this level at all.  But one possible eventual solution is to
make the get_user_pages_fast() case just abort if it sees protection key
bits set, which makes us fall back to the regular get_user_pages() case,
which then has a vma and can do the check there if we want to.

We'll see.

Somewhat related to this all: what we _do_ want to do some day is to
check the PAGE_USER bit - it should obviously always be set for user
pages, but it would be a good check to have back.  Because we have no
generic way to test for it, we lost it as part of moving over from the
architecture-specific x86 GUP implementation to the generic one in
commit e585513b76f7 ("x86/mm/gup: Switch GUP to the generic
get_user_page_fast() implementation").

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Fri, 15 Dec 2017 21:08:37 +0000 (13:08 -0800)]
Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Clamp timeouts to INT_MAX in conntrack, from Jay Elliot.

 2) Fix broken UAPI for BPF_PROG_TYPE_PERF_EVENT, from Hendrik
    Brueckner.

 3) Fix locking in ieee80211_sta_tear_down_BA_sessions, from Johannes
    Berg.

 4) Add missing barriers to ptr_ring, from Michael S. Tsirkin.

 5) Don't advertise gigabit in sh_eth when not available, from Thomas
    Petazzoni.

 6) Check network namespace when delivering to netlink taps, from Kevin
    Cernekee.

 7) Kill a race in raw_sendmsg(), from Mohamed Ghannam.

 8) Use correct address in TCP md5 lookups when replying to an incoming
    segment, from Christoph Paasch.

 9) Add schedule points to BPF map alloc/free, from Eric Dumazet.

10) Don't allow silly mtu values to be used in ipv4/ipv6 multicast, also
    from Eric Dumazet.

11) Fix SKB leak in tipc, from Jon Maloy.

12) Disable MAC learning on OVS ports of mlxsw, from Yuval Mintz.

13) SKB leak fix in skB_complete_tx_timestamp(), from Willem de Bruijn.

14) Add some new qmi_wwan device IDs, from Daniele Palmas.

15) Fix static key imbalance in ingress qdisc, from Jiri Pirko.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
  net: qcom/emac: Reduce timeout for mdio read/write
  net: sched: fix static key imbalance in case of ingress/clsact_init error
  net: sched: fix clsact init error path
  ip_gre: fix wrong return value of erspan_rcv
  net: usb: qmi_wwan: add Telit ME910 PID 0x1101 support
  pkt_sched: Remove TC_RED_OFFLOADED from uapi
  net: sched: Move to new offload indication in RED
  net: sched: Add TCA_HW_OFFLOAD
  net: aquantia: Increment driver version
  net: aquantia: Fix typo in ethtool statistics names
  net: aquantia: Update hw counters on hw init
  net: aquantia: Improve link state and statistics check interval callback
  net: aquantia: Fill in multicast counter in ndev stats from hardware
  net: aquantia: Fill ndev stat couters from hardware
  net: aquantia: Extend stat counters to 64bit values
  net: aquantia: Fix hardware DMA stream overload on large MRRS
  net: aquantia: Fix actual speed capabilities reporting
  sock: free skb in skb_complete_tx_timestamp on error
  s390/qeth: update takeover IPs after configuration change
  s390/qeth: lock IP table while applying takeover changes
  ...

7 years agoMerge tag 'usb-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Fri, 15 Dec 2017 21:03:25 +0000 (13:03 -0800)]
Merge tag 'usb-4.15-rc4' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some USB fixes for 4.15-rc4.

  There is the usual handful gadget/dwc2/dwc3 fixes as always, for
  reported issues. But the most important things in here is the core fix
  from Alan Stern to resolve a nasty security bug (my first attempt is
  reverted, Alan's was much cleaner), as well as a number of usbip fixes
  from Shuah Khan to resolve those reported security issues.

  All of these have been in linux-next with no reported issues"

* tag 'usb-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: core: prevent malicious bNumInterfaces overflow
  Revert "USB: core: only clean up what we allocated"
  USB: core: only clean up what we allocated
  Revert "usb: gadget: allow to enable legacy drivers without USB_ETH"
  usb: gadget: webcam: fix V4L2 Kconfig dependency
  usb: dwc2: Fix TxFIFOn sizes and total TxFIFO size issues
  usb: dwc3: gadget: Fix PCM1 for ISOC EP with ep->mult less than 3
  usb: dwc3: of-simple: set dev_pm_ops
  usb: dwc3: of-simple: fix missing clk_disable_unprepare
  usb: dwc3: gadget: Wait longer for controller to end command processing
  usb: xhci: fix TDS for MTK xHCI1.1
  xhci: Don't add a virt_dev to the devs array before it's fully allocated
  usbip: fix stub_send_ret_submit() vulnerability to null transfer_buffer
  usbip: prevent vhci_hcd driver from leaking a socket pointer address
  usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input
  usbip: fix stub_rx: get_pipe() to validate endpoint number
  tools/usbip: fixes potential (minor) "buffer overflow" (detected on recent gcc with -Werror)
  USB: uas and storage: Add US_FL_BROKEN_FUA for another JMicron JMS567 ID
  usb: musb: da8xx: fix babble condition handling

7 years agoMerge tag 'staging-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Fri, 15 Dec 2017 20:59:48 +0000 (12:59 -0800)]
Merge tag 'staging-4.15-rc4' of git://git./linux/kernel/git/gregkh/staging

Pull staging fixes from Greg KH:
 "Here are some small staging driver fixes for 4.15-rc4.

  One patch for the ccree driver to prevent an unitialized value from
  being returned to a caller, and the other fixes a logic error in the
  pi433 driver"

* tag 'staging-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: pi433: Fixes issue with bit shift in rf69_get_modulation
  staging: ccree: Uninitialized return in ssi_ahash_import()

7 years agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Fri, 15 Dec 2017 20:56:23 +0000 (12:56 -0800)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost

Pull virtio regression fixes from Michael Tsirkin:
 "Fixes two issues in the latest kernel"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_mmio: fix devm cleanup
  ptr_ring: fix up after recent ptr_ring changes

7 years agoMerge tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device...
Linus Torvalds [Fri, 15 Dec 2017 20:53:37 +0000 (12:53 -0800)]
Merge tag 'for-4.15/dm-fixes' of git://git./linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - fix a particularly nasty DM core bug in a 4.15 refcount_t conversion.

 - fix various targets to dm_register_target after module __init
   resources created; otherwise racing lvm2 commands could result in a
   NULL pointer during initialization of associated DM kernel module.

 - fix regression in bio-based DM multipath queue_if_no_path handling.

 - fix DM bufio's shrinker to reclaim more than one buffer per scan.

* tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
  dm mpath: fix bio-based multipath queue_if_no_path handling
  dm: fix various targets to dm_register_target after module __init resources created
  dm table: fix regression from improper dm_dev_internal.count refcount_t conversion

7 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Fri, 15 Dec 2017 20:51:42 +0000 (12:51 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "The most important one is the bfa fix because it's easy to oops the
  kernel with this driver (this includes the commit that corrects the
  compiler warning in the original), a regression in the new timespec
  conversion in aacraid and a regression in the Fibre Channel ELS
  handling patch.

  The other three are a theoretical problem with termination in the
  vendor/host matching code and a use after free in lpfc.

  The additional patches are a fix for an I/O hang in the mq code under
  certain circumstances and a rare oops in some debugging code"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: core: Fix a scsi_show_rq() NULL pointer dereference
  scsi: MAINTAINERS: change FCoE list to linux-scsi
  scsi: libsas: fix length error in sas_smp_handler()
  scsi: bfa: fix type conversion warning
  scsi: core: run queue if SCSI device queue isn't ready and queue is idle
  scsi: scsi_devinfo: cleanly zero-pad devinfo strings
  scsi: scsi_devinfo: handle non-terminated strings
  scsi: bfa: fix access to bfad_im_port_s
  scsi: aacraid: address UBSAN warning regression
  scsi: libfc: fix ELS request handling
  scsi: lpfc: Use after free in lpfc_rq_buf_free()

7 years agoMerge tag 'mmc-v4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Linus Torvalds [Fri, 15 Dec 2017 20:49:54 +0000 (12:49 -0800)]
Merge tag 'mmc-v4.15-rc2' of git://git./linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:
 "A couple of MMC fixes:

   - fix use of uninitialized drv_typ variable

   - apply NO_CMD23 quirk to some specific SD cards to make them work"

* tag 'mmc-v4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: core: apply NO_CMD23 quirk to some specific cards
  mmc: core: properly init drv_type

7 years agoMerge tag 'ceph-for-4.15-rc4' of git://github.com/ceph/ceph-client
Linus Torvalds [Fri, 15 Dec 2017 20:48:27 +0000 (12:48 -0800)]
Merge tag 'ceph-for-4.15-rc4' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "CephFS inode trimming fix from Zheng, marked for stable"

* tag 'ceph-for-4.15-rc4' of git://github.com/ceph/ceph-client:
  ceph: drop negative child dentries before try pruning inode's alias

7 years agoMerge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszer...
Linus Torvalds [Fri, 15 Dec 2017 20:46:48 +0000 (12:46 -0800)]
Merge branch 'overlayfs-linus' of git://git./linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:

 - fix incomplete syncing of filesystem

 - fix regression in readdir on ovl over 9p

 - only follow redirects when needed

 - misc fixes and cleanups

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix overlay: warning prefix
  ovl: Use PTR_ERR_OR_ZERO()
  ovl: Sync upper dirty data when syncing overlayfs
  ovl: update ctx->pos on impure dir iteration
  ovl: Pass ovl_get_nlink() parameters in right order
  ovl: don't follow redirects if redirect_dir=off

7 years agonet: qcom/emac: Reduce timeout for mdio read/write
Hemanth Puranik [Fri, 15 Dec 2017 14:35:58 +0000 (20:05 +0530)]
net: qcom/emac: Reduce timeout for mdio read/write

Currently mdio read/write takes around ~115us as the timeout
between status check is set to 100us.
By reducing the timeout to 1us mdio read/write takes ~15us to
complete. This improves the link up event response.

Signed-off-by: Hemanth Puranik <hpuranik@codeaurora.org>
Acked-by: Timur Tabi <timur@codeaurora.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Fri, 15 Dec 2017 20:44:49 +0000 (12:44 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "There are some significant fixes in here for FP state corruption,
  hardware access/dirty PTE corruption and an erratum workaround for the
  Falkor CPU.

  I'm hoping that things finally settle down now, but never say never...

  Summary:

   - Fix FPSIMD context switch regression introduced in -rc2

   - Fix ABI break with SVE CPUID register reporting

   - Fix use of uninitialised variable

   - Fixes to hardware access/dirty management and sanity checking

   - CPU erratum workaround for Falkor CPUs

   - Fix reporting of writeable+executable mappings

   - Fix signal reporting for RAS errors"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: fpsimd: Fix copying of FP state from signal frame into task struct
  arm64/sve: Report SVE to userspace via CPUID only if supported
  arm64: fix CONFIG_DEBUG_WX address reporting
  arm64: fault: avoid send SIGBUS two times
  arm64: hw_breakpoint: Use linux/uaccess.h instead of asm/uaccess.h
  arm64: Add software workaround for Falkor erratum 1041
  arm64: Define cputype macros for Falkor CPU
  arm64: mm: Fix false positives in set_pte_at access/dirty race detection
  arm64: mm: Fix pte_mkclean, pte_mkdirty semantics
  arm64: Initialise high_memory global variable earlier

7 years agonet: sched: fix static key imbalance in case of ingress/clsact_init error
Jiri Pirko [Fri, 15 Dec 2017 11:40:13 +0000 (12:40 +0100)]
net: sched: fix static key imbalance in case of ingress/clsact_init error

Move static key increments to the beginning of the init function
so they pair 1:1 with decrements in ingress/clsact_destroy,
which is called in case ingress/clsact_init fails.

Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>