Jeff Layton [Fri, 21 Nov 2014 19:19:29 +0000 (14:19 -0500)]
sunrpc: fix potential races in pool_stats collection
In a later patch, we'll be removing some spinlocking around the socket
and thread queueing code in order to fix some contention problems. At
that point, the stats counters will no longer be protected by the
sp_lock.
Change the counters to atomic_long_t fields, except for the
"sockets_queued" counter which will still be manipulated under a
spinlock.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Chris Worley <chris.worley@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Fri, 21 Nov 2014 19:19:28 +0000 (14:19 -0500)]
sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it
...also make the manipulation of sp_all_threads list use RCU-friendly
functions.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Chris Worley <chris.worley@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 19 Nov 2014 12:51:22 +0000 (07:51 -0500)]
sunrpc: require svc_create callers to pass in meaningful shutdown routine
Currently all svc_create callers pass in NULL for the shutdown parm,
which then gets fixed up to be svc_rpcb_cleanup if the service uses
rpcbind.
Simplify this by instead having the the only caller that requires it
(lockd) pass in svc_rpcb_cleanup and get rid of the special casing.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 19 Nov 2014 12:51:21 +0000 (07:51 -0500)]
sunrpc: have svc_wake_up only deal with pool 0
The way that svc_wake_up works is a bit inefficient. It walks all of the
available pools for a service and either wakes up a task in each one or
sets the SP_TASK_PENDING flag in each one.
When svc_wake_up is called, there is no need to wake up more than one
thread to do this work. In practice, only lockd currently uses this
function and it's single threaded anyway. Thus, this just boils down to
doing a wake up of a thread in pool 0 or setting a single flag.
Eliminate the for loop in this function and change it to just operate on
pool 0. Also update the comments that sit above it and get rid of some
code that has been commented out for years now.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 19 Nov 2014 12:51:20 +0000 (07:51 -0500)]
sunrpc: convert sp_task_pending flag to use atomic bitops
In a later patch, we'll want to be able to handle this flag without
holding the sp_lock. Change this field to an unsigned long flags
field, and declare a new flag in it that can be managed with atomic
bitops.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 19 Nov 2014 12:51:19 +0000 (07:51 -0500)]
sunrpc: move rq_cachetype field to better optimize space
There are a couple of holes in the svc_rqst field on x86_64. Move the
rq_cachetype to a different location to eliminate both of them.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 19 Nov 2014 12:51:18 +0000 (07:51 -0500)]
sunrpc: move rq_splice_ok flag into rq_flags
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 19 Nov 2014 12:51:17 +0000 (07:51 -0500)]
sunrpc: move rq_dropme flag into rq_flags
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 19 Nov 2014 12:51:16 +0000 (07:51 -0500)]
sunrpc: move rq_usedeferral flag to rq_flags
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 19 Nov 2014 12:51:15 +0000 (07:51 -0500)]
sunrpc: move rq_local field to rq_flags
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 19 Nov 2014 12:51:14 +0000 (07:51 -0500)]
sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it
In a later patch, we're going to need some atomic bit flags. Since that
field will need to be an unsigned long, we mitigate that space
consumption by migrating some other bitflags to the new field. Start
with the rq_secure flag.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
J. Bruce Fields [Tue, 9 Dec 2014 16:12:26 +0000 (11:12 -0500)]
Merge tag 'nfs-for-3.19-1' into nfsd for-3.19 branch
Mainly what I need is
860a0d9e511f "sunrpc: add some tracepoints in
svc_rqst handling functions", which subsequent server rpc patches from
jlayton depend on. I'm merging this later tag on the assumption that's
more likely to be a tested and stable point.
Dan Carpenter [Thu, 27 Nov 2014 15:58:54 +0000 (18:58 +0300)]
nfsd: minor off by one checks in __write_versions()
My static checker complains that if "len == remaining" then it means we
have truncated the last character off the version string.
The intent of the code is that we print as many versions as we can
without truncating a version. Then we put a newline at the end. If the
newline can't fit we return -EINVAL.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 19 Nov 2014 12:51:13 +0000 (07:51 -0500)]
sunrpc: release svc_pool_map reference when serv allocation fails
Currently, it leaks when the allocation fails.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Mon, 17 Nov 2014 22:02:57 +0000 (17:02 -0500)]
sunrpc: eliminate the XPT_DETACHED flag
All it does is indicate whether a xprt has already been deleted from
a list or not, which is unnecessary since we use list_del_init and it's
always set and checked under the sv_lock anyway.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Jeff Layton [Wed, 26 Nov 2014 19:44:44 +0000 (14:44 -0500)]
sunrpc: add a debugfs rpc_xprt directory with an info file in it
Add a new directory heirarchy under the debugfs sunrpc/ directory:
sunrpc/
rpc_xprt/
<xprt id>/
Within that directory, we can put files that give info about the
xprts. We do have the (minor) problem that there is no succinct,
unique identifier for rpc_xprts. So we generate them synthetically
with a static atomic_t counter.
For now, this directory just holds an "info" file, but we may add
other files to it in the future.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Jeff Layton [Wed, 26 Nov 2014 19:44:43 +0000 (14:44 -0500)]
sunrpc: add debugfs file for displaying client rpc_task queue
It's possible to get a dump of the RPC task queue by writing a value to
/proc/sys/sunrpc/rpc_debug. If you write any value to that file, you get
a dump of the RPC client task list into the log buffer. This is a rather
inconvenient interface however, and makes it hard to get immediate info
about the task queue.
Add a new directory hierarchy under debugfs:
sunrpc/
rpc_clnt/
<clientid>/
Within each clientid directory we create a new "tasks" file that will
dump info similar to what shows up in the log buffer, but with a few
small differences -- we avoid printing raw kernel addresses in favor of
symbolic names and the XID is also displayed.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Wed, 26 Nov 2014 22:37:13 +0000 (17:37 -0500)]
Merge tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next
Pull NFS client RDMA changes for 3.19 from Anna Schumaker:
"NFS: Client side changes for RDMA
These patches various bugfixes and cleanups for using NFS over RDMA, including
better error handling and performance improvements by using pad optimization.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>"
* tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
xprtrdma: Display async errors
xprtrdma: Enable pad optimization
xprtrdma: Re-write rpcrdma_flush_cqs()
xprtrdma: Refactor tasklet scheduling
xprtrdma: unmap all FMRs during transport disconnect
xprtrdma: Cap req_cqinit
xprtrdma: Return an errno from rpcrdma_register_external()
Trond Myklebust [Wed, 26 Nov 2014 22:27:43 +0000 (17:27 -0500)]
Merge tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next
Pull pull additional NFS client changes for 3.19 from Anna Schumaker:
"NFS: Generic client side changes from Chuck
These patches fixes for iostats and SETCLIENTID in addition to cleaning
up the nfs4_init_callback() function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>"
* tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
NFS: Clean up nfs4_init_callback()
NFS: SETCLIENTID XDR buffer sizes are incorrect
SUNRPC: serialize iostats updates
Anna Schumaker [Tue, 25 Nov 2014 18:18:16 +0000 (13:18 -0500)]
nfs: Add DEALLOCATE support
This patch adds support for using the NFS v4.2 operation DEALLOCATE to
punch holes in a file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Anna Schumaker [Tue, 25 Nov 2014 18:18:15 +0000 (13:18 -0500)]
nfs: Add ALLOCATE support
This patch adds support for using the NFS v4.2 operation ALLOCATE to
preallocate data in a file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Chuck Lever [Sun, 9 Nov 2014 01:15:26 +0000 (20:15 -0500)]
NFS: Clean up nfs4_init_callback()
nfs4_init_callback() is never invoked for NFS versions other than 4.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Sun, 9 Nov 2014 01:15:18 +0000 (20:15 -0500)]
NFS: SETCLIENTID XDR buffer sizes are incorrect
Use the correct calculation of the maximum size of a clientaddr4
when encoding and decoding SETCLIENTID operations. clientaddr4 is
defined in section 2.2.10 of RFC3530bis-31.
The usage in encode_setclientid_maxsz is missing the 4-byte length
in both strings, but is otherwise correct. decode_setclientid_maxsz
simply asks for a page of receive buffer space, which is
unnecessarily large (more than 4KB).
Note that a SETCLIENTID reply is either clientid+verifier, or
clientaddr4, depending on the returned NFS status. It doesn't
hurt to allocate enough space for both.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Sun, 9 Nov 2014 01:15:09 +0000 (20:15 -0500)]
SUNRPC: serialize iostats updates
Occasionally mountstats reports a negative retransmission rate.
Ensure that two RPCs completing concurrently don't confuse the sums
in the transport's op_metrics array.
Since pNFS filelayout can invoke rpc_count_iostats() on another
transport from xprt_release(), we can't rely on simply holding the
transport_lock in xprt_release(). There's nothing for it but hard
serialization. One spin lock per RPC operation should make this as
painless as it can be.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Sun, 9 Nov 2014 01:15:01 +0000 (20:15 -0500)]
xprtrdma: Display async errors
An async error upcall is a hard error, and should be reported in
the system log.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Sun, 9 Nov 2014 01:14:53 +0000 (20:14 -0500)]
xprtrdma: Enable pad optimization
The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when
pad optimization was enabled. That bug was fixed by commit
e560e3b510d2 ("svcrdma: Add zero padding if the client doesn't send
it").
We can now enable pad optimization on the client, which helps
performance and is supported now by both Linux and Solaris servers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Sun, 9 Nov 2014 01:14:45 +0000 (20:14 -0500)]
xprtrdma: Re-write rpcrdma_flush_cqs()
Currently rpcrdma_flush_cqs() attempts to avoid code duplication,
and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall.
1. rpcrdma_flush_cqs() can run concurrently with provider upcalls.
Both flush_cqs() and the upcalls were invoking ib_poll_cq() in
different threads using the same wc buffers (ep->rep_recv_wcs
and ep->rep_send_wcs), added by commit
1c00dd077654 ("xprtrmda:
Reduce calls to ib_poll_cq() in completion handlers").
During transport disconnect processing, this sometimes resulted
in the same reply getting added to the rpcrdma_tasklets_g list
more than once, which corrupted the list.
2. The upcall functions drain only a limited number of CQEs,
thanks to the poll budget added by commit
8301a2c047cc
("xprtrdma: Limit work done by completion handler").
Fixes: a7bc211ac926 ("xprtrdma: On disconnect, don't ignore ... ")
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=276
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Sun, 9 Nov 2014 01:14:37 +0000 (20:14 -0500)]
xprtrdma: Refactor tasklet scheduling
Restore the separate function that schedules the reply handling
tasklet. I need to call it from two different paths.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Sun, 9 Nov 2014 01:14:29 +0000 (20:14 -0500)]
xprtrdma: unmap all FMRs during transport disconnect
When using RPCRDMA_MTHCAFMR memory registration, after a few
transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to
return EINVAL because the provider has exhausted its map pool.
Make sure that all FMRs are unmapped during transport disconnect,
and that ->send_request remarshals them during an RPC retransmit.
This resets the transport's MRs to ensure that none are leaked
during a disconnect.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Sun, 9 Nov 2014 01:14:20 +0000 (20:14 -0500)]
xprtrdma: Cap req_cqinit
Recent work made FRMR registration and invalidation completions
unsignaled. This greatly reduces the adapter interrupt rate.
Every so often, however, a posted send Work Request is allowed to
signal. Otherwise, the provider's Work Queue will wrap and the
workload will hang.
The number of Work Requests that are allowed to remain unsignaled is
determined by the value of req_cqinit. Currently, this is set to the
size of the send Work Queue divided by two, minus 1.
For FRMR, the send Work Queue is the maximum number of concurrent
RPCs (currently 32) times the maximum number of Work Requests an
RPC might use (currently 7, though some adapters may need more).
For mlx4, this is 224 entries. This leaves completion signaling
disabled for 111 send Work Requests.
Some providers hold back dispatching Work Requests until a CQE is
generated. If completions are disabled, then no CQEs are generated
for quite some time, and that can stall the Work Queue.
I've seen this occur running xfstests generic/113 over NFSv4, where
eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
because the Work Queue has overflowed. The connection is dropped
and re-established.
Cap the rep_cqinit setting so completions are not left turned off
for too long.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Sun, 9 Nov 2014 01:14:12 +0000 (20:14 -0500)]
xprtrdma: Return an errno from rpcrdma_register_external()
The RPC/RDMA send_request method and the chunk registration code
expects an errno from the registration function. This allows
the upper layers to distinguish between a recoverable failure
(for example, temporary memory exhaustion) and a hard failure
(for example, a bug in the registration logic).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Li RongQing [Sun, 23 Nov 2014 04:47:41 +0000 (12:47 +0800)]
nfs: define nfs_inc_fscache_stats and using it as possible
Define and use nfs_inc_fscache_stats when plus one, which can save to
pass one parameter.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Li RongQing [Sun, 23 Nov 2014 04:47:17 +0000 (12:47 +0800)]
nfs: replace nfs_add_stats with nfs_inc_stats when add one
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Markus Elfring [Tue, 18 Nov 2014 12:23:43 +0000 (13:23 +0100)]
NFS: Deletion of unnecessary checks before the function call "nfs_put_client"
The nfs_put_client() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Jeff Layton [Mon, 17 Nov 2014 21:58:05 +0000 (16:58 -0500)]
sunrpc: eliminate RPC_TRACEPOINTS
It's always set to the same value as CONFIG_TRACEPOINTS, so we can just
use that instead.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Jeff Layton [Mon, 17 Nov 2014 21:58:04 +0000 (16:58 -0500)]
sunrpc: eliminate RPC_DEBUG
It's always set to whatever CONFIG_SUNRPC_DEBUG is, so just use that.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Jeff Layton [Mon, 17 Nov 2014 21:58:03 +0000 (16:58 -0500)]
lockd: eliminate LOCKD_DEBUG
LOCKD_DEBUG is always the same value as CONFIG_SUNRPC_DEBUG, so we can
just use it instead.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Peng Tao [Mon, 17 Nov 2014 03:05:17 +0000 (11:05 +0800)]
nfs41: fix nfs4_proc_layoutget error handling
nfs4_layoutget_release() drops layout hdr refcnt. Grab the refcnt
early so that it is safe to call .release in case nfs4_alloc_pages
fails.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Fixes: a47970ff78147 ("NFSv4.1: Hold reference to layout hdr in layoutget")
Cc: stable@vger.kernel.org # 3.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Weston Andros Adamson [Wed, 12 Nov 2014 17:08:00 +0000 (12:08 -0500)]
NFS: fix subtle change in COMMIT behavior
Recent work in the pgio layer made it possible for there to be more than one
request per page. This caused a subtle change in commit behavior, because
write.c:nfs_commit_unstable_pages compares the number of *pages* waiting for
writeback against the number of requests on a commit list to choose when to
send a COMMIT in a non-blocking flush.
This is probably hard to hit in normal operation - you have to be using
rsize/wsize < PAGE_SIZE, or pnfs with lots of boundaries that are not page
aligned to have a noticeable change in behavior.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Christoph Hellwig [Mon, 24 Nov 2014 21:47:02 +0000 (16:47 -0500)]
pnfs/blocklayout: fix end calculation in pnfs_num_cont_bytes
Use the number of pages in the pagecache mapping instead of the
number of pnfs requests which is only slightly related.
Reported-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Jeff Layton [Tue, 28 Oct 2014 18:24:14 +0000 (14:24 -0400)]
sunrpc: add tracepoints in xs_tcp_data_recv
Add tracepoints inside the main loop on xs_tcp_data_recv that allow
us to keep an eye on what's happening during each phase of it.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Jeff Layton [Tue, 28 Oct 2014 18:24:13 +0000 (14:24 -0400)]
sunrpc: add new tracepoints in xprt handling code
...so we can keep track of when calls are sent and replies received.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Jeff Layton [Tue, 28 Oct 2014 18:24:12 +0000 (14:24 -0400)]
sunrpc: add some tracepoints in svc_rqst handling functions
...just around svc_send, svc_recv and svc_process for now.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Anna Schumaker [Thu, 23 Oct 2014 18:00:54 +0000 (14:00 -0400)]
NFS: Use nfs_server_capable() for checknig NFS_CAP_SEEK
This should make the code easier to maintain in the future.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Jan Kara [Thu, 23 Oct 2014 15:56:11 +0000 (17:56 +0200)]
nfs: Remove dead case from nfs4_map_errors()
NFS4ERR_ACCESS has number 13 and thus is matched and returned
immediately at the beginning of nfs4_map_errors() and there's no point
in checking it later.
Coverity-id: 733891
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Linus Torvalds [Sun, 23 Nov 2014 23:25:20 +0000 (15:25 -0800)]
Linux 3.18-rc6
Andy Lutomirski [Fri, 21 Nov 2014 21:26:07 +0000 (13:26 -0800)]
uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
not on non-paranoid returns. I suspect that this is a mistake and that
the code only works because int3 is paranoid.
Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME
from the uprobes code.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Sun, 23 Nov 2014 22:04:52 +0000 (23:04 +0100)]
sched: Provide update_curr callbacks for stop/idle scheduling classes
Chris bisected a NULL pointer deference in task_sched_runtime() to
commit
6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency'.
Chris observed crashes in atop or other /proc walking programs when he
started fork bombs on his machine. He assumed that this is a new exit
race, but that does not make any sense when looking at that commit.
What's interesting is that, the commit provides update_curr callbacks
for all scheduling classes except stop_task and idle_task.
While nothing can ever hit that via the clock_nanosleep() and
clock_gettime() interfaces, which have been the target of the commit in
question, the author obviously forgot that there are other code paths
which invoke task_sched_runtime()
do_task_stat(()
thread_group_cputime_adjusted()
thread_group_cputime()
task_cputime()
task_sched_runtime()
if (task_current(rq, p) && task_on_rq_queued(p)) {
update_rq_clock(rq);
up->sched_class->update_curr(rq);
}
If the stats are read for a stomp machine task, aka 'migration/N' and
that task is current on its cpu, this will happily call the NULL pointer
of stop_task->update_curr. Ooops.
Chris observation that this happens faster when he runs the fork bomb
makes sense as the fork bomb will kick migration threads more often so
the probability to hit the issue will increase.
Add the missing update_curr callbacks to the scheduler classes stop_task
and idle_task. While idle tasks cannot be monitored via /proc we have
other means to hit the idle case.
Fixes: 6e998916dfe3 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
Reported-by: Chris Mason <clm@fb.com>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 23 Nov 2014 21:56:55 +0000 (13:56 -0800)]
Merge branch 'x86-traps' (trap handling from Andy Lutomirski)
Merge x86-64 iret fixes from Andy Lutomirski:
"This addresses the following issues:
- an unrecoverable double-fault triggerable with modify_ldt.
- invalid stack usage in espfix64 failed IRET recovery from IST
context.
- invalid stack usage in non-espfix64 failed IRET recovery from IST
context.
It also makes a good but IMO scary change: non-espfix64 failed IRET
will now report the correct error. Hopefully nothing depended on the
old incorrect behavior, but maybe Wine will get confused in some
obscure corner case"
* emailed patches from Andy Lutomirski <luto@amacapital.net>:
x86_64, traps: Rework bad_iret
x86_64, traps: Stop using IST for #SS
x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C
Andy Lutomirski [Sun, 23 Nov 2014 02:00:33 +0000 (18:00 -0800)]
x86_64, traps: Rework bad_iret
It's possible for iretq to userspace to fail. This can happen because
of a bad CS, SS, or RIP.
Historically, we've handled it by fixing up an exception from iretq to
land at bad_iret, which pretends that the failed iret frame was really
the hardware part of #GP(0) from userspace. To make this work, there's
an extra fixup to fudge the gs base into a usable state.
This is suboptimal because it loses the original exception. It's also
buggy because there's no guarantee that we were on the kernel stack to
begin with. For example, if the failing iret happened on return from an
NMI, then we'll end up executing general_protection on the NMI stack.
This is bad for several reasons, the most immediate of which is that
general_protection, as a non-paranoid idtentry, will try to deliver
signals and/or schedule from the wrong stack.
This patch throws out bad_iret entirely. As a replacement, it augments
the existing swapgs fudge into a full-blown iret fixup, mostly written
in C. It's should be clearer and more correct.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Lutomirski [Sun, 23 Nov 2014 02:00:32 +0000 (18:00 -0800)]
x86_64, traps: Stop using IST for #SS
On a 32-bit kernel, this has no effect, since there are no IST stacks.
On a 64-bit kernel, #SS can only happen in user code, on a failed iret
to user space, a canonical violation on access via RSP or RBP, or a
genuine stack segment violation in 32-bit kernel code. The first two
cases don't need IST, and the latter two cases are unlikely fatal bugs,
and promoting them to double faults would be fine.
This fixes a bug in which the espfix64 code mishandles a stack segment
violation.
This saves 4k of memory per CPU and a tiny bit of code.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Lutomirski [Sun, 23 Nov 2014 02:00:31 +0000 (18:00 -0800)]
x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C
There's nothing special enough about the espfix64 double fault fixup to
justify writing it in assembly. Move it to C.
This also fixes a bug: if the double fault came from an IST stack, the
old asm code would return to a partially uninitialized stack frame.
Fixes: 3891a04aafd668686239349ea58f3314ea2af86b
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 23 Nov 2014 19:46:01 +0000 (11:46 -0800)]
Merge tag 'armsoc-for-rc6' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"A collection of fixes this week:
- A set of clock fixes for shmobile platforms
- A fix for tegra that moves serial port labels to be per board.
We're choosing to merge this for 3.18 because the labels will start
being parsed in 3.19, and without this change serial port numbers
that used to be stable since the dawn of time will change numbers.
- A few other DT tweaks for Tegra.
- A fix for multi_v7_defconfig that makes it stop spewing cpufreq
errors on Arndale (Exynos)"
* tag 'armsoc-for-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: multi_v7_defconfig: fix failure setting CPU voltage by enabling dependent I2C controller
ARM: tegra: roth: Fix SD card VDD_IO regulator
ARM: tegra: Remove eMMC vmmc property for roth/tn7
ARM: dts: tegra: move serial aliases to per-board
ARM: tegra: Add serial port labels to Tegra124 DT
ARM: shmobile: kzm9g legacy: Set i2c clks_per_count to 2
ARM: shmobile: r8a7740 dtsi: Correct IIC0 parent clock
ARM: shmobile: r8a7790: Fix SD3CKCR address to device tree
ARM: shmobile: r8a7740 legacy: Correct IIC0 parent clock
ARM: shmobile: r8a7740 legacy: Add missing INTCA clock for irqpin module
ARM: shmobile: r8a7790: Fix SD3CKCR address
ARM: dts: sun6i: Re-parent ahb1_mux to pll6 as required by dma controller
Linus Torvalds [Sun, 23 Nov 2014 19:33:49 +0000 (11:33 -0800)]
Merge branch 'for-3.18-fixes' of git://git./linux/kernel/git/tj/percpu
Pull percpu fix from Tejun Heo:
"This contains one patch to fix a race condition which can lead to
percpu_ref using a percpu pointer which is corrupted with a set DEAD
bit. The bug was introduced while separating out the ATOMIC mode flag
from the DEAD flag. The fix is pretty straight forward.
I just committed the patch to the percpu tree but am sending out the
pull request early as I'll be on vacation for a week. The patch
should be fairly safe and while the latency will be higher I'll be
checking emails"
* 'for-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu-ref: fix DEAD flag contamination of percpu pointer
Linus Torvalds [Sun, 23 Nov 2014 19:16:36 +0000 (11:16 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs deadlock fix from Chris Mason:
"This has a fix for a long standing deadlock that we've been trying to
nail down for a while. It ended up being a bad interaction with the
fair reader/writer locks and the order btrfs reacquires locks in the
btree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix lockups from btrfs_clear_path_blocking
Tejun Heo [Sat, 22 Nov 2014 14:22:42 +0000 (09:22 -0500)]
percpu-ref: fix DEAD flag contamination of percpu pointer
While decoupling ATOMIC and DEAD flags,
f47ad4578461 ("percpu_ref:
decouple switching to percpu mode and reinit") updated
__ref_is_percpu() so that it only tests ATOMIC flag to determine
whether the ref is in percpu mode or not; however, while DEAD implies
ATOMIC, the two flags are set separately during percpu_ref_kill() and
if __ref_is_percpu() races percpu_ref_kill(), it may see DEAD w/o
ATOMIC. Because __ref_is_percpu() returns @ref->percpu_count_ptr
value verbatim as the percpu pointer after testing ATOMIC, the pointer
may now be contaminated with the DEAD flag.
This can be fixed by clearing the flag bits before returning the
pointer which was the fix proposed by Shaohua; however, as DEAD
implies ATOMIC, we can just test for both flags at once and avoid the
explicit masking.
Update __ref_is_percpu() so that it tests that both ATOMIC and DEAD
are clear before returning @ref->percpu_count_ptr as the percpu
pointer.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Reviewed-by: Shaohua Li <shli@kernel.org>
Link: http://lkml.kernel.org/r/995deb699f5b873c45d667df4add3b06f73c2c25.1416638887.git.shli@kernel.org
Fixes: f47ad4578461 ("percpu_ref: decouple switching to percpu mode and reinit")
Linus Torvalds [Sat, 22 Nov 2014 22:33:11 +0000 (14:33 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single bugfix for an init order problem in the sun4i subarch
clockevents code"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clockevent: sun4i: Fix race condition in the probe code
Linus Torvalds [Sat, 22 Nov 2014 22:15:27 +0000 (14:15 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"Assorted fixes, most in overlayfs land"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ovl: ovl_dir_fsync() cleanup
ovl: update MAINTAINERS
ovl: pass dentry into ovl_dir_read_merged()
ovl: use lockless_dereference() for upperdentry
ovl: allow filenames with comma
ovl: fix race in private xattr checks
ovl: fix remove/copy-up race
ovl: rename filesystem type to "overlay"
isofs: avoid unused function warning
vfs: fix reference leak in d_prune_aliases()
Linus Torvalds [Sat, 22 Nov 2014 01:20:36 +0000 (17:20 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix BUG when decrypting empty packets in mac80211, from Ronald Wahl.
2) nf_nat_range is not fully initialized and this is copied back to
userspace, from Daniel Borkmann.
3) Fix read past end of b uffer in netfilter ipset, also from Dan
Carpenter.
4) Signed integer overflow in ipv4 address mask creation helper
inet_make_mask(), from Vincent BENAYOUN.
5) VXLAN, be2net, mlx4_en, and qlcnic need ->ndo_gso_check() methods to
properly describe the device's capabilities, from Joe Stringer.
6) Fix memory leaks and checksum miscalculations in openvswitch, from
Pravin B SHelar and Jesse Gross.
7) FIB rules passes back ambiguous error code for unreachable routes,
making behavior confusing for userspace. Fix from Panu Matilainen.
8) ieee802154fake_probe() doesn't release resources properly on error,
from Alexey Khoroshilov.
9) Fix skb_over_panic in add_grhead(), from Daniel Borkmann.
10) Fix access of stale slave pointers in bonding code, from Nikolay
Aleksandrov.
11) Fix stack info leak in PPP pptp code, from Mathias Krause.
12) Cure locking bug in IPX stack, from Jiri Bohac.
13) Revert SKB fclone memory freeing optimization that is racey and can
allow accesses to freed up memory, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (71 commits)
tcp: Restore RFC5961-compliant behavior for SYN packets
net: Revert "net: avoid one atomic operation in skb_clone()"
virtio-net: validate features during probe
cxgb4 : Fix DCB priority groups being returned in wrong order
ipx: fix locking regression in ipx_sendmsg and ipx_recvmsg
openvswitch: Don't validate IPv6 label masks.
pptp: fix stack info leak in pptp_getname()
brcmfmac: don't include linux/unaligned/access_ok.h
cxgb4i : Don't block unload/cxgb4 unload when remote closes TCP connection
ipv6: delete protocol and unregister rtnetlink when cleanup
net/mlx4_en: Add VXLAN ndo calls to the PF net device ops too
bonding: fix curr_active_slave/carrier with loadbalance arp monitoring
mac80211: minstrel_ht: fix a crash in rate sorting
vxlan: Inline vxlan_gso_check().
can: m_can: update to support CAN FD features
can: m_can: fix incorrect error messages
can: m_can: add missing delay after setting CCCR_INIT bit
can: m_can: fix not set can_dlc for remote frame
can: m_can: fix possible sleep in napi poll
can: m_can: add missing message RAM initialization
...
Linus Torvalds [Sat, 22 Nov 2014 01:15:28 +0000 (17:15 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Just two radeon and two intel fixes: endian and regression fixes"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon: fix endian swapping in vbios fetch for tdp table
drm/radeon: disable native backlight control on pre-r6xx asics (v2)
drm/i915: Kick fbdev before vgacon
drm/i915: drop WaSetupGtModeTdRowDispatch:snb
Linus Torvalds [Sat, 22 Nov 2014 01:11:56 +0000 (17:11 -0800)]
Merge tag 'sound-3.18-rc6' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This batch ended up as a relatively high volume due to pending ASoC
fixes. But most of fixes there are trivial and/or device- specific
fixes and quirks, so safe to apply. The only (ASoC) core fixes are
the DPCM race fix and the machine-driver matching fix for
componentization"
* tag 'sound-3.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - fix the mic mute led problem for Latitude E5550
ALSA: hda - move DELL_WMI_MIC_MUTE_LED to the tail in the quirk chain
ASoC: wm_adsp: Avoid attempt to free buffers that might still be in use
ALSA: usb-audio: Set the Control Selector to SU_SELECTOR_CONTROL for UAC2
ALSA: usb-audio: Add ctrl message delay quirk for Marantz/Denon devices
ASoC: sgtl5000: Fix SMALL_POP bit definition
ASoC: cs42l51: re-hook of_match_table pointer
ASoC: rt5670: change dapm routes of PLL connection
ASoC: rt5670: correct the incorrect default values
ASoC: samsung: Add MODULE_DEVICE_TABLE for Snow
ASoC: max98090: Correct pclk divisor settings
ASoC: dpcm: Fix race between FE/BE updates and trigger
ASoC: Fix snd_soc_find_dai() matching component by name
ASoC: rsnd: remove unsupported PAUSE flag
ASoC: fsi: remove unsupported PAUSE flag
ASoC: rt5645: Mark RT5645_TDM_CTRL_3 as readable
ASoC: rockchip-i2s: fix infinite loop in rockchip_snd_rxctrl
ASoC: es8328-i2c: Fix i2c_device_id name field in es8328_id
ASoC: fsl_asrc: Add reg_defaults for regmap to fix kernel dump
Linus Torvalds [Sat, 22 Nov 2014 00:56:25 +0000 (16:56 -0800)]
Merge tag 'pm+acpi-3.18-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI power management fix from Rafael Wysocki:
"This is just a one-liner fixing a regression introduced in 3.13 that
broke system suspend on some Chromebooks.
On those machines there are ACPI device objects for some I2C devices
that can wake up the system from sleep states, but that is done via a
platform-specific mechanism and the ACPI objects don't contain any
wakeup-related information. When we started to use ACPI power
management with those devices (which happened during the 3.13 cycle),
their configuration confused the ACPI PM layer that returned error
codes from suspend callbacks for them causing system suspend to fail.
However, the ACPI PM layer can safely ignore the wakeup setting from a
device driver if the ACPI object corresponding to the device in
question doesn't contain wakeup information in which case the driver
itself is responsible for setting up the device for system wakeup"
* tag 'pm+acpi-3.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PM: Ignore wakeup setting if the ACPI companion can't wake up
Linus Torvalds [Sat, 22 Nov 2014 00:40:41 +0000 (16:40 -0800)]
Merge tag 'devicetree-fixes-for-3.18' of git://git./linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
"DeviceTree fixes for 3.18:
- two fixes for OF selftest code
- fix for PowerPC address parsing to disable work-around except on
old PowerMACs
- fix a crash when earlycon is enabled, but no device is found
- DT documentation fixes and missing vendor prefixes
All but the doc updates are also for stable"
* tag 'devicetree-fixes-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of/selftest: Fix testing when /aliases is missing
of/selftest: Fix off-by-one error in removal path
documentation: pinctrl bindings: Fix trivial typo 'abitrary'
devicetree: bindings: Add vendor prefix for Micron Technology, Inc.
of: Add vendor prefix for Chips&Media, Inc.
of/base: Fix PowerPC address parsing hack
devicetree: vendor-prefixes.txt: fix whitespace
of: Fix crash if an earlycon driver is not found
of/irq: Drop obsolete 'interrupts' vs 'interrupts-extended' text
of: Spelling s/stucture/structure/
devicetree: bindings: add sandisk to the vendor prefixes
Linus Torvalds [Sat, 22 Nov 2014 00:36:42 +0000 (16:36 -0800)]
Merge tag 'pci-v3.18-fixes-3' of git://git./linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
"These are fixes for an issue with 64-bit PCI bus addresses on 32-bit
PAE kernels, an APM X-Gene problem (it depended on a generic change we
removed before merging), a fix for my hotplug device configuration
changes, and a devicetree documentation update.
Resource management:
- Support 64-bit bridge windows if we have 64-bit dma_addr_t (Yinghai Lu)
PCI device hotplug:
- Apply _HPX Link Control settings to all devices with a link (Yinghai Lu)
Generic host bridge driver:
- Add DT binding for "linux,pci-domain" property (Lucas Stach)
APM X-Gene:
- Assign resources to bus before adding new devices (Duc Dang)"
* tag 'pci-v3.18-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: Support 64-bit bridge windows if we have 64-bit dma_addr_t
PCI: Apply _HPX Link Control settings to all devices with a link
PCI: Add missing DT binding for "linux,pci-domain" property
PCI: xgene: Assign resources to bus before adding new devices
Linus Torvalds [Sat, 22 Nov 2014 00:28:45 +0000 (16:28 -0800)]
Merge git://git./linux/kernel/git/nab/target-pending
Pull SCSI target fixes from Nicholas Bellinger:
"Here are the target-pending fixes queued for v3.18-rc6.
The highlights include:
- target-core OOPs fix with tcm_qla2xxx + vxworks FC initiators +
zero length SCSI commands having a transfer direction set. (Roland
+ Craig Watson)
- vhost-scsi OOPs fix to explicitly prevent WWPN endpoint configfs
group removal while qemu still has an active reference. (Paolo +
nab)
- ib_srpt fix for RDMA hardware with lower srp_sq_size limits.
(Bart)
- two ib_isert work-arounds for running on ocrdma hardware (Or + Sagi
+ Chris)
- iscsi-target discovery portal typo + SPC-3 PR Preempt SA key
matching fix (Steve)"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
IB/isert: Adjust CQ size to HW limits
target: return CONFLICT only when SA key unmatched
iser-target: Handle DEVICE_REMOVAL event on network portal listener correctly
ib_isert: Add max_send_sge=2 minimum for control PDU responses
srp-target: Retry when QP creation fails with ENOMEM
iscsi-target: return the correct port in SendTargets
vhost-scsi: Take configfs group dependency during VHOST_SCSI_SET_ENDPOINT
target: Don't call TFO->write_pending if data_length == 0
Linus Torvalds [Sat, 22 Nov 2014 00:24:27 +0000 (16:24 -0800)]
Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
"We have couple of fixes for dmaengine queued up:
- dma mempcy fix for dma configuration of sun6i by Maxime
- pl330 fixes: First the fixing allocation for data buffers by Liviu
and then Jon's fixe for fifo width and usage"
* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: Fix allocation size for PL330 data buffer depth.
dmaengine: pl330: Limit MFIFO usage for memcpy to avoid exhausting entries
dmaengine: pl330: Align DMA memcpy operations to MFIFO width
dmaengine: sun6i: Fix memcpy operation
Linus Torvalds [Sat, 22 Nov 2014 00:14:58 +0000 (16:14 -0800)]
Merge branch 'upstream' of git://git.linux-mips.org/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"More 3.18 fixes for MIPS:
- backtraces were not quite working on on 64-bit kernels
- loongson needs a different cache coherency setting
- Loongson 3 is a MIPS64 R2 version but due to erratum we treat is an
older architecture revision.
- fix build errors due to undefined references to __node_distances
for certain configurations.
- fix instruction decodig in the jump label code.
- for certain configurations copy_{from,to}_user destroy the content
of $3 so that register needs to be marked as clobbed by the calling
code.
- Hardware Table Walker fixes.
- fill the delay slot of the last instruction of memcpy otherwise
whatever ends up there randomly might have undesirable effects.
- ensure get_user/__get_user always zero the variable to be read even
in case of an error"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: jump_label.c: Handle the microMIPS J instruction encoding
MIPS: jump_label.c: Correct the span of the J instruction
MIPS: Zero variable read by get_user / __get_user in case of an error.
MIPS: lib: memcpy: Restore NOP on delay slot before returning to caller
MIPS: tlb-r4k: Add missing HTW stop/start sequences
MIPS: asm: uaccess: Add v1 register to clobber list on EVA
MIPS: oprofile: Fix backtrace on 64-bit kernel
MIPS: Loongson: Set Loongson-3's ISA level to MIPS64R1
MIPS: Loongson: Fix the write-combine CCA value setting
MIPS: IP27: Fix __node_distances undefined error
MIPS: Loongson3: Fix __node_distances undefined error
Linus Torvalds [Sat, 22 Nov 2014 00:13:34 +0000 (16:13 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mpe/linux
Pull powerpc fix from Michael Ellerman:
"One fix from Scott, he says:
This patch fixes a crash (introduced in v3.18-rc1) in the FSL MSI driver
when threaded IRQs are enabled"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc/fsl_msi: mark the msi cascade handler IRQF_NO_THREAD
Linus Torvalds [Fri, 21 Nov 2014 23:46:17 +0000 (15:46 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Misc fixes:
- gold linker build fix
- noxsave command line parsing fix
- bugfix for NX setup
- microcode resume path bug fix
- _TIF_NOHZ versus TIF_NOHZ bugfix as discussed in the mysterious
lockup thread"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1
x86, kaslr: Handle Gold linker for finding bss/brk
x86, mm: Set NX across entire PMD at boot
x86, microcode: Update BSPs microcode on resume
x86: Require exact match for 'noxsave' command line option
Linus Torvalds [Fri, 21 Nov 2014 23:44:54 +0000 (15:44 -0800)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Misc fixes: two NUMA fixes, two cputime fixes and an RCU/lockdep fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency
sched/cputime: Fix cpu_timer_sample_group() double accounting
sched/numa: Avoid selecting oneself as swap target
sched/numa: Fix out of bounds read in sched_init_numa()
sched: Remove lockdep check in sched_move_task()
Linus Torvalds [Fri, 21 Nov 2014 23:44:07 +0000 (15:44 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes: two Intel uncore driver fixes, a CPU-hotplug fix and a
build dependencies fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix boot crash on SBOX PMU on Haswell-EP
perf/x86/intel/uncore: Fix IRP uncore register offsets on Haswell EP
perf: Fix corruption of sibling list with hotplug
perf/x86: Fix embarrasing typo
Linus Torvalds [Fri, 21 Nov 2014 23:38:21 +0000 (15:38 -0800)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull core fix from Ingo Molnar:
"Fix GENMASK macro shift overflow"
Nobody seems to currently use GENMASK() to fill every single last bit
(which is what overflows) in-tree, and gcc would warn about it, so we
have that going for us. But apparently there are pending changes that
want this.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bitops: Fix shift overflow in GENMASK macros
Calvin Owens [Thu, 20 Nov 2014 23:09:53 +0000 (15:09 -0800)]
tcp: Restore RFC5961-compliant behavior for SYN packets
Commit
c3ae62af8e755 ("tcp: should drop incoming frames without ACK
flag set") was created to mitigate a security vulnerability in which a
local attacker is able to inject data into locally-opened sockets by
using TCP protocol statistics in procfs to quickly find the correct
sequence number.
This broke the RFC5961 requirement to send a challenge ACK in response
to spurious RST packets, which was subsequently fixed by commit
7b514a886ba50 ("tcp: accept RST without ACK flag").
Unfortunately, the RFC5961 requirement that spurious SYN packets be
handled in a similar manner remains broken.
RFC5961 section 4 states that:
... the handling of the SYN in the synchronized state SHOULD be
performed as follows:
1) If the SYN bit is set, irrespective of the sequence number, TCP
MUST send an ACK (also referred to as challenge ACK) to the remote
peer:
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
After sending the acknowledgment, TCP MUST drop the unacceptable
segment and stop processing further.
By sending an ACK, the remote peer is challenged to confirm the loss
of the previous connection and the request to start a new connection.
A legitimate peer, after restart, would not have a TCB in the
synchronized state. Thus, when the ACK arrives, the peer should send
a RST segment back with the sequence number derived from the ACK
field that caused the RST.
This RST will confirm that the remote peer has indeed closed the
previous connection. Upon receipt of a valid RST, the local TCP
endpoint MUST terminate its connection. The local TCP endpoint
should then rely on SYN retransmission from the remote end to
re-establish the connection.
This patch lets SYN packets through the discard added in
c3ae62af8e755,
so that spurious SYN packets are properly dealt with as per the RFC.
The challenge ACK is sent unconditionally and is rate-limited, so the
original vulnerability is not reintroduced by this patch.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 21 Nov 2014 19:47:16 +0000 (11:47 -0800)]
net: Revert "net: avoid one atomic operation in skb_clone()"
Not sure what I was thinking, but doing anything after
releasing a refcount is suicidal or/and embarrassing.
By the time we set skb->fclone to SKB_FCLONE_FREE, another cpu
could have released last reference and freed whole skb.
We potentially corrupt memory or trap if CONFIG_DEBUG_PAGEALLOC is set.
Reported-by: Chris Mason <clm@fb.com>
Fixes: ce1a4ea3f1258 ("net: avoid one atomic operation in skb_clone()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Al Viro [Fri, 21 Nov 2014 16:51:08 +0000 (11:51 -0500)]
Merge branch 'overlayfs-current' of git://git./linux/kernel/git/mszeredi/vfs into for-linus
"The biggest change is to rename the filesystem from "overlayfs" to "overlay".
This will allow legacy overlayfs to be easily carried by distros alongside the
new mainline one. Also fix a couple of copy-up races and allow escaping comma
character in filenames."
The last bit is about commas in pathname mount options...
Jason Wang [Thu, 20 Nov 2014 09:03:05 +0000 (17:03 +0800)]
virtio-net: validate features during probe
We currently trigger BUG when VIRTIO_NET_F_CTRL_VQ
is not set but one of features depending on it is.
That's not a friendly way to report errors to
hypervisors.
Let's check, and fail probe instead.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 21 Nov 2014 05:12:39 +0000 (00:12 -0500)]
Merge git://git./pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains two bugfixes for your net tree, they are:
1) Validate netlink group from nfnetlink to avoid an out of bound array
access. This should only happen with superuser priviledges though.
Discovered by Andrey Ryabinin using trinity.
2) Don't push ethernet header before calling the netfilter output hook
for multicast traffic, this breaks ebtables since it expects to see
skb->data pointing to the network header, patch from Linus Luessing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 21 Nov 2014 05:07:51 +0000 (00:07 -0500)]
Merge tag 'master-2014-11-20' of git://git./linux/kernel/git/linville/wireless
John W. Linville says:
====================
pull request: wireless 2014-11-20
Please full this little batch of fixes intended for the 3.18 stream!
For the mac80211 patch, Johannes says:
"Here's another last minute fix, for minstrel HT crashing
depending on the value of some uninitialised stack."
On top of that...
Ben Greear fixes an ath9k regression in which a BSSID mask is
miscalculated.
Dmitry Torokhov corrects an error handling routing in brcmfmac which
was checking an unsigned variable for a negative value.
Johannes Berg avoids a build problem in brcmfmac for arches where
linux/unaligned/access_ok.h and asm/unaligned.h conflict.
Mathy Vanhoef addresses another brcmfmac issue so as to eliminate a
use-after-free of the URB transfer buffer if a timeout occurs.
Please let me know if there are problems!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Anish Bhatt [Fri, 21 Nov 2014 01:11:46 +0000 (17:11 -0800)]
cxgb4 : Fix DCB priority groups being returned in wrong order
Peer priority groups were being reversed, but this was missed in the previous
fix sent out for this issue.
v2 : Previous patch was doing extra unnecessary work, result is the same.
Please ignore previous patch
Fixes :
ee7bc3cdc270 ('cxgb4 : dcb open-lldp interop fixes')
Signed-off-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Bohac [Wed, 19 Nov 2014 22:05:49 +0000 (23:05 +0100)]
ipx: fix locking regression in ipx_sendmsg and ipx_recvmsg
This fixes an old regression introduced by commit
b0d0d915 (ipx: remove the BKL).
When a recvmsg syscall blocks waiting for new data, no data can be sent on the
same socket with sendmsg because ipx_recvmsg() sleeps with the socket locked.
This breaks mars-nwe (NetWare emulator):
- the ncpserv process reads the request using recvmsg
- ncpserv forks and spawns nwconn
- ncpserv calls a (blocking) recvmsg and waits for new requests
- nwconn deadlocks in sendmsg on the same socket
Commit
b0d0d915 has simply replaced BKL locking with
lock_sock/release_sock. Unlike now, BKL got unlocked while
sleeping, so a blocking recvmsg did not block a concurrent
sendmsg.
Only keep the socket locked while actually working with the socket data and
release it prior to calling skb_recv_datagram().
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Stringer [Wed, 19 Nov 2014 21:54:49 +0000 (13:54 -0800)]
openvswitch: Don't validate IPv6 label masks.
When userspace doesn't provide a mask, OVS datapath generates a fully
unwildcarded mask for the flow by copying the flow and setting all bits
in all fields. For IPv6 label, this creates a mask that matches on the
upper 12 bits, causing the following error:
openvswitch: netlink: Invalid IPv6 flow label value (value=
ffffffff, max=fffff)
This patch ignores the label validation check for masks, avoiding this
error.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mathias Krause [Wed, 19 Nov 2014 17:05:26 +0000 (18:05 +0100)]
pptp: fix stack info leak in pptp_getname()
pptp_getname() only partially initializes the stack variable sa,
particularly only fills the pptp part of the sa_addr union. The code
thereby discloses 16 bytes of kernel stack memory via getsockname().
Fix this by memset(0)'ing the union before.
Cc: Dmitry Kozlov <xeb@mail.ru>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Airlie [Fri, 21 Nov 2014 02:19:19 +0000 (12:19 +1000)]
Merge branch 'drm-fixes-3.18' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
fix one regression and one endian issue.
* 'drm-fixes-3.18' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: fix endian swapping in vbios fetch for tdp table
drm/radeon: disable native backlight control on pre-r6xx asics (v2)
Andy Lutomirski [Wed, 19 Nov 2014 21:56:19 +0000 (13:56 -0800)]
x86, syscall: Fix _TIF_NOHZ handling in syscall_trace_enter_phase1
TIF_NOHZ is 19 (i.e. _TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME |
_TIF_SINGLESTEP), not (1<<19).
This code is involved in Dave's trinity lockup, but I don't see why
it would cause any of the problems he's seeing, except inadvertently
by causing a different path through entry_64.S's syscall handling.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/a6cd3b60a3f53afb6e1c8081b0ec30ff19003dd7.1416434075.git.luto@amacapital.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Johannes Berg [Wed, 19 Nov 2014 21:13:10 +0000 (22:13 +0100)]
brcmfmac: don't include linux/unaligned/access_ok.h
This is a specific implementation, <asm/unaligned.h> is the
multiplexer that has the arch-specific knowledge of which
of the implementations needs to be used, so include that.
This issue was revealed by kbuild testing
when <asm/unaligned.h> was added in <linux/ieee80211.h>
resulting in redefinition of get_unaligned_be16 (and
probably others).
Cc: stable@vger.kernel.org # v3.17
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Alex Deucher [Thu, 13 Nov 2014 00:17:02 +0000 (19:17 -0500)]
drm/radeon: fix endian swapping in vbios fetch for tdp table
Value needs to be swapped on BE.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Alex Deucher [Wed, 19 Nov 2014 18:12:54 +0000 (13:12 -0500)]
drm/radeon: disable native backlight control on pre-r6xx asics (v2)
Just use the acpi interface. That's what windows uses on this
generation and it's the only thing that seems to work reliably
on these generation parts.
You can still force the native backlight interface by setting
radeon.backlight=1
Bug:
https://bugzilla.kernel.org/show_bug.cgi?id=88501
v2: merge into above if/else block
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Miklos Szeredi [Thu, 20 Nov 2014 15:40:02 +0000 (16:40 +0100)]
ovl: ovl_dir_fsync() cleanup
Check against !OVL_PATH_LOWER instead of OVL_PATH_MERGE. For a copied up
directory the two are currently equivalent.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 20 Nov 2014 15:40:01 +0000 (16:40 +0100)]
ovl: update MAINTAINERS
There's a union/overlay specific mailing list now. Also add a git tree.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 20 Nov 2014 15:40:01 +0000 (16:40 +0100)]
ovl: pass dentry into ovl_dir_read_merged()
Pass dentry into ovl_dir_read_merged() insted of upperpath and lowerpath.
This cleans up callers and paves the way for multi-layer directory reads.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 20 Nov 2014 15:40:01 +0000 (16:40 +0100)]
ovl: use lockless_dereference() for upperdentry
Don't open code lockless_dereference() in ovl_upperdentry_dereference().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 20 Nov 2014 15:40:00 +0000 (16:40 +0100)]
ovl: allow filenames with comma
Allow option separator (comma) to be escaped with backslash.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 20 Nov 2014 15:40:00 +0000 (16:40 +0100)]
ovl: fix race in private xattr checks
Xattr operations can race with copy up. This does not matter as long as
we consistently fiter out "trunsted.overlay.opaque" attribute on upper
directories.
Previously we checked parent against OVL_PATH_MERGE. This is too general,
and prone to race with copy-up. I.e. we found the parent to be on the
lower layer but ovl_dentry_real() would return the copied-up dentry,
possibly with the "opaque" attribute.
So instead use ovl_path_real() and decide to filter the attributes based on
the actual type of the dentry we'll use.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 20 Nov 2014 15:39:59 +0000 (16:39 +0100)]
ovl: fix remove/copy-up race
ovl_remove_and_whiteout() needs to check if upper dentry exists or not
after having locked upper parent directory.
Previously we used a "type" value computed before locking the upper parent
directory, which is susceptible to racing with copy-up.
There's a similar check in ovl_check_empty_and_clear(). This one is not
actually racy, since copy-up doesn't change the "emptyness" property of a
directory. Add a comment to this effect, and check the existence of upper
dentry locally to make the code cleaner.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Miklos Szeredi [Thu, 20 Nov 2014 15:39:59 +0000 (16:39 +0100)]
ovl: rename filesystem type to "overlay"
Some distributions carry an "old" format of overlayfs while mainline has a
"new" format.
The distros will possibly want to keep the old overlayfs alongside the new
for compatibility reasons.
To make it possible to differentiate the two versions change the name of
the new one from "overlayfs" to "overlay".
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Andy Whitcroft <apw@canonical.com>
Grant Likely [Wed, 19 Nov 2014 17:13:44 +0000 (17:13 +0000)]
of/selftest: Fix testing when /aliases is missing
The /aliases node isn't always present in the device tree, but the
unittest code assumes that /aliases is there. Add a check when inserting
the testcase data to see if of_aliases needs to be updated, and undo the
settings when the nodes are removed.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Gaurav Minocha <gaurav.minocha.os@gmail.com>
Cc: <stable@vger.kernel.org>
Chris Moore [Tue, 4 Nov 2014 16:28:29 +0000 (16:28 +0000)]
IB/isert: Adjust CQ size to HW limits
isert has an issue of trying to create a CQ with more CQEs than are
supported by the hardware, that currently results in failures during
isert_device creation during first session login.
This is the isert version of the patch that Minh Tran submitted for
iser, and is simple a workaround required to function with existing
ocrdma hardware.
Signed-off-by: Chris Moore <chris.moore@emulex.com>
Reviewied-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Dave Airlie [Thu, 20 Nov 2014 02:58:11 +0000 (12:58 +1000)]
Merge tag 'drm-intel-fixes-2014-11-19' of git://anongit.freedesktop.org/drm-intel into drm-fixes
two regression fixes.
* tag 'drm-intel-fixes-2014-11-19' of git://anongit.freedesktop.org/drm-intel:
drm/i915: Kick fbdev before vgacon
drm/i915: drop WaSetupGtModeTdRowDispatch:snb
Rafael J. Wysocki [Wed, 19 Nov 2014 00:44:11 +0000 (01:44 +0100)]
ACPI / PM: Ignore wakeup setting if the ACPI companion can't wake up
As reported by Dmitry, on some Chromebooks there are devices with
corresponding ACPI objects and with unusual system wakeup
configuration. Namely, they technically are wakeup-capable, but the
wakeup is handled via a platform-specific out-of-band mechanism and
the ACPI PM layer has no information on the wakeup capability. As
a result, device_may_wakeup(dev) called from acpi_dev_suspend_late()
returns 'true' for those devices, but the wakeup.flags.valid flag is
unset for the corresponding ACPI device objects, so acpi_device_wakeup()
reproducibly fails for them causing acpi_dev_suspend_late() to return
an error code. The entire system suspend is then aborted and the
machines in question cannot suspend at all.
Address the problem by ignoring the device_may_wakeup(dev) return
value in acpi_dev_suspend_late() if the ACPI companion of the device
being handled has wakeup.flags.valid unset (in which case it is clear
that the wakeup is supposed to be handled by other means).
This fixes a regression introduced by commit
a76e9bd89ae7 (i2c:
attach/detach I2C client device to the ACPI power domain) as the
affected systems could suspend and resume successfully before that
commit.
Fixes: a76e9bd89ae7 (i2c: attach/detach I2C client device to the ACPI power domain)
Reported-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-by: Dmitry Torokhov <dtor@chromium.org>
Cc: 3.13+ <stable@vger.kernel.org> # 3.13+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Anish Bhatt [Wed, 19 Nov 2014 03:09:51 +0000 (19:09 -0800)]
cxgb4i : Don't block unload/cxgb4 unload when remote closes TCP connection
cxgb4i was returning wrong error and not releasing module reference if remote
end abruptly closed TCP connection. This prevents the cxgb4 network module from
being unloaded, further affecting other network drivers dependent on cxgb4
Sending to net as this affects all cxgb4 based network drivers.
Signed-off-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>