Kirill A. Shutemov [Thu, 21 Nov 2013 22:32:11 +0000 (14:32 -0800)]
mm: place page->pmd_huge_pte to right union
I don't know what went wrong, mis-merge or something, but ->pmd_huge_pte
placed in wrong union within struct page.
In original patch[1] it's placed to union with ->lru and ->slab, but in
commit
e009bb30c8df ("mm: implement split page table lock for PMD
level") it's in union with ->index and ->freelist.
That union seems also unused for pages with table tables and safe to
re-use, but it's not what I've tested.
Let's move it to original place. It fixes indentation at least. :)
[1] https://lkml.org/lkml/2013/10/7/288
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Haiyang Zhang [Thu, 21 Nov 2013 22:32:10 +0000 (14:32 -0800)]
MAINTAINERS: add keyboard driver to Hyper-V file list
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Thu, 21 Nov 2013 22:32:09 +0000 (14:32 -0800)]
x86, mm: do not leak page->ptl for pmd page tables
There are two code paths how page with pmd page table can be freed:
pmd_free() and pmd_free_tlb().
I've missed the second one and didn't add page table destructor call
there. It leads to leak of page->ptl for pmd page tables, if
dynamically allocated page->ptl is in use.
The patch adds the missed destructor and modifies documentation
accordingly.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Andrey Vagin <avagin@openvz.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jesper Nilsson [Thu, 21 Nov 2013 22:32:08 +0000 (14:32 -0800)]
ipc,shm: correct error return value in shmctl (SHM_UNLOCK)
Commit
2caacaa82a51 ("ipc,shm: shorten critical region for shmctl")
restructured the ipc shm to shorten critical region, but introduced a
path where the return value could be -EPERM, even if the operation
actually was performed.
Before the commit, the err return value was reset by the return value
from security_shm_shmctl() after the if (!ns_capable(...)) statement.
Now, we still exit the if statement with err set to -EPERM, and in the
case of SHM_UNLOCK, it is not reset at all, and used as the return value
from shmctl.
To fix this, we only set err when errors occur, leaving the fallthrough
case alone.
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org> [3.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Thu, 21 Nov 2013 22:32:06 +0000 (14:32 -0800)]
mm, mempolicy: silence gcc warning
Fengguang Wu reports that compiling mm/mempolicy.c results in a warning:
mm/mempolicy.c: In function 'mpol_to_str':
mm/mempolicy.c:2878:2: error: format not a string literal and no format arguments
Kees says this is because he is using -Wformat-security.
Silence the warning.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Antti P Miettinen [Thu, 21 Nov 2013 22:32:05 +0000 (14:32 -0800)]
block/partitions/efi.c: fix bound check
Use ARRAY_SIZE instead of sizeof to get proper max for label length.
Since this is just a read out of bounds it's not that bad, but the
problem becomes user-visible eg if one tries to use DEBUG_PAGEALLOC and
DEBUG_RODATA, at least with some enhancements from Hiroshi. Of course
the destination array can contain garbage when we read beyond the end of
source array so that would be another user-visible problem.
Signed-off-by: Antti P Miettinen <amiettinen@nvidia.com>
Reviewed-by: Hiroshi Doyu <hdoyu@nvidia.com>
Tested-by: Hiroshi Doyu <hdoyu@nvidia.com>
Cc: Will Drewry <wad@chromium.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Thu, 21 Nov 2013 22:32:04 +0000 (14:32 -0800)]
ARM: drivers/rtc/rtc-at91rm9200.c: disable interrupts at shutdown
Make sure RTC-interrupts are disabled at shutdown.
As the RTC is generally powered by backup power (VDDBU), its interrupts
are not disabled on wake-up, user, watchdog or software reset. This
could cause troubles on other systems (e.g. older kernels) if an
interrupt occurs before a handler has been installed at next boot.
Let us be well-behaved and disable them on clean shutdowns at least (as
do the RTT-based rtc-at91sam9 driver).
Signed-off-by: Johan Hovold <jhovold@gmail.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Andrew Victor <linux@maxim.org.za>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Thu, 21 Nov 2013 22:32:02 +0000 (14:32 -0800)]
mm: hugetlbfs: fix hugetlbfs optimization
Commit
7cb2ef56e6a8 ("mm: fix aio performance regression for database
caused by THP") can cause dereference of a dangling pointer if
split_huge_page runs during PageHuge() if there are updates to the
tail_page->private field.
Also it is repeating compound_head twice for hugetlbfs and it is running
compound_head+compound_trans_head for THP when a single one is needed in
both cases.
The new code within the PageSlab() check doesn't need to verify that the
THP page size is never bigger than the smallest hugetlbfs page size, to
avoid memory corruption.
A longstanding theoretical race condition was found while fixing the
above (see the change right after the skip_unlock label, that is
relevant for the compound_lock path too).
By re-establishing the _mapcount tail refcounting for all compound
pages, this also fixes the below problem:
echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
BUG: Bad page state in process bash pfn:59a01
page:
ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
page flags: 0x1c00000000008000(tail)
Modules linked in:
CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x55/0x76
bad_page+0xd5/0x130
free_pages_prepare+0x213/0x280
__free_pages+0x36/0x80
update_and_free_page+0xc1/0xd0
free_pool_huge_page+0xc2/0xe0
set_max_huge_pages.part.58+0x14c/0x220
nr_hugepages_store_common.isra.60+0xd0/0xf0
nr_hugepages_store+0x13/0x20
kobj_attr_store+0xf/0x20
sysfs_write_file+0x189/0x1e0
vfs_write+0xc5/0x1f0
SyS_write+0x55/0xb0
system_call_fastpath+0x16/0x1b
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yuanhan Liu [Thu, 21 Nov 2013 22:32:01 +0000 (14:32 -0800)]
kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS cleanly
Remove CONFIG_USE_GENERIC_SMP_HELPERS left by commit
0a06ff068f12
("kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS").
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Greg Thelen [Thu, 21 Nov 2013 22:32:00 +0000 (14:32 -0800)]
ipc,shm: fix shm_file deletion races
When IPC_RMID races with other shm operations there's potential for
use-after-free of the shm object's associated file (shm_file).
Here's the race before this patch:
TASK 1 TASK 2
------ ------
shm_rmid()
ipc_lock_object()
shmctl()
shp = shm_obtain_object_check()
shm_destroy()
shum_unlock()
fput(shp->shm_file)
ipc_lock_object()
shmem_lock(shp->shm_file)
<OOPS>
The oops is caused because shm_destroy() calls fput() after dropping the
ipc_lock. fput() clears the file's f_inode, f_path.dentry, and
f_path.mnt, which causes various NULL pointer references in task 2. I
reliably see the oops in task 2 if with shmlock, shmu
This patch fixes the races by:
1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock().
2) modify at risk operations to check shm_file while holding
ipc_object_lock().
Example workloads, which each trigger oops...
Workload 1:
while true; do
id=$(shmget 1 4096)
shm_rmid $id &
shmlock $id &
wait
done
The oops stack shows accessing NULL f_inode due to racing fput:
_raw_spin_lock
shmem_lock
SyS_shmctl
Workload 2:
while true; do
id=$(shmget 1 4096)
shmat $id 4096 &
shm_rmid $id &
wait
done
The oops stack is similar to workload 1 due to NULL f_inode:
touch_atime
shmem_mmap
shm_mmap
mmap_region
do_mmap_pgoff
do_shmat
SyS_shmat
Workload 3:
while true; do
id=$(shmget 1 4096)
shmlock $id
shm_rmid $id &
shmunlock $id &
wait
done
The oops stack shows second fput tripping on an NULL f_inode. The
first fput() completed via from shm_destroy(), but a racing thread did
a get_file() and queued this fput():
locks_remove_flock
__fput
____fput
task_work_run
do_notify_resume
int_signal
Fixes: c2c737a0461e ("ipc,shm: shorten critical region for shmat")
Fixes: 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl")
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org> # 3.10.17+ 3.11.6+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Thu, 21 Nov 2013 22:31:58 +0000 (14:31 -0800)]
mm: thp: give transparent hugepage code a separate copy_page
Right now, the migration code in migrate_page_copy() uses copy_huge_page()
for hugetlbfs and thp pages:
if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);
So, yay for code reuse. But:
void copy_huge_page(struct page *dst, struct page *src)
{
struct hstate *h = page_hstate(src);
and a non-hugetlbfs page has no page_hstate(). This works 99% of the
time because page_hstate() determines the hstate from the page order
alone. Since the page order of a THP page matches the default hugetlbfs
page order, it works.
But, if you change the default huge page size on the boot command-line
(say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
so page_hstate() returns null and copy_huge_page() oopses pretty fast
since copy_huge_page() dereferences the hstate:
void copy_huge_page(struct page *dst, struct page *src)
{
struct hstate *h = page_hstate(src);
if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
...
Mel noticed that the migration code is really the only user of these
functions. This moves all the copy code over to migrate.c and makes
copy_huge_page() work for THP by checking for it explicitly.
I believe the bug was introduced in commit
b32967ff101a ("mm: numa: Add
THP migration for the NUMA working set scanning fault case")
[akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 21 Nov 2013 22:31:57 +0000 (14:31 -0800)]
checkpatch: fix "Use of uninitialized value" warnings
checkpatch is currently confused about some complex macros and references
undefined variables $stat and $cond.
Make sure these are defined before using them.
Signed-off-by: Joe Perches <joe@perches.com>
Reported-by: Gerhard Sittig <gsi@denx.de>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Thu, 21 Nov 2013 22:31:56 +0000 (14:31 -0800)]
configfs: fix race between dentry put and lookup
A race window in configfs, it starts from one dentry is UNHASHED and end
before configfs_d_iput is called. In this window, if a lookup happen,
since the original dentry was UNHASHED, so a new dentry will be
allocated, and then in configfs_attach_attr(), sd->s_dentry will be
updated to the new dentry. Then in configfs_d_iput(),
BUG_ON(sd->s_dentry != dentry) will be triggered and system panic.
sys_open: sys_close:
... fput
dput
dentry_kill
__d_drop <--- dentry unhashed here,
but sd->dentry still point
to this dentry.
lookup_real
configfs_lookup
configfs_attach_attr---> update sd->s_dentry
to new allocated dentry here.
d_kill
configfs_d_iput <--- BUG_ON(sd->s_dentry != dentry)
triggered here.
To fix it, change configfs_d_iput to not update sd->s_dentry if
sd->s_count > 2, that means there are another dentry is using the sd
beside the one that is going to be put. Use configfs_dirent_lock in
configfs_attach_attr to sync with configfs_d_iput.
With the following steps, you can reproduce the bug.
1. enable ocfs2, this will mount configfs at /sys/kernel/config and
fill configure in it.
2. run the following script.
while [ 1 ]; do cat /sys/kernel/config/cluster/$your_cluster_name/idle_timeout_ms > /dev/null; done &
while [ 1 ]; do cat /sys/kernel/config/cluster/$your_cluster_name/idle_timeout_ms > /dev/null; done &
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 20 Nov 2013 23:13:47 +0000 (15:13 -0800)]
Merge branch 'next' of git://git./linux/kernel/git/benh/powerpc
Pull powerpc LE updates from Ben Herrenschmidt:
"With my previous pull request I mentioned some remaining Little Endian
patches, notably support for our new ABI, which I was sitting on
making sure it was all finalized.
The toolchain folks confirmed it now, the new ABI is stable and merged
with gcc, so we are all good. Oh and we actually missed the actual
Kconfig switch for LE so here it is, along with a couple more bug
fixes.
I have more fixes but not related to LE so I'll send them as a
separate pull request tomorrow, let's get this one out of the way.
Note that this supports running user space binaries using the new ABI,
but the kernel itself still needs to be built with the old one. We'll
bring fixes for that after -rc1.
Here's Anton log that goes with this series:
This patch series adds support for the new ABI, LPAR support for
H_SET_MODE and finally adds a kconfig option and defconfig.
ABIv2 support was recently committed to binutils and gcc, and should
be merged into glibc soon. There are a number of very nice
improvements including the removal of function descriptors. Rusty's
kernel patches allow binaries of either ABI to work, easing the
transition"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Wrong DWARF CFI in the kernel vdso for little-endian / ELFv2
powerpc: Add pseries_le_defconfig
powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config option.
powerpc: Don't use ELFv2 ABI to build the kernel
powerpc: ELF2 binaries signal handling
powerpc: ELF2 binaries launched directly.
powerpc: Set eflags correctly for ELF ABIv2 core dumps.
powerpc: Add TIF_ELF2ABI flag.
pseries: Add H_SET_MODE to change exception endianness
powerpc/pseries: Fix endian issues in pseries EEH code
Linus Torvalds [Wed, 20 Nov 2013 23:03:45 +0000 (15:03 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mattst88/alpha
Pull alpha updates from Matt Turner:
"It contains a few fixes and some work from Richard to make alpha
emulation under QEMU much more usable"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha:
alpha: Prevent a NULL ptr dereference in csum_partial_copy.
alpha: perf: fix out-of-bounds array access triggered from raw event
alpha: Use qemu+cserve provided high-res clock and alarm.
alpha: Switch to GENERIC_CLOCKEVENTS
alpha: Enable the rpcc clocksource for single processor
alpha: Reorganize rtc handling
alpha: Primitive support for CPU power down.
alpha: Allow HZ to be configured
alpha: Notice if we're being run under QEMU
alpha: Eliminate compiler warning from memset macro
Linus Torvalds [Wed, 20 Nov 2013 23:02:50 +0000 (15:02 -0800)]
Merge branch 'parisc-3.13' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
- revert an access_ok() patch which broke 32bit userspace on 64bit
kernels
- avoid a gcc miscompilation in two internal pa_memcpy() functions by
not inlining those
- do not export the definition of SOCK_NONBLOCK via uapi header (fixes
build of audit package)
- depending on the fault type we now correctly report either SIGBUS or
SIGSEGV
- a small fix to not compare a size_t variable for < 0
* 'parisc-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: size_t is unsigned, so comparison size < 0 doesn't make sense.
parisc: improve SIGBUS/SIGSEGV error reporting
parisc: break out SOCK_NONBLOCK define to own asm header file
parisc: do not inline pa_memcpy() internal functions
Revert "parisc: implement full version of access_ok()"
Linus Torvalds [Wed, 20 Nov 2013 23:02:22 +0000 (15:02 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/egtvedt/linux-avr32
Pull AVR32 updates from Hans-Christian Egtvedt.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32:
avr32: uapi: be sure of "_UAPI" prefix for all guard macros
avr32: add kprobe_ctlblk memory struct
avr32: fix out-of-range jump in large kernels
avr32: setup crt for early panic()
Linus Torvalds [Wed, 20 Nov 2013 22:51:37 +0000 (14:51 -0800)]
Merge tag 'squashfs-updates' of git://git./linux/kernel/git/pkl/squashfs-next
Pull squashfs updates from Phillip Lougher:
"These patches optionally improve the multi-threading peformance of
Squashfs by adding parallel decompression, and direct decompression
into the page cache, eliminating an intermediate buffer (removing
memcpy overhead and lock contention)"
* tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
Squashfs: Check stream is not NULL in decompressor_multi.c
Squashfs: Directly decompress into the page cache for file data
Squashfs: Restructure squashfs_readpage()
Squashfs: Generalise paging handling in the decompressors
Squashfs: add multi-threaded decompression using percpu variable
squashfs: Enhance parallel I/O
Squashfs: Refactor decompressor interface and code
Linus Torvalds [Wed, 20 Nov 2013 22:41:47 +0000 (14:41 -0800)]
Revert "mm: create a separate slab for page->ptl allocation"
This reverts commit
ea1e7ed33708c7a760419ff9ded0a6cb90586a50.
Al points out that while the commit *does* actually create a separate
slab for the page->ptl allocation, that slab is never actually used, and
the code continues to use kmalloc/kfree.
Damien Wyart points out that the original patch did have the conversion
to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.
Revert the half-arsed attempt that didn't do anything. If we really do
want the special slab (remember: this is all relevant just for debug
builds, so it's not necessarily all that critical) we might as well redo
the patch fully.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 20 Nov 2013 22:25:39 +0000 (14:25 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull vfs bits and pieces from Al Viro:
"Assorted bits that got missed in the first pull request + fixes for a
couple of coredump regressions"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fold try_to_ascend() into the sole remaining caller
dcache.c: get rid of pointless macros
take read_seqbegin_or_lock() and friends to seqlock.h
consolidate simple ->d_delete() instances
gfs2: endianness misannotations
dump_emit(): use __kernel_write(), not vfs_write()
dump_align(): fix the dumb braino
Al Viro [Wed, 20 Nov 2013 22:16:36 +0000 (22:16 +0000)]
Wrong page freed on preallocate_pmds() failure exit
Note that pmds[i] is simply uninitialized at that point...
Granted, it's very hard to hit (you need split page locks *and*
kmalloc(sizeof(spinlock_t), GFP_KERNEL) failing), but the code is
obviously bogus.
Introduced by commit
09ef4939850a ("x86: add missed
pgtable_pmd_page_ctor/dtor calls for preallocated pmds")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ulrich Weigand [Wed, 20 Nov 2013 20:38:05 +0000 (07:38 +1100)]
powerpc: Wrong DWARF CFI in the kernel vdso for little-endian / ELFv2
I've finally tracked down why my CR signal-unwind test case still
fails on little-endian. The problem turned to be that the kernel
installs a signal trampoline in the vDSO, and provides a DWARF CFI
record for that trampoline. This CFI describes the save location
for CR:
rsave (70, 38*RSIZE + (RSIZE - CRSIZE))
which is correct for big-endian, but points to the wrong word on
little-endian. This is wrong no matter which ABI.
In addition, for the ELFv2 ABI, we should not only provide a CFI
record for register 70 (cr2), but for all CR fields separately.
Strictly speaking, I guess this would mean providing two separate
vDSO images, one for ELFv1 processes and one for ELFv2 processes (or
maybe playing some tricks with conditional DWARF expressions).
However, having CFI records for the other CR fields in ELFv1 is not
actually wrong, they just will be ignored. So it seems the simplest
fix would be just to always provide CFI for all the fields.
Signed-off-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Wed, 20 Nov 2013 11:15:06 +0000 (22:15 +1100)]
powerpc: Add pseries_le_defconfig
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Wed, 20 Nov 2013 11:15:05 +0000 (22:15 +1100)]
powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config option.
With the little endian support merged, we can add the
CONFIG_CPU_LITTLE_ENDIAN kernel config option.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Alistair Popple [Wed, 20 Nov 2013 11:15:04 +0000 (22:15 +1100)]
powerpc: Don't use ELFv2 ABI to build the kernel
The kernel doesn't build correctly using the ELFv2 ABI. This patch
ensures that the ELFv1 ABI is used when building a kernel with an
ELFv2 enabled compiler.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rusty Russell [Wed, 20 Nov 2013 11:15:03 +0000 (22:15 +1100)]
powerpc: ELF2 binaries signal handling
For the ELFv2 ABI, the hander is the entry point, not a function descriptor.
We also need to set up r12, and fortunately the fast_exception_return
exit path restores r12 for us so nothing else is required.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rusty Russell [Wed, 20 Nov 2013 11:15:02 +0000 (22:15 +1100)]
powerpc: ELF2 binaries launched directly.
No function descriptor, but we set r12 up and set TIF_RESTOREALL as it
normally isn't restored on return from syscall.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rusty Russell [Wed, 20 Nov 2013 11:15:01 +0000 (22:15 +1100)]
powerpc: Set eflags correctly for ELF ABIv2 core dumps.
We leave it at zero (though it could be 1) for old tasks.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rusty Russell [Wed, 20 Nov 2013 11:15:00 +0000 (22:15 +1100)]
powerpc: Add TIF_ELF2ABI flag.
Little endian ppc64 is getting an exciting new ABI. This is reflected
by the bottom two bits of e_flags in the ELF header:
0 == legacy binaries (v1 ABI)
1 == binaries using the old ABI (compiled with a new toolchain)
2 == binaries using the new ABI.
We store this in a thread flag, because we need to set it in core
dumps and for signal delivery. Our chief concern is that it doesn't
use function descriptors.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Wed, 20 Nov 2013 11:14:59 +0000 (22:14 +1100)]
pseries: Add H_SET_MODE to change exception endianness
On little endian builds call H_SET_MODE so exceptions have the
correct endianness. We need to reset the endian during kexec
so do that in the MMU hashtable clear callback.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Wed, 20 Nov 2013 11:14:58 +0000 (22:14 +1100)]
powerpc/pseries: Fix endian issues in pseries EEH code
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Linus Torvalds [Wed, 20 Nov 2013 21:25:04 +0000 (13:25 -0800)]
Merge tag 'pm+acpi-2-3.13-rc1' of git://git./linux/kernel/git/rafael/linux-pm
Pull more ACPI and power management updates from Rafael Wysocki:
- ACPI-based device hotplug fixes for issues introduced recently and a
fix for an older error code path bug in the ACPI PCI host bridge
driver
- Fix for recently broken OMAP cpufreq build from Viresh Kumar
- Fix for a recent hibernation regression related to s2disk
- Fix for a locking-related regression in the ACPI EC driver from
Puneet Kumar
- System suspend error code path fix related to runtime PM and runtime
PM documentation update from Ulf Hansson
- cpufreq's conservative governor fix from Xiaoguang Chen
- New processor IDs for intel_idle and turbostat and removal of an
obsolete Kconfig option from Len Brown
- New device IDs for the ACPI LPSS (Low-Power Subsystem) driver and
ACPI-based PCI hotplug (ACPIPHP) cleanup from Mika Westerberg
- Removal of several ACPI video DMI blacklist entries that are not
necessary any more from Aaron Lu
- Rework of the ACPI companion representation in struct device and code
cleanup related to that change from Rafael J Wysocki, Lan Tianyu and
Jarkko Nikula
- Fixes for assigning names to ACPI-enumerated I2C and SPI devices from
Jarkko Nikula
* tag 'pm+acpi-2-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (24 commits)
PCI / hotplug / ACPI: Drop unused acpiphp_debug declaration
ACPI / scan: Set flags.match_driver in acpi_bus_scan_fixed()
ACPI / PCI root: Clear driver_data before failing enumeration
ACPI / hotplug: Fix PCI host bridge hot removal
ACPI / hotplug: Fix acpi_bus_get_device() return value check
cpufreq: governor: Remove fossil comment in the cpufreq_governor_dbs()
ACPI / video: clean up DMI table for initial black screen problem
ACPI / EC: Ensure lock is acquired before accessing ec struct members
PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()
ACPI / AC: Remove struct acpi_device pointer from struct acpi_ac
spi: Use stable dev_name for ACPI enumerated SPI slaves
i2c: Use stable dev_name for ACPI enumerated I2C slaves
ACPI: Provide acpi_dev_name accessor for struct acpi_device device name
ACPI / bind: Use (put|get)_device() on ACPI device objects too
ACPI: Eliminate the DEVICE_ACPI_HANDLE() macro
ACPI / driver core: Store an ACPI device pointer in struct acpi_dev_node
cpufreq: OMAP: Fix compilation error 'r & ret undeclared'
PM / Runtime: Fix error path for prepare
PM / Runtime: Update documentation around probe|remove|suspend
cpufreq: conservative: set requested_freq to policy max when it is over policy max
...
Linus Torvalds [Wed, 20 Nov 2013 21:20:24 +0000 (13:20 -0800)]
Merge branch 'next' of git://git.infradead.org/users/vkoul/slave-dma
Pull slave-dmaengine changes from Vinod Koul:
"This brings for slave dmaengine:
- Change dma notification flag to DMA_COMPLETE from DMA_SUCCESS as
dmaengine can only transfer and not verify validaty of dma
transfers
- Bunch of fixes across drivers:
- cppi41 driver fixes from Daniel
- 8 channel freescale dma engine support and updated bindings from
Hongbo
- msx-dma fixes and cleanup by Markus
- DMAengine updates from Dan:
- Bartlomiej and Dan finalized a rework of the dma address unmap
implementation.
- In the course of testing 1/ a collection of enhancements to
dmatest fell out. Notably basic performance statistics, and
fixed / enhanced test control through new module parameters
'run', 'wait', 'noverify', and 'verbose'. Thanks to Andriy and
Linus [Walleij] for their review.
- Testing the raid related corner cases of 1/ triggered bugs in
the recently added 16-source operation support in the ioatdma
driver.
- Some minor fixes / cleanups to mv_xor and ioatdma"
* 'next' of git://git.infradead.org/users/vkoul/slave-dma: (99 commits)
dma: mv_xor: Fix mis-usage of mmio 'base' and 'high_base' registers
dma: mv_xor: Remove unneeded NULL address check
ioat: fix ioat3_irq_reinit
ioat: kill msix_single_vector support
raid6test: add new corner case for ioatdma driver
ioatdma: clean up sed pool kmem_cache
ioatdma: fix selection of 16 vs 8 source path
ioatdma: fix sed pool selection
ioatdma: Fix bug in selftest after removal of DMA_MEMSET.
dmatest: verbose mode
dmatest: convert to dmaengine_unmap_data
dmatest: add a 'wait' parameter
dmatest: add basic performance metrics
dmatest: add support for skipping verification and random data setup
dmatest: use pseudo random numbers
dmatest: support xor-only, or pq-only channels in tests
dmatest: restore ability to start test at module load and init
dmatest: cleanup redundant "dmatest: " prefixes
dmatest: replace stored results mechanism, with uniform messages
Revert "dmatest: append verify result to results"
...
Linus Torvalds [Wed, 20 Nov 2013 21:06:20 +0000 (13:06 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
"Normally I'd defer my initial for-linus pull request until after the
merge window, but a race was uncovered in the virtio-blk conversion to
blk-mq that could cause hangs. So here's a small collection of fixes
for you to pull:
- The fix for the virtio-blk IO hang reported by Dave Chinner, from
Shaohua and myself.
- Add the Insert blktrace event for blk-mq. This makes 'btt' happy
when it is doing it's state transition analysis.
- Ensure that blk-mq has disk/partition stats enabled by default,
instead of making it opt-in.
- A fix for __bio_add_page() and large sector counts"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: add blktrace insert event trace
virtio-blk: virtqueue_kick() must be ordered with other virtqueue operations
blk-mq: ensure that we set REQ_IO_STAT so diskstats work
bio: fix argument of __bio_add_page() for max_sectors > 0xffff
Linus Torvalds [Wed, 20 Nov 2013 21:05:25 +0000 (13:05 -0800)]
Merge tag 'md/3.13' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io during
resync"
* tag 'md/3.13' of git://neil.brown.name/md:
md/raid5: Use conf->device_lock protect changing of multi-thread resources.
md/raid5: Before freeing old multi-thread worker, it should flush them.
md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
UAPI: include <asm/byteorder.h> in linux/raid/md_p.h
raid1: Rewrite the implementation of iobarrier.
raid1: Add some macros to make code clearly.
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
raid1: Add a field array_frozen to indicate whether raid in freeze state.
md: Convert use of typedef ctl_table to struct ctl_table
md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
md: fix some places where mddev_lock return value is not checked.
raid5: Retry R5_ReadNoMerge flag when hit a read error.
raid5: relieve lock contention in get_active_stripe()
raid5: relieve lock contention in get_active_stripe()
wait: add wait_event_cmd()
md/raid5.c: add proper locking to error path of raid5_start_reshape.
md: fix calculation of stacking limits on level change.
raid5: Use slow_path to release stripe when mddev->thread is null
Chen Gang [Tue, 12 Nov 2013 08:38:47 +0000 (16:38 +0800)]
avr32: uapi: be sure of "_UAPI" prefix for all guard macros
For all uapi headers, need use "_UAPI" prefix for its guard macro
(which will be stripped by "scripts/headers_installer.sh").
Also remove redundant files (bitsperlong.h, errno.h, fcntl.h, ioctl.h,
ioctls.h, ipcbuf.h, kvm_para.h, mman.h, poll.h, resource.h, siginfo.h,
statfs.h, and unistd.h) which are already in Kbuild.
Also be sure that all "#endif" only have one empty line above, and each
file has guard macro.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Hans-Christian Egtvedt <hegtvedt@cisco.com>
Eirik Aanonsen [Wed, 6 Nov 2013 21:00:33 +0000 (22:00 +0100)]
avr32: add kprobe_ctlblk memory struct
This re-enables kprobes on AVR32 architecture.
Signed-off-by: Eirik Aanonsen <eaa@wprmedical.com>
Signed-off-by: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Andreas Bießmann [Thu, 24 Oct 2013 10:31:04 +0000 (12:31 +0200)]
avr32: fix out-of-range jump in large kernels
This patch fixes following error (for big kernels):
---8<---
arch/avr32/boot/u-boot/head.o: In function `no_tag_table':
(.init.text+0x44): relocation truncated to fit: R_AVR32_22H_PCREL against symbol `panic' defined in .text.unlikely section in kernel/built-in.o
arch/avr32/kernel/built-in.o: In function `bad_return':
(.ex.text+0x236): relocation truncated to fit: R_AVR32_22H_PCREL against symbol `panic' defined in .text.unlikely section in kernel/built-in.o
--->8---
It comes up when the kernel increases and 'panic()' is too far away to fit in
the +/- 2MiB range. Which in turn issues from the 21-bit displacement in
'br{cond4}' mnemonic which is one of the two ways to do jumps (rjmp has just
10-bit displacement and therefore a way smaller range). This fact was stated
before in
8d29b7b9f81d6b83d869ff054e6c189d6da73f1f.
One solution to solve this is to add a local storage for the symbol address
and just load the $pc with that value.
Signed-off-by: Andreas Bießmann <andreas@biessmann.de>
Acked-by: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: stable@vger.kernel.org
Andreas Bießmann [Thu, 24 Oct 2013 10:31:03 +0000 (12:31 +0200)]
avr32: setup crt for early panic()
Before the CRT was (fully) set up in kernel_entry (bss cleared before in
_start, but also not before jump to panic() in no_tag_table case).
This patch fixes this up to have a fully working CRT when branching to panic()
in no_tag_table.
Signed-off-by: Andreas Bießmann <andreas@biessmann.de>
Acked-by: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: stable@vger.kernel.org
Phillip Lougher [Sun, 10 Nov 2013 00:02:29 +0000 (00:02 +0000)]
Squashfs: Check stream is not NULL in decompressor_multi.c
Fix static checker complaint that stream is not checked in
squashfs_decompressor_destroy().
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Phillip Lougher [Wed, 13 Nov 2013 02:04:19 +0000 (02:04 +0000)]
Squashfs: Directly decompress into the page cache for file data
This introduces an implementation of squashfs_readpage_block()
that directly decompresses into the page cache.
This uses the previously added page handler abstraction to push
down the necessary kmap_atomic/kunmap_atomic operations on the
page cache buffers into the decompressors. This enables
direct copying into the page cache without using the slow
kmap/kunmap calls.
The code detects when multiple threads are racing in
squashfs_readpage() to decompress the same block, and avoids
this regression by falling back to using an intermediate
buffer.
This patch enhances the performance of Squashfs significantly
when multiple processes are accessing the filesystem simultaneously
because it not only reduces memcopying, but it more importantly
eliminates the lock contention on the intermediate buffer.
Using single-thread decompression.
dd if=file1 of=/dev/null bs=4096 &
dd if=file2 of=/dev/null bs=4096 &
dd if=file3 of=/dev/null bs=4096 &
dd if=file4 of=/dev/null bs=4096
Before:
629145600 bytes (629 MB) copied, 45.8046 s, 13.7 MB/s
After:
629145600 bytes (629 MB) copied, 9.29414 s, 67.7 MB/s
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Phillip Lougher [Thu, 31 Oct 2013 19:24:27 +0000 (19:24 +0000)]
Squashfs: Restructure squashfs_readpage()
Restructure squashfs_readpage() splitting it into separate
functions for datablocks, fragments and sparse blocks.
Move the memcpying (from squashfs cache entry) implementation of
squashfs_readpage_block into file_cache.c
This allows different implementations to be supported.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Phillip Lougher [Mon, 18 Nov 2013 02:59:12 +0000 (02:59 +0000)]
Squashfs: Generalise paging handling in the decompressors
Further generalise the decompressors by adding a page handler
abstraction. This adds helpers to allow the decompressors
to access and process the output buffers in an implementation
independant manner.
This allows different types of output buffer to be passed
to the decompressors, with the implementation specific
aspects handled at decompression time, but without the
knowledge being held in the decompressor wrapper code.
This will allow the decompressors to handle Squashfs
cache buffers, and page cache pages.
This patch adds the abstraction and an implementation for
the caches.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Phillip Lougher [Mon, 18 Nov 2013 02:31:36 +0000 (02:31 +0000)]
Squashfs: add multi-threaded decompression using percpu variable
Add a multi-threaded decompression implementation which uses
percpu variables.
Using percpu variables has advantages and disadvantages over
implementations which do not use percpu variables.
Advantages:
* the nature of percpu variables ensures decompression is
load-balanced across the multiple cores.
* simplicity.
Disadvantages: it limits decompression to one thread per core.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Minchan Kim [Mon, 28 Oct 2013 05:26:30 +0000 (14:26 +0900)]
squashfs: Enhance parallel I/O
Now squashfs have used for only one stream buffer for decompression
so it hurts parallel read performance so this patch supports
multiple decompressor to enhance performance parallel I/O.
Four 1G file dd read on KVM machine which has 2 CPU and 4G memory.
dd if=test/test1.dat of=/dev/null &
dd if=test/test2.dat of=/dev/null &
dd if=test/test3.dat of=/dev/null &
dd if=test/test4.dat of=/dev/null &
old : 1m39s -> new : 9s
* From v1
* Change comp_strm with decomp_strm - Phillip
* Change/add comments - Phillip
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Phillip Lougher [Wed, 13 Nov 2013 02:56:26 +0000 (02:56 +0000)]
Squashfs: Refactor decompressor interface and code
The decompressor interface and code was written from
the point of view of single-threaded operation. In doing
so it mixed a lot of single-threaded implementation specific
aspects into the decompressor code and elsewhere which makes it
difficult to seamlessly support multiple different decompressor
implementations.
This patch does the following:
1. It removes compressor_options parsing from the decompressor
init() function. This allows the decompressor init() function
to be dynamically called to instantiate multiple decompressors,
without the compressor options needing to be read and parsed each
time.
2. It moves threading and all sleeping operations out of the
decompressors. In doing so, it makes the decompressors
non-blocking wrappers which only deal with interfacing with
the decompressor implementation.
3. It splits decompressor.[ch] into decompressor generic functions
in decompressor.[ch], and moves the single threaded
decompressor implementation into decompressor_single.c.
The result of this patch is Squashfs should now be able to
support multiple decompressors by adding new decompressor_xxx.c
files with specialised implementations of the functions in
decompressor_single.c
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Jens Axboe [Wed, 20 Nov 2013 01:59:10 +0000 (18:59 -0700)]
blk-mq: add blktrace insert event trace
We need it to make 'btt' from blktrace happy, otherwise
we are missing one state transition.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shaohua Li [Wed, 20 Nov 2013 01:57:24 +0000 (18:57 -0700)]
virtio-blk: virtqueue_kick() must be ordered with other virtqueue operations
It isn't safe to call it without holding the vblk->vq_lock.
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Fixed another condition of virtqueue_kick() not holding the lock.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mahesh Rajashekhara [Thu, 31 Oct 2013 08:31:02 +0000 (14:01 +0530)]
aacraid: prevent invalid pointer dereference
It appears that driver runs into a problem here if fibsize is too small
because we allocate user_srbcmd with fibsize size only but later we
access it until user_srbcmd->sg.count to copy it over to srbcmd.
It is not correct to test (fibsize < sizeof(*user_srbcmd)) because this
structure already includes one sg element and this is not needed for
commands without data. So, we would recommend to add the following
(instead of test for fibsize == 0).
Signed-off-by: Mahesh Rajashekhara <Mahesh.Rajashekhara@pmcs.com>
Reported-by: Nico Golde <nico@ngolde.de>
Reported-by: Fabian Yamaguchi <fabs@goesec.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 19 Nov 2013 23:50:47 +0000 (15:50 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"Mostly these are fixes for fallout due to merge window changes, as
well as cures for problems that have been with us for a much longer
period of time"
1) Johannes Berg noticed two major deficiencies in our genetlink
registration. Some genetlink protocols we passing in constant
counts for their ops array rather than something like
ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
using fixed IDs for their multicast groups.
We have to retain these fixed IDs to keep existing userland tools
working, but reserve them so that other multicast groups used by
other protocols can not possibly conflict.
In dealing with these two problems, we actually now use less state
management for genetlink operations and multicast groups.
2) When configuring interface hardware timestamping, fix several
drivers that simply do not validate that the hwtstamp_config value
is one the driver actually supports. From Ben Hutchings.
3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.
4) In dev_forward_skb(), set the skb->protocol in the right order
relative to skb_scrub_packet(). From Alexei Starovoitov.
5) Bridge erroneously fails to use the proper wrapper functions to make
calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
Makita.
6) When detaching a bridge port, make sure to flush all VLAN IDs to
prevent them from leaking, also from Toshiaki Makita.
7) Put in a compromise for TCP Small Queues so that deep queued devices
that delay TX reclaim non-trivially don't have such a performance
decrease. One particularly problematic area is 802.11 AMPDU in
wireless. From Eric Dumazet.
8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
here. Fix from Eric Dumzaet, reported by Dave Jones.
9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.
10) When computing mergeable buffer sizes, virtio-net fails to take the
virtio-net header into account. From Michael Dalton.
11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
bumping, this one has been with us for a while. From Eric Dumazet.
12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
Hugne.
13) 6lowpan bit used for traffic classification was wrong, from Jukka
Rissanen.
14) macvlan has the same issue as normal vlans did wrt. propagating LRO
disabling down to the real device, fix it the same way. From Michal
Kubecek.
15) CPSW driver needs to soft reset all slaves during suspend, from
Daniel Mack.
16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.
17) The xen-netfront RX buffer refill timer isn't properly scheduled on
partial RX allocation success, from Ma JieYue.
18) When ipv6 ping protocol support was added, the AF_INET6 protocol
initialization cleanup path on failure was borked a little. Fix
from Vlad Yasevich.
19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
blocks we can do the wrong thing with the msg_name we write back to
userspace. From Hannes Frederic Sowa. There is another fix in the
works from Hannes which will prevent future problems of this nature.
20) Fix route leak in VTI tunnel transmit, from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
genetlink: make multicast groups const, prevent abuse
genetlink: pass family to functions using groups
genetlink: add and use genl_set_err()
genetlink: remove family pointer from genl_multicast_group
genetlink: remove genl_unregister_mc_group()
hsr: don't call genl_unregister_mc_group()
quota/genetlink: use proper genetlink multicast APIs
drop_monitor/genetlink: use proper genetlink multicast APIs
genetlink: only pass array to genl_register_family_with_ops()
tcp: don't update snd_nxt, when a socket is switched from repair mode
atm: idt77252: fix dev refcnt leak
xfrm: Release dst if this dst is improper for vti tunnel
netlink: fix documentation typo in netlink_set_err()
be2net: Delete secondary unicast MAC addresses during be_close
be2net: Fix unconditional enabling of Rx interface options
net, virtio_net: replace the magic value
ping: prevent NULL pointer dereference on write to msg_name
bnx2x: Prevent "timeout waiting for state X"
bnx2x: prevent CFC attention
bnx2x: Prevent panic during DMAE timeout
...
Linus Torvalds [Tue, 19 Nov 2013 23:50:03 +0000 (15:50 -0800)]
Merge git://git./linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:
"Two merge window fallout build fixes"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: merge fix
sparc64: fix build regession
Linus Torvalds [Tue, 19 Nov 2013 23:49:31 +0000 (15:49 -0800)]
Merge tag 'please-pull-fixia64' of git://git./linux/kernel/git/aegl/linux
Pull ia64 fix from Tony Luck:
"Unbreak ia64 build by avoiding circular dependency"
* tag 'please-pull-fixia64' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
kernel/bounds: avoid circular dependencies in generated headers
Helge Deller [Tue, 12 Nov 2013 20:01:24 +0000 (21:01 +0100)]
parisc: size_t is unsigned, so comparison size < 0 doesn't make sense.
Signed-off-by: Helge Deller <deller@gmx.de>
CC: Mikulas Patocka <mpatocka@redhat.com>
Helge Deller [Mon, 18 Nov 2013 21:12:11 +0000 (22:12 +0100)]
parisc: improve SIGBUS/SIGSEGV error reporting
This patch fixes most of the Linux Test Project testcases, e.g. fstat05.
Signed-off-by: Helge Deller <deller@gmx.de>
Helge Deller [Mon, 14 Oct 2013 19:04:13 +0000 (21:04 +0200)]
parisc: break out SOCK_NONBLOCK define to own asm header file
Break SOCK_NONBLOCK out to its own asm-file as other arches do. This
fixes build errors with auditd and probably other packages.
Signed-off-by: Helge Deller <deller@gmx.de>
Helge Deller [Sun, 17 Nov 2013 21:03:11 +0000 (22:03 +0100)]
parisc: do not inline pa_memcpy() internal functions
gcc (4.8.x) creates wrong code when the pa_memcpy() functions are
inlined. Especially in 32bit builds it calculates wrong return values
if we encounter a fault during execution of the memcpy.
Signed-off-by: Helge Deller <deller@gmx.de>
Helge Deller [Tue, 19 Nov 2013 22:31:35 +0000 (23:31 +0100)]
Revert "parisc: implement full version of access_ok()"
This reverts commit
63379c135331c724d40a87b98eb62d2122981341.
It broke userspace and adding more checking is not needed.
Even checking if a syscall would access memory in page zero doesn't
makes sense since it may lead to some syscalls returning -EFAULT
where we would return other error codes on other platforms.
In summary, just drop this change and return to always return 1.
Signed-off-by: Helge Deller <deller@gmx.de>
Kirill A. Shutemov [Mon, 18 Nov 2013 08:47:27 +0000 (10:47 +0200)]
kernel/bounds: avoid circular dependencies in generated headers
<linux/spinlock.h> has heavy dependencies on other header files.
It triggers circular dependencies in generated headers on IA64, at
least:
CC kernel/bounds.s
In file included from /home/space/kas/git/public/linux/arch/ia64/include/asm/thread_info.h:9:0,
from include/linux/thread_info.h:54,
from include/asm-generic/preempt.h:4,
from arch/ia64/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:18,
from include/linux/spinlock.h:50,
from kernel/bounds.c:14:
/home/space/kas/git/public/linux/arch/ia64/include/asm/asm-offsets.h:1:35: fatal error: generated/asm-offsets.h: No such file or directory
compilation terminated.
Let's replace <linux/spinlock.h> with <linux/spinlock_types.h>, it's
enough to find out size of spinlock_t.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-and-Tested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
David S. Miller [Tue, 19 Nov 2013 21:39:42 +0000 (16:39 -0500)]
Merge branch 'genetlink_mcast'
Johannes Berg says:
====================
genetlink: clean up multicast group APIs
The generic netlink multicast group registration doesn't have to
be dynamic, and can thus be simplified just like I did with the
ops. This removes some complexity in registration code.
Additionally, two users of generic netlink already use multicast
groups in a wrong way, add workarounds for those two to keep the
userspace API working, but at the same time make them not clash
with other users of multicast groups as might happen now.
While making it all a bit easier, also prevent such abuse by adding
checks to the APIs so each family can only use the groups it owns.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:39 +0000 (15:19 +0100)]
genetlink: make multicast groups const, prevent abuse
Register generic netlink multicast groups as an array with
the family and give them contiguous group IDs. Then instead
of passing the global group ID to the various functions that
send messages, pass the ID relative to the family - for most
families that's just 0 because the only have one group.
This avoids the list_head and ID in each group, adding a new
field for the mcast group ID offset to the family.
At the same time, this allows us to prevent abusing groups
again like the quota and dropmon code did, since we can now
check that a family only uses a group it owns.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:38 +0000 (15:19 +0100)]
genetlink: pass family to functions using groups
This doesn't really change anything, but prepares for the
next patch that will change the APIs to pass the group ID
within the family, rather than the global group ID.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:37 +0000 (15:19 +0100)]
genetlink: add and use genl_set_err()
Add a static inline to generic netlink to wrap netlink_set_err()
to make it easier to use here - use it in openvswitch (the only
generic netlink user of netlink_set_err()).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:36 +0000 (15:19 +0100)]
genetlink: remove family pointer from genl_multicast_group
There's no reason to have the family pointer there since it
can just be passed internally where needed, so remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:35 +0000 (15:19 +0100)]
genetlink: remove genl_unregister_mc_group()
There are no users of this API remaining, and we'll soon
change group registration to be static (like ops are now)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:34 +0000 (15:19 +0100)]
hsr: don't call genl_unregister_mc_group()
There's no need to unregister the multicast group if the
generic netlink family is registered immediately after.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:33 +0000 (15:19 +0100)]
quota/genetlink: use proper genetlink multicast APIs
The quota code is abusing the genetlink API and is using
its family ID as the multicast group ID, which is invalid
and may belong to somebody else (and likely will.)
Make the quota code use the correct API, but since this
is already used as-is by userspace, reserve a family ID
for this code and also reserve that group ID to not break
userspace assumptions.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:32 +0000 (15:19 +0100)]
drop_monitor/genetlink: use proper genetlink multicast APIs
The drop monitor code is abusing the genetlink API and is
statically using the generic netlink multicast group 1, even
if that group belongs to somebody else (which it invariably
will, since it's not reserved.)
Make the drop monitor code use the proper APIs to reserve a
group ID, but also reserve the group id 1 in generic netlink
code to preserve the userspace API. Since drop monitor can
be a module, don't clear the bit for it on unregistration.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:31 +0000 (15:19 +0100)]
genetlink: only pass array to genl_register_family_with_ops()
As suggested by David Miller, make genl_register_family_with_ops()
a macro and pass only the array, evaluating ARRAY_SIZE() in the
macro, this is a little safer.
The openvswitch has some indirection, assing ops/n_ops directly in
that code. This might ultimately just assign the pointers in the
family initializations, saving the struct genl_family_and_ops and
code (once mcast groups are handled differently.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey Vagin [Tue, 19 Nov 2013 18:10:06 +0000 (22:10 +0400)]
tcp: don't update snd_nxt, when a socket is switched from repair mode
snd_nxt must be updated synchronously with sk_send_head. Otherwise
tp->packets_out may be updated incorrectly, what may bring a kernel panic.
Here is a kernel panic from my host.
[ 103.043194] BUG: unable to handle kernel NULL pointer dereference at
0000000000000048
[ 103.044025] IP: [<
ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150
...
[ 146.301158] Call Trace:
[ 146.301158] [<
ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0
Before this panic a tcp socket was restored. This socket had sent and
unsent data in the write queue. Sent data was restored in repair mode,
then the socket was switched from reapair mode and unsent data was
restored. After that the socket was switched back into repair mode.
In that moment we had a socket where write queue looks like this:
snd_una snd_nxt write_seq
|_________|________|
|
sk_send_head
After a second switching from repair mode the state of socket was
changed:
snd_una snd_nxt, write_seq
|_________ ________|
|
sk_send_head
This state is inconsistent, because snd_nxt and sk_send_head are not
synchronized.
Bellow you can find a call trace, how packets_out can be incremented
twice for one skb, if snd_nxt and sk_send_head are not synchronized.
In this case packets_out will be always positive, even when
sk_write_queue is empty.
tcp_write_wakeup
skb = tcp_send_head(sk);
tcp_fragment
if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq))
tcp_adjust_pcount(sk, skb, diff);
tcp_event_new_data_sent
tp->packets_out += tcp_skb_pcount(skb);
I think update of snd_nxt isn't required, when a socket is switched from
repair mode. Because it's initialized in tcp_connect_init. Then when a
write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent,
so it's always is in consistent state.
I have checked, that the bug is not reproduced with this patch and
all tests about restoring tcp connections work fine.
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Tue, 19 Nov 2013 10:09:27 +0000 (18:09 +0800)]
atm: idt77252: fix dev refcnt leak
init_card() calls dev_get_by_name() to get a network deceive. But it
doesn't decrease network device reference count after the device is
used.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fan.du [Tue, 19 Nov 2013 08:53:28 +0000 (16:53 +0800)]
xfrm: Release dst if this dst is improper for vti tunnel
After searching rt by the vti tunnel dst/src parameter,
if this rt has neither attached to any transformation
nor the transformation is not tunnel oriented, this rt
should be released back to ip layer.
otherwise causing dst memory leakage.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafael J. Wysocki [Tue, 19 Nov 2013 20:18:13 +0000 (21:18 +0100)]
Merge branch 'acpi-hotplug'
* acpi-hotplug:
PCI / hotplug / ACPI: Drop unused acpiphp_debug declaration
Mika Westerberg [Tue, 19 Nov 2013 14:15:35 +0000 (16:15 +0200)]
PCI / hotplug / ACPI: Drop unused acpiphp_debug declaration
Commit
bd950799d951 (PCI: acpiphp: Convert to dynamic debug) removed users
of acpiphp_debug variable and the variable itself but the declaration was
left in the header file. Drop this unused declaration.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Johannes Berg [Tue, 19 Nov 2013 09:35:40 +0000 (10:35 +0100)]
netlink: fix documentation typo in netlink_set_err()
The parameter is just 'group', not 'groups', fix the documentation typo.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 19 Nov 2013 19:44:15 +0000 (11:44 -0800)]
Merge tag 'arc-v3.13-rc1-part2' of git://git./linux/kernel/git/vgupta/arc
Pull second set of ARC changes from Vineet Gupta:
- Support for Perf from Mischa
- Enabling GPIO/Pinctrl drivers for Abilis TB10x platform
- New defconfig for buildroot
* tag 'arc-v3.13-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: [plat-arcfpga] Add defconfig without initramfs location
ARC: perf: ARC 700 PMU doesn't support sampling events
ARC: Add documentation on DT binding for ARC700 PMU
ARC: Add perf support for ARC700 cores
ARC: [TB10x] Updates for GPIO and pinctrl
Linus Torvalds [Tue, 19 Nov 2013 19:43:21 +0000 (11:43 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux
Pull second set of s390 patches from Martin Schwidefsky:
"The handling of the PCI hotplug notifications has been improved, the
zfcp dumper can now detect the HSA size dynamically and the default
install kernel has been changed to the compressed bzImage. And two
bug-fixes for scm and 3720"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: implement hotplug notifications
s390/scm_block: do not hide eadm subchannel dependency
s390/sclp: Consolidate early sclp init calls to sclp_early_detect()
s390/sclp: Move early code from sclp_cmd.c to sclp_early.c
s390/sclp: Determine HSA size dynamically for zfcpdump
s390/sclp: Move declarations for sclp_sdias into separate header file
s390/pci: implement pcibios_remove_bus
s390/pci: improve handling of bus resources
s390/3270: fix missing device_destroy() call
s390/boot: Install bzImage as default kernel image
Linus Torvalds [Tue, 19 Nov 2013 19:42:32 +0000 (11:42 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/rw/uml
Pull UML changes from Richard Weinberger:
"This pile contains a nice defconfig cleanup, a rewritten stack
unwinder and various cleanups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: Remove unused declarations from <as-layout.h>
um: remove used STDIO_CONSOLE Kconfig param
um/vdso: add .gitignore for a couple of targets
arch/um: make it work with defconfig and x86_64
um: Make kstack_depth_to_print conform to arch/x86
um: Get rid of thread_struct->saved_task
um: Make stack trace reliable against kernel mode faults
um: Rewrite show_stack()
Linus Torvalds [Tue, 19 Nov 2013 18:48:19 +0000 (10:48 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"A modular build fix for certain .config's"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Export 'boot_cpu_physical_apicid' to modules
Linus Torvalds [Tue, 19 Nov 2013 18:40:00 +0000 (10:40 -0800)]
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq cleanups from Ingo Molnar:
"This is a multi-arch cleanup series from Thomas Gleixner, which we
kept to near the end of the merge window, to not interfere with
architecture updates.
This series (motivated by the -rt kernel) unifies more aspects of IRQ
handling and generalizes PREEMPT_ACTIVE"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
preempt: Make PREEMPT_ACTIVE generic
sparc: Use preempt_schedule_irq
ia64: Use preempt_schedule_irq
m32r: Use preempt_schedule_irq
hardirq: Make hardirq bits generic
m68k: Simplify low level interrupt handling code
genirq: Prevent spurious detection for unconditionally polled interrupts
Jens Axboe [Tue, 19 Nov 2013 16:25:07 +0000 (09:25 -0700)]
blk-mq: ensure that we set REQ_IO_STAT so diskstats work
If disk stats are enabled on the queue, a request needs to
be marked with REQ_IO_STAT for accounting to be active on
that request. This fixes an issue with virtio-blk not
showing up in /proc/diskstats after the conversion to
blk-mq.
Add QUEUE_FLAG_MQ_DEFAULT, setting stats and same cpu-group
completion on by default.
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
majianpeng [Thu, 14 Nov 2013 04:16:20 +0000 (15:16 +1100)]
md/raid5: Use conf->device_lock protect changing of multi-thread resources.
When we change group_thread_cnt from sysfs entry, it can OOPS.
The kernel messages are:
[ 135.299021] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 135.299073] IP: [<
ffffffff815188ab>] handle_active_stripes+0x32b/0x440
[ 135.299107] PGD 0
[ 135.299122] Oops: 0000 [#1] SMP
[ 135.299144] Modules linked in: netconsole e1000e ptp pps_core
[ 135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24
[ 135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011
[ 135.299255] task:
ffff8800b9638f80 ti:
ffff8800b77a4000 task.ti:
ffff8800b77a4000
[ 135.299283] RIP: 0010:[<
ffffffff815188ab>] [<
ffffffff815188ab>] handle_active_stripes+0x32b/0x440
[ 135.299323] RSP: 0018:
ffff8800b77a5c48 EFLAGS:
00010002
[ 135.299344] RAX:
ffff880037bb5c70 RBX:
0000000000000000 RCX:
0000000000000008
[ 135.299371] RDX:
ffff880037bb5cb8 RSI:
0000000000000001 RDI:
ffff880037bb5c00
[ 135.299398] RBP:
ffff8800b77a5d08 R08:
0000000000000001 R09:
0000000000000000
[ 135.299425] R10:
ffff8800b77a5c98 R11:
00000000ffffffff R12:
ffff880037bb5c00
[ 135.299452] R13:
0000000000000000 R14:
0000000000000000 R15:
ffff880037bb5c70
[ 135.299479] FS:
0000000000000000(0000) GS:
ffff88013fd80000(0000) knlGS:
0000000000000000
[ 135.299510] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[ 135.299532] CR2:
0000000000000000 CR3:
0000000001c0b000 CR4:
00000000000407e0
[ 135.299559] Stack:
[ 135.299570]
ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300
[ 135.299611]
000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8
[ 135.299654]
000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98
[ 135.299696] Call Trace:
[ 135.299711] [<
ffffffff8107383e>] ? __wake_up+0x4e/0x70
[ 135.299733] [<
ffffffff81518f88>] raid5d+0x4c8/0x680
[ 135.299756] [<
ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0
[ 135.299781] [<
ffffffff81524c9f>] md_thread+0x11f/0x170
[ 135.299804] [<
ffffffff81069cd0>] ? wake_up_bit+0x40/0x40
[ 135.299826] [<
ffffffff81524b80>] ? md_rdev_init+0x110/0x110
[ 135.299850] [<
ffffffff81069656>] kthread+0xc6/0xd0
[ 135.299871] [<
ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
[ 135.299899] [<
ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
[ 135.299923] [<
ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
[ 135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90
[ 135.300005] RIP [<
ffffffff815188ab>] handle_active_stripes+0x32b/0x440
[ 135.300005] RSP <
ffff8800b77a5c48>
[ 135.300005] CR2:
0000000000000000
[ 135.300005] ---[ end trace
504854e5bb7562ed ]---
[ 135.300005] Kernel panic - not syncing: Fatal exception
This is because raid5d() can be running when the multi-thread
resources are changed via system. We see need to provide locking.
mddev->device_lock is suitable, but we cannot simple call
alloc_thread_groups under this lock as we cannot allocate memory
while holding a spinlock.
So change alloc_thread_groups() to allocate and return the data
structures, then raid5_store_group_thread_cnt() can take the lock
while updating the pointers to the data structures.
This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x
stable series.
Fixes: b721420e8719131896b009b11edbbd27
Cc: stable@vger.kernel.org (3.12)
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Shaohua Li <shli@kernel.org>
majianpeng [Thu, 14 Nov 2013 04:16:19 +0000 (15:16 +1100)]
md/raid5: Before freeing old multi-thread worker, it should flush them.
When changing group_thread_cnt from sysfs entry, the kernel can oops.
The kernel messages are:
[ 740.961389] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[ 740.961444] IP: [<
ffffffff81062570>] process_one_work+0x30/0x500
[ 740.961476] PGD
b9013067 PUD
b651e067 PMD 0
[ 740.961503] Oops: 0000 [#1] SMP
[ 740.961525] Modules linked in: netconsole e1000e ptp pps_core
[ 740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23
[ 740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011
[ 740.961646] task:
ffff88013abe0000 ti:
ffff88013a246000 task.ti:
ffff88013a246000
[ 740.961673] RIP: 0010:[<
ffffffff81062570>] [<
ffffffff81062570>] process_one_work+0x30/0x500
[ 740.961708] RSP: 0018:
ffff88013a247e08 EFLAGS:
00010086
[ 740.961730] RAX:
ffff8800b912b400 RBX:
ffff88013a61e680 RCX:
ffff8800b912b400
[ 740.961757] RDX:
ffff8800b912b600 RSI:
ffff8800b912b600 RDI:
ffff88013a61e680
[ 740.961782] RBP:
ffff88013a247e48 R08:
ffff88013a246000 R09:
000000000002c09d
[ 740.961808] R10:
000000000000010f R11:
0000000000000000 R12:
ffff88013b00cc00
[ 740.961833] R13:
0000000000000000 R14:
ffff88013b00cf80 R15:
ffff88013a61e6b0
[ 740.961861] FS:
0000000000000000(0000) GS:
ffff88013fc00000(0000) knlGS:
0000000000000000
[ 740.961893] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[ 740.962001] CR2:
00000000000000b8 CR3:
00000000b24fe000 CR4:
00000000000407f0
[ 740.962001] Stack:
[ 740.962001]
0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680
[ 740.962001]
ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0
[ 740.962001]
ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8
[ 740.962001] Call Trace:
[ 740.962001] [<
ffffffff810639c6>] worker_thread+0x206/0x3f0
[ 740.962001] [<
ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0
[ 740.962001] [<
ffffffff81069656>] kthread+0xc6/0xd0
[ 740.962001] [<
ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
[ 740.962001] [<
ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
[ 740.962001] [<
ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
[ 740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84
[ 740.962001] RIP [<
ffffffff81062570>] process_one_work+0x30/0x500
[ 740.962001] RSP <
ffff88013a247e08>
[ 740.962001] CR2:
0000000000000008
[ 740.962001] ---[ end trace
39181460000748de ]---
[ 740.962001] Kernel panic - not syncing: Fatal exception
This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH.
A worker is queued to handle them.
But before calling raid5_do_work, raid5d handles those
stripes making conf->active_stripe = 0.
So mddev_suspend() can return.
We might then free old worker resources before the queued
raid5_do_work() handled them. When it runs, it crashes.
raid5d() raid5_store_group_thread_cnt()
queue_work mddev_suspend()
handle_strips
active_stripe=0
free(old worker resources)
process_one_work
raid5_do_work
To avoid this, we should only flush the worker resources before freeing them.
This fixes a bug introduced in 3.12 so is suitable for the 3.12.x
stable series.
Cc: stable@vger.kernel.org (3.12)
Fixes: b721420e8719131896b009b11edbbd27
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Shaohua Li <shli@kernel.org>
majianpeng [Thu, 14 Nov 2013 04:16:19 +0000 (15:16 +1100)]
md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
For R5_ReadNoMerge,it mean this bio can't merge with other bios or
request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the
same work.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Aurelien Jarno [Thu, 14 Nov 2013 04:16:19 +0000 (15:16 +1100)]
UAPI: include <asm/byteorder.h> in linux/raid/md_p.h
linux/raid/md_p.h is using conditionals depending on endianess and fails
with an error if neither of __BIG_ENDIAN, __LITTLE_ENDIAN or
__BYTE_ORDER are defined, but it doesn't include any header which can
define these constants. This make this header unusable alone.
This patch adds a #include <asm/byteorder.h> at the beginning of this
header to make it usable alone. This is needed to compile klibc on MIPS.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: NeilBrown <neilb@suse.de>
majianpeng [Fri, 15 Nov 2013 06:55:02 +0000 (14:55 +0800)]
raid1: Rewrite the implementation of iobarrier.
There is an iobarrier in raid1 because of contention between normal IO and
resync IO. It suspends all normal IO when resync/recovery happens.
However if normal IO is out side the resync window, there is no contention.
So this patch changes the barrier mechanism to only block IO that
could contend with the resync that is currently happening.
We partition the whole space into five parts.
|---------|-----------|------------|----------------|-------|
start next_resync start_next_window end_window
start + RESYNC_WINDOW = next_resync
next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
start_next_window + NEXT_NORMALIO_DISTANCE = end_window
Firstly we introduce some concepts:
1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
and start_next_window. It also indicates the distance between
start_next_window and end_window.
It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
this turned out not to be optimal.
3 - next_resync: the next sector at which we will do sync IO.
4 - start: a position which is at most RESYNC_WINDOW before
next_resync.
5 - start_next_window: a position which is NEXT_NORMALIO_DISTANCE
beyond next_resync. Normal-io after this position doesn't need to
wait for resync-io to complete.
6 - end_window: a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
next_resync. This also doesn't need to wait, but is counted
differently.
7 - current_window_requests: the count of normalIO between
start_next_window and end_window.
8 - next_window_requests: the count of normalIO after end_window.
NormalIO will be partitioned into four types:
NormIO1: the end sector of bio is smaller or equal the start
NormIO2: the start sector of bio larger or equal to end_window
NormIO3: the start sector of bio larger or equal to
start_next_window.
NormIO4: the location between start_next_window and end_window
|--------|-----------|--------------------|----------------|-------------|
| start | next_resync | start_next_window | end_window |
NormIO1 NormIO4 NormIO4 NormIO3 NormIO2
For NormIO1, we don't need any io barrier.
For NormIO4, we used a similar approach to the original iobarrier
mechanism. The normalIO and resyncIO must be kept separate.
For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
and "next_window_requests". They indicate the count of active
requests in the two window.
For these, we don't wait for resync io to complete.
For resync action, if there are NormIO4s, we must wait for it.
If not, we can proceed.
But if resync action reaches start_next_window and
current_window_requests > 0 (that is there are NormIO3s), we must
wait until the current_window_requests becomes zero.
When current_window_requests becomes zero, start_next_window also
moves forward. Then current_window_requests will replaced by
next_window_requests.
There is a problem which when and how to change from NormIO2 to
NormIO3. Only then can sync action progress.
We add a field in struct r1conf "start_next_window".
A: if start_next_window == MaxSector, it means there are no NormIO2/3.
So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
B: if current_window_requests == 0 && next_window_requests != 0, it
means start_next_window move to end_window
There is another problem which how to differentiate between
old NormIO2(now it is NormIO3) and NormIO2.
For example, there are many bios which are NormIO2 and a bio which is
NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.
We add a field in struct r1bio "start_next_window".
This is used to record the position conf->start_next_window when the call
to wait_barrier() is made in make_request().
In allow_barrier(), we check the conf->start_next_window.
If r1bio->stat_next_window == conf->start_next_window, it means
there is no transition between NormIO2 and NormIO3.
If r1bio->start_next_window != conf->start_next_window, it mean
there was a transition between NormIO2 and NormIO3. There can only
have been one transition. So it only means the bio is old NormIO2.
For one bio, there may be many r1bio's. So we make sure
all the r1bio->start_next_window are the same value.
If we met blocked_dev in make_request(), it must call allow_barrier
and wait_barrier. So the former and the later value of
conf->start_next_window will be change.
If there are many r1bio's with differnet start_next_window,
for the relevant bio, it depend on the last value of r1bio.
It will cause error. To avoid this, we must wait for previous r1bios
to complete.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
majianpeng [Thu, 14 Nov 2013 04:16:18 +0000 (15:16 +1100)]
raid1: Add some macros to make code clearly.
In a subsequent patch, we'll use some const parameters.
Using macros will make the code clearly.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
majianpeng [Thu, 14 Nov 2013 04:16:18 +0000 (15:16 +1100)]
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
We used to use raise_barrier to suspend normal IO while we reconfigure
the array. However raise_barrier will soon only suspend some normal
IO, not all. So we need something else.
Change it to use freeze_array.
But freeze_array not only suspends normal io, it also suspends
resync io.
For the place where call raise_barrier for reconfigure, it isn't a
problem.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
majianpeng [Thu, 14 Nov 2013 04:16:18 +0000 (15:16 +1100)]
raid1: Add a field array_frozen to indicate whether raid in freeze state.
Because the following patch will rewrite the content between normal IO
and resync IO. So we used a parameter to indicate whether raid is in freeze
array.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Joe Perches [Thu, 14 Nov 2013 04:16:18 +0000 (15:16 +1100)]
md: Convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Thu, 14 Nov 2013 04:16:17 +0000 (15:16 +1100)]
md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
When raid5 recovery hits a fresh badblock, this badblock will flagged as unack
badblock until md_update_sb() is called.
But md_stop will take reconfig lock which means raid5d can't call
md_update_sb() in md_check_recovery(), the badblock will always
be unack, so raid5d thread enters an infinite loop and md_stop_write()
can never stop sync_thread. This causes deadlock.
To solve this, when STOP_ARRAY ioctl is issued and sync_thread is
running, we need set md->recovery FROZEN and INTR flags and wait for
sync_thread to stop before we (re)take reconfig lock.
This requires that raid5 reshape_request notices MD_RECOVERY_INTR
(which it probably should have noticed anyway) and stops waiting for a
metadata update in that case.
Reported-by: Jianpeng Ma <majianpeng@gmail.com>
Reported-by: Bian Yu <bianyu@kedacom.com>
Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Tue, 19 Nov 2013 01:02:01 +0000 (12:02 +1100)]
md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
We currently use kthread_should_stop() in various places in the
sync/reshape code to abort early.
However some places set MD_RECOVERY_INTR but don't immediately call
md_reap_sync_thread() (and we will shortly get another one).
When this happens we are relying on md_check_recovery() to reap the
thread and that only happen when it finishes normally.
So MD_RECOVERY_INTR must lead to a normal finish without the
kthread_should_stop() test.
So replace all relevant tests, and be more careful when the thread is
interrupted not to acknowledge that latest step in a reshape as it may
not be fully committed yet.
Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
so we don't wait have to wait for the speed to drop before we can abort.
Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Thu, 14 Nov 2013 06:54:51 +0000 (17:54 +1100)]
md: fix some places where mddev_lock return value is not checked.
Sometimes we need to lock and mddev and cannot cope with
failure due to interrupt.
In these cases we should use mutex_lock, not mutex_lock_interruptible.
Signed-off-by: NeilBrown <neilb@suse.de>
Bian Yu [Thu, 14 Nov 2013 04:16:17 +0000 (15:16 +1100)]
raid5: Retry R5_ReadNoMerge flag when hit a read error.
Because of block layer merge, one bio fails will cause other bios
which belongs to the same request fails, so raid5_end_read_request
will record all these bios as badblocks.
If retry request with R5_ReadNoMerge flag to avoid bios merge,
badblocks can only record sector which is bad exactly.
test:
hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb
mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean
mdadm /dev/md0 -f /dev/sdd
mdadm /dev/md0 -r /dev/sdd
mdadm --zero-superblock /dev/sdd
mdadm /dev/md0 -a /dev/sdd
1. Without this patch:
cat /sys/block/md0/md/rd*/bad_blocks
299776 256
299776 256
2. With this patch:
cat /sys/block/md0/md/rd*/bad_blocks
300000 8
300000 8
Signed-off-by: Bian Yu <bianyu@kedacom.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Shaohua Li [Thu, 14 Nov 2013 04:16:17 +0000 (15:16 +1100)]
raid5: relieve lock contention in get_active_stripe()
track empty inactive list count, so md_raid5_congested() can use it to make
decision.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Al Viro [Tue, 19 Nov 2013 01:20:43 +0000 (01:20 +0000)]
seq_file: always clear m->count when we free m->buf
Once we'd freed m->buf, m->count should become zero - we have no valid
contents reachable via m->buf.
Reported-by: Charley (Hao Chuan) Chu <charley.chu@broadcom.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rafael J. Wysocki [Tue, 19 Nov 2013 00:07:08 +0000 (01:07 +0100)]
Merge branch 'pm-sleep'
* pm-sleep:
PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()
Rafael J. Wysocki [Tue, 19 Nov 2013 00:07:00 +0000 (01:07 +0100)]
Merge branch 'pm-cpufreq'
* pm-cpufreq:
cpufreq: governor: Remove fossil comment in the cpufreq_governor_dbs()
cpufreq: OMAP: Fix compilation error 'r & ret undeclared'
cpufreq: conservative: set requested_freq to policy max when it is over policy max
Rafael J. Wysocki [Tue, 19 Nov 2013 00:06:49 +0000 (01:06 +0100)]
Merge branch 'pm-runtime'
* pm-runtime:
PM / Runtime: Fix error path for prepare
PM / Runtime: Update documentation around probe|remove|suspend
Rafael J. Wysocki [Tue, 19 Nov 2013 00:06:38 +0000 (01:06 +0100)]
Merge branch 'pm-tools'
* pm-tools:
tools / power turbostat: Support Silvermont
Rafael J. Wysocki [Tue, 19 Nov 2013 00:06:28 +0000 (01:06 +0100)]
Merge branch 'pm-cpuidle'
* pm-cpuidle:
intel_idle: Support Intel Atom Processor C2000 Product Family