Alex Shi [Thu, 28 Jun 2012 01:02:23 +0000 (09:02 +0800)]
x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
There are 32 INVALIDATE_TLB_VECTOR now in kernel. That is quite big
amount of vector in IDT. But it is still not enough, since modern x86
sever has more cpu number. That still causes heavy lock contention
in TLB flushing.
The patch using generic smp call function to replace it. That saved 32
vector number in IDT, and resolved the lock contention in TLB
flushing on large system.
In the NHM EX machine 4P * 8cores * HT = 64 CPUs, hackbench pthread
has 3% performance increase.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-9-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Alex Shi [Thu, 28 Jun 2012 01:02:22 +0000 (09:02 +0800)]
x86/tlb: enable tlb flush range support for x86
Not every tlb_flush execution moment is really need to evacuate all
TLB entries, like in munmap, just few 'invlpg' is better for whole
process performance, since it leaves most of TLB entries for later
accessing.
This patch also rewrite flush_tlb_range for 2 purposes:
1, split it out to get flush_blt_mm_range function.
2, clean up to reduce line breaking, thanks for Borislav's input.
My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
show that the random memory access on other CPU has 0~50% speed up
on a 2P * 4cores * HT NHM EP while do 'munmap'.
Thanks Yongjie's testing on this patch:
-------------
I used Linux 3.4-RC6 w/ and w/o his patches as Xen dom0 and guest
kernel.
After running two benchmarks in Xen HVM guest, I found his patches
brought about 1%~3% performance gain in 'kernel build' and 'netperf'
testing, though the performance gain was not very stable in 'kernel
build' testing.
Some detailed testing results are below.
Testing Environment:
Hardware: Romley-EP platform
Xen version: latest upstream
Linux kernel: 3.4-RC6
Guest vCPU number: 8
NIC: Intel 82599 (10GB bandwidth)
In 'kernel build' testing in guest:
Command line | performance gain
make -j 4 | 3.81%
make -j 8 | 0.37%
make -j 16 | -0.52%
In 'netperf' testing, we tested TCP_STREAM with default socket size
16384 byte as large packet and 64 byte as small packet.
I used several clients to add networking pressure, then 'netperf' server
automatically generated several threads to response them.
I also used large-size packet and small-size packet in the testing.
Packet size | Thread number | performance gain
16384 bytes | 4 | 0.02%
16384 bytes | 8 | 2.21%
16384 bytes | 16 | 2.04%
64 bytes | 4 | 1.07%
64 bytes | 8 | 3.31%
64 bytes | 16 | 0.71%
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-8-git-send-email-alex.shi@intel.com
Tested-by: Ren, Yongjie <yongjie.ren@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Alex Shi [Thu, 28 Jun 2012 01:02:21 +0000 (09:02 +0800)]
mm/mmu_gather: enable tlb flush range in generic mmu_gather
This patch enabled the tlb flush range support in generic mmu layer.
Most of arch has self tlb flush range support, like ARM/IA64 etc.
X86 arch has no this support in hardware yet. But another instruction
'invlpg' can implement this function in some degree. So, enable this
feather in generic layer for x86 now. and maybe useful for other archs
in further.
Generic mmu_gather struct is protected by micro
HAVE_GENERIC_MMU_GATHER. Other archs that has flush range supported
own self mmu_gather struct. So, now this change is safe for them.
In future we may unify this struct and related functions on multiple
archs.
Thanks for Peter Zijlstra time and time reminder for multiple
architecture code safe!
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-7-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Alex Shi [Thu, 28 Jun 2012 01:02:20 +0000 (09:02 +0800)]
x86/tlb: add tlb_flushall_shift knob into debugfs
kernel will replace cr3 rewrite with invlpg when
tlb_flush_entries <= active_tlb_entries / 2^tlb_flushall_factor
if tlb_flushall_factor is -1, kernel won't do this replacement.
User can modify its value according to specific CPU/applications.
Thanks for Borislav providing the help message of
CONFIG_DEBUG_TLBFLUSH.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-6-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Alex Shi [Thu, 28 Jun 2012 01:02:19 +0000 (09:02 +0800)]
x86/tlb: add tlb_flushall_shift for specific CPU
Testing show different CPU type(micro architectures and NUMA mode) has
different balance points between the TLB flush all and multiple invlpg.
And there also has cases the tlb flush change has no any help.
This patch give a interface to let x86 vendor developers have a chance
to set different shift for different CPU type.
like some machine in my hands, balance points is 16 entries on
Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
help.
For untested machine, do a conservative optimization, same as NHM CPU.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-5-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Alex Shi [Thu, 28 Jun 2012 01:02:18 +0000 (09:02 +0800)]
x86/tlb: fall back to flush all when meet a THP large page
We don't need to flush large pages by PAGE_SIZE step, that just waste
time. and actually, large page don't need 'invlpg' optimizing according
to our micro benchmark. So, just flush whole TLB is enough for them.
The following result is tested on a 2CPU * 4cores * 2HT NHM EP machine,
with THP 'always' setting.
Multi-thread testing, '-t' paramter is thread number:
without this patch with this patch
./mprotect -t 1 14ns 13ns
./mprotect -t 2 13ns 13ns
./mprotect -t 4 12ns 11ns
./mprotect -t 8 14ns 10ns
./mprotect -t 16 28ns 28ns
./mprotect -t 32 54ns 52ns
./mprotect -t 128 200ns 200ns
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-4-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Alex Shi [Thu, 28 Jun 2012 01:02:17 +0000 (09:02 +0800)]
x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range
x86 has no flush_tlb_range support in instruction level. Currently the
flush_tlb_range just implemented by flushing all page table. That is not
the best solution for all scenarios. In fact, if we just use 'invlpg' to
flush few lines from TLB, we can get the performance gain from later
remain TLB lines accessing.
But the 'invlpg' instruction costs much of time. Its execution time can
compete with cr3 rewriting, and even a bit more on SNB CPU.
So, on a 512 4KB TLB entries CPU, the balance points is at:
(512 - X) * 100ns(assumed TLB refill cost) =
X(TLB flush entries) * 100ns(assumed invlpg cost)
Here, X is 256, that is 1/2 of 512 entries.
But with the mysterious CPU pre-fetcher and page miss handler Unit, the
assumed TLB refill cost is far lower then 100ns in sequential access. And
2 HT siblings in one core makes the memory access more faster if they are
accessing the same memory. So, in the patch, I just do the change when
the target entries is less than 1/16 of whole active tlb entries.
Actually, I have no data support for the percentage '1/16', so any
suggestions are welcomed.
As to hugetlb, guess due to smaller page table, and smaller active TLB
entries, I didn't see benefit via my benchmark, so no optimizing now.
My micro benchmark show in ideal scenarios, the performance improves 70
percent in reading. And in worst scenario, the reading/writing
performance is similar with unpatched 3.4-rc4 kernel.
Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
'always':
multi thread testing, '-t' paramter is thread number:
with patch unpatched 3.4-rc4
./mprotect -t 1 14ns 24ns
./mprotect -t 2 13ns 22ns
./mprotect -t 4 12ns 19ns
./mprotect -t 8 14ns 16ns
./mprotect -t 16 28ns 26ns
./mprotect -t 32 54ns 51ns
./mprotect -t 128 200ns 199ns
Single process with sequencial flushing and memory accessing:
with patch unpatched 3.4-rc4
./mprotect 7ns 11ns
./mprotect -p 4096 -l 8 -n 10240
21ns 21ns
[ hpa: http://lkml.kernel.org/r/
1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
has additional performance numbers. ]
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Alex Shi [Thu, 28 Jun 2012 01:02:16 +0000 (09:02 +0800)]
x86/tlb_info: get last level TLB entry number of CPU
For 4KB pages, x86 CPU has 2 or 1 level TLB, first level is data TLB and
instruction TLB, second level is shared TLB for both data and instructions.
For hupe page TLB, usually there is just one level and seperated by 2MB/4MB
and 1GB.
Although each levels TLB size is important for performance tuning, but for
genernal and rude optimizing, last level TLB entry number is suitable. And
in fact, last level TLB always has the biggest entry number.
This patch will get the biggest TLB entry number and use it in furture TLB
optimizing.
Accroding Borislav's suggestion, except tlb_ll[i/d]_* array, other
function and data will be released after system boot up.
For all kinds of x86 vendor friendly, vendor specific code was moved to its
specific files.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-2-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Vlad Zolotarov [Mon, 11 Jun 2012 09:56:52 +0000 (12:56 +0300)]
x86: Add read_mostly declaration/definition to variables from smp.h
Add "read-mostly" qualifier to the following variables in
smp.h:
- cpu_sibling_map
- cpu_core_map
- cpu_llc_shared_map
- cpu_llc_id
- cpu_number
- x86_cpu_to_apicid
- x86_bios_cpu_apicid
- x86_cpu_to_logical_apicid
As long as all the variables above are only written during the
initialization, this change is meant to prevent the false
sharing. More specifically, on vSMP Foundation platform
x86_cpu_to_apicid shared the same internode_cache_line with
frequently written lapic_events.
From the analysis of the first 33 per_cpu variables out of 219
(memories they describe, to be more specific) the 8 have read_mostly
nature (tlb_vector_offset, cpu_loops_per_jiffy, xen_debug_irq, etc.)
and 25 are frequently written (irq_stack_union, gdt_page,
exception_stacks, idt_desc, etc.).
Assuming that the spread of the rest of the per_cpu variables is
similar, identifying the read mostly memories will make more sense
in terms of long-term code maintenance comparing to identifying
frequently written memories.
Signed-off-by: Vlad Zolotarov <vlad@scalemp.com>
Acked-by: Shai Fultheim <shai@scalemp.com>
Cc: Shai Fultheim (Shai@ScaleMP.com) <Shai@scalemp.com>
Cc: ido@wizery.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1719258.EYKzE4Zbq5@vlad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ido Yariv [Mon, 11 Jun 2012 09:56:45 +0000 (12:56 +0300)]
x86: Define early read-mostly per-cpu macros
Some read-mostly per-cpu data may need to be declared or defined
early, so it can be initialized and accessed before per_cpu
areas are allocated.
Only the data that resides in the per_cpu areas should be
read-mostly, as there is little benefit in optimizing cache
lines on initialization.
Signed-off-by: Ido Yariv <ido@wizery.com>
[ Added the missing declarations in !SMP code. ]
Signed-off-by: Vlad Zolotarov <vlad@scalemp.com>
Acked-by: Shai Fultheim <shai@scalemp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/46188571.ddB8aVQYWo@vlad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Mon, 11 Jun 2012 03:57:43 +0000 (06:57 +0300)]
Merge tag 'regmap-3.5' of git://git./linux/kernel/git/broonie/regmap
Pull regmap fixes from Mark Brown:
"Nothing too exciting - a cleanup for debugfs in error handling and a
fix for the padding (which has only just acquired real use) and
exporting a function that's supposed to be usable by drivers."
* tag 'regmap-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: Export regmap_reinit_cache()
regmap: Fix the size calculation for map->format.buf_size
regmap: clean up debugfs if regmap_init fails
Linus Torvalds [Mon, 11 Jun 2012 03:53:48 +0000 (06:53 +0300)]
Merge tag 'regulator-3.5' of git://git./linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A couple of small fixes, plus larger fixes for the gpio-regulator
driver the most recent changes for which had apparently not been
tested at all in -next (or elsewhere from the looks of it)."
* tag 'regulator-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: core: Properly handle the case min_uV < rdev->desc->min_uV in map_voltage_linear
regulator: max8649: fix missing regmap in rdev
regulator: gpio-regulator: populate selector from set_voltage
regulator: gpio-regulator: Fix finding of smallest value
regulator: gpio-regulator: do not pass drvdata pointer as reference
regulator: anatop: Use correct __devexit_p annotation
regulator: palmas: Fix wrong kfree calls
Linus Torvalds [Sat, 9 Jun 2012 01:40:09 +0000 (18:40 -0700)]
Linux 3.5-rc2
David Rientjes [Fri, 8 Jun 2012 20:21:26 +0000 (13:21 -0700)]
mm, oom: fix badness score underflow
If the privileges given to root threads (3% of allowable memory) or a
negative value of /proc/pid/oom_score_adj happen to exceed the amount of
rss of a thread, its badness score overflows as a result of commit
a7f638f999ff ("mm, oom: normalize oom scores to oom_score_adj scale only
for userspace").
Fix this by making the type signed and return 1, meaning the thread is
still eligible for kill, if the value is negative.
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 8 Jun 2012 21:59:29 +0000 (14:59 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix the relax_domain_level boot parameter
sched: Validate assumptions in sched_init_numa()
sched: Always initialize cpu-power
sched: Fix domain iteration
sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
sched/numa: Load balance between remote nodes
sched/x86: Calculate booted cores after construction of sibling_mask
Randy Dunlap [Fri, 8 Jun 2012 20:18:33 +0000 (13:18 -0700)]
sched/fair: fix lots of kernel-doc warnings
Fix lots of new kernel-doc warnings in kernel/sched/fair.c:
Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
.. more warnings
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 8 Jun 2012 21:53:06 +0000 (14:53 -0700)]
Revert "drm/i915/crt: Do not rely upon the HPD presence pin"
This reverts commit
9e612a008fa7fe493a473454def56aa321479495.
It incorrectly finds VGA connectors where none are attached, apparently
not noticing that nothing replied to the EDID queries, and happily using
the default EDID modes that have nothing to do with actual hardware.
That in turn then causes X to fall down to the lowest common
denominator, which is usually the default 1024x768 mode that is in the
default EDID and pretty much anything supports).
I'd suggest that if not relying on the HDP pin, the code should at least
check whether it gets valid EDID data back, rather than just assume
there's something on the VGA connector.
Cc: Dave Airlie <airlied@linux.ie>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 8 Jun 2012 18:15:31 +0000 (11:15 -0700)]
Merge tag 'ext4_for_linus' of git://git./linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Theodore Ts'o:
"This update contains two bug fixes, both destined for the stable tree.
Perhaps the most important is one which fixes ext4 when used with file
systems originally formatted for use with ext3, but then later
converted to take advantage of ext4."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: don't set i_flags in EXT4_IOC_SETFLAGS
ext4: fix the free blocks calculation for ext3 file systems w/ uninit_bg
Linus Torvalds [Fri, 8 Jun 2012 18:06:01 +0000 (11:06 -0700)]
Merge branch 'merge' of git://git./linux/kernel/git/paulus/powerpc
Pull powerpc fixes from Paul Mackerras:
"Two small fixes for powerpc:
- a fix for a regression since 3.2 that causes 4-second (or longer)
pauses
- a fix for a potential oops when loading kernel modules on 32-bit
embedded systems."
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
powerpc: Fix kernel panic during kernel module load
powerpc/time: Sanity check of decrementer expiration is necessary
Linus Torvalds [Fri, 8 Jun 2012 18:04:06 +0000 (11:04 -0700)]
Merge tag 'upstream-3.5-rc2' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS fixes from Artem Bityutskiy:
"Fix UBI and UBIFS - they refuse to work without debugfs. This was
broken by the 3.5-rc1 UBI/UBIFS changes when we removed the debugging
Kconfig switches.
Also, correct locking in 'ubi_wl_flush()' - it was extended to support
flushing a specific LEB in 3.5-rc1, and the locking was sub-optimal."
* tag 'upstream-3.5-rc2' of git://git.infradead.org/linux-ubifs:
UBI: correct ubi_wl_flush locking
UBIFS: fix debugfs-less systems support
UBI: fix debugfs-less systems support
Linus Torvalds [Fri, 8 Jun 2012 17:34:03 +0000 (10:34 -0700)]
Revert "vfs: stop d_splice_alias creating directory aliases"
This reverts commit
7732a557b1342c6e6966efb5f07effcf99f56167 (and commit
3f50fff4dace23d3cfeb195d5cd4ee813cee68b7, which was a follow-up
cleanup).
We're chasing an elusive bug that Dave Jones can apparently reproduce
using his system call fuzzer tool, and that looks like some kind of
locking ordering problem on the directory i_mutex chain. Our i_mutex
locking is rather complex, and depends on the topological ordering of
the directories, which is why we have been very wary of splicing
directory entries around.
Of course, we really don't want to ever see aliased unconnected
directories anyway, so none of this should ever happen, but this revert
aims to basically get us back to a known older state.
Bruce points to some of the previous discussion at
http://marc.info/?i=<
20110310105821.GE22723@ZenIV.linux.org.uk>
and in particular a long post from Neil:
http://marc.info/?i=<
20110311150749.
2fa2be66@notabene.brown>
It should be noted that it's possible that Dave's problems come from
other changes altohgether, including possibly just the fact that Dave
constantly is teachning his fuzzer new tricks. So what appears to be a
new bug could in fact be an old one that just gets newly triggered, but
reverting these patches as "still under heavy discussion" is the right
thing regardless.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 8 Jun 2012 16:26:55 +0000 (09:26 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix section mismatch warnings on 32-bit
x86/uv: Fix UV2 BAU legacy mode
x86/mm: Only add extra pages count for the first memory range during pre-allocation early page table space
x86, efi stub: Add .reloc section back into image
x86/ioapic: Fix NULL pointer dereference on CPU hotplug after disabling irqs
x86/reboot: Fix a warning message triggered by stop_other_cpus()
x86/intel/moorestown: Change intel_scu_devices_create() to __devinit
x86/numa: Set numa_nodes_parsed at acpi_numa_memory_affinity_init()
x86/gart: Fix kmemleak warning
x86: mce: Add the dropped timer interval init back
x86/mce: Fix the MCE poll timer logic
Linus Torvalds [Fri, 8 Jun 2012 16:14:46 +0000 (09:14 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"A bit larger than what I'd wish for - half of it is due to hw driver
updates to Intel Ivy-Bridge which info got recently released,
cycles:pp should work there now too, amongst other things. (but we
are generally making exceptions for hardware enablement of this type.)
There are also callchain fixes in it - responding to mostly
theoretical (but valid) concerns. The tooling side sports perf.data
endianness/portability fixes which did not make it for the merge
window - and various other fixes as well."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
perf/x86: Check user address explicitly in copy_from_user_nmi()
perf/x86: Check if user fp is valid
perf: Limit callchains to 127
perf/x86: Allow multiple stacks
perf/x86: Update SNB PEBS constraints
perf/x86: Enable/Add IvyBridge hardware support
perf/x86: Implement cycles:p for SNB/IVB
perf/x86: Fix Intel shared extra MSR allocation
x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
perf: Remove duplicate invocation on perf_event_for_each
perf uprobes: Remove unnecessary check before strlist__delete
perf symbols: Check for valid dso before creating map
perf evsel: Fix 32 bit values endianity swap for sample_id_all header
perf session: Handle endianity swap on sample_id_all header data
perf symbols: Handle different endians properly during symbol load
perf evlist: Pass third argument to ioctl explicitly
perf tools: Update ioctl documentation for PERF_IOC_FLAG_GROUP
perf tools: Make --version show kernel version instead of pull req tag
perf tools: Check if callchain is corrupted
perf callchain: Make callchain cursors TLS
...
Linus Torvalds [Fri, 8 Jun 2012 16:12:21 +0000 (09:12 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm intel and exynos fixes from Dave Airlie:
"A bunch of fixes for Intel and exynos, nothing too major, a new intel
PCI ID, and a fix for CRT detection."
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/i915: pch_irq_handler -> {ibx, cpt}_irq_handler
char/agp: add another Ironlake host bridge
drm/i915: fix up ivb plane 3 pageflips
drm/exynos: fixed blending for hdmi graphic layer
drm/exynos: Remove dummy encoder get_crtc operation implementation
drm/exynos: Keep a reference to frame buffer GEM objects
drm/exynos: Don't cast GEM object to Exynos GEM object when not needed
drm/exynos: DRIVER_BUS_PLATFORM is not a driver feature
drm/exynos: fixed size type.
drm/exynos: Use DRM_FORMAT_{NV12, YUV420} instead of DRM_FORMAT_{NV12M, YUV420M}
drm/i915: hold forcewake around ring hw init
drm/i915: Mark the ringbuffers as being in the GTT domain
drm/i915/crt: Do not rely upon the HPD presence pin
drm/i915: Reset last_retired_head when resetting ring
Linus Torvalds [Fri, 8 Jun 2012 16:11:33 +0000 (09:11 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull leap second timer fix from Thomas Gleixner.
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond
Linus Torvalds [Fri, 8 Jun 2012 16:10:35 +0000 (09:10 -0700)]
Merge tag 'moduleparam-for-linus' of git://git./linux/kernel/git/rusty/linux-2.6-for-linus
Pull minor module param fixes from Rusty Russell:
"One bugfix for multiple moduleparam levels, one removal of overzealous
printk."
* tag 'moduleparam-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
init: Drop initcall level output
module_param: stop double-calling parameters.
Don Zickus [Wed, 6 Jun 2012 14:05:42 +0000 (10:05 -0400)]
x86/nmi: Fix section mismatch warnings on 32-bit
It was reported that compiling for 32-bit caused a bunch of
section mismatch warnings:
VDSOSYM arch/x86/vdso/vdso32-syms.lds
LD arch/x86/vdso/built-in.o
LD arch/x86/built-in.o
WARNING: arch/x86/built-in.o(.data+0x5af0): Section mismatch in
reference from the variable test_nmi_ipi_callback_na.10451 to
the function .init.text:test_nmi_ipi_callback() [...]
WARNING: arch/x86/built-in.o(.data+0x5b04): Section mismatch in
reference from the variable nmi_unk_cb_na.10399 to the function
.init.text:nmi_unk_cb() The variable nmi_unk_cb_na.10399
references the function __init nmi_unk_cb() [...]
Both of these are attributed to the internal representation of
the nmiaction struct created during register_nmi_handler. The
reason for this is that those structs are not defined in the
init section whereas the rest of the code in nmi_selftest.c is.
To resolve this, I created a new #define,
register_nmi_handler_initonly, that tags the struct as
__initdata to resolve the mismatch. This #define should only be
used in rare situations where the register/unregister is called
during init of the kernel.
Big thanks to Jan Beulich for decoding this for me as I didn't
have a clue what was going on.
Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Tested-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Cc: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/1338991542-23000-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Steffen Rumler [Wed, 6 Jun 2012 14:37:17 +0000 (16:37 +0200)]
powerpc: Fix kernel panic during kernel module load
This fixes a problem which can causes kernel oopses while loading
a kernel module.
According to the PowerPC EABI specification, GPR r11 is assigned
the dedicated function to point to the previous stack frame.
In the powerpc-specific kernel module loader, do_plt_call()
(in arch/powerpc/kernel/module_32.c), GPR r11 is also used
to generate trampoline code.
This combination crashes the kernel, in the case where the compiler
chooses to use a helper function for saving GPRs on entry, and the
module loader has placed the .init.text section far away from the
.text section, meaning that it has to generate a trampoline for
functions in the .init.text section to call the GPR save helper.
Because the trampoline trashes r11, references to the stack frame
using r11 can cause an oops.
The fix just uses GPR r12 instead of GPR r11 for generating the
trampoline code. According to the statements from Freescale, this is
safe from an EABI perspective.
I've tested the fix for kernel 2.6.33 on MPC8541.
Cc: stable@vger.kernel.org
Signed-off-by: Steffen Rumler <steffen.rumler.ext@nsn.com>
[paulus@samba.org: reworded the description]
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cliff Wickman [Thu, 7 Jun 2012 13:31:40 +0000 (08:31 -0500)]
x86/uv: Fix UV2 BAU legacy mode
The SGI Altix UV2 BAU (Broadcast Assist Unit) as used for
tlb-shootdown (selective broadcast mode) always uses UV2
broadcast descriptor format. There is no need to clear the
'legacy' (UV1) mode, because the hardware always uses UV2 mode
for selective broadcast.
But the BIOS uses general broadcast and legacy mode, and the
hardware pays attention to the legacy mode bit for general
broadcast. So the kernel must not clear that mode bit.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/E1SccoO-0002Lh-Cb@eag09.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Yinghai Lu [Wed, 6 Jun 2012 17:55:40 +0000 (10:55 -0700)]
x86/mm: Only add extra pages count for the first memory range during pre-allocation early page table space
Robin found this regression:
| I just tried to boot an 8TB system. It fails very early in boot with:
| Kernel panic - not syncing: Cannot find space for the kernel page tables
git bisect commit
722bc6b16771ed80871e1fd81c86d3627dda2ac8.
A git revert of that commit does boot past that point on the 8TB
configuration.
That commit will add up extra pages for all memory range even
above 4g.
Try to limit that extra page count adding to first entry only.
Bisected-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/CAE9FiQUj3wyzQxtq9yzBNc9u220p8JZ1FYHG7t%3DMOzJ%3D9BZMYA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dave Airlie [Fri, 8 Jun 2012 08:42:51 +0000 (09:42 +0100)]
Merge branch 'exynos-drm-fixes' of git://git.infradead.org/users/kmpark/linux-samsung into drm-fixes
* 'exynos-drm-fixes' of git://git.infradead.org/users/kmpark/linux-samsung:
drm/exynos: fixed blending for hdmi graphic layer
drm/exynos: Remove dummy encoder get_crtc operation implementation
drm/exynos: Keep a reference to frame buffer GEM objects
drm/exynos: Don't cast GEM object to Exynos GEM object when not needed
drm/exynos: DRIVER_BUS_PLATFORM is not a driver feature
drm/exynos: fixed size type.
drm/exynos: Use DRM_FORMAT_{NV12, YUV420} instead of DRM_FORMAT_{NV12M, YUV420M}
Dave Airlie [Fri, 8 Jun 2012 08:42:35 +0000 (09:42 +0100)]
Merge branch 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel into drm-fixes
* 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel:
drm/i915: pch_irq_handler -> {ibx, cpt}_irq_handler
char/agp: add another Ironlake host bridge
drm/i915: fix up ivb plane 3 pageflips
drm/i915: hold forcewake around ring hw init
drm/i915: Mark the ringbuffers as being in the GTT domain
drm/i915/crt: Do not rely upon the HPD presence pin
drm/i915: Reset last_retired_head when resetting ring
Borislav Petkov [Fri, 1 Jun 2012 16:56:00 +0000 (18:56 +0200)]
init: Drop initcall level output
9fb48c744ba6a ("params: add 3rd arg to option handler callback
signature") added similar lines to dmesg:
initlevel:0=early, 4 registered initcalls
initlevel:1=core, 31 registered initcalls
initlevel:2=postcore, 11 registered initcalls
initlevel:3=arch, 7 registered initcalls
initlevel:4=subsys, 40 registered initcalls
initlevel:5=fs, 30 registered initcalls
initlevel:6=device, 250 registered initcalls
initlevel:7=late, 35 registered initcalls
but they don't contain any info for the general user staring at dmesg.
I'm very doubtful the count of initcalls registered per level helps
anyone so drop that output completely.
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Fri, 8 Jun 2012 05:28:13 +0000 (14:58 +0930)]
module_param: stop double-calling parameters.
Commit
026cee0086fe1df4cf74691cf273062cc769617d "params:
<level>_initcall-like kernel parameters" set old-style module
parameters to level 0. And we call those level 0 calls where we used
to, early in start_kernel().
We also loop through the initcall levels and call the levelled
module_params before the corresponding initcall. Unfortunately level
0 is early_init(), so we call the standard module_param calls twice.
(Turns out most things don't care, but at least ubi.mtd does).
Change the level to -1 for standard module_param calls.
Reported-by: Benoît Thébaudeau <benoit.thebaudeau@advansee.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
Paul Mackerras [Fri, 1 Jun 2012 08:13:43 +0000 (18:13 +1000)]
powerpc/time: Sanity check of decrementer expiration is necessary
This reverts
68568add2c ("powerpc/time: Remove unnecessary sanity check
of decrementer expiration"). We do need to check whether we have reached
the expiration time of the next event, because we sometimes get an early
decrementer interrupt, most notably when we set the decrementer to 1 in
arch_irq_work_raise(). The effect of not having the sanity check is that
if timer_interrupt() gets called early, we leave the decrementer set to
its maximum value, which means we then don't get any more decrementer
interrupts for about 4 seconds (or longer, depending on timebase
frequency). I saw these pauses as a consequence of getting a stray
hypervisor decrementer interrupt left over from exiting a KVM guest.
This isn't quite a straight revert because of changes to the surrounding
code, but it restores the same algorithm as was previously used.
Cc: stable@vger.kernel.org
Acked-by: Anton Blanchard <anton@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Linus Torvalds [Fri, 8 Jun 2012 00:54:07 +0000 (17:54 -0700)]
Revert "mm: correctly synchronize rss-counters at exit/exec"
This reverts commit
40af1bbdca47e5c8a2044039bb78ca8fd8b20f94.
It's horribly and utterly broken for at least the following reasons:
- calling sync_mm_rss() from mmput() is fundamentally wrong, because
there's absolutely no reason to believe that the task that does the
mmput() always does it on its own VM. Example: fork, ptrace, /proc -
you name it.
- calling it *after* having done mmdrop() on it is doubly insane, since
the mm struct may well be gone now.
- testing mm against NULL before you call it is insane too, since a
NULL mm there would have caused oopses long before.
.. and those are just the three bugs I found before I decided to give up
looking for me and revert it asap. I should have caught it before I
even took it, but I trusted Andrew too much.
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Axel Lin [Thu, 7 Jun 2012 01:52:12 +0000 (09:52 +0800)]
regulator: core: Properly handle the case min_uV < rdev->desc->min_uV in map_voltage_linear
Properly handle the case if the specified min_uV is less than the voltage given
by the lowest selector.
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Tao Ma [Thu, 7 Jun 2012 23:04:19 +0000 (19:04 -0400)]
ext4: don't set i_flags in EXT4_IOC_SETFLAGS
Commit
7990696 uses the ext4_{set,clear}_inode_flags() functions to
change the i_flags automatically but fails to remove the error setting
of i_flags. So we still have the problem of trashing state flags.
Fix this by removing the assignment.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Theodore Ts'o [Thu, 7 Jun 2012 22:56:06 +0000 (18:56 -0400)]
ext4: fix the free blocks calculation for ext3 file systems w/ uninit_bg
Ext3 filesystems that are converted to use as many ext4 file system
features as possible will enable uninit_bg to speed up e2fsck times.
These file systems will have a native ext3 layout of inode tables and
block allocation bitmaps (as opposed to ext4's flex_bg layout).
Unfortunately, in these cases, when first allocating a block in an
uninitialized block group, ext4 would incorrectly calculate the number
of free blocks in that block group, and then errorneously report that
the file system was corrupt:
EXT4-fs error (device vdd): ext4_mb_generate_buddy:741: group 30, 32254 clusters in bitmap, 32258 in gd
This problem can be reproduced via:
mke2fs -q -t ext4 -O ^flex_bg /dev/vdd 5g
mount -t ext4 /dev/vdd /mnt
fallocate -l 4600m /mnt/test
The problem was caused by a bone headed mistake in the check to see if a
particular metadata block was part of the block group.
Many thanks to Kees Cook for finding and bisecting the buggy commit
which introduced this bug (commit
fd034a84e1, present since v3.2).
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: stable@kernel.org
Linus Torvalds [Thu, 7 Jun 2012 22:05:43 +0000 (15:05 -0700)]
Merge branch 'akpm' (Andrew's fixups)
Merge random fixes from Andrew Morton.
* emailed from Andrew Morton <akpm@linux-foundation.org>: (11 patches)
mm: correctly synchronize rss-counters at exit/exec
btree: catch NULL value before it does harm
btree: fix tree corruption in btree_get_prev()
ipc: shm: restore MADV_REMOVE functionality on shared memory segments
drivers/platform/x86/acerhdf.c: correct Boris' mail address
c/r: prctl: drop VMA flags test on PR_SET_MM_ stack data assignment
c/r: prctl: add ability to get clear_tid_address
c/r: prctl: add minimal address test to PR_SET_MM
c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal
MAINTAINERS: whitespace fixes
shmem: replace_page must flush_dcache and others
Mark Brown [Mon, 14 May 2012 09:00:12 +0000 (10:00 +0100)]
regmap: Export regmap_reinit_cache()
It's supposed to be there for drivers.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Konstantin Khlebnikov [Thu, 7 Jun 2012 21:21:14 +0000 (14:21 -0700)]
mm: correctly synchronize rss-counters at exit/exec
mm->rss_stat counters have per-task delta: task->rss_stat. Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().
do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats,
audit and other stuff. Unfortunately the kernel does this before
calling mm_release(), which can call put_user() for processing
task->clear_child_tid. So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:
| BUG: Bad rss-counter state mm:
ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:
ffff88020813c380 idx:2 val:1
This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier. After mm_release() there should
be no pagefaults.
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org> [3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joern Engel [Thu, 7 Jun 2012 21:21:14 +0000 (14:21 -0700)]
btree: catch NULL value before it does harm
Storing NULL values in the btree is illegal and can lead to memory
corruption and possible other fun as well. Catch it on insert, instead
of waiting for the inevitable.
Signed-off-by: Joern Engel <joern@logfs.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roland Dreier [Thu, 7 Jun 2012 21:21:13 +0000 (14:21 -0700)]
btree: fix tree corruption in btree_get_prev()
The memory the parameter __key points to is used as an iterator in
btree_get_prev(), so if we save off a bkey() pointer in retry_key and
then assign that to __key, we'll end up corrupting the btree internals
when we do eg
longcpy(__key, bkey(geo, node, i), geo->keylen);
to return the key value. What we should do instead is use longcpy() to
copy the key value that retry_key points to __key.
This can cause a btree to get corrupted by seemingly read-only
operations such as btree_for_each_safe.
[akpm@linux-foundation.org: avoid the double longcpy()]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Acked-by: Joern Engel <joern@logfs.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will Deacon [Thu, 7 Jun 2012 21:21:13 +0000 (14:21 -0700)]
ipc: shm: restore MADV_REMOVE functionality on shared memory segments
Commit
17cf28afea2a ("mm/fs: remove truncate_range") removed the
truncate_range inode operation in favour of the fallocate file
operation.
When using SYSV IPC shared memory segments, calling madvise with the
MADV_REMOVE advice on an area of shared memory will attempt to invoke
the .fallocate function for the shm_file_operations, which is NULL and
therefore returns -EOPNOTSUPP to userspace. The previous behaviour
would inherit the inode_operations from the underlying tmpfs file and
invoke truncate_range there.
This patch restores the previous behaviour by wrapping the underlying
fallocate function in shm_fallocate, as we do for fsync.
[hughd@google.com: use -ENOTSUPP in shm_fallocate()]
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Borislav Petkov [Thu, 7 Jun 2012 21:21:12 +0000 (14:21 -0700)]
drivers/platform/x86/acerhdf.c: correct Boris' mail address
Correct mail address reference to a mail account which I actually read.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Cc: Peter Feuerer <peter@piie.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cyrill Gorcunov [Thu, 7 Jun 2012 21:21:12 +0000 (14:21 -0700)]
c/r: prctl: drop VMA flags test on PR_SET_MM_ stack data assignment
In commit
b76437579d13 ("procfs: mark thread stack correctly in
proc/<pid>/maps") the stack allocated via clone() is marked in
/proc/<pid>/maps as [stack:%d] thus it might be out of the former
mm->start_stack/end_stack values (and even has some custom VMA flags
set).
So to be able to restore mm->start_stack/end_stack drop vma flags test,
but still require the underlying VMA to exist.
As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
requires CAP_SYS_RESOURCE to be granted.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cyrill Gorcunov [Thu, 7 Jun 2012 21:21:12 +0000 (14:21 -0700)]
c/r: prctl: add ability to get clear_tid_address
Zero is written at clear_tid_address when the process exits. This
functionality is used by pthread_join().
We already have sys_set_tid_address() to change this address for the
current task but there is no way to obtain it from user space.
Without the ability to find this address and dump it we can't restore
pthread'ed apps which call pthread_join() once they have been restored.
This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
the current process to obtain own clear_tid_address.
This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.
[akpm@linux-foundation.org: fix prctl numbering]
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pedro Alves <palves@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cyrill Gorcunov [Thu, 7 Jun 2012 21:21:11 +0000 (14:21 -0700)]
c/r: prctl: add minimal address test to PR_SET_MM
Make sure the address being set is greater than mmap_min_addr (as
suggested by Kees Cook).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Konstantin Khlebnikov [Thu, 7 Jun 2012 21:21:11 +0000 (14:21 -0700)]
c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal
A fix for commit
b32dfe377102 ("c/r: prctl: add ability to set new
mm_struct::exe_file").
After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until
final mmput(), it never becomes NULL while task is alive.
We can check for other mapped files in mm instead of checking
mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in
order to forbid second changing of mm->exe_file.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Thu, 7 Jun 2012 21:21:10 +0000 (14:21 -0700)]
MAINTAINERS: whitespace fixes
Remove trailing spaces at EOL.
Always use a tab after the type :
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Thu, 7 Jun 2012 21:21:09 +0000 (14:21 -0700)]
shmem: replace_page must flush_dcache and others
Commit
bde05d1ccd51 ("shmem: replace page if mapping excludes its zone")
is not at all likely to break for anyone, but it was an earlier version
from before review feedback was incorporated. Fix that up now.
* shmem_replace_page must flush_dcache_page after copy_highpage [akpm]
* Expand comment on why shmem_unuse_inode needs page_swapcount [akpm]
* Remove excess of VM_BUG_ONs from shmem_replace_page [wangcong]
* Check page_private matches swap before calling shmem_replace_page [hughd]
* shmem_replace_page allow for unexpected race in radix_tree lookup [hughd]
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Stephane Marchesin <marcheu@chromium.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Rob Clark <rob.clark@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jordan Justen [Thu, 7 Jun 2012 16:05:21 +0000 (09:05 -0700)]
x86, efi stub: Add .reloc section back into image
Some UEFI firmware will not load a .efi with a .reloc section
with a size of 0.
Therefore, we create a .efi image with 4 main areas and 3 sections.
1. PE/COFF file header
2. .setup section (covers all setup code following the first sector)
3. .reloc section (contains 1 dummy reloc entry, created in build.c)
4. .text section (covers the remaining kernel image)
To make room for the new .setup section data, the header
bugger_off_msg had to be shortened.
Reported-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Link: http://lkml.kernel.org/r/1339085121-12760-1-git-send-email-jordan.l.justen@intel.com
Tested-by: Lee G Rosenbaum <lee.g.rosenbaum@intel.com>
Tested-by: Henrik Rydberg <rydberg@euromail.se>
Cc: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Linus Torvalds [Thu, 7 Jun 2012 16:06:54 +0000 (09:06 -0700)]
Merge tag 'parisc-fixes' of git://git./linux/kernel/git/jejb/parisc-2.6
Pull PARISC fixes from James Bottomley:
"This is a set of three bug fixes for minor build breakages that got
introduced just before 3.5-rc1 was released."
* tag 'parisc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6:
[PARISC] fix code to find libgcc
[PARISC] fix compile break in use of lib/strncopy_from_user.c
[PARISC] fix missing TAINT_WARN problem
Linus Torvalds [Thu, 7 Jun 2012 16:06:13 +0000 (09:06 -0700)]
Merge git://git./linux/kernel/git/cmetcalf/linux-tile
Pull tile fixes from Chris Metcalf:
"These two minor bug fixes fix build failures from some changes that
were merged in during the 3.5 merge window."
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
tile: add #include to unbreak build after generic init_task conversion
tile: remove cpu_idle_on_new_stack
Artem Bityutskiy [Thu, 7 Jun 2012 12:15:30 +0000 (15:15 +0300)]
UBI: correct ubi_wl_flush locking
Commit "
62f38455 UBI: modify ubi_wl_flush function to clear work queue for a lnum"
takes the 'work_sem' semaphore in write mode for the entire loop, which is not
very good because it will block other workers for potentially long time. We do
not need to have it in write mode - read mode is enough, and we do not need to
hole it over the entire loop. So this patch turns changes the locking: takes
'work_sem' in read mode and pushes it down to the loop.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Artem Bityutskiy [Wed, 6 Jun 2012 13:03:10 +0000 (16:03 +0300)]
UBIFS: fix debugfs-less systems support
Commit "
f70b7e5 UBIFS: remove Kconfig debugging option" broke UBIFS and it
refuses to initialize if debugfs (CONFIG_DEBUG_FS) is disabled. I incorrectly
assumed that debugfs files creation function will return success if debugfs
is disabled, but they actually return -ENODEV. This patch fixes the issue.
Reported-by: Paul Parsons <lost.distance@yahoo.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Paul Parsons <lost.distance@yahoo.com>
Artem Bityutskiy [Wed, 6 Jun 2012 12:22:41 +0000 (15:22 +0300)]
UBI: fix debugfs-less systems support
Commit "
aa44d1d UBI: remove Kconfig debugging option" broke UBI and it
refuses to initialize if debugfs (CONFIG_DEBUG_FS) is disabled. I incorrectly
assumed that debugfs files creation function will return success if debugfs
is disabled, but they actually return -ENODEV. This patch fixes the issue.
Reported-by: Paul Parsons <lost.distance@yahoo.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Paul Parsons <lost.distance@yahoo.com>
Adam Jackson [Wed, 6 Jun 2012 19:45:44 +0000 (15:45 -0400)]
drm/i915: pch_irq_handler -> {ibx, cpt}_irq_handler
Cougar/Panther Point redefine the bits in SDEIIR pretty completely.
This function is just debugging, but if we're debugging we probably want
to be told accurate things instead of lies.
I'm told Lynx Point changes this yet more, but I have no idea how...
Note from Eugeni's review:
"For the record and for future enabling efforts, for LPT, bits 28-31
and 1-14 are gone since CPT/PPT (e.g., those must be zero). And there
is the bit 15 as a new addition, but we are not using it yet and
probably won't be using in foreseeable future."
Signed-off-by: Adam Jackson <ajax@redhat.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35103
Reviewed-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Linus Torvalds [Wed, 6 Jun 2012 18:56:45 +0000 (11:56 -0700)]
Merge tag 'hwmon-for-linus' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fix from Guenter Roeck:
"Update e-mail address in MAINTAINERS"
* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
MAINTAINERS: Update my e-mail address
Linus Torvalds [Wed, 6 Jun 2012 17:47:15 +0000 (10:47 -0700)]
Merge branch 'release' of git://git./linux/kernel/git/lenb/linux
Pull ACPI and Power Management changes from Len Brown.
This does an evil merge to fix up what I think is a mismerge by Len to
the gma500 driver, and restore it to the mainline state.
In that driver, both branches had commented out the call to
acpi_video_register(), and Len resolved the merge to that commented-out
version.
However, in mainline, further changes by Alan (commit
d839ede47a56:
"gma500: opregion and ACPI" to be exact) had re-enabled the ACPI video
registration, so the current state of the driver seems to want it.
Alan is apparently still feeling the effects of partying with the Queen,
so he didn't reply to my query, but I'll do the evil merge anyway.
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
ACPI: fix acpi_bus.h build warnings when ACPI is not enabled
drivers: acpi: Fix dependency for ACPI_HOTPLUG_CPU
tools/power turbostat: fix IVB support
tools/power turbostat: fix un-intended affinity of forked program
ACPI video: use after input_unregister_device()
gma500: don't register the ACPI video bus
acpi_video: Intel video is not always i915
acpi_video: fix leaking PCI references
ACPI: Ignore invalid _PSS entries, but use valid ones
ACPI battery: only refresh the sysfs files when pertinent information changes
Linus Torvalds [Wed, 6 Jun 2012 17:45:21 +0000 (10:45 -0700)]
Merge tag 'rdma-fixes' of git://git./linux/kernel/git/roland/infiniband
Pull InfiniBand/RDMA fixes from Roland Dreier:
"All in hardware drivers:
- Fix crash in cxgb4
- Fixes to new ocrdma driver
- Regression fixes for mlx4"
* tag 'rdma-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mlx4: Fix max_wqe capacity reported from query device
mlx4_core: Fix setting VL_cap in mlx4_SET_PORT wrapper flow
IB/mlx4: Fix EQ deallocation in legacy mode
RDMA/cxgb4: Fix crash when peer address is 0.0.0.0
RDMA/ocrdma: Remove unnecessary version.h includes
RDMA/ocrdma: Fix signaled event for SRQ_LIMIT_REACHED
RDMA/ocrdma: Correct queue free count math
Roland Dreier [Wed, 6 Jun 2012 17:08:11 +0000 (10:08 -0700)]
Merge branches 'cxgb4', 'mlx4' and 'ocrdma' into for-linus
Sagi Grimberg [Thu, 24 May 2012 13:08:08 +0000 (16:08 +0300)]
IB/mlx4: Fix max_wqe capacity reported from query device
1. Limit the max number of WQEs per QP reported when querying the
device, so that ib_create_qp() will not fail for a QP size that the
device claimed to support due to additional headroom WQEs being
allocated.
2. Limit qp resources accepted for ib_create_qp() to the limits
reported in ib_query_device(). In kernel space, make sure that the
limits returned to the caller following qp creation also lie within
the reported device limits. For userspace, report as before, and do
adjustment in libmlx4 (so as not to break ABI).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Sagi Grimberg <sagig@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Jack Morgenstein [Thu, 24 May 2012 13:08:09 +0000 (16:08 +0300)]
mlx4_core: Fix setting VL_cap in mlx4_SET_PORT wrapper flow
Commit
096335b3f983 ("mlx4_core: Allow dynamic MTU configuration for
IB ports") modifies the port VL setting. This exposes a bug in
mlx4_common_set_port(), where the VL cap value passed in (inside the
command mailbox) is incorrectly zeroed-out:
mlx4_SET_PORT modifies the VL_cap field (byte 3 of the mailbox).
Since the SET_PORT command is paravirtualized on the master as well as
on the slaves, mlx4_SET_PORT_wrapper() is invoked on the master. This
calls mlx4_common_set_port() where mailbox byte 3 gets overwritten by
code which should only set a single bit in that byte (for the reset
qkey counter flag) -- but instead overwrites the entire byte.
The result is that when running in SR-IOV mode, the VL_cap will be set
to zero -- fix this.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Linus Torvalds [Wed, 6 Jun 2012 16:49:28 +0000 (09:49 -0700)]
Merge tag 'md-3.5-fixes' of git://neil.brown.name/md
Pull two md fixes from NeilBrown:
"One sparse-warning fix, one bugfix for 3.4-stable"
* tag 'md-3.5-fixes' of git://neil.brown.name/md:
md: raid1/raid10: fix problem with merge_bvec_fn
lib/raid6: fix sparse warnings in recovery functions
Linus Torvalds [Wed, 6 Jun 2012 16:47:57 +0000 (09:47 -0700)]
Merge tag 'iommu-fixes-3.5-rc1' of git://git./linux/kernel/git/joro/iommu
Pull IOMMU fixes from Joerg Roedel:
"Two patches are in here which fix AMD IOMMU specific issues. One
patch fixes a long-standing warning on resume because the
amd_iommu_resume function enabled interrupts. The other patch fixes a
deadlock in an error-path of the page-fault request handling code of
the IOMMU driver.
* tag 'iommu-fixes-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/amd: Fix deadlock in ppr-handling error path
iommu/amd: Cache pdev pointer to root-bridge
Chris Metcalf [Wed, 6 Jun 2012 15:23:19 +0000 (11:23 -0400)]
tile: add #include to unbreak build after generic init_task conversion
Some code was moved from init_task.c to setup.c but the appropriate
header needed to be moved as well.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Chris Metcalf [Wed, 6 Jun 2012 15:21:44 +0000 (11:21 -0400)]
tile: remove cpu_idle_on_new_stack
This routine isn't used unless CONFIG_HOMECACHE is enabled, which
isn't even available as a public configuration option yet.
Since it no longer links correctly in 3.4, just remove it for now.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Arun Sharma [Fri, 20 Apr 2012 22:41:36 +0000 (15:41 -0700)]
perf/x86: Check user address explicitly in copy_from_user_nmi()
Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1334961696-19580-5-git-send-email-asharma@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Arun Sharma [Fri, 20 Apr 2012 22:41:35 +0000 (15:41 -0700)]
perf/x86: Check if user fp is valid
Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1334961696-19580-4-git-send-email-asharma@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Arun Sharma [Fri, 20 Apr 2012 22:41:34 +0000 (15:41 -0700)]
perf: Limit callchains to 127
Stack depth of 255 seems excessive, given that copy_from_user_nmi()
could be slow.
Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1334961696-19580-3-git-send-email-asharma@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Arun Sharma [Fri, 20 Apr 2012 22:41:33 +0000 (15:41 -0700)]
perf/x86: Allow multiple stacks
Without this patch, applications with two different stack
regions (eg: native stack vs JIT stack) get truncated
callchains even when RBP chaining is present. GDB shows proper
stack traces and the frame pointer chaining is intact.
This patch disables the (fp < RSP) check, hoping that other checks
in the code save the day for us. In our limited testing, this
didn't seem to break anything.
In the long term, we could potentially have userspace advise
the kernel on the range of valid stack addresses, so we don't
spend a lot of time unwinding from bogus addresses.
Signed-off-by: Arun Sharma <asharma@fb.com>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1334961696-19580-2-git-send-email-asharma@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dimitri Sivanich [Tue, 5 Jun 2012 18:44:36 +0000 (13:44 -0500)]
sched: Fix the relax_domain_level boot parameter
It does not get processed because sched_domain_level_max is 0 at the
time that setup_relax_domain_level() is run.
Simply accept the value as it is, as we don't know the value of
sched_domain_level_max until sched domain construction is completed.
Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls
the set_domain_attribute() routine prior to setting the sd->level, however,
the set_domain_attribute() routine relies on the sd->level to decide whether
idle load balancing will be off/on.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Eugeni Dodonov [Wed, 6 Jun 2012 14:59:06 +0000 (11:59 -0300)]
char/agp: add another Ironlake host bridge
This seems to come on Gigabyte H55M-S2V and was discovered through the
https://bugs.freedesktop.org/show_bug.cgi?id=50381 debugging.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=50381
Signed-off-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Peter Zijlstra [Tue, 5 Jun 2012 08:26:43 +0000 (10:26 +0200)]
perf/x86: Update SNB PEBS constraints
Afaict there's no need to (incompletely) iterate the
MEM_UOPS_RETIRED.* umask state.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 5 Jun 2012 08:26:43 +0000 (10:26 +0200)]
perf/x86: Enable/Add IvyBridge hardware support
Implement rudimentary IVB perf support. The SDM states its identical
to SNB with exception of the exact event tables, but a quick look
suggests they're similar enough.
Also mark SNB-EP as broken for now.
Requested-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 5 Jun 2012 08:26:43 +0000 (10:26 +0200)]
perf/x86: Implement cycles:p for SNB/IVB
Now that there's finally a chip with working PEBS (IvyBridge), we can
enable the hardware and implement cycles:p for SNB/IVB.
Cc: Stephane Eranian <eranian@google.com>
Requested-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 5 Jun 2012 13:30:31 +0000 (15:30 +0200)]
perf/x86: Fix Intel shared extra MSR allocation
Zheng Yan reported that event group validation can wreck event state
when Intel extra_reg allocation changes event state.
Validation shouldn't change any persistent state. Cloning events in
validate_{event,group}() isn't really pretty either, so add a few
special cases to avoid modifying the event state.
The code is restructured to minimize the special case impact.
Reported-by: Zheng Yan <zheng.z.yan@linux.intel.com>
Acked-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338903031.28282.175.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Thu, 31 May 2012 19:20:16 +0000 (21:20 +0200)]
sched: Validate assumptions in sched_init_numa()
Add some code to validate assumptions we're making and output
warnings if they are not.
If this trigger we want to know about it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alex Shi <lkml.alex@gmail.com>
Link: http://lkml.kernel.org/n/tip-6uc3wk5s9udxtdl9cnku0vtt@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Thu, 31 May 2012 10:05:32 +0000 (12:05 +0200)]
sched: Always initialize cpu-power
Often when we run into mis-shapen topologies the balance iteration
fails to update the cpu power properly and we'll end up in /0 traps.
Always initialize the cpu-power to a semi-sane value so that we can
at least boot the machine, even if the load-balancer might not
function correctly.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Thu, 31 May 2012 12:47:33 +0000 (14:47 +0200)]
sched: Fix domain iteration
Weird topologies can lead to asymmetric domain setups. This needs
further consideration since these setups are typically non-minimal
too.
For now, make it work by adding an extra mask selecting which CPUs
are allowed to iterate up.
The topology that triggered it is the one from David Rientjes:
10 20 20 30
20 10 20 20
20 20 10 20
30 20 20 10
resulting in boxes that wouldn't even boot.
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Thu, 17 May 2012 19:19:46 +0000 (21:19 +0200)]
sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
Roland Dreier reported spurious, hard to trigger lockdep warnings
within the scheduler - without any real lockup.
This bit gives us the right clue:
> [89945.640512] [<
ffffffff8103fa1a>] double_lock_balance+0x5a/0x90
> [89945.640568] [<
ffffffff8104c546>] push_rt_task+0xc6/0x290
if you look at that code you'll find the double_lock_balance() in
question is the one in find_lock_lowest_rq() [yay for inlining].
Now find_lock_lowest_rq() has a bug.. it fails to use
double_unlock_balance() in one exit path, if this results in a retry in
push_rt_task() we'll call double_lock_balance() again, at which point
we'll run into said lockdep confusion.
Reported-by: Roland Dreier <roland@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Alex Shi [Wed, 6 Jun 2012 06:52:51 +0000 (14:52 +0800)]
sched/numa: Load balance between remote nodes
Commit
cb83b629b ("sched/numa: Rewrite the CONFIG_NUMA sched
domain support") removed the NODE sched domain and started checking
if the node distance in SLIT table is farther than REMOTE_DISTANCE,
if so, it will lose the load balance chance at exec/fork/wake_affine
points.
But actually, even the node distance is farther than REMOTE_DISTANCE.
Modern CPUs also has QPI like connections, which ensures that memory
access is not too slow between nodes. So the above change in behavior
on NUMA machine causes a performance regression on various benchmarks:
hackbench, tbench, netperf, oltp, etc.
This patch will recover the scheduler behavior to old mode on all my
Intel platforms: NHM EP/EX, WSM EP, SNB EP/EP4S, and thus fixes the
perfromance regressions. (all of them just have 2 kinds distance, 10, 21)
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338965571-9812-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kamalesh Babulal [Thu, 31 May 2012 07:37:38 +0000 (13:07 +0530)]
sched/x86: Calculate booted cores after construction of sibling_mask
Commit
316ad248307fb ("sched/x86: Rewrite set_cpu_sibling_map()")
broke the booted_cores accounting.
The problem is that the booted_cores accounting needs all the
sibling links set up. So restore the second loop and add a comment as
to why its needed.
On qemu booted with -smp sockets=1,cores=2,threads=2;
Before:
$ grep cores /proc/cpuinfo
cpu cores : 2
cpu cores : 1
cpu cores : 4
cpu cores : 3
With the patch:
$ grep cores /proc/cpuinfo
cpu cores : 2
cpu cores : 2
cpu cores : 2
cpu cores : 2
Reported-by: Prarit Bhargava <prarit@redhat.com>
Reported-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120531073738.GH7511@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tomoki Sekiyama [Mon, 28 May 2012 09:09:18 +0000 (18:09 +0900)]
x86/ioapic: Fix NULL pointer dereference on CPU hotplug after disabling irqs
In current Linux, percpu variable `vector_irq' is not cleared on
offlined cpus while disabling devices' irqs. If the cpu that has
the disabled irqs in vector_irq is hotplugged,
__setup_vector_irq() hits invalid irq vector and may crash.
This bug can be reproduced as following;
# echo 0 > /sys/devices/system/cpu/cpu7/online
# modprobe -r some_driver_using_interrupts # vector_irq@cpu7 uncleared
# echo 1 > /sys/devices/system/cpu/cpu7/online # kernel may crash
This patch fixes this bug by clearing vector_irq in
__clear_irq_vector() even if the cpu is offlined.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: ltc-kernel@ml.yrl.intra.hitachi.co.jp
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/4FC340BE.7080101@hitachi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Feng Tang [Wed, 30 May 2012 15:15:41 +0000 (23:15 +0800)]
x86/reboot: Fix a warning message triggered by stop_other_cpus()
When rebooting our 24 CPU Westmere servers with 3.4-rc6, we
always see this warning msg:
Restarting system.
machine restart
------------[ cut here ]------------
WARNING: at arch/x86/kernel/smp.c:125
native_smp_send_reschedule+0x74/0xa7() Hardware name: X8DTN
Modules linked in: igb [last unloaded: scsi_wait_scan]
Pid: 1, comm: systemd-shutdow Not tainted 3.4.0-rc6+ #22
Call Trace:
<IRQ> [<
ffffffff8102a41f>] warn_slowpath_common+0x7e/0x96
[<
ffffffff8102a44c>] warn_slowpath_null+0x15/0x17
[<
ffffffff81018cf7>] native_smp_send_reschedule+0x74/0xa7
[<
ffffffff810561c1>] trigger_load_balance+0x279/0x2a6
[<
ffffffff81050112>] scheduler_tick+0xe0/0xe9
[<
ffffffff81036768>] update_process_times+0x60/0x70
[<
ffffffff81062f2f>] tick_sched_timer+0x68/0x92
[<
ffffffff81046e33>] __run_hrtimer+0xb3/0x13c
[<
ffffffff81062ec7>] ? tick_nohz_handler+0xd0/0xd0
[<
ffffffff810474f2>] hrtimer_interrupt+0xdb/0x198
[<
ffffffff81019a35>] smp_apic_timer_interrupt+0x81/0x94
[<
ffffffff81655187>] apic_timer_interrupt+0x67/0x70
<EOI> [<
ffffffff8101a3c4>] ? default_send_IPI_mask_allbutself_phys+0xb4/0xc4
[<
ffffffff8101c680>] physflat_send_IPI_allbutself+0x12/0x14
[<
ffffffff81018db4>] native_nmi_stop_other_cpus+0x8a/0xd6
[<
ffffffff810188ba>] native_machine_shutdown+0x50/0x67
[<
ffffffff81018926>] machine_shutdown+0xa/0xc
[<
ffffffff8101897e>] native_machine_restart+0x20/0x32
[<
ffffffff810189b0>] machine_restart+0xa/0xc
[<
ffffffff8103b196>] kernel_restart+0x47/0x4c
[<
ffffffff8103b2e6>] sys_reboot+0x13e/0x17c
[<
ffffffff8164e436>] ? _raw_spin_unlock_bh+0x10/0x12
[<
ffffffff810fcac9>] ? bdi_queue_work+0xcf/0xd8
[<
ffffffff810fe82f>] ? __bdi_start_writeback+0xae/0xb7
[<
ffffffff810e0d64>] ? iterate_supers+0xa3/0xb7
[<
ffffffff816547a2>] system_call_fastpath+0x16/0x1b
---[ end trace
320af5cb1cb60c5b ]---
The root cause seems to be the
default_send_IPI_mask_allbutself_phys() takes quite some time (I
measured it could be several ms) to complete sending NMIs to all
the other 23 CPUs, and for HZ=250/1000 system, the time is long
enough for a timer interrupt to happen, which will in turn
trigger to kick load balance to a stopped CPU and cause this
warning in native_smp_send_reschedule().
So disabling the local irq before stop_other_cpu() can fix this
problem (tested 25 times reboot ok), and it is fine as there
should be nobody caring the timer interrupt in such reboot
stage.
The latest 3.4 kernel slightly changes this behavior by sending
REBOOT_VECTOR first and only send NMI_VECTOR if the REBOOT_VCTOR
fails, and this patch is still needed to prevent the problem.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120530231541.4c13433a@feng-i7
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sebastian Andrzej Siewior [Thu, 31 May 2012 21:20:25 +0000 (23:20 +0200)]
x86/intel/moorestown: Change intel_scu_devices_create() to __devinit
The allmodconfig hits:
WARNING: vmlinux.o(.text+0x6553d): Section mismatch in
reference from the function intel_scu_devices_create() to the
function .devinit.text: spi_register_board_info()
[...]
This patch marks intel_scu_devices_create() as devinit because
it only calls a devinit function, spi_register_board_info().
Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Link: http://lkml.kernel.org/r/20120531212025.GA8519@breakpoint.cc
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Yasuaki Ishimatsu [Mon, 4 Jun 2012 02:42:32 +0000 (11:42 +0900)]
x86/numa: Set numa_nodes_parsed at acpi_numa_memory_affinity_init()
When hot-adding a CPU, the system outputs following messages
since node_to_cpumask_map[2] was not allocated memory.
Booting Node 2 Processor 32 APIC 0xc0
node_to_cpumask_map[2] NULL
Pid: 0, comm: swapper/32 Tainted: G A 3.3.5-acd #21
Call Trace:
[<
ffffffff81048845>] debug_cpumask_set_cpu+0x155/0x160
[<
ffffffff8105e28a>] ? add_timer_on+0xaa/0x120
[<
ffffffff8150665f>] numa_add_cpu+0x1e/0x22
[<
ffffffff815020bb>] identify_cpu+0x1df/0x1e4
[<
ffffffff815020d6>] identify_econdary_cpu+0x16/0x1d
[<
ffffffff81504614>] smp_store_cpu_info+0x3c/0x3e
[<
ffffffff81505263>] smp_callin+0x139/0x1be
[<
ffffffff815052fb>] start_secondary+0x13/0xeb
The reason is that the bit of node 2 was not set at
numa_nodes_parsed. numa_nodes_parsed is set by only
acpi_numa_processor_affinity_init /
acpi_numa_x2apic_affinity_init. Thus even if hot-added memory
which is same PXM as hot-added CPU is written in ACPI SRAT
Table, if the hot-added CPU is not written in ACPI SRAT table,
numa_nodes_parsed is not set.
But according to ACPI Spec Rev 5.0, it says about ACPI SRAT
table as follows: This optional table provides information that
allows OSPM to associate processors and memory ranges, including
ranges of memory provided by hot-added memory devices, with
system localities / proximity domains and clock domains.
It means that ACPI SRAT table only provides information for CPUs
present at boot time and for memory including hot-added memory.
So hot-added memory is written in ACPI SRAT table, but hot-added
CPU is not written in it. Thus numa_nodes_parsed should be set
by not only acpi_numa_processor_affinity_init /
acpi_numa_x2apic_affinity_init but also
acpi_numa_memory_affinity_init for the case.
Additionally, if system has cpuless memory node,
acpi_numa_processor_affinity_init /
acpi_numa_x2apic_affinity_init cannot set numa_nodes_parseds
since these functions cannot find cpu description for the node.
In this case, numa_nodes_parsed needs to be set by
acpi_numa_memory_affinity_init.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: liuj97@gmail.com
Cc: kosaki.motohiro@gmail.com
Link: http://lkml.kernel.org/r/4FCC2098.4030007@jp.fujitsu.com
[ merged it ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Xiaotian Feng [Tue, 5 Jun 2012 19:00:31 +0000 (15:00 -0400)]
x86/gart: Fix kmemleak warning
aperture_64.c now is using memblock, the previous
kmemleak_ignore() for alloc_bootmem() should be removed then.
Otherwise, with kmemleak enabled, kernel will throw warnings
like:
[ 0.000000] kmemleak: Trying to color unknown object at 0xffff8800c4000000 as Black
[ 0.000000] Pid: 0, comm: swapper/0 Not tainted 3.5.0-rc1-next-
20120605+ #130
[ 0.000000] Call Trace:
[ 0.000000] [<
ffffffff811b27e6>] paint_ptr+0x66/0xc0
[ 0.000000] [<
ffffffff816b90fb>] kmemleak_ignore+0x2b/0x60
[ 0.000000] [<
ffffffff81ef7bc0>] kmemleak_init+0x217/0x2c1
[ 0.000000] [<
ffffffff81ed2b97>] start_kernel+0x32d/0x3eb
[ 0.000000] [<
ffffffff81ed25e4>] ? repair_env_string+0x5a/0x5a
[ 0.000000] [<
ffffffff81ed2356>] x86_64_start_reservations+0x131/0x135
[ 0.000000] [<
ffffffff81ed2120>] ? early_idt_handlers+0x120/0x120
[ 0.000000] [<
ffffffff81ed245c>] x86_64_start_kernel+0x102/0x111
[ 0.000000] kmemleak: Early log backtrace:
[ 0.000000] [<
ffffffff816b911b>] kmemleak_ignore+0x4b/0x60
[ 0.000000] [<
ffffffff81ee6a38>] gart_iommu_hole_init+0x3e7/0x547
[ 0.000000] [<
ffffffff81edb20b>] pci_iommu_alloc+0x44/0x6f
[ 0.000000] [<
ffffffff81ee81ad>] mem_init+0x19/0xec
[ 0.000000] [<
ffffffff81ed2a54>] start_kernel+0x1ea/0x3eb
[ 0.000000] [<
ffffffff81ed2356>] x86_64_start_reservations+0x131/0x135
[ 0.000000] [<
ffffffff81ed245c>] x86_64_start_kernel+0x102/0x111
[ 0.000000] [<
ffffffffffffffff>] 0xffffffffffffffff
Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Xiaotian Feng <xtfeng@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1338922831-2847-1-git-send-email-xtfeng@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Thomas Gleixner [Wed, 6 Jun 2012 09:33:21 +0000 (11:33 +0200)]
x86: mce: Add the dropped timer interval init back
commit
82f7af09 ("x86/mce: Cleanup timer mess) dropped the
initialization of the per cpu timer interval. Duh :(
Restore the previous behaviour.
Reported-by: Chen Gong <gong.chen@linux.intel.com>
Cc: bp@amd64.org
Cc: tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Masami Hiramatsu [Mon, 4 Jun 2012 15:09:11 +0000 (00:09 +0900)]
x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
Fix the x86 instruction decoder to decode bsr/bsf/jmpe with
operand-size prefix (66h). This fixes the test case failure
reported by Linus, attached below.
bsf/bsr/jmpe have a special encoding. Opcode map in
Intel Software Developers Manual vol2 says they have
TZCNT/LZCNT variants if it has F3h prefix. However, there
is no information if it has other 66h or F2h prefixes.
Current instruction decoder supposes that those are
bad instructions, but it actually accepts at least
operand-size prefixes.
H. Peter Anvin further explains:
" TZCNT/LZCNT are F3 + BSF/BSR exactly because the F2 and
F3 prefixes have historically been no-ops with most instructions.
This allows software to unconditionally use the prefixed versions
and get TZCNT/LZCNT on the processors that have them if they don't
care about the difference. "
This fixes errors reported by test_get_len:
Warning: arch/x86/tools/test_get_len found difference at <em_bsf>:
ffffffff81036d87
Warning:
ffffffff81036de5: 66 0f bc c2 bsf %dx,%ax
Warning: objdump says 4 bytes, but insn_get_length() says 3
Warning: arch/x86/tools/test_get_len found difference at <em_bsr>:
ffffffff81036ea6
Warning:
ffffffff81036f04: 66 0f bd c2 bsr %dx,%ax
Warning: objdump says 4 bytes, but insn_get_length() says 3
Warning: decoded and checked
13298882 instructions with 2 warnings
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <yrl.pp-manager.tt@hitachi.com>
Link: http://lkml.kernel.org/r/20120604150911.22338.43296.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo Molnar [Wed, 6 Jun 2012 06:40:10 +0000 (08:40 +0200)]
Merge tag 'perf-urgent-for-mingo' of git://git./linux/kernel/git/acme/linux into perf/urgent
Pull perf fixes from Arnaldo Carvalho de Melo:
* Endianness fixes from Jiri Olsa
* Fixes for make perf tarball
* Fix for DSO name in perf script callchains, from David Ahern
* Segfault fixes for perf top --callchain, from Namhyung Kim
* Minor function result fixes from Srikar Dronamraju
* Add missing 3rd ioctl parameter, from Namhyung Kim
* Fix pager usage in minimal embedded systems, from Avik Sil
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Chen Gong [Tue, 5 Jun 2012 02:35:02 +0000 (10:35 +0800)]
x86/mce: Fix the MCE poll timer logic
In commit
82f7af09 ("x86/mce: Cleanup timer mess), Thomas just
forgot the "/ 2" there while cleaning up.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@amd64.org
Cc: tony.luck@intel.com
Link: http://lkml.kernel.org/r/1338863702-9245-1-git-send-email-gong.chen@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Tue, 5 Jun 2012 22:15:04 +0000 (15:15 -0700)]
Merge tag 'please-pull-mce' of git://git./linux/kernel/git/ras/ras
Pull MCE regression fix from Tony Luck:
"Typo/thinko in a cleanup caused a semantic change. Fix it."
* tag 'please-pull-mce' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
x86/mce: Fix the MCE poll timer logic
Linus Torvalds [Tue, 5 Jun 2012 20:23:17 +0000 (13:23 -0700)]
Merge branch 'fixes-for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
Pull arm CMA fix from Marek Szyprowski:
"This removes the ARMv6+ CMA dependency and lets one use old, well-
tested dma-mapping implementation also on ARMv6+ systems without the
need to use EXPERIMENTAL stuff."
Russell King complained (rightly) about the experimental feature being
forced on by the ARM config.
Here CMA is "continuous memory allocator", not "cross-memory attach".
We really neet to stop using insane TLA's for things that aren't big
industry standards.
* 'fixes-for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
ARM: dma-mapping: remove unconditional dependency on CMA
Bryan Wu [Thu, 31 May 2012 11:51:37 +0000 (19:51 +0800)]
MAINTAINERS: add linux-leds mail list address and git URL
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Vetter [Wed, 23 May 2012 12:02:00 +0000 (14:02 +0200)]
drm/i915: fix up ivb plane 3 pageflips
Or at least plug another gapping hole. Apparrently hw desingers only
moved the bit field, but did not bother ot re-enumerate the planes
when adding support for a 3rd pipe.
Discovered by i-g-t/flip_test.
This may or may not fix the reference bugzilla, because that one
smells like we have still larger fish to fry.
v2: Fixup the impossible case to catch programming errors, noticed by
Chris Wilson.
References: https://bugs.freedesktop.org/show_bug.cgi?id=50069
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
Eugeni Dodonov <eugeni.dodonov@intel.com>
Cc: stable@vger.kernel.org
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Karel Zak [Fri, 1 Jun 2012 08:13:09 +0000 (10:13 +0200)]
MAINTAINERS: update util-linux info
Signed-off-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 5 Jun 2012 18:55:27 +0000 (11:55 -0700)]
Merge branch '3.5-merge-fixes' of git://git./linux/kernel/git/nab/target-pending
Pull two scsi target fixes from Nicholas Bellinger:
"The first is a small name-space collision fix from Stefan for the new
sbp-target / ieee-1394 code, and second is the FILEIO backend
conversion patch to always use O_DSYNC by default instead of O_SYNC as
recommended by hch. Note the latter is CC'ed stable."
* '3.5-merge-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
target/file: Use O_DSYNC by default for FILEIO backends
sbp-target: rename a variable to avoid name clash