Christophe Leroy [Sat, 13 Oct 2018 09:45:12 +0000 (09:45 +0000)]
powerpc/64: properly initialise the stackprotector canary on SMP.
commit
06ec27aea9fc ("powerpc/64: add stack protector support")
doesn't initialise the stack canary on SMP secondary CPU's paca,
leading to the following false positive report from the
stack protector.
smp: Bringing up secondary CPUs ...
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __schedule+0x978/0xa80
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.0-rc7-next-
20181010-autotest-autotest #1
Call Trace:
[
c000001fed5b3bf0] [
c000000000a0ef3c] dump_stack+0xb0/0xf4 (unreliable)
[
c000001fed5b3c30] [
c0000000000f9d68] panic+0x140/0x308
[
c000001fed5b3cc0] [
c0000000000f9844] __stack_chk_fail+0x24/0x30
[
c000001fed5b3d20] [
c000000000a2c3a8] __schedule+0x978/0xa80
[
c000001fed5b3e00] [
c000000000a2c9b4] schedule_idle+0x34/0x60
[
c000001fed5b3e30] [
c00000000013d344] do_idle+0x224/0x3d0
[
c000001fed5b3ec0] [
c00000000013d6e0] cpu_startup_entry+0x30/0x50
[
c000001fed5b3ef0] [
c000000000047f34] start_secondary+0x4d4/0x520
[
c000001fed5b3f90] [
c00000000000b370] start_secondary_prolog+0x10/0x14
This patch properly initialises the stack_canary of the secondary
idle tasks.
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Fixes: 06ec27aea9fc ("powerpc/64: add stack protector support")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Tue, 9 Oct 2018 05:51:05 +0000 (16:51 +1100)]
Merge branch 'fixes' into next
Merge our fixes branch. It has a few important fixes that are needed for
futher testing and also some commits that will conflict with content in
next.
Finn Thain [Wed, 12 Sep 2018 00:18:44 +0000 (20:18 -0400)]
macintosh/via-macii, macintosh/adb-iop: Clean up whitespace
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Finn Thain [Wed, 12 Sep 2018 00:18:44 +0000 (20:18 -0400)]
macintosh/via-macii, macintosh/adb-iop: Modernize printk calls
Add missing severity level to log messages.
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Finn Thain [Wed, 12 Sep 2018 00:18:44 +0000 (20:18 -0400)]
macintosh/via-macii: Simplify locking
Modifying the request queue or changing the current state requires
mutual exclusion. Use local_irq_disable() consistently for this
rather than disabling the ADB interrupt. This simplifies the locking
scheme and brings via-macii into line with the other ADB drivers.
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Finn Thain [Wed, 12 Sep 2018 00:18:44 +0000 (20:18 -0400)]
macintosh/via-macii: Remove BUG_ON assertions
The BUG_ON assertions I added to the via-macii driver over a decade ago
haven't fired AFAIK. Some can never fire (by inspection). One assertion
checks for a NULL pointer, but that would merely substitute a BUG crash
for an Oops crash. Remove the pointless BUG_ON assertions and replace
the others with a WARN_ON and an array bounds check.
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Finn Thain [Wed, 12 Sep 2018 00:18:44 +0000 (20:18 -0400)]
macintosh/via-macii: Synchronous bus reset
Make the reset operation synchronous, like the other ADB drivers.
The reset request is static data but callers may not know that.
This way the struct is not in use when the reset method returns.
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Finn Thain [Wed, 12 Sep 2018 00:18:44 +0000 (20:18 -0400)]
macintosh/adb: Rework printk output again
Avoid the KERN_CONT problem by avoiding message fragments. The problem
arises during async ADB bus probing, when ADB messages may get mixed up
with other messages. See also, commit
4bcc595ccd80 ("printk: reinstate
KERN_CONT for printing continuation lines").
Remove a number of printk() continuation lines by logging handler
changes in adb_try_handler_change() instead.
This patch addresses the problematic use of "\n" at the beginning of
pr_cont() messages, which got overlooked in commit
f2be6295684b
("macintosh/adb: Properly mark continued kernel messages").
That commit also changed printk(KERN_DEBUG ...) to pr_debug(...), which
hinders work on low-level ADB driver bugs. Revert that change.
Cc: Andreas Schwab <schwab@linux-m68k.org>
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Finn Thain [Wed, 12 Sep 2018 00:18:44 +0000 (20:18 -0400)]
macintosh: Use common code to access RTC
Now that the 68k Mac port has adopted the via-pmu driver, the same RTC
code can be shared between m68k and powerpc. Replace duplicated code in
arch/powerpc and arch/m68k with common RTC accessors for Cuda and PMU.
Drop the problematic WARN_ON which was introduced in commit
22db552b50fa
("powerpc/powermac: Fix rtc read/write functions").
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Srikar Dronamraju [Fri, 28 Sep 2018 03:47:32 +0000 (09:17 +0530)]
powerpc/numa: Skip onlining a offline node in kdump path
With commit
2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot"), kdump kernel on shared LPAR may crash.
The necessary conditions are
- Shared LPAR with at least 2 nodes having memory and CPUs.
- Memory requirement for kdump kernel must be met by the first N-1
nodes where there are at least N nodes with memory and CPUs.
Example numactl of such a machine.
$ numactl -H
available: 5 nodes (0,2,5-7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 255 MB
node 2 free: 189 MB
node 5 cpus: 24 25 26 27 28 29 30 31
node 5 size: 4095 MB
node 5 free: 4024 MB
node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 6 size: 6353 MB
node 6 free: 5998 MB
node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
node 7 size: 7640 MB
node 7 free: 7164 MB
node distances:
node 0 2 5 6 7
0: 10 40 40 40 40
2: 40 10 40 40 40
5: 40 40 10 40 40
6: 40 40 40 10 20
7: 40 40 40 20 10
Steps to reproduce.
1. Load / start kdump service.
2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger)
When booting a kdump kernel with 2048M:
kexec: Starting switchover sequence.
I'm in purgatory
Using 1TB segments
hash-mmu: Initializing hash mmu with SLB
Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018
Found initrd at 0xc000000009e70000:0xc00000000ae554b4
Using pSeries machine description
-----------------------------------------------------
ppc64_pft_size = 0x1e
phys_mem_size = 0x88000000
dcache_bsize = 0x80
icache_bsize = 0x80
cpu_features = 0x000000ff8f5d91a7
possible = 0x0000fbffcf5fb1a7
always = 0x0000006f8b5c91a1
cpu_user_features = 0xdc0065c2 0xef000000
mmu_features = 0x7c006001
firmware_features = 0x00000007c45bfc57
htab_hash_mask = 0x7fffff
physical_start = 0x8000000
-----------------------------------------------------
numa: NODE_DATA [mem 0x87d5e300-0x87d67fff]
numa: NODE_DATA(0) on node 6
numa: NODE_DATA [mem 0x87d54600-0x87d5e2ff]
Top of RAM: 0x88000000, Total RAM: 0x88000000
Memory hole size: 0MB
Zone ranges:
DMA [mem 0x0000000000000000-0x0000000087ffffff]
DMA32 empty
Normal empty
Movable zone start for each node
Early memory node ranges
node 6: [mem 0x0000000000000000-0x0000000087ffffff]
Could not find start_pfn for node 0
Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
On node 0 totalpages: 0
Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff]
On node 6 totalpages: 34816
Unable to handle kernel paging request for data at address 0x00000060
Faulting instruction address: 0xc000000008703a54
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1
NIP:
c000000008703a54 LR:
c000000008703a38 CTR:
0000000000000000
REGS:
c00000000b673440 TRAP: 0380 Not tainted (4.19.0-rc5-master+)
MSR:
8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR:
24022022 XER:
20000002
CFAR:
c0000000086fc238 IRQMASK: 0
GPR00:
c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000
GPR04:
0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080
GPR08:
0000000000000000 0000000000000000 0000000000000000 0000000000000220
GPR12:
0000000000002200 c000000009e51400 0000000000000000 0000000000000008
GPR16:
0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000
GPR20:
c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008
GPR24:
0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78
GPR28:
c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400
NIP [
c000000008703a54] bus_add_device+0x84/0x1e0
LR [
c000000008703a38] bus_add_device+0x68/0x1e0
Call Trace:
[
c00000000b6736c0] [
c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable)
[
c00000000b673740] [
c000000008700194] device_add+0x454/0x7c0
[
c00000000b673800] [
c00000000872e660] __register_one_node+0xb0/0x240
[
c00000000b673860] [
c00000000839a6bc] __try_online_node+0x12c/0x180
[
c00000000b673900] [
c00000000839b978] try_online_node+0x58/0x90
[
c00000000b673930] [
c0000000080846d8] find_and_online_cpu_nid+0x158/0x190
[
c00000000b673a10] [
c0000000080848a0] numa_update_cpu_topology+0x190/0x580
[
c00000000b673c00] [
c000000008d3f2e4] smp_cpus_done+0x94/0x108
[
c00000000b673c70] [
c000000008d5c00c] smp_init+0x174/0x19c
[
c00000000b673d00] [
c000000008d346b8] kernel_init_freeable+0x1e0/0x450
[
c00000000b673dc0] [
c0000000080102e8] kernel_init+0x28/0x160
[
c00000000b673e30] [
c00000000800b65c] ret_from_kernel_thread+0x5c/0x80
Instruction dump:
60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c
e8bf0050 e93e0098 3b9f0010 2fa50000 <
e8690060>
38630018 419e0114 7f84e378
---[ end trace
593577668c2daa65 ]---
However a regular kernel with 4096M (2048 gets reserved for crash
kernel) boots properly.
Unlike regular kernels, which mark all available nodes as online,
kdump kernel only marks just enough nodes as online and marks the rest
as offline at boot. However kdump kernel boots with all available
CPUs. With Commit
2ea626306810 ("powerpc/topology: Get topology for
shared processors at boot"), all CPUs are onlined on their respective
nodes at boot time. try_online_node() tries to online the offline
nodes but fails as all needed subsystems are not yet initialized.
As part of fix, detect and skip early onlining of a offline node.
Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
Reported-by: Pavithra Prakash <pavrampu@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 5 Oct 2018 06:43:55 +0000 (16:43 +1000)]
powerpc: Don't print kernel instructions in show_user_instructions()
Recently we implemented show_user_instructions() which dumps the code
around the NIP when a user space process dies with an unhandled
signal. This was modelled on the x86 code, and we even went so far as
to implement the exact same bug, namely that if the user process
crashed with its NIP pointing into the kernel we will dump kernel text
to dmesg. eg:
bad-bctr[2996]: segfault (11) at
c000000000010000 nip
c000000000010000 lr
12d0b0894 code 1
bad-bctr[2996]: code:
fbe10068 7cbe2b78 7c7f1b78 fb610048 38a10028 38810020 fb810050 7f8802a6
bad-bctr[2996]: code:
3860001c f8010080 48242371 60000000 <
7c7b1b79>
4082002c e8010080 eb610048
This was discovered on x86 by Jann Horn and fixed in commit
342db04ae712 ("x86/dumpstack: Don't dump kernel memory based on usermode RIP").
Fix it by checking the adjusted NIP value (pc) and number of
instructions against USER_DS, and bail if we fail the check, eg:
bad-bctr[2969]: segfault (11) at
c000000000010000 nip
c000000000010000 lr
107930894 code 1
bad-bctr[2969]: Bad NIP, not dumping instructions.
Fixes: 88b0fe175735 ("powerpc: Add show_user_instructions()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Mon, 27 Aug 2018 03:03:02 +0000 (13:03 +1000)]
powerpc/64s/radix: Explicitly flush ERAT with local LPID invalidation
Local radix TLB flush operations that operate on congruence classes
have explicit ERAT flushes for POWER9. The process scoped LPID flush
did not have a flush, so add it.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Mon, 27 Aug 2018 03:03:01 +0000 (13:03 +1000)]
powerpc/64s/hash: Do not use PPC_INVALIDATE_ERAT on CPUs before POWER9
PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced
with POWER9, and the result is undefined on earlier CPUs.
Commits
7b9f71f974 ("powerpc/64s: POWER9 machine check handler") and
d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on
POWER9") caused POWER7/8 code to use this instruction. Remove it. An
ERAT flush can be made by invalidatig the SLB, but before POWER9 that
requires a flush and rebolt.
Fixes: 7b9f71f974 ("powerpc/64s: POWER9 machine check handler")
Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Mon, 1 Oct 2018 23:01:05 +0000 (09:01 +1000)]
powerpc/time: Add set_state_oneshot_stopped decrementer callback
If CONFIG_PPC_WATCHDOG is enabled we always cap the decrementer to
0x7fffffff:
if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
set_dec(0x7fffffff);
else
set_dec(decrementer_max);
If there are no future events, we don't reprogram the decrementer
after this and we end up with 0x7fffffff even on a large decrementer
capable system.
As suggested by Nick, add a set_state_oneshot_stopped callback
so we program the decrementer with decrementer_max if there are
no future events.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Mon, 1 Oct 2018 23:01:04 +0000 (09:01 +1000)]
powerpc/time: Use clockevents_register_device(), fixing an issue with large decrementer
We currently cap the decrementer clockevent at 4 seconds, even on systems
with large decrementer support. Fix this by converting the code to use
clockevents_register_device() which calculates the upper bound based on
the max_delta passed in.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mark Hairgrove [Wed, 3 Oct 2018 18:51:34 +0000 (11:51 -0700)]
powerpc/powernv/npu: Remove atsd_threshold debugfs setting
This threshold is no longer used now that all invalidates issue a single
ATSD to each active NPU.
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mark Hairgrove [Wed, 3 Oct 2018 18:51:33 +0000 (11:51 -0700)]
powerpc/powernv/npu: Use size-based ATSD invalidates
Prior to this change only two types of ATSDs were issued to the NPU:
invalidates targeting a single page and invalidates targeting the whole
address space. The crossover point happened at the configurable
atsd_threshold which defaulted to 2M. Invalidates that size or smaller
would issue per-page invalidates for the whole range.
The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
These invalidates target addresses aligned to their size. 2M is a common
invalidation size for GPU-enabled applications because that is a GPU
page size, so reducing the number of invalidates by 32x in that case is a
clear improvement.
ATSD latency is high in general so now we always issue a single invalidate
rather than multiple. This will over-invalidate in some cases, but for any
invalidation size over 2M it matches or improves the prior behavior.
There's also an improvement for single-page invalidates since the prior
version issued two invalidates for that case instead of one.
With this change all issued ATSDs now perform a flush, so the flush
parameter has been removed from all the helpers.
To show the benefit here are some performance numbers from a
microbenchmark which creates a 1G allocation then uses mprotect with
PROT_NONE to trigger invalidates in strides across the allocation.
One NPU (1 GPU):
mprotect rate (GB/s)
Stride Before After Speedup
64K 5.3 5.6 5%
1M 39.3 57.4 46%
2M 49.7 82.6 66%
4M 286.6 285.7 0%
Two NPUs (6 GPUs):
mprotect rate (GB/s)
Stride Before After Speedup
64K 6.5 7.4 13%
1M 33.4 67.9 103%
2M 38.7 93.1 141%
4M 356.7 354.6 -1%
Anything over 2M is roughly the same as before since both cases issue a
single ATSD.
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mark Hairgrove [Wed, 3 Oct 2018 18:51:32 +0000 (11:51 -0700)]
powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
There are two types of ATSDs issued to the NPU: invalidates targeting a
specific virtual address and invalidates targeting the whole address
space. In both cases prior to this change, the sequence was:
for each NPU
- Write the target address to the XTS_ATSD_AVA register
- EIEIO
- Write the launch value to issue the ATSD
First, a target address is not required when invalidating the whole
address space, so that write and the EIEIO have been removed. The AP
(size) field in the launch is not needed either.
Second, for per-address invalidates the above sequence is inefficient in
the common case of multiple NPUs because an EIEIO is issued per NPU. This
unnecessarily forces the launches of later ATSDs to be ordered with the
launches of earlier ones. The new sequence only issues a single EIEIO:
for each NPU
- Write the target address to the XTS_ATSD_AVA register
EIEIO
for each NPU
- Write the launch value to issue the ATSD
Performance results were gathered using a microbenchmark which creates a
1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
strides across the allocation.
With only a single NPU active (one GPU) the difference is in the noise for
both types of invalidates (+/-1%).
With two NPUs active (on a 6-GPU system) the effect is more noticeable:
mprotect rate (GB/s)
Stride Before After Speedup
64K 5.9 6.5 10%
1M 31.2 33.4 7%
2M 36.3 38.7 7%
4M 322.6 356.7 11%
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Masahiro Yamada [Mon, 1 Oct 2018 06:10:24 +0000 (15:10 +0900)]
powerpc: remove leftover code of old GCC version checks
Clean up the leftover of commit
f2910f0e6835 ("powerpc: remove old
GCC version checks").
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Mon, 1 Oct 2018 06:21:51 +0000 (16:21 +1000)]
powerpc/nohash: fix undefined behaviour when testing page size support
When enumerating page size definitions to check hardware support,
we construct a constant which is (1U << (def->shift - 10)).
However, the array of page size definitions is only initalised for
various MMU_PAGE_* constants, so it contains a number of 0-initialised
elements with def->shift == 0. This means we end up shifting by a
very large number, which gives the following UBSan splat:
================================================================================
UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21
shift exponent
4294967286 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 0 Comm: swapper Not tainted
4.19.0-rc3-00045-ga604f927b012-dirty #6
Call Trace:
[
c00000000101bc20] [
c000000000a13d54] .dump_stack+0xa8/0xec (unreliable)
[
c00000000101bcb0] [
c0000000004f20a8] .ubsan_epilogue+0x18/0x64
[
c00000000101bd30] [
c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4
[
c00000000101be20] [
c000000000d21760] .early_init_mmu+0x1b4/0x5a0
[
c00000000101bf10] [
c000000000d1ba28] .early_setup+0x100/0x130
[
c00000000101bf90] [
c000000000000528] start_here_multiplatform+0x68/0x80
================================================================================
Fix this by first checking if the element exists (shift != 0) before
constructing the constant.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 28 Sep 2018 15:39:20 +0000 (15:39 +0000)]
powerpc: Wire up memtest
Add call to early_memtest() so that kernel compiled with
CONFIG_MEMTEST really perform memtest at startup when requested
via 'memtest' boot parameter.
Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 13 Aug 2018 13:19:52 +0000 (13:19 +0000)]
powerpc/mm: Don't report hugepage tables as memory leaks when using kmemleak
When a process allocates a hugepage, the following leak is
reported by kmemleak. This is a false positive which is
due to the pointer to the table being stored in the PGD
as physical memory address and not virtual memory pointer.
unreferenced object 0xc30f8200 (size 512):
comm "mmap", pid 374, jiffies
4872494 (age 627.630s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<
e32b68da>] huge_pte_alloc+0xdc/0x1f8
[<
9e0df1e1>] hugetlb_fault+0x560/0x8f8
[<
7938ec6c>] follow_hugetlb_page+0x14c/0x44c
[<
afbdb405>] __get_user_pages+0x1c4/0x3dc
[<
b8fd7cd9>] __mm_populate+0xac/0x140
[<
3215421e>] vm_mmap_pgoff+0xb4/0xb8
[<
c148db69>] ksys_mmap_pgoff+0xcc/0x1fc
[<
4fcd760f>] ret_from_syscall+0x0/0x38
See commit
a984506c542e2 ("powerpc/mm: Don't report PUDs as
memory leaks when using kmemleak") for detailed explanation.
To fix that, this patch tells kmemleak to ignore the allocated
hugepage table.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Thu, 27 Sep 2018 05:05:15 +0000 (15:05 +1000)]
powerpc/tm: Reformat comments
The comments in this file don't conform to the coding style so take
them to "Comment Formatting Re-Education Camp".
Suggested-by: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
[mpe: Reflow some comments and add full stops, fix spelling of Sergeant.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Petr Vorel [Wed, 26 Sep 2018 14:10:56 +0000 (16:10 +0200)]
powerpc/config: Enable CONFIG_PRINTK_TIME
for 64bit configs which use for CONFIG_LOG_BUF_SHIFT the same
or higher value than the default (currently 17).
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
YueHaibing [Sun, 23 Sep 2018 08:12:08 +0000 (08:12 +0000)]
powerpc: Remove duplicated include from pci_32.c
Remove duplicated include.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michal Suchanek [Wed, 26 Sep 2018 12:24:30 +0000 (14:24 +0200)]
powerpc/64s: consolidate MCE counter increment.
The code in machine_check_exception excludes 64s hvmode when
incrementing the MCE counter only to call opal_machine_check to
increment it specifically for this case.
Remove the exclusion and special case.
Fixes: a43c1590426c ("powerpc/pseries: Flush SLB contents on SLB MCE
errors.")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Breno Leitao [Tue, 7 Aug 2018 13:35:00 +0000 (10:35 -0300)]
powerpc/tm: Print 64-bits MSR
On a kernel TM Bad thing program exception, the Machine State Register
(MSR) is not being properly displayed. The exception code dumps a 32-bits
value but MSR is a 64 bits register for all platforms that have HTM
enabled.
This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
order to do so, the 'reason' variable could not be used, since it trimmed
MSR to 32-bits (int).
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Breno Leitao [Thu, 16 Aug 2018 17:21:07 +0000 (14:21 -0300)]
powerpc/tm: Remove msr_tm_active()
Currently msr_tm_active() is a wrapper around MSR_TM_ACTIVE() if
CONFIG_PPC_TRANSACTIONAL_MEM is set, or it is just a function that
returns false if CONFIG_PPC_TRANSACTIONAL_MEM is not set.
This function is not necessary, since MSR_TM_ACTIVE() just do the same and
could be used, removing the dualism and simplifying the code.
This patchset remove every instance of msr_tm_active() and replaced it
by MSR_TM_ACTIVE().
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Breno Leitao [Tue, 25 Sep 2018 14:29:33 +0000 (11:29 -0300)]
powerpc/powernv: Mark function as __noreturn
There is a mismatch between function pnv_platform_error_reboot() definition
and declaration regarding function modifiers. In the declaration part, it
contains the function attribute __noreturn, while function definition
itself lacks it.
This was reported by sparse tool as an error:
arch/powerpc/platforms/powernv/opal.c:538:6: error: symbol 'pnv_platform_error_reboot' redeclared with different type (originally declared at arch/powerpc/platforms/powernv/powernv.h:11) - different modifiers
I checked and the function is already being considered as being 'noreturn'
by the compiler, thus, I understand this patch does not change any code
being generated.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Breno Leitao [Thu, 20 Sep 2018 16:45:07 +0000 (13:45 -0300)]
selftests/powerpc: New PTRACE_SYSEMU test
This patch adds a new test for the new PTRACE_SYSEMU ptrace request.
This test also relies on PTRACE_GETREGS and PTRACE_SETREGS requests to
run properly, since the trace instruction (gettid() syscall) is being
modified at run-time (by PTRACE_SETREGS) and re-executed three times.
PTRACE_GETREGS is being used to check that the registers are still
sane.
This test basically creates a child process that executes syscalls
and the parent process check if it is being traced appropriately. The
parent process guarantees that the SYSCALLs are being traced, with
PTRACE_SYSEMU, and ptrace stops the child application before a syscall is
executed. The way the tests validates it, is by guaranteeing that the
system calls arguments, as argv[0] (r3) which is the same register that
will have the syscall return value on powerpc, are not being corrupted on
PTRACE_SYSEMU with a return value, i.e, it continues to have the current
arguments instead, meaning that the registers where not clobbered.
This test is basically the same test for x86 located at
tools/testing/selftests/x86/ptrace_syscall.c, limited to test PTRACE_SYSEMU
request, and ported to PowerPC.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Breno Leitao [Thu, 20 Sep 2018 16:45:06 +0000 (13:45 -0300)]
powerpc/ptrace: Add support for PTRACE_SYSEMU
This is a patch that adds support for PTRACE_SYSEMU ptrace request in
PowerPC architecture.
When ptrace(PTRACE_SYSEMU, ...) request is called, it will be handled by
the arch independent function ptrace_resume(), which will tag the task with
the TIF_SYSCALL_EMU flag. This flag needs to be handled from a platform
dependent point of view, which is what this patch does.
This patch adds this task's flag as part of the _TIF_SYSCALL_DOTRACE, which
is the MACRO that is used to trace syscalls at entrance/exit.
Since TIF_SYSCALL_EMU is now part of _TIF_SYSCALL_DOTRACE, if the task has
_TIF_SYSCALL_DOTRACE set, it will hit do_syscall_trace_enter() at syscall
entrance and do_syscall_trace_leave() at syscall leave.
do_syscall_trace_enter() needs to handle the TIF_SYSCALL_EMU flag properly,
which will interrupt the syscall executing if TIF_SYSCALL_EMU is set. The
output values should not be changed, i.e. the return value (r3) should
contain the original syscall argument on exit.
With this flag set, the syscall is not executed fundamentally, because
do_syscall_trace_enter() is returning -1 which is bigger than NR_syscall,
thus, skipping the syscall execution and exiting userspace.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Breno Leitao [Thu, 20 Sep 2018 16:45:05 +0000 (13:45 -0300)]
powerpc: Redefine TIF_32BITS thread flag
Moving TIF_32BIT to use bit 20 instead of 4 in the task flag field.
This change is making room for an upcoming new task macro
(_TIF_SYSCALL_EMU) which is preferred to set a bit in the lower 16-bits
part of the word.
This upcoming flag macro will take part in a composed macro
(_TIF_SYSCALL_DOTRACE) which will contain other flags as well, and it is
preferred that the whole _TIF_SYSCALL_DOTRACE macro only sets the lower 16
bits of a word, so, it could be handled using immediate operations (as load
immediate, add immediate, ...) where the immediate operand (SI) is limited
to 16-bits.
Another possible solution would be using the LOAD_REG_IMMEDIATE() macro
to load a full 64-bits word immediate, but it takes 5 operations instead of
one.
Having TIF_32BITS being redefined to use an upper bit is not a problem
since there is only one place in the assembly code where TIF_32BIT is being
used, and it could be replaced with an operation with right shift (addis),
since it is used alone, i.e. not being part of a composed macro, which has
different bits set, and would require LOAD_REG_IMMEDIATE().
Tested on a 64 bits Big Endian machine running a 32 bits task.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 27 Sep 2018 07:05:55 +0000 (07:05 +0000)]
powerpc/64: add stack protector support
On PPC64, as register r13 points to the paca_struct at all time,
this patch adds a copy of the canary there, which is copied at
task_switch.
That new canary is then used by using the following GCC options:
-mstack-protector-guard=tls
-mstack-protector-guard-reg=r13
-mstack-protector-guard-offset=offsetof(struct paca_struct, canary))
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 27 Sep 2018 07:05:53 +0000 (07:05 +0000)]
powerpc/32: add stack protector support
This functionality was tentatively added in the past
(commit
6533b7c16ee5 ("powerpc: Initial stack protector
(-fstack-protector) support")) but had to be reverted
(commit
f2574030b0e3 ("powerpc: Revert the initial stack
protector support") because of GCC implementing it differently
whether it had been built with libc support or not.
Now, GCC offers the possibility to manually set the
stack-protector mode (global or tls) regardless of libc support.
This time, the patch selects HAVE_STACKPROTECTOR only if
-mstack-protector-guard=tls is supported by GCC.
On PPC32, as register r2 points to current task_struct at
all time, the stack_canary located inside task_struct can be
used directly by using the following GCC options:
-mstack-protector-guard=tls
-mstack-protector-guard-reg=r2
-mstack-protector-guard-offset=offsetof(struct task_struct, stack_canary))
The protector is disabled for prom_init and bootx_init as
it is too early to handle it properly.
$ echo CORRUPT_STACK > /sys/kernel/debug/provoke-crash/DIRECT
[ 134.943666] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: lkdtm_CORRUPT_STACK+0x64/0x64
[ 134.943666]
[ 134.955414] CPU: 0 PID: 283 Comm: sh Not tainted
4.18.0-s3k-dev-12143-ga3272be41209 #835
[ 134.963380] Call Trace:
[ 134.965860] [
c6615d60] [
c001f76c] panic+0x118/0x260 (unreliable)
[ 134.971775] [
c6615dc0] [
c001f654] panic+0x0/0x260
[ 134.976435] [
c6615dd0] [
c032c368] lkdtm_CORRUPT_STACK_STRONG+0x0/0x64
[ 134.982769] [
c6615e00] [
ffffffff] 0xffffffff
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
zhong jiang [Wed, 26 Sep 2018 12:09:32 +0000 (20:09 +0800)]
powerpc/xive: Move a dereference below a NULL test
Move the dereference of xc below the NULL test.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Naveen N. Rao [Thu, 27 Sep 2018 08:10:58 +0000 (13:40 +0530)]
powerpc/pseries: Fix how we iterate over the DTL entries
When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we look up dtl_idx in
the lppaca to determine the number of entries in the buffer. Since
lppaca is in big endian, we need to do an endian conversion before using
this in our calculation to determine the number of entries in the
buffer. Without this, we do not iterate over the existing entries in the
DTL buffer properly.
Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config option.")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Naveen N. Rao [Thu, 27 Sep 2018 08:10:57 +0000 (13:40 +0530)]
powerpc/pseries: Fix DTL buffer registration
When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we register the DTL
buffer for a cpu when the associated file under powerpc/dtl in debugfs
is opened. When doing so, we need to set the size of the buffer being
registered in the second u32 word of the buffer. This needs to be in big
endian, but we are not doing the conversion resulting in the below error
showing up in dmesg:
dtl_start: DTL registration for cpu 0 (hw 0) failed with -4
Fix this in the obvious manner.
Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config option.")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Tue, 25 Sep 2018 14:10:04 +0000 (14:10 +0000)]
powerpc/traps: merge unrecoverable_exception() and nonrecoverable_exception()
PPC32 uses nonrecoverable_exception() while PPC64 uses
unrecoverable_exception().
Both functions are doing almost the same thing.
This patch removes nonrecoverable_exception()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rob Herring [Tue, 28 Aug 2018 01:52:07 +0000 (20:52 -0500)]
powerpc: Convert to using %pOFn instead of device_node.name
In preparation to remove the node name pointer from struct device_node,
convert printf users to use the %pOFn format specifier.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rob Herring [Tue, 4 Sep 2018 21:27:44 +0000 (16:27 -0500)]
macintosh: Convert to using %pOFn instead of device_node.name
In preparation to remove the node name pointer from struct device_node,
convert printf users to use the %pOFn format specifier.
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rob Herring [Thu, 1 Feb 2018 17:59:22 +0000 (11:59 -0600)]
powerpc/pseries: Use of_irq_get helper() in request_event_sources_irqs()
Instead of calling both of_irq_parse_one() and
irq_create_of_mapping(), call of_irq_get() instead which does
essentially the same thing. of_irq_get() also calls irq_find_host()
for deferred probe support, but this should be fine as
irq_create_of_mapping() also calls that internally. This gets us
closer to making the former 2 functions static.
In the process of simplifying request_event_sources_irqs(), combine
the the pr_err() and WARN_ON() calls to just a WARN().
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rob Herring [Thu, 4 Jan 2018 22:45:41 +0000 (16:45 -0600)]
powerpc/cell: Use irq_of_parse_and_map() helper
Instead of calling both of_irq_parse_one() and
irq_create_of_mapping(), call of_irq_parse_and_map() instead which
does the same thing. This gets us closer to making the former 2
functions static.
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Thu, 20 Sep 2018 18:09:47 +0000 (23:39 +0530)]
powerpc/mm:book3s: Enable THP migration support
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Thu, 20 Sep 2018 18:09:46 +0000 (23:39 +0530)]
powerpc/mm/thp: update pmd_trans_huge to check for pmd_present
We need to make sure pmd_trans_huge returns false for a pmd migration entry.
We mark the migration entry by clearing the _PAGE_PRESENT bit. We keep the
_PAGE_PTE bit set to indicate a leaf page table entry. Hence we need to make
sure we check for pmd_present() so that pmd_trans_huge won't return true on
pmd migration entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Thu, 20 Sep 2018 18:09:45 +0000 (23:39 +0530)]
arch/powerpc/mm/hash: validate the pte entries before handling the hash fault
Make sure we are operating on THP and hugetlb entries in the respective hash
fault handling routines.
No functional change in this patch. If we walked the table wrongly before, we
will retry the access.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Thu, 20 Sep 2018 18:09:44 +0000 (23:39 +0530)]
powerpc/mm/book3s: Check for pmd_large instead of pmd_trans_huge
Update few code paths to check for pmd_large.
set_pmd_at:
We want to use this to store swap pte at pmd level. For swap ptes we don't want
to set H_PAGE_THP_HUGE. Hence check for pmd_large in set_pmd_at. This remove
the false WARN_ON when using this with swap pmd entry.
pmd_page:
We don't really use them on pmd migration entries. But they can also work with
migration entries and we don't differentiate at the pte level. Hence update
pmd_page to work with pmd migration entries too
__find_linux_pte:
lockless page table walk need to handle pmd migration entries. pmd_trans_huge
check will return false on them. We don't set thp = 1 for such entries, but
update hpage_shift correctly. Without this we will walk pmd migration entries
as a pte page pointer which is wrong.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Thu, 20 Sep 2018 18:09:43 +0000 (23:39 +0530)]
powerpc/mm/hugetlb/book3s: add _PAGE_PRESENT to hugepd pointer.
This make hugetlb directory pointer similar to other page able entries. A hugepd
entry is identified by lack of _PAGE_PTE bit set and directory size stored in
HUGEPD_SHIFT_MASK. We update that to also look at _PAGE_PRESENT
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Thu, 20 Sep 2018 18:09:42 +0000 (23:39 +0530)]
powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit
With this patch we use 0x8000000000000000UL (_PAGE_PRESENT) to indicate a valid
pgd/pud/pmd entry. We also switch the p**_present() to look at this bit.
With pmd_present, we have a special case. We need to make sure we consider a
pmd marked invalid during THP split as present. Right now we clear the
_PAGE_PRESENT bit during a pmdp_invalidate. Inorder to consider this special
case we add a new pte bit _PAGE_INVALID (mapped to _RPAGE_SW0). This bit is
only used with _PAGE_PRESENT cleared. Hence we are not really losing a pte bit
for this special case. pmd_present is also updated to look at _PAGE_INVALID.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Vaibhav Jain [Fri, 7 Sep 2018 07:34:48 +0000 (13:04 +0530)]
powerpc/powernv: Make possible for user to force a full ipl cec reboot
Ever since fast reboot is enabled by default in opal,
opal_cec_reboot() will use fast-reset instead of full IPL to perform
system reboot. This leaves the user with no direct way to force a full
IPL reboot except changing an nvram setting that persistently disables
fast-reset for all subsequent reboots.
This patch provides a more direct way for the user to force a one-shot
full IPL reboot by passing the command line argument 'full' to the
reboot command. So the user will be able to tweak the reboot behavior
via:
$ sudo reboot full # Force a full ipl reboot skipping fast-reset
or
$ sudo reboot # default reboot path (usually fast-reset)
The reboot command passes the un-parsed command argument to the kernel
via the 'Reboot' syscall which is then passed on to the arch function
pnv_restart(). The patch updates pnv_restart() to handle this cmd-arg
and issues opal_cec_reboot2 with OPAL_REBOOT_FULL_IPL to force a full
IPL reset.
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Thu, 20 Sep 2018 09:41:11 +0000 (19:41 +1000)]
powerpc/perf: Add missing break in power7_marked_instr_event()
In power7_marked_instr_event() there is a switch case that is missing
a break or an explicit fallthrough, it's not immediately clear which
it should be.
The function determines based on the PMU event code, whether the event
is a "marked" event (which then requires us to configure the PMU in a
certain way). On Power7 there is no specific bit(s) in the event to
tell us that, we just have to know.
Rather than having a full list of every event and whether they are
marked, we pull apart the event code and for events with certain
values of certain fields we can say that those are all marked events.
We take the psel (bits 0-7) of the event, and look at bits 4-7. For a
value of 6 we say that if the entire psel == 0x64 then if the pmc == 3
the event is marked, else not, and otherwise we continue.
It is then that we fallthrough to the 8 case, where we return true if
the unit == 0xd.
The question is should the 6 case also fallthrough and check for
unit == 0xd, or should it return.
Looking at the full list of events we see that there are zero events
where (psel >> 4) == 0x6 and unit == 0xd.
So the answer is it doesn't really matter, there are no valid event
codes that will return a different result whether we fallthrough or
break.
But equally, testing the 6 case events against unit == 0xd is slightly
bogus, as there are no such events. So to make the code clearer, and
avoid any future confusion, have the 6 case break rather than falling
through.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Michael Ellerman [Tue, 2 Oct 2018 13:56:39 +0000 (23:56 +1000)]
Revert "convert SLB miss handlers to C" and subsequent commits
This reverts commits:
5e46e29e6a97 ("powerpc/64s/hash: convert SLB miss handlers to C")
8fed04d0f6ae ("powerpc/64s/hash: remove user SLB data from the paca")
655deecf67b2 ("powerpc/64s/hash: SLB allocation status bitmaps")
2e1626744e8d ("powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup")
89ca4e126a3f ("powerpc/64s/hash: Add a SLB preload cache")
This series had a few bugs, and the fixes are not all trivial. So
revert most of it for now.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 1 Oct 2018 12:21:10 +0000 (12:21 +0000)]
powerpc/lib: fix book3s/32 boot failure due to code patching
Commit
51c3c62b58b3 ("powerpc: Avoid code patching freed init
sections") accesses 'init_mem_is_free' flag too early, before the
kernel is relocated. This provokes early boot failure (before the
console is active).
As it is not necessary to do this verification that early, this
patch moves the test into patch_instruction() instead of
__patch_instruction().
This modification also has the advantage of avoiding unnecessary
remappings.
Fixes: 51c3c62b58b3 ("powerpc: Avoid code patching freed init sections")
Cc: stable@vger.kernel.org # 4.13+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Joel Stanley [Fri, 21 Sep 2018 02:54:31 +0000 (12:24 +0930)]
lib/xz: Put CRC32_POLY_LE in xz_private.h
This fixes a regression introduced by
faa16bc404d72a5 ("lib: Use
existing define with polynomial").
The cleanup added a dependency on include/linux, which broke the PowerPC
boot wrapper/decompresser when KERNEL_XZ is enabled:
BOOTCC arch/powerpc/boot/decompress.o
In file included from arch/powerpc/boot/../../../lib/decompress_unxz.c:233,
from arch/powerpc/boot/decompress.c:42:
arch/powerpc/boot/../../../lib/xz/xz_crc32.c:18:10: fatal error:
linux/crc32poly.h: No such file or directory
#include <linux/crc32poly.h>
^~~~~~~~~~~~~~~~~~~
The powerpc decompresser is a hairy corner of the kernel. Even while building
a 64-bit kernel it needs to build a 32-bit binary and therefore avoid including
files from include/linux.
This allows users of the xz library to avoid including headers from
'include/linux/' while still achieving the cleanup of the magic number.
Fixes: faa16bc404d72a5 ("lib: Use existing define with polynomial")
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: kbuild test robot <lkp@intel.com>
Suggested-by: Christophe LEROY <christophe.leroy@c-s.fr>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 28 Sep 2018 04:53:18 +0000 (14:53 +1000)]
selftests/powerpc: Fix Makefiles for headers_install change
Commit
b2d35fa5fc80 ("selftests: add headers_install to lib.mk")
introduced a requirement that Makefiles more than one level below the
selftests directory need to define top_srcdir, but it didn't update
any of the powerpc Makefiles.
This broke building all the powerpc selftests with eg:
make[1]: Entering directory '/src/linux/tools/testing/selftests/powerpc'
BUILD_TARGET=/src/linux/tools/testing/selftests/powerpc/alignment; mkdir -p $BUILD_TARGET; make OUTPUT=$BUILD_TARGET -k -C alignment all
make[2]: Entering directory '/src/linux/tools/testing/selftests/powerpc/alignment'
../../lib.mk:20: ../../../../scripts/subarch.include: No such file or directory
make[2]: *** No rule to make target '../../../../scripts/subarch.include'.
make[2]: Failed to remake makefile '../../../../scripts/subarch.include'.
Makefile:38: recipe for target 'alignment' failed
Fix it by setting top_srcdir in the affected Makefiles.
Fixes: b2d35fa5fc80 ("selftests: add headers_install to lib.mk")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Srikar Dronamraju [Tue, 25 Sep 2018 12:25:15 +0000 (17:55 +0530)]
powerpc/numa: Use associativity if VPHN hcall is successful
Currently associativity is used to lookup node-id even if the
preceding VPHN hcall failed. However this can cause CPU to be made
part of the wrong node, (most likely to be node 0). This is because
VPHN is not enabled on KVM guests.
With
2ea6263 ("powerpc/topology: Get topology for shared processors at
boot"), associativity is used to set to the wrong node. Hence KVM
guest topology is broken.
For example : A 4 node KVM guest before would have reported.
[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 1746 MB
node 0 free: 1604 MB
node 1 cpus: 4 5 6 7
node 1 size: 2044 MB
node 1 free: 1765 MB
node 2 cpus: 8 9 10 11
node 2 size: 2044 MB
node 2 free: 1837 MB
node 3 cpus: 12 13 14 15
node 3 size: 2044 MB
node 3 free: 1903 MB
node distances:
node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10
Would now report:
[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 1746 MB
node 0 free: 1244 MB
node 1 cpus:
node 1 size: 2044 MB
node 1 free: 2032 MB
node 2 cpus: 1
node 2 size: 2044 MB
node 2 free: 2028 MB
node 3 cpus:
node 3 size: 2044 MB
node 3 free: 2032 MB
node distances:
node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10
Fix this by skipping associativity lookup if the VPHN hcall failed.
Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Tue, 25 Sep 2018 09:36:47 +0000 (19:36 +1000)]
powerpc/tm: Avoid possible userspace r1 corruption on reclaim
Current we store the userspace r1 to PACATMSCRATCH before finally
saving it to the thread struct.
In theory an exception could be taken here (like a machine check or
SLB miss) that could write PACATMSCRATCH and hence corrupt the
userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but
others do.
We've never actually seen this happen but it's theoretically
possible. Either way, the code is fragile as it is.
This patch saves r1 to the kernel stack (which can't fault) before we
turn MSR[RI] back on. PACATMSCRATCH is still used but only with
MSR[RI] off. We then copy r1 from the kernel stack to the thread
struct once we have MSR[RI] back on.
Suggested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Mon, 24 Sep 2018 07:27:04 +0000 (17:27 +1000)]
powerpc/tm: Fix userspace r13 corruption
When we treclaim we store the userspace checkpointed r13 to a scratch
SPR and then later save the scratch SPR to the user thread struct.
Unfortunately, this doesn't work as accessing the user thread struct
can take an SLB fault and the SLB fault handler will write the same
scratch SPRG that now contains the userspace r13.
To fix this, we store r13 to the kernel stack (which can't fault)
before we access the user thread struct.
Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen
as a random userspace segfault with r13 looking like a kernel address.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Bringmann [Thu, 20 Sep 2018 16:45:13 +0000 (11:45 -0500)]
powerpc/pseries: Fix unitialized timer reset on migration
After migration of a powerpc LPAR, the kernel executes code to
update the system state to reflect new platform characteristics.
Such changes include modifications to device tree properties provided
to the system by PHYP. Property notifications received by the
post_mobility_fixup() code are passed along to the kernel in general
through a call to of_update_property() which in turn passes such
events back to all modules through entries like the '.notifier_call'
function within the NUMA module.
When the NUMA module updates its state, it resets its event timer. If
this occurs after a previous call to stop_topology_update() or on a
system without VPHN enabled, the code runs into an unitialized timer
structure and crashes. This patch adds a safety check along this path
toward the problem code.
An example crash log is as follows.
ibmvscsi
30000081: Re-enabling adapter!
------------[ cut here ]------------
kernel BUG at kernel/time/timer.c:958!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179
...
NIP mod_timer+0x4c/0x400
LR reset_topology_timer+0x40/0x60
Call Trace:
0xc0000003f9407830 (unreliable)
reset_topology_timer+0x40/0x60
dt_update_callback+0x100/0x120
notifier_call_chain+0x90/0x100
__blocking_notifier_call_chain+0x60/0x90
of_property_notify+0x90/0xd0
of_update_property+0x104/0x150
update_dt_property+0xdc/0x1f0
pseries_devicetree_update+0x2d0/0x510
post_mobility_fixup+0x7c/0xf0
migration_store+0xa4/0xc0
kobj_attr_store+0x30/0x60
sysfs_kf_write+0x64/0xa0
kernfs_fop_write+0x16c/0x240
__vfs_write+0x40/0x200
vfs_write+0xc8/0x240
ksys_write+0x5c/0x100
system_call+0x58/0x6c
Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated")
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Thiago Jung Bauermann [Thu, 20 Sep 2018 04:38:58 +0000 (01:38 -0300)]
powerpc/pkeys: Fix reading of ibm, processor-storage-keys property
scan_pkey_feature() uses of_property_read_u32_array() to read the
ibm,processor-storage-keys property and calls be32_to_cpu() on the
value it gets. The problem is that of_property_read_u32_array() already
returns the value converted to the CPU byte order.
The value of pkeys_total ends up more or less sane because there's a min()
call in pkey_initialize() which reduces pkeys_total to 32. So in practice
the kernel ignores the fact that the hypervisor reserved one key for
itself (the device tree advertises 31 keys in my test VM).
This is wrong, but the effect in practice is that when a process tries to
allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which
would indicate that there aren't any keys available
Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 10 Sep 2018 06:09:04 +0000 (06:09 +0000)]
powerpc: fix csum_ipv6_magic() on little endian platforms
On little endian platforms, csum_ipv6_magic() keeps len and proto in
CPU byte order. This generates a bad results leading to ICMPv6 packets
from other hosts being dropped by powerpc64le platforms.
In order to fix this, len and proto should be converted to network
byte order ie bigendian byte order. However checksumming 0x12345678
and 0x56341278 provide the exact same result so it is enough to
rotate the sum of len and proto by 1 byte.
PPC32 only support bigendian so the fix is needed for PPC64 only
Fixes: e9c4943a107b ("powerpc: Implement csum_ipv6_magic in assembly")
Reported-by: Jianlin Shi <jishi@redhat.com>
Reported-by: Xin Long <lucien.xin@gmail.com>
Cc: <stable@vger.kernel.org> # 4.18+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Tue, 11 Sep 2018 05:38:05 +0000 (15:38 +1000)]
powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again)
mpe: This was fixed originally in commit
d3d4ffaae439
("powerpc/powernv/ioda2: Reduce upper limit for DMA window size"), but
contrary to what the merge commit says was inadvertently lost by me in
commit
ce57c6610cc2 ("Merge branch 'topic/ppc-kvm' into next") which
brought in changes that moved the code to a new file. So reapply it to
the new file.
Original commit message follows:
We use PHB in mode1 which uses bit 59 to select a correct DMA window.
However there is mode2 which uses bits 59:55 and allows up to 32 DMA
windows per a PE.
Even though documentation does not clearly specify that, it seems that
the actual hardware does not support bits 59:55 even in mode1, in
other words we can create a window as big as 1<<58 but DMA simply
won't work.
This reduces the upper limit from 59 to 55 bits to let the userspace
know about the hardware limits.
Fixes: ce57c6610cc2 ("Merge branch 'topic/ppc-kvm' into next")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hari Bathini [Fri, 14 Sep 2018 14:06:02 +0000 (19:36 +0530)]
powerpc/fadump: re-register firmware-assisted dump if already registered
Firmware-Assisted Dump (FADump) needs to be registered again after any
memory hot add/remove operation to update the crash memory ranges. But
currently, the kernel returns '-EEXIST' if we try to register without
uregistering it first. This could expose the system to racing issues
while unregistering and registering FADump from userspace during udev
events. Spare the userspace of this and let it be taken care of in the
kernel space for a simpler interface.
Since this change, running 'echo 1 > /sys/kernel/fadump_registered'
would result in re-regisering (unregistering and registering) FADump,
if it was already registered.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Suraj Jitindar Singh [Wed, 5 Sep 2018 02:09:51 +0000 (12:09 +1000)]
powerpc/pseries: Remove VLA from lparcfg_write()
In lparcfg_write we hard code kbuf_sz and then use this as the variable
length of kbuf creating a variable length array. Since we're hard coding
the length anyway just define the array using this as the length and
remove the need for kbuf_sz, thus removing the variable length array.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Suraj Jitindar Singh [Wed, 5 Sep 2018 02:09:50 +0000 (12:09 +1000)]
powerpc/prom: Remove VLA in prom_check_platform_support()
In prom_check_platform_support() we retrieve and parse the
"ibm,arch-vec-5-platform-support" property of the chosen node.
Currently we use a variable length array however to avoid this use an
array of constant length 8.
This property is used to indicate the supported options of vector 5
bytes 23-26 of the ibm,architecture.vec node. Each of these options
is a pair of bytes, thus for 4 options we have a max length of 8 bytes.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Fri, 14 Sep 2018 04:06:48 +0000 (13:36 +0930)]
powerpc: Fix duplicate const clang warning in user access code
This re-applies commit
b91c1e3e7a6f ("powerpc: Fix duplicate const
clang warning in user access code") (Jun 2015) which was undone in
commits:
f2ca80905929 ("powerpc/sparse: Constify the address pointer in __get_user_nosleep()") (Feb 2017)
d466f6c5cac1 ("powerpc/sparse: Constify the address pointer in __get_user_nocheck()") (Feb 2017)
f84ed59a612d ("powerpc/sparse: Constify the address pointer in __get_user_check()") (Feb 2017)
We see a large number of duplicate const errors in the user access
code when building with llvm/clang:
include/linux/pagemap.h:576:8: warning: duplicate 'const' declaration specifier [-Wduplicate-decl-specifier]
ret = __get_user(c, uaddr);
The problem is we are doing const __typeof__(*(ptr)), which will hit
the warning if ptr is marked const.
Removing const does not seem to have any effect on GCC code
generation.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Joel Stanley [Fri, 14 Sep 2018 04:06:47 +0000 (13:36 +0930)]
powerpc/boot: Ensure _zimage_start is a weak symbol
When building with clang crt0's _zimage_start is not marked weak, which
breaks the build when linking the kernel image:
$ objdump -t arch/powerpc/boot/crt0.o |grep _zimage_start$
0000000000000058 g .text
0000000000000000 _zimage_start
ld: arch/powerpc/boot/wrapper.a(crt0.o): in function '_zimage_start':
(.text+0x58): multiple definition of '_zimage_start';
arch/powerpc/boot/pseries-head.o:(.text+0x0): first defined here
Clang requires the .weak directive to appear after the symbol is
declared. The binutils manual says:
This directive sets the weak attribute on the comma separated list of
symbol names. If the symbols do not already exist, they will be
created.
So it appears this is different with clang. The only reference I could
see for this was an OpenBSD mailing list post[1].
Changing it to be after the declaration fixes building with Clang, and
still works with GCC.
$ objdump -t arch/powerpc/boot/crt0.o |grep _zimage_start$
0000000000000058 w .text
0000000000000000 _zimage_start
Reported to clang as https://bugs.llvm.org/show_bug.cgi?id=38921
[1] https://groups.google.com/forum/#!topic/fa.openbsd.tech/PAgKKen2YCY
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Joel Stanley [Tue, 18 Sep 2018 03:36:17 +0000 (13:06 +0930)]
powerpc/configs: Update skiroot defconfig
Disable new features from recent releases, and clean out some other
unused options:
- Enable EXPERT, so we can disable some things
- Disable non-powerpc BPF decoders
- Disable TASKSTATS
- Disable unused syscalls
- Set more things to be modules
- Turn off unused network vendors
- PPC_OF_BOOT_TRAMPOLINE and FB_OF are unused on powernv
- Drop unused Radeon and Matrox GPU drivers
- IPV6 support landed in petitboot
- Bringup related command line powersave=off dropped, switch to quiet
Set CONFIG_I2C_CHARDEV=y as the module is not loaded automatically, and
without this i2cget etc. will fail in the skiroot environment.
This defconfig gets us build coverage of KERNEL_XZ, which was broken in
the 4.19 merge window for powerpc.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Mon, 17 Sep 2018 19:14:02 +0000 (14:14 -0500)]
powerpc/pseries: Disable CPU hotplug across migrations
When performing partition migrations all present CPUs must be online
as all present CPUs must make the H_JOIN call as part of the migration
process. Once all present CPUs make the H_JOIN call, one CPU is returned
to make the rtas call to perform the migration to the destination system.
During testing of migration and changing the SMT state we have found
instances where CPUs are offlined, as part of the SMT state change,
before they make the H_JOIN call. This results in a hung system where
every CPU is either in H_JOIN or offline.
To prevent this this patch disables CPU hotplug during the migration
process.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Mon, 10 Sep 2018 14:57:07 +0000 (09:57 -0500)]
powerpc/pseries: Remove unneeded uses of dlpar work queue
There are three instances in which dlpar hotplug events are invoked;
handling a hotplug interrupt (in a kvm guest), handling a dlpar
request through sysfs, and updating LMB affinity when handling a
PRRN event. Only in the case of handling a hotplug interrupt do we
have to put the work on a workqueue, the other cases can handle the
dlpar request directly.
This patch exports the handle_dlpar_errorlog() function so that
dlpar hotplug events can be handled directly and updates the two
instances mentioned above to use the direct invocation.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Mon, 10 Sep 2018 14:57:00 +0000 (09:57 -0500)]
powerpc/pseries: Remove prrn_work workqueue
When a PRRN event is received we are already running in a worker
thread. Instead of spawning off another worker thread on the prrn_work
workqueue to handle the PRRN event we can just call the PRRN handler
routine directly.
With this update we can also pass the scope variable for the PRRN
event directly to the handler instead of it being a global variable.
This patch fixes the following oops mnessage we are seeing in PRRN testing:
Oops: Bad kernel stack pointer, sig: 6 [#1]
SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache binfmt_misc reiserfs vfat fat rpadlpar_io(X) rpaphp(X) tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag af_packet xfs libcrc32c dm_service_time ibmveth(X) ses enclosure scsi_transport_sas rtc_generic btrfs xor raid6_pq sd_mod ibmvscsi(X) scsi_transport_srp ipr(X) libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: Yes, External 54
CPU: 7 PID: 18967 Comm: kworker/u96:0 Tainted: G X 4.4.126-94.22-default #1
Workqueue: pseries hotplug workque pseries_hp_work_fn
task:
c000000775367790 ti:
c00000001ebd4000 task.ti:
c00000070d140000
NIP:
0000000000000000 LR:
000000001fb3d050 CTR:
0000000000000000
REGS:
c00000001ebd7d40 TRAP: 0700 Tainted: G X (4.4.126-94.22-default)
MSR:
8000000102081000 <41,VEC,ME5 CR:
28000002 XER:
20040018 4
CFAR:
000000001fb3d084 40 419 1 3
GPR00:
000000000000000040000000000010007 000000001ffff400 000000041fffe200
GPR04:
000000000000008050000000000000000 000000001fb15fa8 0000000500000500
GPR08:
000000000001f40040000000000000001 0000000000000000 000005:
5200040002
GPR12:
00000000000000005c000000007a05400 c0000000000e89f8 000000001ed9f668
GPR16:
000000001fbeff944000000001fbeff94 000000001fb545e4 0000006000000060
GPR20:
ffffffffffffffff4ffffffffffffffff 0000000000000000 0000000000000000
GPR24:
00000000000000005400000001fb3c000 0000000000000000 000000001fb1b040
GPR28:
000000001fb240004000000001fb440d8 0000000000000008 0000000000000000
NIP [
0000000000000000] 5 (null)
LR [
000000001fb3d050]
031fb3d050
Call Trace: 4
Instruction dump: 4 5:47 12 2
XXXXXXXX XXXXXXXX XXXXX4XX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXX5XX XXXXXXXX
60000000 60000000 60000000 60000000
---[ end trace
aa5627b04a7d9d6b ]--- 3NMI watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [kworker/27:0:13903]
Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache binfmt_misc reiserfs vfat fat rpadlpar_io(X) rpaphp(X) tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag af_packet xfs libcrc32c dm_service_time ibmveth(X) ses enclosure scsi_transport_sas rtc_generic btrfs xor raid6_pq sd_mod ibmvscsi(X) scsi_transport_srp ipr(X) libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: Yes, External
CPU: 27 PID: 13903 Comm: kworker/27:0 Tainted: G D X 4.4.126-94.22-default #1
Workqueue: events prrn_work_fn
task:
c000000747cfa390 ti:
c00000074712c000 task.ti:
c00000074712c000
NIP:
c0000000008002a8 LR:
c000000000090770 CTR:
000000000032e088
REGS:
c00000074712f7b0 TRAP: 0901 Tainted: G D X (4.4.126-94.22-default)
MSR:
8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR:
22482044 XER:
20040000
CFAR:
c0000000008002c4 SOFTE: 1
GPR00:
c000000000090770 c00000074712fa30 c000000000f09800 c000000000fa1928 6:02
GPR04:
c000000775f5e000 fffffffffffffffe 0000000000000001 c000000000f42db8
GPR08:
0000000000000001 0000000080000007 0000000000000000 0000000000000000
GPR12:
8006210083180000 c000000007a14400
NIP [
c0000000008002a8] _raw_spin_lock+0x68/0xd0
LR [
c000000000090770] mobility_rtas_call+0x50/0x100
Call Trace: 59 5
[
c00000074712fa60] [
c000000000090770] mobility_rtas_call+0x50/0x100
[
c00000074712faf0] [
c000000000090b08] pseries_devicetree_update+0xf8/0x530
[
c00000074712fc20] [
c000000000031ba4] prrn_work_fn+0x34/0x50
[
c00000074712fc40] [
c0000000000e0390] process_one_work+0x1a0/0x4e0
[
c00000074712fcd0] [
c0000000000e0870] worker_thread+0x1a0/0x6105:57 2
[
c00000074712fd80] [
c0000000000e8b18] kthread+0x128/0x150
[
c00000074712fe30] [
c0000000000096f8] ret_from_kernel_thread+0x5c/0x64
Instruction dump:
2c090000 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 40de0018 5:540030 3
e8010010 ebe1fff8 7c0803a6 4e800020 <
7c210b78>
e92d0000 89290009 792affe3
Signed-off-by: John Allen <jallen@linux.ibm.com>
Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Fri, 20 Apr 2018 20:29:48 +0000 (15:29 -0500)]
powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR request
The updates to powerpc numa and memory hotplug code now use the
in-kernel LMB array instead of the device tree. This change allows the
pseries memory DLPAR code to only update the device tree once after
successfully handling a DLPAR request.
Prior to the in-kernel LMB array, the numa code looked up the affinity
for memory being added in the device tree, the code now looks this up
in the LMB array. This change means the memory hotplug code can just
update the affinity for an LMB in the LMB array instead of updating
the device tree.
This also provides a savings in kernel memory. When updating the
device tree old properties are never free'ed since there is no
usecount on properties. This behavior leads to a new copy of the
property being allocated every time a LMB is added or removed (i.e. a
request to add 100 LMBs creates 100 new copies of the property). With
this update only a single new property is created when a DLPAR request
completes successfully.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 05:08:54 +0000 (15:08 +1000)]
powerpc: avoid -mno-sched-epilog on GCC 4.9 and newer
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 05:08:53 +0000 (15:08 +1000)]
powerpc: consolidate -mno-sched-epilog into FTRACE flags
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 05:08:52 +0000 (15:08 +1000)]
powerpc: remove old GCC version checks
GCC 4.6 is the minimum supported now.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:56 +0000 (01:30 +1000)]
powerpc/64s/hash: Add a SLB preload cache
When switching processes, currently all user SLBEs are cleared, and a
few (exec_base, pc, and stack) are preloaded. In trivial testing with
small apps, this tends to miss the heap and low 256MB segments, and it
will also miss commonly accessed segments on large memory workloads.
Add a simple round-robin preload cache that just inserts the last SLB
miss into the head of the cache and preloads those at context switch
time. Every 256 context switches, the oldest entry is removed from the
cache to shrink the cache and require fewer slbmte if they are unused.
Much more could go into this, including into the SLB entry reclaim
side to track some LRU information etc, which would require a study of
large memory workloads. But this is a simple thing we can do now that
is an obvious win for common workloads.
With the full series, process switching speed on the context_switch
benchmark on POWER9/hash (with kernel speculation security masures
disabled) increases from 140K/s to 178K/s (27%).
POWER8 does not change much (within 1%), it's unclear why it does not
see a big gain like POWER9.
Booting to busybox init with 256MB segments has SLB misses go down
from 945 to 69, and with 1T segments 900 to 21. These could almost all
be eliminated by preloading a bit more carefully with ELF binary
loading.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:55 +0000 (01:30 +1000)]
powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup
This will be used by the SLB code in the next patch, but for now this
sets the slb_addr_limit to the correct size for 32-bit tasks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:54 +0000 (01:30 +1000)]
powerpc/64s: xmon do not dump hash fields when using radix mode
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:53 +0000 (01:30 +1000)]
powerpc/64s/hash: SLB allocation status bitmaps
Add 32-entry bitmaps to track the allocation status of the first 32
SLB entries, and whether they are user or kernel entries. These are
used to allocate free SLB entries first, before resorting to the round
robin allocator.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:52 +0000 (01:30 +1000)]
powerpc/64s/hash: remove user SLB data from the paca
User SLB mappig data is copied into the PACA from the mm->context so
it can be accessed by the SLB miss handlers.
After the C conversion, SLB miss handlers now run with relocation on,
and user SLB misses are able to take recursive kernel SLB misses, so
the user SLB mapping data can be removed from the paca and accessed
directly.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:51 +0000 (01:30 +1000)]
powerpc/64s/hash: convert SLB miss handlers to C
This patch moves SLB miss handlers completely to C, using the standard
exception handler macros to set up the stack and branch to C.
This can be done because the segment containing the kernel stack is
always bolted, so accessing it with relocation on will not cause an
SLB exception.
Arbitrary kernel memory may not be accessed when handling kernel space
SLB misses, so care should be taken there. However user SLB misses can
access any kernel memory, which can be used to move some fields out of
the paca (in later patches).
User SLB misses could quite easily reconcile IRQs and set up a first
class kernel environment and exit via ret_from_except, however that
doesn't seem to be necessary at the moment, so we only do that if a
bad fault is encountered.
[ Credit to Aneesh for bug fixes, error checks, and improvements to bad
address handling, etc ]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Since RFC:
- Added MSR[RI] handling
- Fixed up a register loss bug exposed by irq tracing (Aneesh)
- Reject misses outside the defined kernel regions (Aneesh)
- Added several more sanity checks and error handling (Aneesh), we may
look at consolidating these tests and tightenig up the code but for
a first pass we decided it's better to check carefully.
Since v1:
- Fixed SLB cache corruption (Aneesh)
- Fixed untidy SLBE allocation "leak" in get_vsid error case
- Now survives some stress testing on real hardware
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:50 +0000 (01:30 +1000)]
powerpc/64s/hash: Use POWER9 SLBIA IH=3 variant in switch_slb
POWER9 introduces SLBIA IH=3, which invalidates all SLB entries and
associated lookaside information that have a class value of 1, which
Linux assigns to user addresses. This matches what switch_slb wants,
and allows a simple fast implementation that avoids the slb_cache
complexity.
As a side-effect, the POWER5 < DD2.1 SLB invalidation workaround is
also avoided on POWER9.
Process context switching rate is improved about 2.2% for a small
process that hits the slb cache which is the best case for the current
code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:49 +0000 (01:30 +1000)]
powerpc/64s/hash: Use POWER6 SLBIA IH=1 variant in switch_slb
The SLBIA IH=1 hint will remove all non-zero SLBEs, but only
invalidate ERAT entries associated with a class value of 1, for
processors that support the hint (e.g., POWER6 and newer), which
Linux assigns to user addresses.
This prevents kernel ERAT entries from being invalidated when
context switchig (if the thread faulted in more than 8 user SLBEs).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:48 +0000 (01:30 +1000)]
powerpc/64s/hash: remove the vmalloc segment from the bolted SLB
Remove the vmalloc segment from bolted SLBEs. This is not required to
be bolted, and seems like it was added to help pre-load the SLB on
context switch. However there are now other segments like the vmemmap
segment and non-zero node memory that often take misses after a context
switch, so it is better to solve this in a more general way.
A subsequent change will track free SLB entries and uses those rather
than round-robin overwrite valid entries, which makes it far less
likely for kernel SLBEs to be evicted after they are installed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:47 +0000 (01:30 +1000)]
powerpc/64s/hash: move POWER5 < DD2.1 slbie workaround where it is needed
The POWER5 < DD2.1 issue is that slbie needs to be issued more than
once. It came in with this change:
ChangeSet@1.1608, 2004-04-29 07:12:31-07:00, david@gibson.dropbear.id.au
[PATCH] POWER5 erratum workaround
Early POWER5 revisions (<DD2.1) have a problem requiring slbie
instructions to be repeated under some circumstances. The patch below
adds a workaround (patch made by Anton Blanchard).
(aka.
3e4520f7605243abf66a7ccd3d2e49e48e8c0483 in the full history tree)
The extra slbie in switch_slb is done even for the case where slbia is
called (slb_flush_and_rebolt). I don't believe that is required
because there are other slb_flush_and_rebolt callers which do not
issue the workaround slbie, which would be broken if it was required.
It also seems to be fine inside the isync with the first slbie, as it
is in the kernel stack switch code.
So move this workaround to where it is required. This is not much of
an optimisation because this is the fast path, but it makes the code
more understandable and neater.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Retain slbie_data initialisation to avoid compiler warning]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:46 +0000 (01:30 +1000)]
powerpc/64s/hash: avoid the POWER5 < DD2.1 slb invalidate workaround on POWER8/9
I only have POWER8/9 to test, so just remove it for those.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 14 Sep 2018 15:30:45 +0000 (01:30 +1000)]
powerpc/64s/hash: Fix stab_rr off by one initialization
This causes SLB alloation to start 1 beyond the start of the SLB.
There is no real problem because after it wraps it stats behaving
properly, it's just surprisig to see when looking at SLB traces.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mahesh Salgaonkar [Tue, 11 Sep 2018 14:27:23 +0000 (19:57 +0530)]
powernv/pseries: consolidate code for mce early handling.
Now that other platforms also implements real mode mce handler,
lets consolidate the code by sharing existing powernv machine check
early code. Rename machine_check_powernv_early to
machine_check_common_early and reuse the code.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mahesh Salgaonkar [Tue, 11 Sep 2018 14:27:15 +0000 (19:57 +0530)]
powerpc/pseries: Dump the SLB contents on SLB MCE errors.
If we get a machine check exceptions due to SLB errors then dump the
current SLB contents which will be very much helpful in debugging the
root cause of SLB errors. Introduce an exclusive buffer per cpu to hold
faulty SLB entries. In real mode mce handler saves the old SLB contents
into this buffer accessible through paca and print it out later in virtual
mode.
With this patch the console will log SLB contents like below on SLB MCE
errors:
[ 507.297236] SLB contents of cpu 0x1
[ 507.297237] Last SLB entry inserted at slot 16
[ 507.297238] 00
c000000008000000 400ea1b217000500
[ 507.297239] 1T ESID= c00000 VSID=
ea1b217 LLP:100
[ 507.297240] 01
d000000008000000 400d43642f000510
[ 507.297242] 1T ESID= d00000 VSID=
d43642f LLP:110
[ 507.297243] 11
f000000008000000 400a86c85f000500
[ 507.297244] 1T ESID= f00000 VSID=
a86c85f LLP:100
[ 507.297245] 12
00007f0008000000 4008119624000d90
[ 507.297246] 1T ESID= 7f VSID=
8119624 LLP:110
[ 507.297247] 13
0000000018000000 00092885f5150d90
[ 507.297247] 256M ESID= 1 VSID=
92885f5150 LLP:110
[ 507.297248] 14
0000010008000000 4009e7cb50000d90
[ 507.297249] 1T ESID= 1 VSID=
9e7cb50 LLP:110
[ 507.297250] 15
d000000008000000 400d43642f000510
[ 507.297251] 1T ESID= d00000 VSID=
d43642f LLP:110
[ 507.297252] 16
d000000008000000 400d43642f000510
[ 507.297253] 1T ESID= d00000 VSID=
d43642f LLP:110
[ 507.297253] ----------------------------------
[ 507.297254] SLB cache ptr value = 3
[ 507.297254] Valid SLB cache entries:
[ 507.297255] 00 EA[0-35]= 7f000
[ 507.297256] 01 EA[0-35]= 1
[ 507.297257] 02 EA[0-35]= 1000
[ 507.297257] Rest of SLB cache entries:
[ 507.297258] 03 EA[0-35]= 7f000
[ 507.297258] 04 EA[0-35]= 1
[ 507.297259] 05 EA[0-35]= 1000
[ 507.297260] 06 EA[0-35]= 12
[ 507.297260] 07 EA[0-35]= 7f000
Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mahesh Salgaonkar [Tue, 11 Sep 2018 14:27:07 +0000 (19:57 +0530)]
powerpc/pseries: Display machine check error details.
Extract the MCE error details from RTAS extended log and display it to
console.
With this patch you should now see mce logs like below:
[ 142.371818] Severe Machine check interrupt [Recovered]
[ 142.371822] NIP [
d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[ 142.371822] Initiator: CPU
[ 142.371823] Error type: SLB [Multihit]
[ 142.371824] Effective address:
d00000000ca70000
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mahesh Salgaonkar [Tue, 11 Sep 2018 14:27:00 +0000 (19:57 +0530)]
powerpc/pseries: Flush SLB contents on SLB MCE errors.
On pseries, as of today system crashes if we get a machine check
exceptions due to SLB errors. These are soft errors and can be fixed
by flushing the SLBs so the kernel can continue to function instead of
system crash. We do this in real mode before turning on MMU. Otherwise
we would run into nested machine checks. This patch now fetches the
rtas error log in real mode and flushes the SLBs on SLB/ERAT errors.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michal Suchanek <msuchanek@suse.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mahesh Salgaonkar [Tue, 11 Sep 2018 14:26:52 +0000 (19:56 +0530)]
powerpc/pseries: Define MCE error event section.
On pseries, the machine check error details are part of RTAS extended
event log passed under Machine check exception section. This patch adds
the definition of rtas MCE event section and related helper
functions.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Breno Leitao [Wed, 12 Sep 2018 20:31:05 +0000 (17:31 -0300)]
selftests/powerpc: Do not fail with reschedule
There are cases where the test is not expecting to have the transaction
aborted, but, the test process might have been rescheduled, either in the
OS level or by KVM (if it is running on a KVM guest machine). The process
reschedule will cause a treclaim/recheckpoint which will cause the
transaction to doom, aborting the transaction as soon as the process is
rescheduled back to the CPU. This might cause the test to fail, but this is
not a failure in essence.
If that is the case, TEXASR[FC] is indicated with either
TM_CAUSE_RESCHEDULE or TM_CAUSE_KVM_RESCHEDULE for KVM interruptions.
In this scenario, ignore these two failures and avoid the whole test to
return failure.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Gustavo Romero <gromero@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Breno Leitao [Tue, 21 Aug 2018 18:44:48 +0000 (15:44 -0300)]
powerpc/iommu: Avoid derefence before pointer check
The tbl pointer is being derefenced by IOMMU_PAGE_SIZE prior the check
if it is not NULL.
Just moving the dereference code to after the check, where there will
be guarantee that 'tbl' will not be NULL.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Breno Leitao [Thu, 23 Aug 2018 23:26:39 +0000 (20:26 -0300)]
powerpc/xive: Use xive_cpu->chip_id instead of looking it up again
Function xive_native_get_ipi() might use chip_id without it being
initialized, if the CPU node is not found, as reported by smatch:
error: uninitialized symbol 'chip_id'
As suggested by Cédric, we can use xc->chip_id instead of consulting
the device tree for chip id, which is safe since xive_prepare_cpu()
should have initialized ->chip_id by the time xive_native_get_ipi() is
called.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
[mpe: Tweak change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Lombard [Tue, 14 Aug 2018 12:45:15 +0000 (14:45 +0200)]
ocxl: Fix access to the AFU Descriptor Data
The AFU Information DVSEC capability is a means to extract common,
general information about all of the AFUs associated with a Function
independent of the specific functionality that each AFU provides.
Write in the AFU Index field allows to access to the descriptor data
for each AFU.
With the current code, we are not able to access to these specific data
when the index >= 1 because we are writing to the wrong location.
All requests to the data of each AFU are pointing to those of the AFU 0,
which could have impacts when using a card with more than one AFU per
function.
This patch fixes the access to the AFU Descriptor Data indexed by the
AFU Info Index field.
Fixes: 5ef3166e8a32 ("ocxl: Driver code for 'generic' opencapi devices")
Cc: stable <stable@vger.kernel.org> # 4.16
Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rashmica Gupta [Fri, 17 Aug 2018 04:25:01 +0000 (14:25 +1000)]
powerpc/memtrace: Remove memory in chunks
When hot-removing memory release_mem_region_adjustable() splits iomem
resources if they are not the exact size of the memory being
hot-deleted. Adding this memory back to the kernel adds a new resource.
Eg a node has memory 0x0 - 0xfffffffff. Hot-removing 1GB from
0xf40000000 results in the single resource 0x0-0xfffffffff being split
into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff.
When we hot-add the memory back we now have three resources:
0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff.
This is an issue if we try to remove some memory that overlaps
resources. Eg when trying to remove 2GB at address 0xf40000000,
release_mem_region_adjustable() fails as it expects the chunk of memory
to be within the boundaries of a single resource. We then get the
warning: "Unable to release resource" and attempting to use memtrace
again gives us this error: "bash: echo: write error: Resource
temporarily unavailable"
This patch makes memtrace remove memory in chunks that are always the
same size from an address that is always equal to end_of_memory -
n*size, for some n. So hotremoving and hotadding memory of different
sizes will now not attempt to remove memory that spans multiple
resources.
Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Fri, 14 Sep 2018 01:14:11 +0000 (11:14 +1000)]
powerpc: Avoid code patching freed init sections
This stops us from doing code patching in init sections after they've
been freed.
In this chain:
kvm_guest_init() ->
kvm_use_magic_page() ->
fault_in_pages_readable() ->
__get_user() ->
__get_user_nocheck() ->
barrier_nospec();
We have a code patching location at barrier_nospec() and
kvm_guest_init() is an init function. This whole chain gets inlined,
so when we free the init section (hence kvm_guest_init()), this code
goes away and hence should no longer be patched.
We seen this as userspace memory corruption when using a memory
checker while doing partition migration testing on powervm (this
starts the code patching post migration via
/sys/kernel/mobility/migration). In theory, it could also happen when
using /sys/kernel/debug/powerpc/barrier_nospec.
Cc: stable@vger.kernel.org # 4.13+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Tue, 21 Aug 2018 01:04:12 +0000 (11:04 +1000)]
powerpc/64: Remove static branch hints from memset()
Static branch hints override dynamic branch prediction on recent
POWER CPUs. We should only use them when we are overwhelmingly
sure of the direction.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Laurent Dufour [Mon, 20 Aug 2018 14:29:36 +0000 (16:29 +0200)]
powerpc/pseries/mm: call H_BLOCK_REMOVE
This hypervisor's call allows to remove up to 8 ptes with only call to
tlbie.
The virtual pages must be all within the same naturally aligned 8 pages
virtual address block and have the same page and segment size encodings.
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Laurent Dufour [Mon, 20 Aug 2018 14:29:35 +0000 (16:29 +0200)]
powerpc/pseries/mm: factorize PTE slot computation
This part of code will be called also when dealing with H_BLOCK_REMOVE.
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>