Nikos Tsironis [Wed, 17 Jul 2019 11:24:10 +0000 (14:24 +0300)]
dm kcopyd: Increase default sub-job size to 512KB
Currently, kcopyd has a sub-job size of 64KB and a maximum number of 8
sub-jobs. As a result, for any kcopyd job, we have a maximum of 512KB of
I/O in flight.
This upper limit to the amount of in-flight I/O under-utilizes fast
devices and results in decreased throughput, e.g., when writing to a
snapshotted thin LV with I/O size less than the pool's block size (so
COW is performed using kcopyd).
Increase kcopyd's default sub-job size to 512KB, so we have a maximum of
4MB of I/O in flight for each kcopyd job. This results in an up to 96%
improvement of bandwidth when writing to a snapshotted thin LV, with I/O
sizes less than the pool's block size.
Also, add dm_mod.kcopyd_subjob_size_kb module parameter to allow users
to fine tune the sub-job size of kcopyd. The default value of this
parameter is 512KB and the maximum allowed value is 1024KB.
We evaluate the performance impact of the change by running the
snap_breaking_throughput benchmark, from the device mapper test suite
[1].
The benchmark:
1. Creates a 1G thin LV
2. Provisions the thin LV
3. Takes a snapshot of the thin LV
4. Writes to the thin LV with:
dd if=/dev/zero of=/dev/vg/thin_lv oflag=direct bs=<I/O size>
Running this benchmark with various thin pool block sizes and dd I/O
sizes (all combinations triggering the use of kcopyd) we get the
following results:
+-----------------+-------------+------------------+-----------------+
| Pool block size | dd I/O size | BW before (MB/s) | BW after (MB/s) |
+-----------------+-------------+------------------+-----------------+
| 1 MB | 256 KB | 242 | 280 |
| 1 MB | 512 KB | 238 | 295 |
| | | | |
| 2 MB | 256 KB | 238 | 354 |
| 2 MB | 512 KB | 241 | 380 |
| 2 MB | 1 MB | 245 | 394 |
| | | | |
| 4 MB | 256 KB | 248 | 412 |
| 4 MB | 512 KB | 234 | 432 |
| 4 MB | 1 MB | 251 | 474 |
| 4 MB | 2 MB | 257 | 504 |
| | | | |
| 8 MB | 256 KB | 239 | 420 |
| 8 MB | 512 KB | 256 | 431 |
| 8 MB | 1 MB | 264 | 467 |
| 8 MB | 2 MB | 264 | 502 |
| 8 MB | 4 MB | 281 | 537 |
+-----------------+-------------+------------------+-----------------+
[1] https://github.com/jthornber/device-mapper-test-suite
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Wed, 17 Jul 2019 15:12:30 +0000 (11:12 -0400)]
dm snapshot: fix oversights in optional discard support
__find_snapshots_sharing_cow() should always be used with _origins_lock
held so fix snapshot_io_hints() accordingly. Also, once a snapshot is
being merged discards must not be allowed -- otherwise incorrect or
duplicate work will be performed.
Fixes: 2e6023850e177d ("dm snapshot: add optional discard support features")
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Damien Le Moal [Tue, 16 Jul 2019 05:39:34 +0000 (14:39 +0900)]
dm zoned: fix zone state management race
dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
backend device is being actively read or written and so cannot be
reclaimed. This flag is set as long as the zone atomic reference
counter is not 0. When this atomic is decremented and reaches 0 (e.g.
on BIO completion), the active flag is cleared and set again whenever
the zone is reused and BIO issued with the atomic counter incremented.
These 2 operations (atomic inc/dec and flag set/clear) are however not
always executed atomically under the target metadata mutex lock and
this causes the warning:
WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));
in dmz_deactivate_zone() to be displayed. This problem is regularly
triggered with xfstests generic/209, generic/300, generic/451 and
xfs/077 with XFS being used as the file system on the dm-zoned target
device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
generic/300 trigger the warning with ext4 use.
This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
and managing the "ACTIVE" state by directly looking at the reference
counter value. To do so, the functions dmz_activate_zone() and
dmz_deactivate_zone() are changed to inline functions respectively
calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
is changed to an inline function calling atomic_read().
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Junxiao Bi [Wed, 10 Jul 2019 00:17:19 +0000 (17:17 -0700)]
dm bufio: fix deadlock with loop device
When thin-volume is built on loop device, if available memory is low,
the following deadlock can be triggered:
One process P1 allocates memory with GFP_FS flag, direct alloc fails,
memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
IO to complete in __try_evict_buffer().
But this IO may never complete if issued to an underlying loop device
that forwards it using direct-IO, which allocates memory using
GFP_KERNEL (see: do_blockdev_direct_IO()). If allocation fails, memory
reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
will be invoked, and since the mutex is already held by P1 the loop
thread will hang, and IO will never complete. Resulting in ABBA
deadlock.
Cc: stable@vger.kernel.org
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Wed, 19 Jun 2019 21:05:54 +0000 (17:05 -0400)]
dm snapshot: add optional discard support features
discard_zeroes_cow - a discard issued to the snapshot device that maps
to entire chunks to will zero the corresponding exception(s) in the
snapshot's exception store.
discard_passdown_origin - a discard to the snapshot device is passed down
to the snapshot-origin's underlying device. This doesn't cause copy-out
to the snapshot exception store because the snapshot-origin target is
bypassed.
The discard_passdown_origin feature depends on the discard_zeroes_cow
feature being enabled.
When these 2 features are enabled they allow a temporarily read-only
device that has completely exhausted its free space to recover space.
To do so dm-snapshot provides temporary buffer to accommodate writes
that the temporarily read-only device cannot handle yet. Once the upper
layer frees space (e.g. fstrim to XFS) the discards issued to the
dm-snapshot target will be issued to underlying read-only device whose
free space was exhausted. In addition those discards will also cause
zeroes to be written to the snapshot exception store if corresponding
exceptions exist. If the underlying origin device provides
deduplication for zero blocks then if/when the snapshot is merged backed
to the origin those blocks will become unused. Once the origin has
gained adequate space, merging the snapshot back to the thinly
provisioned device will permit continued use of that device without the
temporary space provided by the snapshot.
Requested-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Milan Broz [Tue, 9 Jul 2019 13:22:14 +0000 (15:22 +0200)]
dm crypt: implement eboiv - encrypted byte-offset initialization vector
This IV is used in some BitLocker devices with CBC encryption mode.
IV is encrypted little-endian byte-offset (with the same key and cipher
as the volume).
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Milan Broz [Tue, 9 Jul 2019 13:22:13 +0000 (15:22 +0200)]
dm crypt: remove obsolete comment about plumb IV
The URL is no longer valid and the comment is obsolete anyway
(the plumb IV was never used).
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Milan Broz [Tue, 9 Jul 2019 13:22:12 +0000 (15:22 +0200)]
dm crypt: wipe private IV struct after key invalid flag is set
If a private IV wipe function fails, the code does not set the key
invalid flag. To fix this, move code to after the flag is set to
prevent the device from resuming in an inconsistent state.
Also, this allows using of a randomized key in private wipe function
(to be used in a following commit).
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fuqian Huang [Fri, 28 Jun 2019 02:47:34 +0000 (10:47 +0800)]
dm integrity: use kzalloc() instead of kmalloc() + memset()
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pavel Begunkov [Thu, 20 Jun 2019 17:50:50 +0000 (20:50 +0300)]
dm: update stale comment in end_clone_bio()
Since commit
a1ce35fa49852db60fc6e268 ("block: remove dead elevator
code") blk_end_request() has been replaced with blk_mq_end_request().
So update comment to reference blk_mq_end_request() accordingly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Qu Wenruo [Tue, 18 Jun 2019 05:39:38 +0000 (13:39 +0800)]
dm log writes: fix incorrect comment about the logged sequence example
dm-log-writes records the sequence of completion, not submission, thus
for the following sequence (W=write, C=complete):
Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd
The logged results in log device should be:
c,a,b,flush,fua
Fix the comment to give a better example.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Zhengyuan Liu [Wed, 12 Jun 2019 06:14:46 +0000 (14:14 +0800)]
dm log writes: use struct_size() to calculate size of pending_block
Use struct_size() to avoid open-coded equivalent that is prone to a type
mistake.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Zhengyuan Liu [Wed, 12 Jun 2019 06:14:45 +0000 (14:14 +0800)]
dm crypt: use struct_size() when allocating encryption context
Use struct_size() to avoid open-coded equivalent that is prone to a type
mistake.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Milan Broz [Wed, 22 May 2019 11:29:44 +0000 (13:29 +0200)]
dm integrity: always set version on superblock update
The new integrity bitmap mode uses the dirty flag. The dirty flag
should not be set in older superblock versions.
The current code sets it unconditionally, even if the superblock
was already formatted without bitmap in older system.
Fix this by moving the version check to one common place and check
version on every superblock write.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Tue, 2 Jul 2019 19:50:08 +0000 (15:50 -0400)]
dm thin metadata: check if in fail_io mode when setting needs_check
Check if in fail_io mode at start of dm_pool_metadata_set_needs_check().
Otherwise dm_pool_metadata_set_needs_check()'s superblock_lock() can
crash in dm_bm_write_lock() while accessing the block manager object
that was previously destroyed as part of a failed
dm_pool_abort_metadata() that ultimately set fail_io to begin with.
Also, update DMERR() message to more accurately describe
superblock_lock() failure.
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Milan Broz [Thu, 20 Jun 2019 11:00:19 +0000 (13:00 +0200)]
dm verity: use message limit for data block corruption message
DM verity should also use DMERR_LIMIT to limit repeat data block
corruption messages.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Jerome Marchand [Wed, 12 Jun 2019 16:22:26 +0000 (18:22 +0200)]
dm table: don't copy from a NULL pointer in realloc_argv()
For the first call to realloc_argv() in dm_split_args(), old_argv is
NULL and size is zero. Then memcpy is called, with the NULL old_argv
as the source argument and a zero size argument. AFAIK, this is
undefined behavior and generates the following warning when compiled
with UBSAN on ppc64le:
In file included from ./arch/powerpc/include/asm/paca.h:19,
from ./arch/powerpc/include/asm/current.h:16,
from ./include/linux/sched.h:12,
from ./include/linux/kthread.h:6,
from drivers/md/dm-core.h:12,
from drivers/md/dm-table.c:8:
In function 'memcpy',
inlined from 'realloc_argv' at drivers/md/dm-table.c:565:3,
inlined from 'dm_split_args' at drivers/md/dm-table.c:588:9:
./include/linux/string.h:345:9: error: argument 2 null where non-null expected [-Werror=nonnull]
return __builtin_memcpy(p, q, size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/dm-table.c: In function 'dm_split_args':
./include/linux/string.h:345:9: note: in a call to built-in function '__builtin_memcpy'
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
zhangyi (F) [Wed, 5 Jun 2019 13:27:08 +0000 (21:27 +0800)]
dm log writes: make sure super sector log updates are written in order
Currently, although we submit super bios in order (and super.nr_entries
is incremented by each logged entry), submit_bio() is async so each
super sector may not be written to log device in order and then the
final nr_entries may be smaller than it should be.
This problem can be reproduced by the xfstests generic/455 with ext4:
QA output created by 455
-Silence is golden
+mark 'end' does not exist
Fix this by serializing submission of super sectors to make sure each
is written to the log disk in order.
Fixes: 0e9cebe724597 ("dm: add log writes target")
Cc: stable@vger.kernel.org
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Stephen Boyd [Wed, 5 Jun 2019 01:27:29 +0000 (18:27 -0700)]
dm init: remove trailing newline from calls to DMERR() and DMINFO()
These printing macros already add a trailing newline, so having another
one here just makes for blank lines when these prints are enabled.
Remove these needless newlines.
Fixes: 6bbc923dfcf5 ("dm: add support to directly boot to a mapped device")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Gen Zhang [Wed, 29 May 2019 01:33:20 +0000 (09:33 +0800)]
dm init: fix incorrect uses of kstrndup()
Fix 2 kstrndup() calls with incorrect argument order.
Fixes: 6bbc923dfcf5 ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org # v5.1
Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Linus Torvalds [Sun, 16 Jun 2019 18:49:45 +0000 (08:49 -1000)]
Linux 5.2-rc5
Linus Torvalds [Sun, 16 Jun 2019 17:28:14 +0000 (07:28 -1000)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"The accumulated fixes from this and last week:
- Fix vmalloc TLB flush and map range calculations which lead to
stale TLBs, spurious faults and other hard to diagnose issues.
- Use fault_in_pages_writable() for prefaulting the user stack in the
FPU code as it's less fragile than the current solution
- Use the PF_KTHREAD flag when checking for a kernel thread instead
of current->mm as the latter can give the wrong answer due to
use_mm()
- Compute the vmemmap size correctly for KASLR and 5-Level paging.
Otherwise this can end up with a way too small vmemmap area.
- Make KASAN and 5-level paging work again by making sure that all
invalid bits are masked out when computing the P4D offset. This
worked before but got broken recently when the LDT remap area was
moved.
- Prevent a NULL pointer dereference in the resource control code
which can be triggered with certain mount options when the
requested resource is not available.
- Enforce ordering of microcode loading vs. perf initialization on
secondary CPUs. Otherwise perf tries to access a non-existing MSR
as the boot CPU marked it as available.
- Don't stop the resource control group walk early otherwise the
control bitmaps are not updated correctly and become inconsistent.
- Unbreak kgdb by returning 0 on success from
kgdb_arch_set_breakpoint() instead of an error code.
- Add more Icelake CPU model defines so depending changes can be
queued in other trees"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
x86/kasan: Fix boot with 5-level paging and KASAN
x86/fpu: Don't use current->mm to check for a kthread
x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()
x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled
x86/resctrl: Don't stop walking closids when a locksetup group is found
x86/fpu: Update kernel's FPU state before using for the fsave header
x86/mm/KASLR: Compute the size of the vmemmap section properly
x86/fpu: Use fault_in_pages_writeable() for pre-faulting
x86/CPU: Add more Icelake model numbers
mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
mm/vmalloc: Fix calculation of direct map addr range
Linus Torvalds [Sun, 16 Jun 2019 17:22:56 +0000 (07:22 -1000)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A set of small fixes:
- Repair the ktime_get_coarse() functions so they actually deliver
what they are supposed to: tick granular time stamps. The current
code missed to add the accumulated nanoseconds part of the
timekeeper so the resulting granularity was 1 second.
- Prevent the tracer from infinitely recursing into time getter
functions in the arm architectured timer by marking these functions
notrace
- Fix a trivial compiler warning caused by wrong qualifier ordering"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Repair ktime_get_coarse*() granularity
clocksource/drivers/arm_arch_timer: Don't trace count reader functions
clocksource/drivers/timer-ti-dm: Change to new style declaration
Linus Torvalds [Sun, 16 Jun 2019 17:19:15 +0000 (07:19 -1000)]
Merge branch 'ras-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull RAS fixes from Thomas Gleixner:
"Two small fixes for RAS:
- Use a proper search algorithm to find the correct element in the
CEC array. The replacement was a better choice than fixing the
crash causes by the original search function with horrible duct
tape.
- Move the timer based decay function into thread context so it can
actually acquire the mutex which protects the CEC array to prevent
corruption"
* 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
RAS/CEC: Convert the timer callback to a workqueue
RAS/CEC: Fix binary search function
Linus Torvalds [Sat, 15 Jun 2019 17:38:54 +0000 (07:38 -1000)]
Merge tag 'platform-drivers-x86-v5.2-3' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver fixes from Andy Shevchenko:
- fix a couple of Mellanox driver enumeration issues
- fix ASUS laptop regression with backlight
- fix Dell computers that got a wrong mode (tablet versus laptop) after
resume
* tag 'platform-drivers-x86-v5.2-3' of git://git.infradead.org/linux-platform-drivers-x86:
platform/mellanox: mlxreg-hotplug: Add devm_free_irq call to remove flow
platform/x86: mlx-platform: Fix parent device in i2c-mux-reg device registration
platform/x86: intel-vbtn: Report switch events when event wakes device
platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from asus_nb_wmi
Linus Torvalds [Sat, 15 Jun 2019 17:34:23 +0000 (07:34 -1000)]
Merge tag 'usb-5.2-rc5' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 5.2-rc5
Nothing major, just some small gadget fixes, usb-serial new device
ids, a few new quirks, and some small fixes for some regressions that
have been found after the big 5.2-rc1 merge.
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: typec: Make sure an alt mode exist before getting its partner
usb: gadget: udc: lpc32xx: fix return value check in lpc32xx_udc_probe()
usb: gadget: dwc2: fix zlp handling
usb: dwc2: Set actual frame number for completed ISOC transfer for none DDMA
usb: gadget: udc: lpc32xx: allocate descriptor with GFP_ATOMIC
usb: gadget: fusb300_udc: Fix memory leak of fusb300->ep[i]
usb: phy: mxs: Disable external charger detect in mxs_phy_hw_init()
usb: dwc2: Fix DMA cache alignment issues
usb: dwc2: host: Fix wMaxPacketSize handling (fix webcam regression)
USB: Fix chipmunk-like voice when using Logitech C270 for recording audio.
USB: usb-storage: Add new ID to ums-realtek
usb: typec: ucsi: ccg: fix memory leak in do_flash
USB: serial: option: add Telit 0x1260 and 0x1261 compositions
USB: serial: pl2303: add Allied Telesis VT-Kit3
USB: serial: option: add support for Simcom SIM7500/SIM7600 RNDIS mode
Linus Torvalds [Sat, 15 Jun 2019 17:29:32 +0000 (07:29 -1000)]
Merge tag 'powerpc-5.2-4' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"One fix for a regression introduced by our 32-bit KASAN support, which
broke booting on machines with "bootx" early debugging enabled.
A fix for a bug which broke kexec on 32-bit, introduced by changes to
the 32-bit STRICT_KERNEL_RWX support in v5.1.
Finally two fixes going to stable for our THP split/collapse handling,
discovered by Nick. The first fixes random crashes and/or corruption
in guests under sufficient load.
Thanks to: Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu
Malaterre"
* tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/32s: fix booting with CONFIG_PPC_EARLY_DEBUG_BOOTX
powerpc/64s: __find_linux_pte() synchronization vs pmdp_invalidate()
powerpc/64s: Fix THP PMD collapse serialisation
powerpc: Fix kexec failure on book3s/32
Linus Torvalds [Sat, 15 Jun 2019 17:24:11 +0000 (07:24 -1000)]
Merge tag 'trace-v5.2-rc4' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Out of range read of stack trace output
- Fix for NULL pointer dereference in trace_uprobe_create()
- Fix to a livepatching / ftrace permission race in the module code
- Fix for NULL pointer dereference in free_ftrace_func_mapper()
- A couple of build warning clean ups
* tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
module: Fix livepatch/ftrace module text permissions race
tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
tracing: Make two symbols static
tracing: avoid build warning with HAVE_NOP_MCOUNT
tracing: Fix out-of-range read in trace_stack_print()
Borislav Petkov [Thu, 13 Jun 2019 13:49:02 +0000 (15:49 +0200)]
x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
Adric Blake reported the following warning during suspend-resume:
Enabling non-boot CPUs ...
x86: Booting SMP configuration:
smpboot: Booting Node 0 Processor 1 APIC 0x2
unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) \
at rIP: 0xffffffff8d267924 (native_write_msr+0x4/0x20)
Call Trace:
intel_set_tfa
intel_pmu_cpu_starting
? x86_pmu_dead_cpu
x86_pmu_starting_cpu
cpuhp_invoke_callback
? _raw_spin_lock_irqsave
notify_cpu_starting
start_secondary
secondary_startup_64
microcode: sig=0x806ea, pf=0x80, revision=0x96
microcode: updated to revision 0xb4, date = 2019-04-01
CPU1 is up
The MSR in question is MSR_TFA_RTM_FORCE_ABORT and that MSR is emulated
by microcode. The log above shows that the microcode loader callback
happens after the PMU restoration, leading to the conjecture that
because the microcode hasn't been updated yet, that MSR is not present
yet, leading to the #GP.
Add a microcode loader-specific hotplug vector which comes before
the PERF vectors and thus executes earlier and makes sure the MSR is
present.
Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
Reported-by: Adric Blake <promarbler14@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: x86@kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203637
Linus Torvalds [Sat, 15 Jun 2019 03:46:14 +0000 (17:46 -1000)]
Merge branch 'for-5.2-fixes' of git://git./linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"This has an unusually high density of tricky fixes:
- task_get_css() could deadlock when it races against a dying cgroup.
- cgroup.procs didn't list thread group leaders with live threads.
This could mislead readers to think that a cgroup is empty when
it's not. Fixed by making PROCS iterator include dead tasks. I made
a couple mistakes making this change and this pull request contains
a couple follow-up patches.
- When cpusets run out of online cpus, it updates cpusmasks of member
tasks in bizarre ways. Joel improved the behavior significantly"
* 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: restore sanity to cpuset_cpus_allowed_fallback()
cgroup: Fix css_task_iter_advance_css_set() cset skip condition
cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
cgroup: Include dying leaders with live threads in PROCS iterations
cgroup: Implement css_task_iter_skip()
cgroup: Call cgroup_release() before __exit_signal()
docs cgroups: add another example size for hugetlb
cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
Linus Torvalds [Sat, 15 Jun 2019 03:34:45 +0000 (17:34 -1000)]
Merge tag 'drm-fixes-2019-06-14' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Daniel Vetter:
"Nothing unsettling here, also not aware of anything serious still
pending.
The edid override regression fix took a bit longer since this seems to
be an area with an overabundance of bad options. But the fix we have
now seems like a good path forward.
Next week it should be back to Dave.
Summary:
- fix regression on amdgpu on SI
- fix edid override regression
- driver fixes: amdgpu, i915, mediatek, meson, panfrost
- fix writecombine for vmap in gem-shmem helper (used by panfrost)
- add more panel quirks"
* tag 'drm-fixes-2019-06-14' of git://anongit.freedesktop.org/drm/drm: (25 commits)
drm/amdgpu: return 0 by default in amdgpu_pm_load_smu_firmware
drm/amdgpu: Fix bounds checking in amdgpu_ras_is_supported()
drm: add fallback override/firmware EDID modes workaround
drm/edid: abstract override/firmware EDID retrieval
drm/i915/perf: fix whitelist on Gen10+
drm/i915/sdvo: Implement proper HDMI audio support for SDVO
drm/i915: Fix per-pixel alpha with CCS
drm/i915/dmc: protect against reading random memory
drm/i915/dsi: Use a fuzzy check for burst mode clock check
drm/amdgpu/{uvd,vcn}: fetch ring's read_ptr after alloc
drm/panfrost: Require the simple_ondemand governor
drm/panfrost: make devfreq optional again
drm/gem_shmem: Use a writecombine mapping for ->vaddr
drm: panel-orientation-quirks: Add quirk for GPD MicroPC
drm: panel-orientation-quirks: Add quirk for GPD pocket2
drm/meson: fix G12A primary plane disabling
drm/meson: fix primary plane disabling
drm/meson: fix G12A HDMI PLL settings for 4K60 1000/1001 variations
drm/mediatek: call mtk_dsi_stop() after mtk_drm_crtc_atomic_disable()
drm/mediatek: clear num_pipes when unbind driver
...
Linus Torvalds [Sat, 15 Jun 2019 03:27:12 +0000 (17:27 -1000)]
Merge tag 'gfs2-v5.2.fixes2' of git://git./linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Andreas Gruenbacher:
"Fix rounding error in gfs2_iomap_page_prepare"
* tag 'gfs2-v5.2.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix rounding error in gfs2_iomap_page_prepare
Linus Torvalds [Sat, 15 Jun 2019 01:52:51 +0000 (15:52 -1000)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fix from James Bottomley:
"A single bug fix for hpsa.
The user visible consequences aren't clear, but the ioaccel2 raid
acceleration may misfire on the malformed request assuming the payload
is big enough to require chaining (more than 31 sg entries)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: hpsa: correct ioaccel2 chaining
Linus Torvalds [Sat, 15 Jun 2019 01:41:18 +0000 (15:41 -1000)]
Merge tag 'for-linus-
20190614' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Remove references to old schedulers for the scheduler switching and
blkio controller documentation (Andreas)
- Kill duplicate check for report zone for null_blk (Chaitanya)
- Two bcache fixes (Coly)
- Ensure that mq-deadline is selected if zoned block device is enabled,
as we need that to support them (Damien)
- Fix io_uring memory leak (Eric)
- ps3vram fallout from LBDAF removal (Geert)
- Redundant blk-mq debugfs debugfs_create return check cleanup (Greg)
- Extend NOPLM quirk for ST1000LM024 drives (Hans)
- Remove error path warning that can now trigger after the queue
removal/addition fixes (Ming)
* tag 'for-linus-
20190614' of git://git.kernel.dk/linux-block:
block/ps3vram: Use %llu to format sector_t after LBDAF removal
libata: Extend quirks for the ST1000LM024 drives with NOLPM quirk
bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached
bcache: fix stack corruption by PRECEDING_KEY()
blk-mq: remove WARN_ON(!q->elevator) from blk_mq_sched_free_requests
blkio-controller.txt: Remove references to CFQ
block/switching-sched.txt: Update to blk-mq schedulers
null_blk: remove duplicate check for report zone
blk-mq: no need to check return value of debugfs_create functions
io_uring: fix memory leak of UNIX domain socket inode
block: force select mq-deadline for zoned block devices
Linus Torvalds [Sat, 15 Jun 2019 01:25:27 +0000 (15:25 -1000)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"I2C has two simple but wanted driver fixes for you"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: pca-platform: Fix GPIO lookup code
i2c: acorn: fix i2c warning
Casey Schaufler [Fri, 31 May 2019 10:53:33 +0000 (11:53 +0100)]
Smack: Restore the smackfsdef mount option and add missing prefixes
The 5.1 mount system rework changed the smackfsdef mount option to
smackfsdefault. This fixes the regression by making smackfsdef treated
the same way as smackfsdefault.
Also fix the smack_param_specs[] to have "smack" prefixes on all the
names. This isn't visible to a user unless they either:
(a) Try to mount a filesystem that's converted to the internal mount API
and that implements the ->parse_monolithic() context operation - and
only then if they call security_fs_context_parse_param() rather than
security_sb_eat_lsm_opts().
There are no examples of this upstream yet, but nfs will probably want
to do this for nfs2 or nfs3.
(b) Use fsconfig() to configure the filesystem - in which case
security_fs_context_parse_param() will be called.
This issue is that smack_sb_eat_lsm_opts() checks for the "smack" prefix
on the options, but smack_fs_context_parse_param() does not.
Fixes: c3300aaf95fb ("smack: get rid of match_token()")
Fixes: 2febd254adc4 ("smack: Implement filesystem context security hooks")
Cc: stable@vger.kernel.org
Reported-by: Jose Bollo <jose.bollo@iot.bzh>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Li [Thu, 6 Jun 2019 03:17:54 +0000 (11:17 +0800)]
ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
The mapper may be NULL when called from register_ftrace_function_probe()
with probe->data == NULL.
This issue can be reproduced as follow (it may be covered by compiler
optimization sometime):
/ # cat /sys/kernel/debug/tracing/set_ftrace_filter
#### all functions enabled ####
/ # echo foo_bar:dump > /sys/kernel/debug/tracing/set_ftrace_filter
[ 206.949100] Unable to handle kernel NULL pointer dereference at virtual address
0000000000000000
[ 206.952402] Mem abort info:
[ 206.952819] ESR = 0x96000006
[ 206.955326] Exception class = DABT (current EL), IL = 32 bits
[ 206.955844] SET = 0, FnV = 0
[ 206.956272] EA = 0, S1PTW = 0
[ 206.956652] Data abort info:
[ 206.957320] ISV = 0, ISS = 0x00000006
[ 206.959271] CM = 0, WnR = 0
[ 206.959938] user pgtable: 4k pages, 48-bit VAs, pgdp=
0000000419f3a000
[ 206.960483] [
0000000000000000] pgd=
0000000411a87003, pud=
0000000411a83003, pmd=
0000000000000000
[ 206.964953] Internal error: Oops:
96000006 [#1] SMP
[ 206.971122] Dumping ftrace buffer:
[ 206.973677] (ftrace buffer empty)
[ 206.975258] Modules linked in:
[ 206.976631] Process sh (pid: 281, stack limit = 0x(____ptrval____))
[ 206.978449] CPU: 10 PID: 281 Comm: sh Not tainted 5.2.0-rc1+ #17
[ 206.978955] Hardware name: linux,dummy-virt (DT)
[ 206.979883] pstate:
60000005 (nZCv daif -PAN -UAO)
[ 206.980499] pc : free_ftrace_func_mapper+0x2c/0x118
[ 206.980874] lr : ftrace_count_free+0x68/0x80
[ 206.982539] sp :
ffff0000182f3ab0
[ 206.983102] x29:
ffff0000182f3ab0 x28:
ffff8003d0ec1700
[ 206.983632] x27:
ffff000013054b40 x26:
0000000000000001
[ 206.984000] x25:
ffff00001385f000 x24:
0000000000000000
[ 206.984394] x23:
ffff000013453000 x22:
ffff000013054000
[ 206.984775] x21:
0000000000000000 x20:
ffff00001385fe28
[ 206.986575] x19:
ffff000013872c30 x18:
0000000000000000
[ 206.987111] x17:
0000000000000000 x16:
0000000000000000
[ 206.987491] x15:
ffffffffffffffb0 x14:
0000000000000000
[ 206.987850] x13:
000000000017430e x12:
0000000000000580
[ 206.988251] x11:
0000000000000000 x10:
cccccccccccccccc
[ 206.988740] x9 :
0000000000000000 x8 :
ffff000013917550
[ 206.990198] x7 :
ffff000012fac2e8 x6 :
ffff000012fac000
[ 206.991008] x5 :
ffff0000103da588 x4 :
0000000000000001
[ 206.991395] x3 :
0000000000000001 x2 :
ffff000013872a28
[ 206.991771] x1 :
0000000000000000 x0 :
0000000000000000
[ 206.992557] Call trace:
[ 206.993101] free_ftrace_func_mapper+0x2c/0x118
[ 206.994827] ftrace_count_free+0x68/0x80
[ 206.995238] release_probe+0xfc/0x1d0
[ 206.995555] register_ftrace_function_probe+0x4a8/0x868
[ 206.995923] ftrace_trace_probe_callback.isra.4+0xb8/0x180
[ 206.996330] ftrace_dump_callback+0x50/0x70
[ 206.996663] ftrace_regex_write.isra.29+0x290/0x3a8
[ 206.997157] ftrace_filter_write+0x44/0x60
[ 206.998971] __vfs_write+0x64/0xf0
[ 206.999285] vfs_write+0x14c/0x2f0
[ 206.999591] ksys_write+0xbc/0x1b0
[ 206.999888] __arm64_sys_write+0x3c/0x58
[ 207.000246] el0_svc_common.constprop.0+0x408/0x5f0
[ 207.000607] el0_svc_handler+0x144/0x1c8
[ 207.000916] el0_svc+0x8/0xc
[ 207.003699] Code:
aa0003f8 a9025bf5 aa0103f5 f946ea80 (
f9400303)
[ 207.008388] ---[ end trace
7b6d11b5f542bdf1 ]---
[ 207.010126] Kernel panic - not syncing: Fatal exception
[ 207.011322] SMP: stopping secondary CPUs
[ 207.013956] Dumping ftrace buffer:
[ 207.014595] (ftrace buffer empty)
[ 207.015632] Kernel Offset: disabled
[ 207.017187] CPU features: 0x002,
20006008
[ 207.017985] Memory Limit: none
[ 207.019825] ---[ end Kernel panic - not syncing: Fatal exception ]---
Link: http://lkml.kernel.org/r/20190606031754.10798-1-liwei391@huawei.com
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Josh Poimboeuf [Fri, 14 Jun 2019 01:07:22 +0000 (20:07 -0500)]
module: Fix livepatch/ftrace module text permissions race
It's possible for livepatch and ftrace to be toggling a module's text
permissions at the same time, resulting in the following panic:
BUG: unable to handle page fault for address:
ffffffffc005b1d9
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD
3ea0c067 P4D
3ea0c067 PUD
3ea0e067 PMD
3cc13067 PTE
3b8a1061
Oops: 0003 [#1] PREEMPT SMP PTI
CPU: 1 PID: 453 Comm: insmod Tainted: G O K 5.2.0-rc1-
a188339ca5 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
RIP: 0010:apply_relocate_add+0xbe/0x14c
Code: fa 0b 74 21 48 83 fa 18 74 38 48 83 fa 0a 75 40 eb 08 48 83 38 00 74 33 eb 53 83 38 00 75 4e 89 08 89 c8 eb 0a 83 38 00 75 43 <89> 08 48 63 c1 48 39 c8 74 2e eb 48 83 38 00 75 32 48 29 c1 89 08
RSP: 0018:
ffffb223c00dbb10 EFLAGS:
00010246
RAX:
ffffffffc005b1d9 RBX:
0000000000000000 RCX:
ffffffff8b200060
RDX:
000000000000000b RSI:
0000004b0000000b RDI:
ffff96bdfcd33000
RBP:
ffffb223c00dbb38 R08:
ffffffffc005d040 R09:
ffffffffc005c1f0
R10:
ffff96bdfcd33c40 R11:
ffff96bdfcd33b80 R12:
0000000000000018
R13:
ffffffffc005c1f0 R14:
ffffffffc005e708 R15:
ffffffff8b2fbc74
FS:
00007f5f447beba8(0000) GS:
ffff96bdff900000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
ffffffffc005b1d9 CR3:
000000003cedc002 CR4:
0000000000360ea0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
klp_init_object_loaded+0x10f/0x219
? preempt_latency_start+0x21/0x57
klp_enable_patch+0x662/0x809
? virt_to_head_page+0x3a/0x3c
? kfree+0x8c/0x126
patch_init+0x2ed/0x1000 [livepatch_test02]
? 0xffffffffc0060000
do_one_initcall+0x9f/0x1c5
? kmem_cache_alloc_trace+0xc4/0xd4
? do_init_module+0x27/0x210
do_init_module+0x5f/0x210
load_module+0x1c41/0x2290
? fsnotify_path+0x3b/0x42
? strstarts+0x2b/0x2b
? kernel_read+0x58/0x65
__do_sys_finit_module+0x9f/0xc3
? __do_sys_finit_module+0x9f/0xc3
__x64_sys_finit_module+0x1a/0x1c
do_syscall_64+0x52/0x61
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The above panic occurs when loading two modules at the same time with
ftrace enabled, where at least one of the modules is a livepatch module:
CPU0 CPU1
klp_enable_patch()
klp_init_object_loaded()
module_disable_ro()
ftrace_module_enable()
ftrace_arch_code_modify_post_process()
set_all_modules_text_ro()
klp_write_object_relocations()
apply_relocate_add()
*patches read-only code* - BOOM
A similar race exists when toggling ftrace while loading a livepatch
module.
Fix it by ensuring that the livepatch and ftrace code patching
operations -- and their respective permissions changes -- are protected
by the text_mutex.
Link: http://lkml.kernel.org/r/ab43d56ab909469ac5d2520c5d944ad6d4abd476.1560474114.git.jpoimboe@redhat.com
Reported-by: Johannes Erdfelt <johannes@erdfelt.com>
Fixes: 444d13ff10fb ("modules: add ro_after_init support")
Acked-by: Jessica Yu <jeyu@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Eiichi Tsukata [Fri, 14 Jun 2019 07:40:26 +0000 (16:40 +0900)]
tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
Commit
0597c49c69d5 ("tracing/uprobes: Use dyn_event framework for
uprobe events") cleaned up the usage of trace_uprobe_create(), and the
function has been no longer used for removing uprobe/uretprobe.
Link: http://lkml.kernel.org/r/20190614074026.8045-2-devel@etsukata.com
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Eiichi Tsukata [Fri, 14 Jun 2019 07:40:25 +0000 (16:40 +0900)]
tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
Just like the case of commit
8b05a3a7503c ("tracing/kprobes: Fix NULL
pointer dereference in trace_kprobe_create()"), writing an incorrectly
formatted string to uprobe_events can trigger NULL pointer dereference.
Reporeducer:
# echo r > /sys/kernel/debug/tracing/uprobe_events
dmesg:
BUG: kernel NULL pointer dereference, address:
0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD
8000000079d12067 P4D
8000000079d12067 PUD
7b7ab067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 1903 Comm: bash Not tainted 5.2.0-rc3+ #15
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
RIP: 0010:strchr+0x0/0x30
Code: c0 eb 0d 84 c9 74 18 48 83 c0 01 48 39 d0 74 0f 0f b6 0c 07 3a 0c 06 74 ea 19 c0 83 c8 01 c3 31 c0 c3 0f 1f 84 00 00 00 00 00 <0f> b6 07 89 f2 40 38 f0 75 0e eb 13 0f b6 47 01 48 83 c
RSP: 0018:
ffffb55fc0403d10 EFLAGS:
00010293
RAX:
ffff993ffb793400 RBX:
0000000000000000 RCX:
ffffffffa4852625
RDX:
0000000000000000 RSI:
000000000000002f RDI:
0000000000000000
RBP:
ffffb55fc0403dd0 R08:
ffff993ffb793400 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000000 R12:
0000000000000000
R13:
ffff993ff9cc1668 R14:
0000000000000001 R15:
0000000000000000
FS:
00007f30c5147700(0000) GS:
ffff993ffda00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000000 CR3:
000000007b628000 CR4:
00000000000006f0
Call Trace:
trace_uprobe_create+0xe6/0xb10
? __kmalloc_track_caller+0xe6/0x1c0
? __kmalloc+0xf0/0x1d0
? trace_uprobe_create+0xb10/0xb10
create_or_delete_trace_uprobe+0x35/0x90
? trace_uprobe_create+0xb10/0xb10
trace_run_command+0x9c/0xb0
trace_parse_run_command+0xf9/0x1eb
? probes_open+0x80/0x80
__vfs_write+0x43/0x90
vfs_write+0x14a/0x2a0
ksys_write+0xa2/0x170
do_syscall_64+0x7f/0x200
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Link: http://lkml.kernel.org/r/20190614074026.8045-1-devel@etsukata.com
Cc: stable@vger.kernel.org
Fixes: 0597c49c69d5 ("tracing/uprobes: Use dyn_event framework for uprobe events")
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
YueHaibing [Fri, 14 Jun 2019 15:32:10 +0000 (23:32 +0800)]
tracing: Make two symbols static
Fix sparse warnings:
kernel/trace/trace.c:6927:24: warning:
symbol 'get_tracing_log_err' was not declared. Should it be static?
kernel/trace/trace.c:8196:15: warning:
symbol 'trace_instance_dir' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20190614153210.24424-1-yuehaibing@huawei.com
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Vasily Gorbik [Wed, 5 Jun 2019 11:11:58 +0000 (13:11 +0200)]
tracing: avoid build warning with HAVE_NOP_MCOUNT
Selecting HAVE_NOP_MCOUNT enables -mnop-mcount (if gcc supports it)
and sets CC_USING_NOP_MCOUNT. Reuse __is_defined (which is suitable for
testing CC_USING_* defines) to avoid conditional compilation and fix
the following gcc 9 warning on s390:
kernel/trace/ftrace.c:2514:1: warning: ‘ftrace_code_disable’ defined
but not used [-Wunused-function]
Link: http://lkml.kernel.org/r/patch.git-1a82d13f33ac.your-ad-here.call-01559732716-ext-6629@work.hours
Fixes: 2f4df0017baed ("tracing: Add -mcount-nop option support")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Eiichi Tsukata [Mon, 10 Jun 2019 04:00:16 +0000 (13:00 +0900)]
tracing: Fix out-of-range read in trace_stack_print()
Puts range check before dereferencing the pointer.
Reproducer:
# echo stacktrace > trace_options
# echo 1 > events/enable
# cat trace > /dev/null
KASAN report:
==================================================================
BUG: KASAN: use-after-free in trace_stack_print+0x26b/0x2c0
Read of size 8 at addr
ffff888069d20000 by task cat/1953
CPU: 0 PID: 1953 Comm: cat Not tainted 5.2.0-rc3+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
Call Trace:
dump_stack+0x8a/0xce
print_address_description+0x60/0x224
? trace_stack_print+0x26b/0x2c0
? trace_stack_print+0x26b/0x2c0
__kasan_report.cold+0x1a/0x3e
? trace_stack_print+0x26b/0x2c0
kasan_report+0xe/0x20
trace_stack_print+0x26b/0x2c0
print_trace_line+0x6ea/0x14d0
? tracing_buffers_read+0x700/0x700
? trace_find_next_entry_inc+0x158/0x1d0
s_show+0xea/0x310
seq_read+0xaa7/0x10e0
? seq_escape+0x230/0x230
__vfs_read+0x7c/0x100
vfs_read+0x16c/0x3a0
ksys_read+0x121/0x240
? kernel_write+0x110/0x110
? perf_trace_sys_enter+0x8a0/0x8a0
? syscall_slow_exit_work+0xa9/0x410
do_syscall_64+0xb7/0x390
? prepare_exit_to_usermode+0x165/0x200
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f867681f910
Code: b6 fe ff ff 48 8d 3d 0f be 08 00 48 83 ec 08 e8 06 db 01 00 66 0f 1f 44 00 00 83 3d f9 2d 2c 00 00 75 10 b8 00 00 00 00 04
RSP: 002b:
00007ffdabf23488 EFLAGS:
00000246 ORIG_RAX:
0000000000000000
RAX:
ffffffffffffffda RBX:
0000000000020000 RCX:
00007f867681f910
RDX:
0000000000020000 RSI:
00007f8676cde000 RDI:
0000000000000003
RBP:
00007f8676cde000 R08:
ffffffffffffffff R09:
0000000000000000
R10:
0000000000000871 R11:
0000000000000246 R12:
00007f8676cde000
R13:
0000000000000003 R14:
0000000000020000 R15:
0000000000000ec0
Allocated by task 1214:
save_stack+0x1b/0x80
__kasan_kmalloc.constprop.0+0xc2/0xd0
kmem_cache_alloc+0xaf/0x1a0
getname_flags+0xd2/0x5b0
do_sys_open+0x277/0x5a0
do_syscall_64+0xb7/0x390
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 1214:
save_stack+0x1b/0x80
__kasan_slab_free+0x12c/0x170
kmem_cache_free+0x8a/0x1c0
putname+0xe1/0x120
do_sys_open+0x2c5/0x5a0
do_syscall_64+0xb7/0x390
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at
ffff888069d20000
which belongs to the cache names_cache of size 4096
The buggy address is located 0 bytes inside of
4096-byte region [
ffff888069d20000,
ffff888069d21000)
The buggy address belongs to the page:
page:
ffffea0001a74800 refcount:1 mapcount:0 mapping:
ffff88806ccd1380 index:0x0 compound_mapcount: 0
flags: 0x100000000010200(slab|head)
raw:
0100000000010200 dead000000000100 dead000000000200 ffff88806ccd1380
raw:
0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888069d1ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888069d1ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
ffff888069d20000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888069d20080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888069d20100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Link: http://lkml.kernel.org/r/20190610040016.5598-1-devel@etsukata.com
Fixes: 4285f2fcef80 ("tracing: Remove the ULONG_MAX stack trace hackery")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Andreas Gruenbacher [Sat, 8 Jun 2019 12:09:02 +0000 (13:09 +0100)]
gfs2: Fix rounding error in gfs2_iomap_page_prepare
The pos and len arguments to the iomap page_prepare callback are not
block aligned, so we need to take that into account when computing the
number of blocks.
Fixes: d0a22a4b03b8 ("gfs2: Fix iomap write page reclaim deadlock")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Linus Torvalds [Fri, 14 Jun 2019 16:16:47 +0000 (06:16 -1000)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Here are some arm64 fixes for -rc5.
The only non-trivial change (in terms of the diffstat) is fixing our
SVE ptrace API for big-endian machines, but the majority of this is
actually the addition of much-needed comments and updates to the
documentation to try to avoid this mess biting us again in future.
There are still a couple of small things on the horizon, but nothing
major at this point.
Summary:
- Fix broken SVE ptrace API when running in a big-endian configuration
- Fix performance regression due to off-by-one in TLBI range checking
- Fix build regression when using Clang"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/sve: Fix missing SVE/FPSIMD endianness conversions
arm64: tlbflush: Ensure start/end of address range are aligned to stride
arm64: Don't unconditionally add -Wno-psabi to KBUILD_CFLAGS
Linus Torvalds [Fri, 14 Jun 2019 16:08:46 +0000 (06:08 -1000)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/devm_memremap_pages: fix final page put race
PCI/P2PDMA: track pgmap references per resource, not globally
lib/genalloc: introduce chunk owners
PCI/P2PDMA: fix the gen_pool_add_virt() failure path
mm/devm_memremap_pages: introduce devm_memunmap_pages
drivers/base/devres: introduce devm_release_action()
mm/vmscan.c: fix trying to reclaim unevictable LRU page
coredump: fix race condition between collapse_huge_page() and core dumping
mm/mlock.c: change count_mm_mlocked_page_nr return type
mm: mmu_gather: remove __tlb_reset_range() for force flush
fs/ocfs2: fix race in ocfs2_dentry_attach_lock()
mm/vmscan.c: fix recent_rotated history
mm/mlock.c: mlockall error for flag MCL_ONFAULT
scripts/decode_stacktrace.sh: prefix addr2line with $CROSS_COMPILE
mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node
mm: memcontrol: don't batch updates of local VM stats and events
Daniel Vetter [Fri, 14 Jun 2019 15:46:54 +0000 (17:46 +0200)]
Merge branch 'drm-fixes-5.2' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Fixes for 5.2:
- Extend previous vce fix for resume to uvd and vcn
- Fix bounds checking in ras debugfs interface
- Fix a regression on SI using amdgpu
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190613021856.3307-1-alexander.deucher@amd.com
Linus Torvalds [Fri, 14 Jun 2019 15:49:35 +0000 (05:49 -1000)]
Merge tag 'iommu-fixes-v5.2-rc4' of git://git./linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- three fixes for Intel VT-d to fix a potential dead-lock, a formatting
fix and a bit setting fix
- one fix for the ARM-SMMU to make it work on some platforms with
sub-optimal SMMU emulation
* tag 'iommu-fixes-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/arm-smmu: Avoid constant zero in TLBI writes
iommu/vt-d: Set the right field for Page Walk Snoop
iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock
iommu: Add missing new line for dma type
Linus Torvalds [Fri, 14 Jun 2019 15:48:29 +0000 (05:48 -1000)]
Merge tag 'gpio-v5.2-3' of git://git./linux/kernel/git/linusw/linux-gpio
Pull GPIO fix from Linus Walleij:
"A single fix for the PCA953x driver affecting some fringe variants of
the chip"
* tag 'gpio-v5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: pca953x: hack to fix 24 bit gpio expanders
Linus Torvalds [Fri, 14 Jun 2019 15:37:06 +0000 (05:37 -1000)]
Merge tag 'sound-5.2-rc5' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"It might feel like deja vu to receive a bulk of changes at rc5, and it
happens again; we've got a collection of fixes for ASoC. Most of fixes
are targeted for the newly merged SOF (Sound Open Firmware) stuff and
the relevant fixes for Intel platforms.
Other than that, there are a few regression fixes for the recent ASoC
core changes and HD-audio quirk, as well as a couple of FireWire fixes
and for other ASoC codecs"
* tag 'sound-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (54 commits)
Revert "ALSA: hda/realtek - Improve the headset mic for Acer Aspire laptops"
ALSA: ice1712: Check correct return value to snd_i2c_sendbytes (EWS/DMX 6Fire)
ALSA: oxfw: allow PCM capture for Stanton SCS.1m
ALSA: firewire-motu: fix destruction of data for isochronous resources
ASoC: Intel: sst: fix kmalloc call with wrong flags
ASoC: core: Fix deadlock in snd_soc_instantiate_card()
SoC: rt274: Fix internal jack assignment in set_jack callback
ALSA: hdac: fix memory release for SST and SOF drivers
ASoC: SOF: Intel: hda: use the defined ppcap functions
ASoC: core: move DAI pre-links initiation to snd_soc_instantiate_card
ASoC: Intel: cht_bsw_rt5672: fix kernel oops with platform_name override
ASoC: Intel: cht_bsw_nau8824: fix kernel oops with platform_name override
ASoC: Intel: bytcht_es8316: fix kernel oops with platform_name override
ASoC: Intel: cht_bsw_max98090: fix kernel oops with platform_name override
ASoC: sun4i-i2s: Add offset to RX channel select
ASoC: sun4i-i2s: Fix sun8i tx channel offset mask
ASoC: max98090: remove 24-bit format support if RJ is 0
ASoC: da7219: Fix build error without CONFIG_I2C
ASoC: SOF: Intel: hda: Fix COMPILE_TEST build error
ASoC: SOF: fix DSP oops definitions in FW ABI
...
Andrey Ryabinin [Fri, 14 Jun 2019 14:31:49 +0000 (17:31 +0300)]
x86/kasan: Fix boot with 5-level paging and KASAN
Since commit
d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on
5-level paging") kernel doesn't boot with KASAN on 5-level paging machines.
The bug is actually in early_p4d_offset() and introduced by commit
12a8cc7fcf54 ("x86/kasan: Use the same shadow offset for 4- and 5-level paging")
early_p4d_offset() tries to convert pgd_val(*pgd) value to a physical
address. This doesn't make sense because pgd_val() already contains the
physical address.
It did work prior to commit
d52888aa2753 because the result of
"__pa_nodebug(pgd_val(*pgd)) & PTE_PFN_MASK" was the same as "pgd_val(*pgd)
& PTE_PFN_MASK". __pa_nodebug() just set some high bits which were masked
out by applying PTE_PFN_MASK.
After the change of the PAGE_OFFSET offset in commit
d52888aa2753
__pa_nodebug(pgd_val(*pgd)) started to return a value with more high bits
set and PTE_PFN_MASK wasn't enough to mask out all of them. So it returns a
wrong not even canonical address and crashes on the attempt to dereference
it.
Switch back to pgd_val() & PTE_PFN_MASK to cure the issue.
Fixes: 12a8cc7fcf54 ("x86/kasan: Use the same shadow offset for 4- and 5-level paging")
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: kasan-dev@googlegroups.com
Cc: stable@vger.kernel.org
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190614143149.2227-1-aryabinin@virtuozzo.com
Thomas Gleixner [Thu, 13 Jun 2019 19:40:45 +0000 (21:40 +0200)]
timekeeping: Repair ktime_get_coarse*() granularity
Jason reported that the coarse ktime based time getters advance only once
per second and not once per tick as advertised.
The code reads only the monotonic base time, which advances once per
second. The nanoseconds are accumulated on every tick in xtime_nsec up to
a second and the regular time getters take this nanoseconds offset into
account, but the ktime_get_coarse*() implementation fails to do so.
Add the accumulated xtime_nsec value to the monotonic base time to get the
proper per tick advancing coarse tinme.
Fixes: b9ff604cff11 ("timekeeping: Add ktime_get_coarse_with_offset")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Sultan Alsawaf <sultan@kerneltoast.com>
Cc: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906132136280.1791@nanos.tec.linutronix.de
Daniel Vetter [Thu, 13 Jun 2019 20:44:21 +0000 (22:44 +0200)]
Merge tag 'drm-misc-fixes-2019-06-13' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
Sean writes:
meson: A few G12A fixes across the driver (Neil)
quirks: A couple quirks for GPD devices (Hans)
gem_shmem: Use writecombine when vmapping non-dmabuf BOs (Boris)
panfrost: A couple tweaks to requiring devfreq (Neil & Ezequiel)
edid: Ensure we return the override mode when ddc probe fails (Jani)
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Neil Armstrong <narmstrong@baylibre.com>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Ezequiel Garcia <ezequiel@collabora.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Sean Paul <sean@poorly.run>
Link: https://patchwork.freedesktop.org/patch/msgid/20190613143946.GA24233@art_vandelay
Hui Wang [Fri, 14 Jun 2019 08:44:12 +0000 (16:44 +0800)]
Revert "ALSA: hda/realtek - Improve the headset mic for Acer Aspire laptops"
This reverts commit
9cb40eb184c4220d244a532bd940c6345ad9dbd9.
This patch introduces noise and headphone playback issue after
rebooting or suspending/resuming. Let us revert it.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=203831
Fixes: 9cb40eb184c4 ("ALSA: hda/realtek - Improve the headset mic for Acer Aspire laptops")
Cc: <stable@vger.kernel.org>
Signed-off-by: Hui Wang <hui.wang@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Dan Williams [Thu, 13 Jun 2019 22:56:33 +0000 (15:56 -0700)]
mm/devm_memremap_pages: fix final page put race
Logan noticed that devm_memremap_pages_release() kills the percpu_ref
drops all the page references that were acquired at init and then
immediately proceeds to unplug, arch_remove_memory(), the backing pages
for the pagemap. If for some reason device shutdown actually collides
with a busy / elevated-ref-count page then arch_remove_memory() should
be deferred until after that reference is dropped.
As it stands the "wait for last page ref drop" happens *after*
devm_memremap_pages_release() returns, which is obviously too late and
can lead to crashes.
Fix this situation by assigning the responsibility to wait for the
percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
callback. Implement the new cleanup callback for all
devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.
Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 41e94a851304 ("add devm_memremap_pages")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Williams [Thu, 13 Jun 2019 22:56:30 +0000 (15:56 -0700)]
PCI/P2PDMA: track pgmap references per resource, not globally
In preparation for fixing a race between devm_memremap_pages_release()
and the final put of a page from the device-page-map, allocate a
percpu-ref per p2pdma resource mapping.
Link: http://lkml.kernel.org/r/155727338646.292046.9922678317501435597.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Williams [Thu, 13 Jun 2019 22:56:27 +0000 (15:56 -0700)]
lib/genalloc: introduce chunk owners
The p2pdma facility enables a provider to publish a pool of dma
addresses for a consumer to allocate. A genpool is used internally by
p2pdma to collect dma resources, 'chunks', to be handed out to
consumers. Whenever a consumer allocates a resource it needs to pin the
'struct dev_pagemap' instance that backs the chunk selected by
pci_alloc_p2pmem().
Currently that reference is taken globally on the entire provider
device. That sets up a lifetime mismatch whereby the p2pdma core needs
to maintain hacks to make sure the percpu_ref is not released twice.
This lifetime mismatch also stands in the way of a fix to
devm_memremap_pages() whereby devm_memremap_pages_release() must wait for
the percpu_ref ->release() callback to complete before it can proceed to
teardown pages.
So, towards fixing this situation, introduce the ability to store a 'chunk
owner' at gen_pool_add() time, and a facility to retrieve the owner at
gen_pool_{alloc,free}() time. For p2pdma this will be used to store and
recall individual dev_pagemap reference counter instances per-chunk.
Link: http://lkml.kernel.org/r/155727338118.292046.13407378933221579644.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Williams [Thu, 13 Jun 2019 22:56:24 +0000 (15:56 -0700)]
PCI/P2PDMA: fix the gen_pool_add_virt() failure path
The pci_p2pdma_add_resource() implementation immediately frees the pgmap
if gen_pool_add_virt() fails. However, that means that when @dev
triggers a devres release devm_memremap_pages_release() will crash
trying to access the freed @pgmap.
Use the new devm_memunmap_pages() to manually free the mapping in the
error path.
Link: http://lkml.kernel.org/r/155727337603.292046.13101332703665246702.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 52916982af48 ("PCI/P2PDMA: Support peer-to-peer memory")
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Williams [Thu, 13 Jun 2019 22:56:21 +0000 (15:56 -0700)]
mm/devm_memremap_pages: introduce devm_memunmap_pages
Use the new devm_release_action() facility to allow
devm_memremap_pages_release() to be manually triggered.
Link: http://lkml.kernel.org/r/155727337088.292046.5774214552136776763.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Williams [Thu, 13 Jun 2019 22:56:18 +0000 (15:56 -0700)]
drivers/base/devres: introduce devm_release_action()
Patch series "mm/devm_memremap_pages: Fix page release race", v2.
Logan audited the devm_memremap_pages() shutdown path and noticed that
it was possible to proceed to arch_remove_memory() before all potential
page references have been reaped.
Introduce a new ->cleanup() callback to do the work of waiting for any
straggling page references and then perform the percpu_ref_exit() in
devm_memremap_pages_release() context.
For p2pdma this involves some deeper reworks to reference count
resources on a per-instance basis rather than a per pci-device basis. A
modified genalloc api is introduced to convey a driver-private pointer
through gen_pool_{alloc,free}() interfaces. Also, a
devm_memunmap_pages() api is introduced since p2pdma does not
auto-release resources on a setup failure.
The dax and pmem changes pass the nvdimm unit tests, and the p2pdma
changes should now pass testing with the pci_p2pdma_release() fix.
Jrme, how does this look for HMM?
This patch (of 6):
The devm_add_action() facility allows a resource allocation routine to
add custom devm semantics. One such user is devm_memremap_pages().
There is now a need to manually trigger
devm_memremap_pages_release(). Introduce devm_release_action() so the
release action can be triggered via a new devm_memunmap_pages() api in a
follow-on change.
Link: http://lkml.kernel.org/r/155727336530.292046.2926860263201336366.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Thu, 13 Jun 2019 22:56:15 +0000 (15:56 -0700)]
mm/vmscan.c: fix trying to reclaim unevictable LRU page
There was the below bug report from Wu Fangsuo.
On the CMA allocation path, isolate_migratepages_range() could isolate
unevictable LRU pages and reclaim_clean_page_from_list() can try to
reclaim them if they are clean file-backed pages.
page:
ffffffbf02f33b40 count:86 mapcount:84 mapping:
ffffffc08fa7a810 index:0x24
flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked)
raw:
000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053
raw:
ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80
page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page))
page->mem_cgroup:
ffffffc09bf3ee80
------------[ cut here ]------------
kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S 4.14.81 #3
Hardware name: ASR AQUILAC EVB (DT)
task:
ffffffc00a54cd00 task.stack:
ffffffc009b00000
PC is at shrink_page_list+0x1998/0x3240
LR is at shrink_page_list+0x1998/0x3240
pc : [<
ffffff90083a2158>] lr : [<
ffffff90083a2158>] pstate:
60400045
sp :
ffffffc009b05940
..
shrink_page_list+0x1998/0x3240
reclaim_clean_pages_from_list+0x3c0/0x4f0
alloc_contig_range+0x3bc/0x650
cma_alloc+0x214/0x668
ion_cma_allocate+0x98/0x1d8
ion_alloc+0x200/0x7e0
ion_ioctl+0x18c/0x378
do_vfs_ioctl+0x17c/0x1780
SyS_ioctl+0xac/0xc0
Wu found it's due to commit
ad6b67041a45 ("mm: remove SWAP_MLOCK in
ttu"). Before that, unevictable pages go to cull_mlocked so that we
can't reach the VM_BUG_ON_PAGE line.
To fix the issue, this patch filters out unevictable LRU pages from the
reclaim_clean_pages_from_list in CMA.
Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org
Fixes: ad6b67041a45 ("mm: remove SWAP_MLOCK in ttu")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Wu Fangsuo <fangsuowu@asrmicro.com>
Debugged-by: Wu Fangsuo <fangsuowu@asrmicro.com>
Tested-by: Wu Fangsuo <fangsuowu@asrmicro.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Suryawanshi <pankaj.suryawanshi@einfochips.com>
Cc: <stable@vger.kernel.org> [4.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Thu, 13 Jun 2019 22:56:11 +0000 (15:56 -0700)]
coredump: fix race condition between collapse_huge_page() and core dumping
When fixing the race conditions between the coredump and the mmap_sem
holders outside the context of the process, we focused on
mmget_not_zero()/get_task_mm() callers in
04f5866e41fb70 ("coredump: fix
race condition between mmget_not_zero()/get_task_mm() and core
dumping"), but those aren't the only cases where the mmap_sem can be
taken outside of the context of the process as Michal Hocko noticed
while backporting that commit to older -stable kernels.
If mmgrab() is called in the context of the process, but then the
mm_count reference is transferred outside the context of the process,
that can also be a problem if the mmap_sem has to be taken for writing
through that mm_count reference.
khugepaged registration calls mmgrab() in the context of the process,
but the mmap_sem for writing is taken later in the context of the
khugepaged kernel thread.
collapse_huge_page() after taking the mmap_sem for writing doesn't
modify any vma, so it's not obvious that it could cause a problem to the
coredump, but it happens to modify the pmd in a way that breaks an
invariant that pmd_trans_huge_lock() relies upon. collapse_huge_page()
needs the mmap_sem for writing just to block concurrent page faults that
call pmd_trans_huge_lock().
Specifically the invariant that "!pmd_trans_huge()" cannot become a
"pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
The coredump will call __get_user_pages() without mmap_sem for reading,
which eventually can invoke a lockless page fault which will need a
functional pmd_trans_huge_lock().
So collapse_huge_page() needs to use mmget_still_valid() to check it's
not running concurrently with the coredump... as long as the coredump
can invoke page faults without holding the mmap_sem for reading.
This has "Fixes: khugepaged" to facilitate backporting, but in my view
it's more a bug in the coredump code that will eventually have to be
rewritten to stop invoking page faults without the mmap_sem for reading.
So the long term plan is still to drop all mmget_still_valid().
Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
Fixes: ba76149f47d8 ("thp: khugepaged")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swkhack [Thu, 13 Jun 2019 22:56:08 +0000 (15:56 -0700)]
mm/mlock.c: change count_mm_mlocked_page_nr return type
On a 64-bit machine the value of "vma->vm_end - vma->vm_start" may be
negative when using 32 bit ints and the "count >> PAGE_SHIFT"'s result
will be wrong. So change the local variable and return value to
unsigned long to fix the problem.
Link: http://lkml.kernel.org/r/20190513023701.83056-1-swkhack@gmail.com
Fixes: 0cf2f6f6dc60 ("mm: mlock: check against vma for actual mlock() size")
Signed-off-by: swkhack <swkhack@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Thu, 13 Jun 2019 22:56:05 +0000 (15:56 -0700)]
mm: mmu_gather: remove __tlb_reset_range() for force flush
A few new fields were added to mmu_gather to make TLB flush smarter for
huge page by telling what level of page table is changed.
__tlb_reset_range() is used to reset all these page table state to
unchanged, which is called by TLB flush for parallel mapping changes for
the same range under non-exclusive lock (i.e. read mmap_sem).
Before commit
dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update
PTEs in parallel don't remove page tables. But, the forementioned
commit may do munmap() under read mmap_sem and free page tables. This
may result in program hang on aarch64 reported by Jan Stancek. The
problem could be reproduced by his test program with slightly modified
below.
---8<---
static int map_size = 4096;
static int num_iter = 500;
static long threads_total;
static void *distant_area;
void *map_write_unmap(void *ptr)
{
int *fd = ptr;
unsigned char *map_address;
int i, j = 0;
for (i = 0; i < num_iter; i++) {
map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (map_address == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (j = 0; j < map_size; j++)
map_address[j] = 'b';
if (munmap(map_address, map_size) == -1) {
perror("munmap");
exit(1);
}
}
return NULL;
}
void *dummy(void *ptr)
{
return NULL;
}
int main(void)
{
pthread_t thid[2];
/* hint for mmap in map_write_unmap() */
distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
distant_area += DISTANT_MMAP_SIZE / 2;
while (1) {
pthread_create(&thid[0], NULL, map_write_unmap, NULL);
pthread_create(&thid[1], NULL, dummy, NULL);
pthread_join(thid[0], NULL);
pthread_join(thid[1], NULL);
}
}
---8<---
The program may bring in parallel execution like below:
t1 t2
munmap(map_address)
downgrade_write(&mm->mmap_sem);
unmap_region()
tlb_gather_mmu()
inc_tlb_flush_pending(tlb->mm);
free_pgtables()
tlb->freed_tables = 1
tlb->cleared_pmds = 1
pthread_exit()
madvise(thread_stack, 8M, MADV_DONTNEED)
zap_page_range()
tlb_gather_mmu()
inc_tlb_flush_pending(tlb->mm);
tlb_finish_mmu()
if (mm_tlb_flush_nested(tlb->mm))
__tlb_reset_range()
__tlb_reset_range() would reset freed_tables and cleared_* bits, but this
may cause inconsistency for munmap() which do free page tables. Then it
may result in some architectures, e.g. aarch64, may not flush TLB
completely as expected to have stale TLB entries remained.
Use fullmm flush since it yields much better performance on aarch64 and
non-fullmm doesn't yields significant difference on x86.
The original proposed fix came from Jan Stancek who mainly debugged this
issue, I just wrapped up everything together.
Jan's testing results:
v5.2-rc2-24-gbec7550cca10
--------------------------
mean stddev
real 37.382 2.780
user 1.420 0.078
sys 54.658 1.855
v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush"
---------------------------------------------------------------------------------------_
mean stddev
real 37.119 2.105
user 1.548 0.087
sys 55.698 1.357
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Suggested-by: Will Deacon <will.deacon@arm.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org> [4.20+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wengang Wang [Thu, 13 Jun 2019 22:56:01 +0000 (15:56 -0700)]
fs/ocfs2: fix race in ocfs2_dentry_attach_lock()
ocfs2_dentry_attach_lock() can be executed in parallel threads against the
same dentry. Make that race safe. The race is like this:
thread A thread B
(A1) enter ocfs2_dentry_attach_lock,
seeing dentry->d_fsdata is NULL,
and no alias found by
ocfs2_find_local_alias, so kmalloc
a new ocfs2_dentry_lock structure
to local variable "dl", dl1
.....
(B1) enter ocfs2_dentry_attach_lock,
seeing dentry->d_fsdata is NULL,
and no alias found by
ocfs2_find_local_alias so kmalloc
a new ocfs2_dentry_lock structure
to local variable "dl", dl2.
......
(A2) set dentry->d_fsdata with dl1,
call ocfs2_dentry_lock() and increase
dl1->dl_lockres.l_ro_holders to 1 on
success.
......
(B2) set dentry->d_fsdata with dl2
call ocfs2_dentry_lock() and increase
dl2->dl_lockres.l_ro_holders to 1 on
success.
......
(A3) call ocfs2_dentry_unlock()
and decrease
dl2->dl_lockres.l_ro_holders to 0
on success.
....
(B3) call ocfs2_dentry_unlock(),
decreasing
dl2->dl_lockres.l_ro_holders, but
see it's zero now, panic
Link: http://lkml.kernel.org/r/20190529174636.22364-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reported-by: Daniel Sobe <daniel.sobe@nxp.com>
Tested-by: Daniel Sobe <daniel.sobe@nxp.com>
Reviewed-by: Changwei Ge <gechangwei@live.cn>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill Tkhai [Thu, 13 Jun 2019 22:55:58 +0000 (15:55 -0700)]
mm/vmscan.c: fix recent_rotated history
Johannes pointed out that after commit
886cf1901db9 ("mm: move
recent_rotated pages calculation to shrink_inactive_list()") we lost all
zone_reclaim_stat::recent_rotated history.
This fixes it.
Link: http://lkml.kernel.org/r/155905972210.26456.11178359431724024112.stgit@localhost.localdomain
Fixes: 886cf1901db9 ("mm: move recent_rotated pages calculation to shrink_inactive_list()")
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Potyra, Stefan [Thu, 13 Jun 2019 22:55:55 +0000 (15:55 -0700)]
mm/mlock.c: mlockall error for flag MCL_ONFAULT
If mlockall() is called with only MCL_ONFAULT as flag, it removes any
previously applied lockings and does nothing else.
This behavior is counter-intuitive and doesn't match the Linux man page.
For mlockall():
EINVAL Unknown flags were specified or MCL_ONFAULT was specified
without either MCL_FUTURE or MCL_CURRENT.
Consequently, return the error EINVAL, if only MCL_ONFAULT is passed.
That way, applications will at least detect that they are calling
mlockall() incorrectly.
Link: http://lkml.kernel.org/r/20190527075333.GA6339@er01809n.ebgroup.elektrobit.com
Fixes: b0f205c2a308 ("mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage")
Signed-off-by: Stefan Potyra <Stefan.Potyra@elektrobit.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manuel Traut [Thu, 13 Jun 2019 22:55:52 +0000 (15:55 -0700)]
scripts/decode_stacktrace.sh: prefix addr2line with $CROSS_COMPILE
At least for ARM64 kernels compiled with the crosstoolchain from
Debian/stretch or with the toolchain from kernel.org the line number is
not decoded correctly by 'decode_stacktrace.sh':
$ echo "[ 136.513051] f1+0x0/0xc [kcrash]" | \
CROSS_COMPILE=/opt/gcc-8.1.0-nolibc/aarch64-linux/bin/aarch64-linux- \
./scripts/decode_stacktrace.sh /scratch/linux-arm64/vmlinux \
/scratch/linux-arm64 \
/nfs/debian/lib/modules/4.20.0-devel
[ 136.513051] f1 (/linux/drivers/staging/kcrash/kcrash.c:68) kcrash
If addr2line from the toolchain is used the decoded line number is correct:
[ 136.513051] f1 (/linux/drivers/staging/kcrash/kcrash.c:57) kcrash
Link: http://lkml.kernel.org/r/20190527083425.3763-1-manut@linutronix.de
Signed-off-by: Manuel Traut <manut@linutronix.de>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Thu, 13 Jun 2019 22:55:49 +0000 (15:55 -0700)]
mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node
Syzbot reported following memory leak:
ffffffffda RBX:
0000000000000003 RCX:
0000000000441f79
BUG: memory leak
unreferenced object 0xffff888114f26040 (size 32):
comm "syz-executor626", pid 7056, jiffies
4294948701 (age 39.410s)
hex dump (first 32 bytes):
40 60 f2 14 81 88 ff ff 40 60 f2 14 81 88 ff ff @`......@`......
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
slab_post_alloc_hook mm/slab.h:439 [inline]
slab_alloc mm/slab.c:3326 [inline]
kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
kmalloc include/linux/slab.h:547 [inline]
__memcg_init_list_lru_node+0x58/0xf0 mm/list_lru.c:352
memcg_init_list_lru_node mm/list_lru.c:375 [inline]
memcg_init_list_lru mm/list_lru.c:459 [inline]
__list_lru_init+0x193/0x2a0 mm/list_lru.c:626
alloc_super+0x2e0/0x310 fs/super.c:269
sget_userns+0x94/0x2a0 fs/super.c:609
sget+0x8d/0xb0 fs/super.c:660
mount_nodev+0x31/0xb0 fs/super.c:1387
fuse_mount+0x2d/0x40 fs/fuse/inode.c:1236
legacy_get_tree+0x27/0x80 fs/fs_context.c:661
vfs_get_tree+0x2e/0x120 fs/super.c:1476
do_new_mount fs/namespace.c:2790 [inline]
do_mount+0x932/0xc50 fs/namespace.c:3110
ksys_mount+0xab/0x120 fs/namespace.c:3319
__do_sys_mount fs/namespace.c:3333 [inline]
__se_sys_mount fs/namespace.c:3330 [inline]
__x64_sys_mount+0x26/0x30 fs/namespace.c:3330
do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This is a simple off by one bug on the error path.
Link: http://lkml.kernel.org/r/20190528043202.99980-1-shakeelb@google.com
Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists")
Reported-by: syzbot+f90a420dfe2b1b03cb2c@syzkaller.appspotmail.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Thu, 13 Jun 2019 22:55:46 +0000 (15:55 -0700)]
mm: memcontrol: don't batch updates of local VM stats and events
The kernel test robot noticed a 26% will-it-scale pagefault regression
from commit
42a300353577 ("mm: memcontrol: fix recursive statistics
correctness & scalabilty"). This appears to be caused by bouncing the
additional cachelines from the new hierarchical statistics counters.
We can fix this by getting rid of the batched local counters instead.
Originally, there were *only* group-local counters, and they were fully
maintained per cpu. A reader of a stats file high up in the cgroup tree
would have to walk the entire subtree and collect each level's per-cpu
counters to get the recursive view. This was prohibitively expensive,
and so we switched to per-cpu batched updates of the local counters
during
a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting"), reducing the complexity from nr_subgroups *
nr_cpus to nr_subgroups.
With growing machines and cgroup trees, the tree walk itself became too
expensive for monitoring top-level groups, and this is when the culprit
patch added hierarchy counters on each cgroup level. When the per-cpu
batch size would be reached, both the local and the hierarchy counters
would get batch-updated from the per-cpu delta simultaneously.
This makes local and hierarchical counter reads blazingly fast, but it
unfortunately makes the write-side too cache line intense.
Since local counter reads were never a problem - we only centralized
them to accelerate the hierarchy walk - and use of the local counters
are becoming rarer due to replacement with hierarchical views (ongoing
rework in the page reclaim and workingset code), we can make those local
counters unbatched per-cpu counters again.
The scheme will then be as such:
when a memcg statistic changes, the writer will:
- update the local counter (per-cpu)
- update the batch counter (per-cpu). If the batch is full:
- spill the batch into the group's atomic_t
- spill the batch into all ancestors' atomic_ts
- empty out the batch counter (per-cpu)
when a local memcg counter is read, the reader will:
- collect the local counter from all cpus
when a hiearchy memcg counter is read, the reader will:
- read the atomic_t
We might be able to simplify this further and make the recursive
counters unbatched per-cpu counters as well (batch upward propagation,
but leave per-cpu collection to the readers), but that will require a
more in-depth analysis and testing of all the callsites. Deal with the
immediate regression for now.
Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org
Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Hellwig [Tue, 4 Jun 2019 17:54:12 +0000 (19:54 +0200)]
x86/fpu: Don't use current->mm to check for a kthread
current->mm can be non-NULL if a kthread calls use_mm(). Check for
PF_KTHREAD instead to decide when to store user mode FP state.
Fixes: 2722146eb784 ("x86/fpu: Remove fpu->initialized")
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190604175411.GA27477@lst.de
Thomas Gleixner [Thu, 13 Jun 2019 16:34:54 +0000 (18:34 +0200)]
Merge tag 'timers-v5.2-rc1' of https://git.linaro.org/people/daniel.lezcano/linux into timers/urgent
Pull timer fixes from Daniel Lezcano:
- Fix missing notrace leading to deadlock on arch_arm_timer (Julien Thierry)
- Fix compilation warning on timer-ti-dm (Philippe Mazenauer)
Linus Torvalds [Thu, 13 Jun 2019 15:59:05 +0000 (05:59 -1000)]
Merge branch 'for-linus' of git://git./linux/kernel/git/hid/hid
Pull HID fixes from Jiri Kosina:
- regression fixes (reverts) for module loading changes that turned out
to be incompatible with some userspace, from Benjamin Tissoires
- regression fix for special Logitech unifiying receiver 0xc52f, from
Hans de Goede
- a few device ID additions to logitech driver, from Hans de Goede
- fix for Bluetooth support on 2nd-gen Wacom Intuos Pro, from Jason
Gerecke
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: logitech-dj: Fix 064d:c52f receiver support
Revert "HID: core: Call request_module before doing device_add"
Revert "HID: core: Do not call request_module() in async context"
Revert "HID: Increase maximum report size allowed by hid_field_extract()"
HID: a4tech: fix horizontal scrolling
HID: hyperv: Add a module description line
HID: logitech-hidpp: Add support for the S510 remote control
HID: multitouch: handle faulty Elo touch device
HID: wacom: Sync INTUOSP2_BT touch state after each frame if necessary
HID: wacom: Correct button numbering 2nd-gen Intuos Pro over Bluetooth
HID: wacom: Send BTN_TOUCH in response to INTUOSP2_BT eraser contact
HID: wacom: Don't report anything prior to the tool entering range
HID: wacom: Don't set tool type until we're in range
HID: rmi: Use SET_REPORT request on control endpoint for Acer Switch 3 and 5
HID: logitech-hidpp: add support for the MX5500 keyboard
HID: logitech-dj: add support for the Logitech MX5500's Bluetooth Mini-Receiver
HID: i2c-hid: add iBall Aer3 to descriptor override
Takashi Iwai [Thu, 13 Jun 2019 15:33:34 +0000 (17:33 +0200)]
Merge tag 'asoc-fix-v5.2-rc4' of https://git./linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v5.2
There's an awful lot of fixes here, almost all for the newly introduced
SoF DSP drivers (including a few things it turned up in shared code).
This is a large and complex piece of code so it's not surprising that
there have been quite a few issues here, fortunately things seem to have
mostly calmed down now. Otherwise there's just a smattering of small fixes.
Daniel Vetter [Thu, 13 Jun 2019 09:49:15 +0000 (11:49 +0200)]
Merge tag 'drm-intel-fixes-2019-06-13' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
drm/i915 fixes for v5.2-rc5:
- Fix DMC firmware input validation to avoid buffer overflow
- Fix perf register access whitelist for userspace
- Fix DSI panel on GPD MicroPC
- Fix per-pixel alpha with CCS
- Fix HDMI audio for SDVO
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87y325x22w.fsf@intel.com
Geert Uytterhoeven [Thu, 13 Jun 2019 07:30:06 +0000 (09:30 +0200)]
block/ps3vram: Use %llu to format sector_t after LBDAF removal
The removal of CONFIG_LBDAF changed the type of sector_t from "unsigned
long" to "u64" aka "unsigned long long" on 64-bit platforms, leading to
a compiler warning regression:
drivers/block/ps3vram.c: In function ‘ps3vram_probe’:
drivers/block/ps3vram.c:770:23: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘sector_t {aka long long unsigned int}’ [-Wformat=]
Fix this by using "%llu" instead.
Fixes: 72deb455b5ec619f ("block: remove CONFIG_LBDAF")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans de Goede [Tue, 11 Jun 2019 14:32:59 +0000 (16:32 +0200)]
libata: Extend quirks for the ST1000LM024 drives with NOLPM quirk
We've received a bugreport that using LPM with ST1000LM024 drives leads
to system lockups. So it seems that these models are buggy in more then
1 way. Add NOLPM quirk to the existing quirks entry for BROKEN_FPDMA_AA.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1571330
Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sun, 9 Jun 2019 22:13:35 +0000 (06:13 +0800)]
bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached
When people set a writeback percent via sysfs file,
/sys/block/bcache<N>/bcache/writeback_percent
current code directly sets BCACHE_DEV_WB_RUNNING to dc->disk.flags
and schedules kworker dc->writeback_rate_update.
If there is no cache set attached to, the writeback kernel thread is
not running indeed, running dc->writeback_rate_update does not make
sense and may cause NULL pointer deference when reference cache set
pointer inside update_writeback_rate().
This patch checks whether the cache set point (dc->disk.c) is NULL in
sysfs interface handler, and only set BCACHE_DEV_WB_RUNNING and
schedule dc->writeback_rate_update when dc->disk.c is not NULL (it
means the cache device is attached to a cache set).
This problem might be introduced from initial bcache commit, but
commit
3fd47bfe55b0 ("bcache: stop dc->writeback_rate_update properly")
changes part of the original code piece, so I add 'Fixes:
3fd47bfe55b0'
to indicate from which commit this patch can be applied.
Fixes: 3fd47bfe55b0 ("bcache: stop dc->writeback_rate_update properly")
Reported-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Sun, 9 Jun 2019 22:13:34 +0000 (06:13 +0800)]
bcache: fix stack corruption by PRECEDING_KEY()
Recently people report bcache code compiled with gcc9 is broken, one of
the buggy behavior I observe is that two adjacent 4KB I/Os should merge
into one but they don't. Finally it turns out to be a stack corruption
caused by macro PRECEDING_KEY().
See how PRECEDING_KEY() is defined in bset.h,
437 #define PRECEDING_KEY(_k) \
438 ({ \
439 struct bkey *_ret = NULL; \
440 \
441 if (KEY_INODE(_k) || KEY_OFFSET(_k)) { \
442 _ret = &KEY(KEY_INODE(_k), KEY_OFFSET(_k), 0); \
443 \
444 if (!_ret->low) \
445 _ret->high--; \
446 _ret->low--; \
447 } \
448 \
449 _ret; \
450 })
At line 442, _ret points to address of a on-stack variable combined by
KEY(), the life range of this on-stack variable is in line 442-446,
once _ret is returned to bch_btree_insert_key(), the returned address
points to an invalid stack address and this address is overwritten in
the following called bch_btree_iter_init(). Then argument 'search' of
bch_btree_iter_init() points to some address inside stackframe of
bch_btree_iter_init(), exact address depends on how the compiler
allocates stack space. Now the stack is corrupted.
Fixes: 0eacac22034c ("bcache: PRECEDING_KEY()")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Rolf Fokkens <rolf@rolffokkens.nl>
Reviewed-by: Pierre JUHEN <pierre.juhen@orange.fr>
Tested-by: Shenghui Wang <shhuiw@foxmail.com>
Tested-by: Pierre JUHEN <pierre.juhen@orange.fr>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dave Martin [Wed, 12 Jun 2019 16:00:32 +0000 (17:00 +0100)]
arm64/sve: Fix missing SVE/FPSIMD endianness conversions
The in-memory representation of SVE and FPSIMD registers is
different: the FPSIMD V-registers are stored as single 128-bit
host-endian values, whereas SVE registers are stored in an
endianness-invariant byte order.
This means that the two representations differ when running on a
big-endian host. But we blindly copy data from one representation
to another when converting between the two, resulting in the
register contents being unintentionally byteswapped in certain
situations. Currently this can be triggered by the first SVE
instruction after a syscall, for example (though the potential
trigger points may vary in future).
So, fix the conversion functions fpsimd_to_sve(), sve_to_fpsimd()
and sve_sync_from_fpsimd_zeropad() to swab where appropriate.
There is no common swahl128() or swab128() that we could use here.
Maybe it would be worth making this generic, but for now add a
simple local hack.
Since the byte order differences are exposed in ABI, also clarify
the documentation.
Cc: Alex Bennée <alex.bennee@linaro.org>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Alan Hayward <alan.hayward@arm.com>
Cc: Julien Grall <julien.grall@arm.com>
Fixes: bc0ee4760364 ("arm64/sve: Core task context handling")
Fixes: 8cd969d28fd2 ("arm64/sve: Signal handling support")
Fixes: 43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
[will: Fix typos in comments and docs spotted by Julien]
Signed-off-by: Will Deacon <will.deacon@arm.com>
Ming Lei [Tue, 11 Jun 2019 09:31:53 +0000 (17:31 +0800)]
blk-mq: remove WARN_ON(!q->elevator) from blk_mq_sched_free_requests
blk_mq_sched_free_requests() may be called in failure path in which
q->elevator may not be setup yet, so remove WARN_ON(!q->elevator) from
blk_mq_sched_free_requests for avoiding the false positive.
This function is actually safe to call in case of !q->elevator because
hctx->sched_tags is checked.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yi Zhang <yi.zhang@redhat.com>
Fixes: c3e2219216c9 ("block: free sched's request pool in blk_cleanup_queue")
Reported-by: syzbot+b9d0d56867048c7bcfde@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andreas Herrmann [Wed, 12 Jun 2019 06:17:32 +0000 (08:17 +0200)]
blkio-controller.txt: Remove references to CFQ
CFQ is gone. No need anymore to document its "proportional weight time
based division of disk policy".
Signed-off-by: Andreas Herrmann <aherrmann@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andreas Herrmann [Wed, 12 Jun 2019 06:50:09 +0000 (08:50 +0200)]
block/switching-sched.txt: Update to blk-mq schedulers
Remove references to CFQ and legacy block layer which are gone.
Update example with what's available under blk-mq.
Signed-off-by: Andreas Herrmann <aherrmann@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chaitanya Kulkarni [Tue, 11 Jun 2019 22:10:17 +0000 (15:10 -0700)]
null_blk: remove duplicate check for report zone
This patch removes the check in the null_blk_zoned for report zone
command, where it checks for the dev-,>zoned before executing the report
zone.
The null_zone_report() function is a block_device operation callback
which is initialized in the null_blk_main.c and gets called as a part
of blkdev for report zone IOCTL (BLKREPORTZONE).
blkdev_ioctl()
blkdev_report_zones_ioctl()
blkdev_report_zones()
blk_report_zones()
disk->fops->report_zones()
nullb_zone_report();
The null_zone_report() will never get executed on the non-zoned block
device, in the non zoned block device blk_queue_is_zoned() will always
be false which is first check the blkdev_report_zones_ioctl()
before actual low level driver report zone callback is executed.
Here is the detailed scenario:-
1. modprobe null_blk
null_init
null_alloc_dev
dev->zoned = 0
null_add_dev
dev->zoned == 0
so we don't set the q->limits.zoned = BLK_ZONED_HR
2. blkzone report /dev/nullb0
blkdev_ioctl()
blkdev_report_zones_ioctl()
blk_queue_is_zoned()
blk_queue_is_zoned
q->limits.zoned == 0
return false
if (!blk_queue_is_zoned(q)) <--- true
return -ENOTTY;
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Greg Kroah-Hartman [Wed, 12 Jun 2019 12:30:19 +0000 (14:30 +0200)]
blk-mq: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
When all of these checks are cleaned up, lots of the functions used in
the blk-mq-debugfs code can now return void, as no need to check the
return value of them either.
Overall, this ends up cleaning up the code and making it smaller, always
a nice win.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Biggers [Wed, 12 Jun 2019 21:58:43 +0000 (14:58 -0700)]
io_uring: fix memory leak of UNIX domain socket inode
Opening and closing an io_uring instance leaks a UNIX domain socket
inode. This is because the ->file of the io_uring instance's internal
UNIX domain socket is set to point to the io_uring file, but then
sock_release() sees the non-NULL ->file and assumes the inode reference
is held by the file so doesn't call iput(). That's not the case here,
since the reference is still meant to be held by the socket; the actual
inode of the io_uring file is different.
Fix this leak by NULL-ing out ->file before releasing the socket.
Reported-by: syzbot+111cb28d9f583693aefa@syzkaller.appspotmail.com
Fixes: 2b188cc1bb85 ("Add io_uring IO interface")
Cc: <stable@vger.kernel.org> # v5.1+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Tue, 4 Jun 2019 07:23:40 +0000 (16:23 +0900)]
block: force select mq-deadline for zoned block devices
In most use cases of zoned block devices (aka SMR disks), the
mq-deadline scheduler is mandatory as it implements sequential write
command processing guarantees with zone write locking. So make sure that
this scheduler is always enabled if CONFIG_BLK_DEV_ZONED is selected.
Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Thu, 13 Jun 2019 02:10:57 +0000 (16:10 -1000)]
Merge tag 'selinux-pr-
20190612' of git://git./linux/kernel/git/pcmoore/selinux
Pull selinux fixes from Paul Moore:
"Three patches for v5.2.
One fixes a problem where we weren't correctly logging raw SELinux
labels, the other two fix problems where we weren't properly checking
calls to kmemdup()"
* tag 'selinux-pr-
20190612' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: fix a missing-check bug in selinux_sb_eat_lsm_opts()
selinux: fix a missing-check bug in selinux_add_mnt_opt( )
selinux: log raw contexts as untrusted strings
Alex Deucher [Tue, 11 Jun 2019 14:45:51 +0000 (09:45 -0500)]
drm/amdgpu: return 0 by default in amdgpu_pm_load_smu_firmware
Fixes SI cards running on amdgpu.
Fixes: 1929059893022 ("drm/amd/amdgpu: add RLC firmware to support raven1 refresh")
Bug: https://bugs.freedesktop.org/show_bug.cgi?id=110883
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Dan Carpenter [Sat, 8 Jun 2019 09:23:57 +0000 (12:23 +0300)]
drm/amdgpu: Fix bounds checking in amdgpu_ras_is_supported()
The "block" variable can be set by the user through debugfs, so it can
be quite large which leads to shift wrapping here. This means we report
a "block" as supported when it's not, and that leads to array overflows
later on.
This bug is not really a security issue in real life, because debugfs is
generally root only.
Fixes: 36ea1bd2d084 ("drm/amdgpu: add debugfs ctrl node")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Joel Savitz [Wed, 12 Jun 2019 15:50:48 +0000 (11:50 -0400)]
cpuset: restore sanity to cpuset_cpus_allowed_fallback()
In the case that a process is constrained by taskset(1) (i.e.
sched_setaffinity(2)) to a subset of available cpus, and all of those are
subsequently offlined, the scheduler will set tsk->cpus_allowed to
the current value of task_cs(tsk)->effective_cpus.
This is done via a call to do_set_cpus_allowed() in the context of
cpuset_cpus_allowed_fallback() made by the scheduler when this case is
detected. This is the only call made to cpuset_cpus_allowed_fallback()
in the latest mainline kernel.
However, this is not sane behavior.
I will demonstrate this on a system running the latest upstream kernel
with the following initial configuration:
# grep -i cpu /proc/$$/status
Cpus_allowed:
ffffffff,
fffffff
Cpus_allowed_list: 0-63
(Where cpus 32-63 are provided via smt.)
If we limit our current shell process to cpu2 only and then offline it
and reonline it:
# taskset -p 4 $$
pid 2272's current affinity mask:
ffffffffffffffff
pid 2272's new affinity mask: 4
# echo off > /sys/devices/system/cpu/cpu2/online
# dmesg | tail -3
[ 2195.866089] process 2272 (bash) no longer affine to cpu2
[ 2195.872700] IRQ 114: no longer affine to CPU2
[ 2195.879128] smpboot: CPU 2 is now offline
# echo on > /sys/devices/system/cpu/cpu2/online
# dmesg | tail -1
[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4
We see that our current process now has an affinity mask containing
every cpu available on the system _except_ the one we originally
constrained it to:
# grep -i cpu /proc/$$/status
Cpus_allowed:
ffffffff,
fffffffb
Cpus_allowed_list: 0-1,3-63
This is not sane behavior, as the scheduler can now not only place the
process on previously forbidden cpus, it can't even schedule it on
the cpu it was originally constrained to!
Other cases result in even more exotic affinity masks. Take for instance
a process with an affinity mask containing only cpus provided by smt at
the moment that smt is toggled, in a configuration such as the following:
# taskset -p
f000000000 $$
# grep -i cpu /proc/$$/status
Cpus_allowed:
000000f0,
00000000
Cpus_allowed_list: 36-39
A double toggle of smt results in the following behavior:
# echo off > /sys/devices/system/cpu/smt/control
# echo on > /sys/devices/system/cpu/smt/control
# grep -i cpus /proc/$$/status
Cpus_allowed:
ffffff00,
ffffffff
Cpus_allowed_list: 0-31,40-63
This is even less sane than the previous case, as the new affinity mask
excludes all smt-provided cpus with ids less than those that were
previously in the affinity mask, as well as those that were actually in
the mask.
With this patch applied, both of these cases end in the following state:
# grep -i cpu /proc/$$/status
Cpus_allowed:
ffffffff,
ffffffff
Cpus_allowed_list: 0-63
The original policy is discarded. Though not ideal, it is the simplest way
to restore sanity to this fallback case without reinventing the cpuset
wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
for the previous affinity mask to be restored in this fallback case can use
that mechanism instead.
This patch modifies scheduler behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode. I tested the cases above on both modes.
Note that the scheduler uses this fallback mechanism if and only if
_every_ other valid avenue has been traveled, and it is the last resort
before calling BUG().
Suggested-by: Waiman Long <longman@redhat.com>
Suggested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Acked-by: Phil Auld <pauld@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Matt Mullins [Fri, 31 May 2019 19:47:54 +0000 (12:47 -0700)]
x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()
err must be nonzero in order to reach text_poke(), which caused kgdb to
fail to set breakpoints:
(gdb) break __x64_sys_sync
Breakpoint 1 at 0xffffffff81288910: file ../fs/sync.c, line 124.
(gdb) c
Continuing.
Warning:
Cannot insert breakpoint 1.
Cannot access memory at address 0xffffffff81288910
Command aborted.
Fixes: 86a22057127d ("x86/kgdb: Avoid redundant comparison of patched code")
Signed-off-by: Matt Mullins <mmullins@fb.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nadav Amit <namit@vmware.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190531194755.6320-1-mmullins@fb.com
Gen Zhang [Wed, 12 Jun 2019 13:55:38 +0000 (21:55 +0800)]
selinux: fix a missing-check bug in selinux_sb_eat_lsm_opts()
In selinux_sb_eat_lsm_opts(), 'arg' is allocated by kmemdup_nul(). It
returns NULL when fails. So 'arg' should be checked. And 'mnt_opts'
should be freed when error.
Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Fixes: 99dbbb593fe6 ("selinux: rewrite selinux_sb_eat_lsm_opts()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Daniel Vetter [Wed, 12 Jun 2019 16:22:28 +0000 (18:22 +0200)]
Merge branch 'mediatek-drm-fixes-5.2' of https://github.com/ckhu-mediatek/linux.git-tags into drm-fixes
CK writes:
This include unbind error fix, clock control flow refinement, and PRIME
mmap with page offset.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: CK Hu <ck.hu@mediatek.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1560325868.3259.6.camel@mtksdaap41
Linus Torvalds [Wed, 12 Jun 2019 15:57:05 +0000 (05:57 -1000)]
Merge tag 'media/v5.2-2' of git://git./linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- a debug warning for satellite tuning at dvb core was producing too
much noise
- a regression at hfi_parser on Venus driver
* tag 'media/v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: venus: hfi_parser: fix a regression in parser
media: dvb: warning about dvb frequency limits produces too much noise
Gen Zhang [Wed, 12 Jun 2019 13:28:21 +0000 (21:28 +0800)]
selinux: fix a missing-check bug in selinux_add_mnt_opt( )
In selinux_add_mnt_opt(), 'val' is allocated by kmemdup_nul(). It returns
NULL when fails. So 'val' should be checked. And 'mnt_opts' should be
freed when error.
Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Fixes: 757cbe597fe8 ("LSM: new method: ->sb_add_mnt_opt()")
Cc: <stable@vger.kernel.org>
[PM: fixed some indenting problems]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Will Deacon [Tue, 11 Jun 2019 11:47:34 +0000 (12:47 +0100)]
arm64: tlbflush: Ensure start/end of address range are aligned to stride
Since commit
3d65b6bbc01e ("arm64: tlbi: Set MAX_TLBI_OPS to
PTRS_PER_PTE"), we resort to per-ASID invalidation when attempting to
perform more than PTRS_PER_PTE invalidation instructions in a single
call to __flush_tlb_range(). Whilst this is beneficial, the mmu_gather
code does not ensure that the end address of the range is rounded-up
to the stride when freeing intermediate page tables in pXX_free_tlb(),
which defeats our range checking.
Align the bounds passed into __flush_tlb_range().
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Hanjun Guo <guohanjun@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Heikki Krogerus [Wed, 12 Jun 2019 14:15:40 +0000 (17:15 +0300)]
usb: typec: Make sure an alt mode exist before getting its partner
Adding check to typec_altmode_get_partner() to prevent
potential NULL pointer dereference.
Reported-by: Vladimir Yerilov <openmindead@gmail.com>
Fixes: ad74b8649bea ("usb: typec: ucsi: Preliminary support for alternate modes")
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Julien Thierry [Fri, 24 May 2019 09:10:25 +0000 (10:10 +0100)]
clocksource/drivers/arm_arch_timer: Don't trace count reader functions
With v5.2-rc1, The ftrace functions_graph tracer locks up whenever it is
enabled on arm64.
Since commit
0ea415390cd3 ("clocksource/arm_arch_timer: Use
arch_timer_read_counter to access stable counters") a function pointer
is consistently used to read the counter instead of potentially
referencing an inlinable function.
The graph tracers relies on accessing the timer counters to compute the
time spent in functions which causes the lockup when attempting to trace
these code paths.
Annotate the arm arch timer counter accessors as notrace.
Fixes: 0ea415390cd3 ("clocksource/arm_arch_timer: Use
arch_timer_read_counter to access stable counters")
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Linus Walleij [Thu, 30 May 2019 20:24:24 +0000 (22:24 +0200)]
i2c: pca-platform: Fix GPIO lookup code
The devm_gpiod_request_gpiod() call will add "-gpios" to
any passed connection ID before looking it up.
I do not think the reset GPIO on this platform is named
"reset-gpios-gpios" but rather "reset-gpios" in the device
tree, so fix this up so that we get a proper reset GPIO
handle.
Also drop the inclusion of the legacy GPIO header.
Fixes: 0e8ce93bdceb ("i2c: pca-platform: add devicetree awareness")
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>