Nikolay Borisov [Wed, 27 Mar 2019 12:24:16 +0000 (14:24 +0200)]
btrfs: Optimize unallocated chunks discard
Currently unallocated chunks are always trimmed. For example
2 consecutive trims on large storage would trim freespace twice
irrespective of whether the space was actually allocated or not between
those trims.
Optimise this behavior by exploiting the newly introduced alloc_state
tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
those unallocated chunks which have been trimmed and have not been
allocated afterwards. On chunk allocation the respective underlying devices'
physical space will have its CHUNK_TRIMMED flag cleared. This avoids
submitting discards for space which hasn't been changed since the last
time discard was issued.
This applies to the single mount period of the filesystem as the
information is not stored permanently.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:15 +0000 (14:24 +0200)]
btrfs: Factor out in_range macro
This is used in more than one places so let's factor it out in ctree.h.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:14 +0000 (14:24 +0200)]
btrfs: Remove 'trans' argument from find_free_dev_extent(_start)
Now that these functions no longer require a handle to transaction to
inspect pending/pinned chunks the argument can be removed. At the same
time also remove any surrounding code which acquired the handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 27 Mar 2019 12:24:12 +0000 (14:24 +0200)]
btrfs: replace pending/pinned chunks lists with io tree
The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.
The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.
The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.
This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device. Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 12 Feb 2019 14:13:14 +0000 (16:13 +0200)]
btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in close_ctree
Following the introduction of the alloc_state tree, some of the callees
of btrfs_mapping_tree_free will have to interact with the btrfs_device
of the constituent devices. Enable this by moving the code responsible
for freeing devices after the last user (btrfs_mapping_tree_free).
Otherwise the kernel could crash due to use-after-free.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:11 +0000 (14:24 +0200)]
btrfs: Stop using call_rcu for device freeing
btrfs_device structs are freed from RCU context since device iteration
is protected by RCU. Currently this is achieved by using call_rcu since
no blocking functions are called within btrfs_free_device. Future
refactoring of pending/pinned chunks will require calling sleeping
functions.
This patch is in preparation for these changes by simply switching from
RCU callbacks to explicit calls of synchronize_rcu and calling
btrfs_free_device directly. This is functionally equivalent, making sure
that there are no readers at that time.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 27 Mar 2019 12:24:10 +0000 (14:24 +0200)]
btrfs: Implement set_extent_bits_nowait
It will be used in a future patch that will require modifying an
extent_io_tree struct under a spinlock.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:26 +0000 (14:31 +0200)]
btrfs: Introduce new bits for device allocation tree
Rather than hijacking the existing defines let's just define new bits,
with more descriptive names. Instead of using yet more (currently at 18)
bits for the new flags, use the fact those flags will be specific to
the device allocation tree so define them using existing EXTENT_* flags.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:25 +0000 (14:31 +0200)]
btrfs: Populate ->orig_block_len during read_one_chunk
Chunks read from disk currently don't get their ->orig_block_len member
set, in contrast when a new chunk is allocated, the respective
extent_map's ->orig_block_len is assigned the size of the stripe of this
chunk.
Let's apply the same strategy for chunks which are read from
disk, not only does this codify the invariant that ->orig_block_len
always contains the size of the stripe for a chunk (when the em belongs
to the mapping tree). But it's also a preparatory patch for further work
around tracking chunk allocation in an extent tree rather than
pinned/pending lists.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:24 +0000 (14:31 +0200)]
btrfs: Rename and export clear_btree_io_tree
This function is going to be used to clear out the device extent
allocation information. Give it a more generic name and export it. This
is in preparation to replacing the pending/pinned chunk lists with an
extent tree. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:23 +0000 (14:31 +0200)]
btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink
During device shrink pinned/pending chunks (i.e. those which have been
deleted/created respectively, in the current transaction and haven't
touched disk) need to be accounted when doing device shrink. Presently
this happens after the main relocation loop in btrfs_shrink_device,
which could lead to making another go in the body of the function.
Since there is no hard requirement to perform pinned/pending chunks
handling after the relocation loop, move the code before it. This leads
to simplifying the code flow around - i.e. no need to use 'goto again'.
A notable side effect of this change is that modification of the
device's size requires a transaction to be started and committed before
the relocation loop starts. This is necessary to ensure that relocation
process sees the shrunk device size.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:22 +0000 (14:31 +0200)]
btrfs: combine device update operations during transaction commit
We currently overload the pending_chunks list to handle updating
btrfs_device->commit_bytes used. We don't actually care about the
extent mapping or even the device mapping for the chunk - we just need
the device, and we can end up processing it multiple times. The
fs_devices->resized_list does more or less the same thing, but with the
disk size. They are called consecutively during commit and have more or
less the same purpose.
We can combine the two lists into a single list that attaches to the
transaction and contains a list of devices that need updating. Since we
always add the device to a list when we change bytes_used or
disk_total_size, there's no harm in copying both values at once.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 25 Mar 2019 12:31:21 +0000 (14:31 +0200)]
btrfs: Honour FITRIM range constraints during free space trim
Up until now trimming the freespace was done irrespective of what the
arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments
will be entirely ignored. Fix it by correctly handling those paramter.
This requires breaking if the found freespace extent is after the end of
the passed range as well as completing trim after trimming
fstrim_range::len bytes.
Fixes: 499f377f49f0 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Robbie Ko [Fri, 29 Mar 2019 10:03:27 +0000 (18:03 +0800)]
Btrfs: send, improve clone range
Improve clone_range in two scenarios.
1. Remove the limit of inode size when find clone inodes We can do
partial clone, so there is no need to limit the size of the candidate
inode. When clone a range, we clone the legal range only by bytenr,
offset, len, inode size.
2. In the scenarios of rewrite or clone_range, data_offset rarely
matches exactly, so the chance of a clone is missed.
e.g.
1. Write a 1M file
dd if=/dev/zero of=1M bs=1M count=1
2. Clone 1M file
cp --reflink 1M clone
3. Rewrite 4k on the clone file
dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
The disk layout is as follows:
item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
extent data disk byte
1103101952 nr
1048576
extent data offset 0 nr
1048576 ram
1048576
extent compression(none)
...
item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
extent data disk byte
1104150528 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
extent data disk byte
1103101952 nr
1048576
extent data offset 4096 nr
1044480 ram
1048576
extent compression(none)
When send, inode 258 file offset 4096~
1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.
Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Tue, 2 Apr 2019 10:07:41 +0000 (18:07 +0800)]
btrfs: prop: open code btrfs_set_prop in inherit_prop
When an inode inherits property from its parent, we call btrfs_set_prop().
btrfs_set_prop() does an elaborate checks, which is not required in the
context of inheriting a property. Instead just open-code only the required
items from btrfs_set_prop() and then call btrfs_setxattr() directly. So
now the only user of btrfs_set_prop() is gone, (except for the wraper
function btrfs_set_prop_trans()).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 29 Mar 2019 06:03:17 +0000 (14:03 +0800)]
btrfs: drop unused parameter in mount_subvol
@device_name in mount_subvol() is not used, drop it. Also see:
5bedc48a8f9e ("btrfs: drop unused parameters from mount_subvol").
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:22:58 +0000 (16:22 +0100)]
btrfs: tree-checker: get fs_info from eb in check_inode_item
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:22:58 +0000 (16:22 +0100)]
btrfs: tree-checker: get fs_info from eb in check_dev_item
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:22:58 +0000 (16:22 +0100)]
btrfs: tree-checker: get fs_info from eb in dev_item_err
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:22:58 +0000 (16:22 +0100)]
btrfs: tree-checker: get fs_info from eb in chunk_err
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:22:58 +0000 (16:22 +0100)]
btrfs: tree-checker: get fs_info from eb in check_leaf
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:22:00 +0000 (16:22 +0100)]
btrfs: tree-checker: get fs_info from eb in check_leaf_item
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:21:10 +0000 (16:21 +0100)]
btrfs: tree-checker: get fs_info from eb in check_extent_data_item
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:19:31 +0000 (16:19 +0100)]
btrfs: tree-checker: get fs_info from eb in check_block_group_item
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:18:57 +0000 (16:18 +0100)]
btrfs: tree-checker: get fs_info from eb in block_group_err
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:17:46 +0000 (16:17 +0100)]
btrfs: tree-checker: get fs_info from eb in check_dir_item
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:07:27 +0000 (16:07 +0100)]
btrfs: tree-checker: get fs_info from eb in dir_item_err
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 15:02:56 +0000 (16:02 +0100)]
btrfs: tree-checker: get fs_info from eb in check_csum_item
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 14:32:46 +0000 (15:32 +0100)]
btrfs: tree-checker: get fs_info from eb in file_extent_err
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 14:31:28 +0000 (15:31 +0100)]
btrfs: tree-checker: get fs_info from eb in generic_err
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 13 Mar 2019 05:55:11 +0000 (13:55 +0800)]
btrfs: inode: Verify inode mode to avoid NULL pointer dereference
[BUG]
When accessing a file on a crafted image, btrfs can crash in block layer:
BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
PGD
136501067 P4D
136501067 PUD
124519067 PMD 0
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.0.0-rc8-default #252
RIP: 0010:end_bio_extent_readpage+0x144/0x700
Call Trace:
<IRQ>
blk_update_request+0x8f/0x350
blk_mq_end_request+0x1a/0x120
blk_done_softirq+0x99/0xc0
__do_softirq+0xc7/0x467
irq_exit+0xd1/0xe0
call_function_single_interrupt+0xf/0x20
</IRQ>
RIP: 0010:default_idle+0x1e/0x170
[CAUSE]
The crafted image has a tricky corruption, the INODE_ITEM has a
different type against its parent dir:
item 20 key (268 INODE_ITEM 0) itemoff 2808 itemsize 160
generation 13 transid 13 size
1048576 nbytes
1048576
block group 0 mode 121644 links 1 uid 0 gid 0 rdev 0
sequence 9 flags 0x0(none)
This mode number
0120000 means it's a symlink.
But the dir item think it's still a regular file:
item 8 key (264 DIR_INDEX 5) itemoff 3707 itemsize 32
location key (268 INODE_ITEM 0) type FILE
transid 13 data_len 0 name_len 2
name: f4
item 40 key (264 DIR_ITEM
51821248) itemoff 1573 itemsize 32
location key (268 INODE_ITEM 0) type FILE
transid 13 data_len 0 name_len 2
name: f4
For symlink, we don't set BTRFS_I(inode)->io_tree.ops and leave it
empty, as symlink is only designed to have inlined extent, all handled
by tree block read. Thus no need to trigger btrfs_submit_bio_hook() for
inline file extent.
However end_bio_extent_readpage() expects tree->ops populated, as it's
reading regular data extent. This causes NULL pointer dereference.
[FIX]
This patch fixes the problem in two ways:
- Verify inode mode against its dir item when looking up inode
So in btrfs_lookup_dentry() if we find inode mode mismatch with dir
item, we error out so that corrupted inode will not be accessed.
- Verify inode mode when getting extent mapping
Only regular file should have regular or preallocated extent.
If we found regular/preallocated file extent for symlink or
the rest, we error out before submitting the read bio.
With this fix that crafted image can be rejected gracefully:
BTRFS critical (device loop0): inode mode mismatch with dir: inode mode=
0121644 btrfs type=7 dir type=1
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202763
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 13 Mar 2019 06:31:35 +0000 (14:31 +0800)]
btrfs: tree-checker: Verify inode item
There is a report in kernel bugzilla about mismatch file type in dir
item and inode item.
This inspires us to check inode mode in inode item.
This patch will check the following members:
- inode key objectid
Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO.
- inode key offset
Should be 0
- inode item generation
- inode item transid
No newer than sb generation + 1.
The +1 is for log tree.
- inode item mode
No unknown bits.
No invalid S_IF* bit.
NOTE: S_IFMT check is not enough, need to check every know type.
- inode item nlink
Dir should have no more link than 1.
- inode item flags
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 13 Mar 2019 04:17:50 +0000 (12:17 +0800)]
btrfs: tree-checker: Enhance chunk checker to validate chunk profile
Btrfs-progs already have a comprehensive type checker, to ensure there
is only 0 (SINGLE profile) or 1 (DUP/RAID0/1/5/6/10) bit set for chunk
profile bits.
Do the same work for kernel.
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202765
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 8 Mar 2019 06:20:03 +0000 (14:20 +0800)]
btrfs: tree-checker: Verify dev item
[BUG]
For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then
kernel will just panic:
BUG: unable to handle kernel NULL pointer dereference at
0000000000000098
#PF error: [normal kernel read fault]
PGD
800000022b2bd067 P4D
800000022b2bd067 PUD
22b2bc067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0
Call Trace:
open_ctree+0x160d/0x2149
btrfs_mount_root+0x5b2/0x680
[CAUSE]
If device extent verification finds a deivce with 0 total_bytes, then it
assumes it's a seed dummy, then search for seed devices.
But in this case, there is no seed device at all, causing NULL pointer.
[FIX]
Since this is caused by fuzzed image, let's go the tree-check way, just
add a new verification for device item.
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 05:42:33 +0000 (13:42 +0800)]
btrfs: tree-checker: Check chunk item at tree block read time
Since we have btrfs_check_chunk_valid() in tree-checker, let's do
chunk item verification in tree-checker too.
Since the tree-checker is run at endio time, if one chunk leaf fails
chunk verification, we can still retry the other copy, making btrfs more
robust to fuzzed image as we may still get a good chunk item.
Also since we have done chunk verification in tree block read time, skip
the btrfs_check_chunk_valid() call in read_one_chunk() if we're reading
chunk items from leaf.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 05:39:14 +0000 (13:39 +0800)]
btrfs: tree-checker: Make btrfs_check_chunk_valid() return EUCLEAN instead of EIO
To follow the standard behavior of tree-checker.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 05:36:06 +0000 (13:36 +0800)]
btrfs: tree-checker: Make chunk item checker messages more readable
Old error message would be something like:
BTRFS error (device dm-3): invalid chunk num_stipres: 0
New error message would be:
Btrfs critical (device dm-3): corrupt superblock syschunk array: chunk_start=
2097152, invalid chunk num_stripes: 0
Or
Btrfs critical (device dm-3): corrupt leaf: root=3 block=
8388608 slot=3 chunk_start=
2097152, invalid chunk num_stripes: 0
And for certain error message, also output expected value.
The error message levels are changed from error to critical.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 05:16:42 +0000 (13:16 +0800)]
btrfs: Move btrfs_check_chunk_valid() to tree-check.[ch] and export it
By function, chunk item verification is more suitable to be done inside
tree-checker.
So move btrfs_check_chunk_valid() to tree-checker.c and export it.
And since it's now moved to tree-checker, also add a better comment for
what this function is doing.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 27 Mar 2019 15:55:26 +0000 (16:55 +0100)]
btrfs: qgroup: remove obsolete fs_info members
The commit
fcebe4562dec ("Btrfs: rework qgroup accounting") reworked
qgroups and added some new structures. Another rework of qgroup
mechanics
e69bcee37692 ("btrfs: qgroup: Cleanup the old
ref_node-oriented mechanism.") stopped using them and left uncleaned.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:58:13 +0000 (14:58 +0100)]
btrfs: get fs_info from eb in btrfs_verify_level_key
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:56:39 +0000 (14:56 +0100)]
btrfs: get fs_info from eb in btree_read_extent_buffer_pages
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:54:01 +0000 (14:54 +0100)]
btrfs: get fs_info from eb in read_node_slot
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:36:46 +0000 (14:36 +0100)]
btrfs: get fs_info from eb in btrfs_leaf_free_space
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:30:02 +0000 (14:30 +0100)]
btrfs: get fs_info from eb in clean_tree_block
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 13:22:04 +0000 (14:22 +0100)]
btrfs: get fs_info from eb in tree_mod_log_eb_copy
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 12:12:00 +0000 (13:12 +0100)]
btrfs: get fs_info from eb in check_tree_block_fsid
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 11:14:33 +0000 (12:14 +0100)]
btrfs: get fs_info from eb in btrfs_exclude_logged_extents
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 10:33:10 +0000 (11:33 +0100)]
btrfs: get fs_info from eb in leaf_data_end
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 10:27:57 +0000 (11:27 +0100)]
btrfs: get fs_info from eb in write_one_eb
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 10:23:44 +0000 (11:23 +0100)]
btrfs: get fs_info from eb in repair_eb_io_failure
We can read fs_info from extent buffer and can drop it from the
parameters. As all callsites are updated, add the btrfs_ prefix as the
function is exported.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Mar 2019 10:21:41 +0000 (11:21 +0100)]
btrfs: get fs_info from eb in lock_extent_buffer_for_io
We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
Phillip Potter [Tue, 26 Mar 2019 21:39:34 +0000 (21:39 +0000)]
btrfs: use common file type conversion
Deduplicate the btrfs file type conversion implementation - file systems
that use the same file types as defined by POSIX do not need to define
their own versions and can use the common helper functions decared in
fs_types.h and implemented in fs_types.c
Common implementation can be found via commit:
bbe7449e2599 "fs: common implementation of file type"
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Goldwyn Rodrigues [Mon, 25 Feb 2019 19:07:44 +0000 (13:07 -0600)]
btrfs: Perform locking/unlocking in btrfs_remap_file_range()
Move code to make it more readable, so as locking and unlocking is
done in the same function. The generic checks that are now performed in
the locked section are unaffected.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Arnd Bergmann [Mon, 25 Mar 2019 13:02:25 +0000 (14:02 +0100)]
btrfs: use BUG() instead of BUG_ON(1)
BUG_ON(1) leads to bogus warnings from clang when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false
[-Werror,-Wsometimes-uninitialized]
BUG_ON(1);
^~~~~~~~~
include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
# define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here
max_chunk_size);
^~~~~~~~~~~~~~
include/linux/kernel.h:860:36: note: expanded from macro 'min'
#define min(x, y) __careful_cmp(x, y, <)
^
include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp'
__cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
^
include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once'
typeof(y) unique_y = (y); \
^
fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true
BUG_ON(1);
^
include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^
fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning
u64 max_chunk_size;
^
= 0
Change it to BUG() so clang can see that this code path can never
continue.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 21 Mar 2019 19:21:05 +0000 (20:21 +0100)]
btrfs: move tree block wait and write helpers to tree-log
The wrapper names better describe what's happening so they're not
deleted though they're trivial, but at least moved closer to their place
of use.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 21 Mar 2019 19:20:48 +0000 (20:20 +0100)]
btrfs: remove stale definition of BUFFER_LRU_MAX
Long time ago (2008), the extent buffers were organized in a LRU list
and switched to rb-tree in
6af118ce51b52ced ("Btrfs: Index extent
buffers in an rbtree"). There was one stale macro definition left.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 18 Mar 2019 13:06:55 +0000 (14:06 +0100)]
btrfs: tests: unify messages when tests start
- make the messages more visually consistent and use same format
"running ... test", any error or other warning can be easily spotted
- move some message to the test entry function
- add message to the inode tests
Example output:
[ 8.187391] Btrfs loaded, crc32c=crc32c-generic, assert=on, integrity-checker=on, ref-verify=on
[ 8.189476] BTRFS: selftest: sectorsize: 4096 nodesize: 4096
[ 8.190761] BTRFS: selftest: running btrfs free space cache tests
[ 8.192245] BTRFS: selftest: running extent only tests
[ 8.193573] BTRFS: selftest: running bitmap only tests
[ 8.194876] BTRFS: selftest: running bitmap and extent tests
[ 8.196166] BTRFS: selftest: running space stealing from bitmap to extent tests
[ 8.198026] BTRFS: selftest: running extent buffer operation tests
[ 8.199328] BTRFS: selftest: running btrfs_split_item tests
[ 8.200653] BTRFS: selftest: running extent I/O tests
[ 8.201808] BTRFS: selftest: running find delalloc tests
[ 8.320733] BTRFS: selftest: running extent buffer bitmap tests
[ 8.340795] BTRFS: selftest: running inode tests
[ 8.341766] BTRFS: selftest: running btrfs_get_extent tests
[ 8.342981] BTRFS: selftest: running hole first btrfs_get_extent test
[ 8.344342] BTRFS: selftest: running outstanding_extents tests
[ 8.345575] BTRFS: selftest: running qgroup tests
[ 8.346537] BTRFS: selftest: running qgroup add/remove tests
[ 8.347725] BTRFS: selftest: running qgroup multiple refs test
[ 8.354982] BTRFS: selftest: running free space tree tests
[ 8.372175] BTRFS: selftest: sectorsize: 4096 nodesize: 8192
[ 8.373539] BTRFS: selftest: running btrfs free space cache tests
[ 8.374989] BTRFS: selftest: running extent only tests
[ 8.376236] BTRFS: selftest: running bitmap only tests
[ 8.377483] BTRFS: selftest: running bitmap and extent tests
[ 8.378854] BTRFS: selftest: running space stealing from bitmap to extent tests
...
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 18 Mar 2019 12:54:36 +0000 (13:54 +0100)]
btrfs: tests: drop messages when some tests finish
The messages like 'extent I/O tests finished' are redundant, if the test
fails it's quite obvious in the log and hang is also noticeable. No
other then extent_io and free space tree tests print that so make it
consistent.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 18 Mar 2019 13:19:33 +0000 (14:19 +0100)]
btrfs: tests: fix comments about tested extent map ranges
Comments about ranges did not match the code, the correct calculation is
to use start and start+len as the interval boundaries.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 18 Mar 2019 13:14:35 +0000 (14:14 +0100)]
btrfs: tests: use SZ_ constants everywhere
There are a few unconverted constants that are not powers of two and
haven't been converted.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 16:28:46 +0000 (17:28 +0100)]
btrfs: tests: use standard error message after extent map allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 18 Mar 2019 14:05:27 +0000 (15:05 +0100)]
btrfs: tests: return error from all extent map test cases
The way the extent map tests handle errors does not conform to the rest
of the suite, where the first failure is reported and then it stops.
Do the same now that we have the errors returned from all the functions.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 17:41:06 +0000 (18:41 +0100)]
btrfs: tests: return errors from extent map test case 4
Replace asserts with error messages and return errors.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 17:41:06 +0000 (18:41 +0100)]
btrfs: tests: return errors from extent map test case 3
Replace asserts with error messages and return errors.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 17:41:06 +0000 (18:41 +0100)]
btrfs: tests: return errors from extent map test case 2
Replace asserts with error messages and return errors.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 17:41:06 +0000 (18:41 +0100)]
btrfs: tests: return errors from extent map test case 1
Replace asserts with error messages and return errors.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 17:06:16 +0000 (18:06 +0100)]
btrfs: tests: return errors from extent map tests
The individual testcases for extent maps do not return an error on
allocation failures. This is not a big problem as the allocation don't
fail in general but there are functional tests handled with ASSERTS.
This makes tests dependent on them and it's not reliable.
This patch adds the allocation failure handling and allows for the
conversion of the asserts to proper error handling and reporting.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 16:42:07 +0000 (17:42 +0100)]
btrfs: tests: properly initialize fs_info of extent buffer
The fs_info is supposed to be valid, even though it's not used right
now and the test does not crash.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 16:28:46 +0000 (17:28 +0100)]
btrfs: tests: use standard error message after block group allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 16:28:46 +0000 (17:28 +0100)]
btrfs: tests: use standard error message after inode allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 16:28:46 +0000 (17:28 +0100)]
btrfs: tests: use standard error message after path allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 16:28:46 +0000 (17:28 +0100)]
btrfs: tests: use standard error message after extent buffer allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 16:28:46 +0000 (17:28 +0100)]
btrfs: tests: use standard error message after root allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 16:28:46 +0000 (17:28 +0100)]
btrfs: tests: use standard error message after fs_info allocation failure
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 16:23:30 +0000 (17:23 +0100)]
btrfs: tests: add table of most common errors
Allocation of main objects like fs_info or extent buffers is in each
test so let's simplify and unify the error messages to a table and add a
convenience helper.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 16:03:55 +0000 (17:03 +0100)]
btrfs: tests: print file:line for error messages
For better diagnostics print the file name and line to locate the
errors. Sample output:
[ 9.052924] BTRFS: selftest: fs/btrfs/tests/extent-io-tests.c:283 offset bits do not match
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 15:46:55 +0000 (16:46 +0100)]
btrfs: tests: don't leak fs_info in extent_io bitmap tests
The fs_info is not freed at the end of the function and leaks. The
function is called twice so there can be up to 2x sizeof(struct
btrfs_fs_info) of leaked memory. Fortunatelly this affects only testing
builds, the size could be 16k with several debugging features enabled.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 15 Mar 2019 15:43:11 +0000 (16:43 +0100)]
btrfs: tests: handle fs_info allocation failure in extent_io tests
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 06:27:40 +0000 (14:27 +0800)]
btrfs: disk-io: Show the timing of corrupted tree block explicitly
Just add one extra line to show when the corruption is detected.
Currently only read time detection is possible.
The planned distinguish line would be:
read time:
<detailed report>
block=XXXXX read time tree block corruption detected
write time:
<detailed report>
block=XXXXX write time tree block corruption detected
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Mon, 25 Feb 2019 16:14:45 +0000 (11:14 -0500)]
btrfs: fix panic during relocation after ENOSPC before writeback happens
We've been seeing the following sporadically throughout our fleet
panic: kernel BUG at fs/btrfs/relocation.c:4584!
netversion: 5.0-0
Backtrace:
#0 [
ffffc90003adb880] machine_kexec at
ffffffff81041da8
#1 [
ffffc90003adb8c8] __crash_kexec at
ffffffff8110396c
#2 [
ffffc90003adb988] crash_kexec at
ffffffff811048ad
#3 [
ffffc90003adb9a0] oops_end at
ffffffff8101c19a
#4 [
ffffc90003adb9c0] do_trap at
ffffffff81019114
#5 [
ffffc90003adba00] do_error_trap at
ffffffff810195d0
#6 [
ffffc90003adbab0] invalid_op at
ffffffff81a00a9b
[exception RIP: btrfs_reloc_cow_block+692]
RIP:
ffffffff8143b614 RSP:
ffffc90003adbb68 RFLAGS:
00010246
RAX:
fffffffffffffff7 RBX:
ffff8806b9c32000 RCX:
ffff8806aad00690
RDX:
ffff880850b295e0 RSI:
ffff8806b9c32000 RDI:
ffff88084f205bd0
RBP:
ffff880849415000 R8:
ffffc90003adbbe0 R9:
ffff88085ac90000
R10:
ffff8805f7369140 R11:
0000000000000000 R12:
ffff880850b295e0
R13:
ffff88084f205bd0 R14:
0000000000000000 R15:
0000000000000000
ORIG_RAX:
ffffffffffffffff CS: 0010 SS: 0018
#7 [
ffffc90003adbbb0] __btrfs_cow_block at
ffffffff813bf1cd
#8 [
ffffc90003adbc28] btrfs_cow_block at
ffffffff813bf4b3
#9 [
ffffc90003adbc78] btrfs_search_slot at
ffffffff813c2e6c
The way relocation moves data extents is by creating a reloc inode and
preallocating extents in this inode and then copying the data into these
preallocated extents. Once we've done this for all of our extents,
we'll write out these dirty pages, which marks the extent written, and
goes into btrfs_reloc_cow_block(). From here we get our current
reloc_control, which _should_ match the reloc_control for the current
block group we're relocating.
However if we get an ENOSPC in this path at some point we'll bail out,
never initiating writeback on this inode. Not a huge deal, unless we
happen to be doing relocation on a different block group, and this block
group is now rc->stage == UPDATE_DATA_PTRS. This trips the BUG_ON() in
btrfs_reloc_cow_block(), because we expect to be done modifying the data
inode. We are in fact done modifying the metadata for the data inode
we're currently using, but not the one from the failed block group, and
thus we BUG_ON().
(This happens when writeback finishes for extents from the previous
group, when we are at btrfs_finish_ordered_io() which updates the data
reloc tree (inode item, drops/adds extent items, etc).)
Fix this by writing out the reloc data inode always, and then breaking
out of the loop after that point to keep from tripping this BUG_ON()
later.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ add note from Filipe ]
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 20 Mar 2019 19:53:16 +0000 (21:53 +0200)]
btrfs: Use less confusing condition for uptodate parameter to btrfs_writepage_endio_finish_ordered
The uptodate parameter of btrfs_writepage_endio_finish_ordered is used
to signal whether an error has occured while writing the given page.
0 signals an error, which is propagated to callees and 1 signifies
success. In end_compressed_bio_write the ->bi_status is checked and
based on it either BLK_STS_OK (0) or BLK_STS_NOTSUPP (1) are used. While
from functional point of view this is ok it's a for the poor reader of
the code, since the block layer values are conflated with the semantics
of the parameter.
Just use plain 0 or 1. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 06:27:48 +0000 (14:27 +0800)]
btrfs: extent_io: Handle errors better in extent_writepages()
We can only get <=0 from extent_write_cache_pages, add an ASSERT() for
it just in case.
Then instead of submitting the write bio even if we got some error,
check the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.
If there is no error so far, then call flush_write_bio() and return the
result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 06:27:46 +0000 (14:27 +0800)]
btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()
This function needs some extra checks on locked pages and eb. For error
handling we need to unlock locked pages and the eb.
There is a rare >0 return value branch, where all pages get locked
while write bio is not flushed.
Thankfully it's handled by the only caller, btree_write_cache_pages(),
as later write_one_eb() call will trigger submit_one_bio(). So there
shouldn't be any problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 06:27:45 +0000 (14:27 +0800)]
btrfs: extent_io: Handle errors better in extent_write_locked_range()
We can only get @ret <= 0. Add an ASSERT() for it just in case.
Then, instead of submitting the write bio even we got some error, check
the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.
If there is no error so far, then call flush_write_bio() and return the
result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 06:27:44 +0000 (14:27 +0800)]
btrfs: extent_io: Kill dead condition in extent_write_cache_pages()
Since __extent_writepage() will no longer return >0 value,
(ret == AOP_WRITEPAGE_ACTIVATE) will never be true.
Kill that dead branch.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 06:27:43 +0000 (14:27 +0800)]
btrfs: extent_io: Handle errors better in btree_write_cache_pages()
In btree_write_cache_pages(), we can only get @ret <= 0.
Add an ASSERT() for it just in case.
Then instead of submitting the write bio even we got some error, check
the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.
If there is no error so far, then call flush_write_bio() and return the
result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 06:27:42 +0000 (14:27 +0800)]
btrfs: extent_io: Handle errors better in extent_write_full_page()
Since now flush_write_bio() could return error, kill the BUG_ON() first.
Then don't call flush_write_bio() unconditionally, instead we check the
return value from __extent_writepage() first.
If __extent_writepage() fails, we do cleanup, and return error without
submitting the possible corrupted or half-baked bio.
If __extent_writepage() successes, then we call flush_write_bio() and
return the result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 06:27:41 +0000 (14:27 +0800)]
btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up
We have a BUG_ON() in flush_write_bio() to handle the return value of
submit_one_bio().
Move the BUG_ON() one level up to all its callers.
This patch will introduce temporary variable, @flush_ret to keep code
change minimal in this patch. That variable will be cleaned up when
enhancing the error handling later.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 20 Mar 2019 06:27:39 +0000 (14:27 +0800)]
btrfs: Always output error message when key/level verification fails
We have internal report of strange transaction abort due to EUCLEAN
without any error message.
Since error message inside verify_level_key() is only enabled for
CONFIG_BTRFS_DEBUG, the error message won't be printed on most builds.
This patch will make the error message mandatory, so when problem
happens we know what's causing the problem.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 12 Mar 2019 09:10:40 +0000 (17:10 +0800)]
btrfs: Check the first key and level for cached extent buffer
[BUG]
When reading a file from a fuzzed image, kernel can panic like:
BTRFS warning (device loop0): csum failed root 5 ino 270 off 0 csum 0x98f94189 expected csum 0x00000000 mirror 1
assertion failed: !memcmp_extent_buffer(b, &disk_key, offsetof(struct btrfs_leaf, items[0].key), sizeof(disk_key)), file: fs/btrfs/ctree.c, line: 2544
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3500!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:btrfs_search_slot.cold.24+0x61/0x63 [btrfs]
Call Trace:
btrfs_lookup_csum+0x52/0x150 [btrfs]
__btrfs_lookup_bio_sums+0x209/0x640 [btrfs]
btrfs_submit_bio_hook+0x103/0x170 [btrfs]
submit_one_bio+0x59/0x80 [btrfs]
extent_read_full_page+0x58/0x80 [btrfs]
generic_file_read_iter+0x2f6/0x9d0
__vfs_read+0x14d/0x1a0
vfs_read+0x8d/0x140
ksys_read+0x52/0xc0
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
The fuzzed image has a corrupted leaf whose first key doesn't match its
parent:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node
29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
fs uuid
3381d111-94a3-4ac7-8f39-
611bbbdab7e6
chunk uuid
9af1c3c7-2af5-488b-8553-
530bd515f14c
...
key (EXTENT_CSUM EXTENT_CSUM
79691776) block
29761536 gen 19
leaf
29761536 items 1 free space 1726 generation 19 owner CSUM_TREE
leaf
29761536 flags 0x1(WRITTEN) backref revision 1
fs uuid
3381d111-94a3-4ac7-8f39-
611bbbdab7e6
chunk uuid
9af1c3c7-2af5-488b-8553-
530bd515f14c
item 0 key (EXTENT_CSUM EXTENT_CSUM
8798638964736) itemoff 1751 itemsize 2244
range start
8798638964736 end
8798641262592 length
2297856
When reading the above tree block, we have extent_buffer->refs = 2 in
the context:
- initial one from __alloc_extent_buffer()
alloc_extent_buffer()
|- __alloc_extent_buffer()
|- atomic_set(&eb->refs, 1)
- one being added to fs_info->buffer_radix
alloc_extent_buffer()
|- check_buffer_tree_ref()
|- atomic_inc(&eb->refs)
So if even we call free_extent_buffer() in read_tree_block or other
similar situation, we only decrease the refs by 1, it doesn't reach 0
and won't be freed right now.
The staled eb and its corrupted content will still be kept cached.
Furthermore, we have several extra cases where we either don't do first
key check or the check is not proper for all callers:
- scrub
We just don't have first key in this context.
- shared tree block
One tree block can be shared by several snapshot/subvolume trees.
In that case, the first key check for one subvolume doesn't apply to
another.
So for the above reasons, a corrupted extent buffer can sneak into the
buffer cache.
[FIX]
Call verify_level_key in read_block_for_search to do another
verification. For that purpose the function is exported.
Due to above reasons, although we can free corrupted extent buffer from
cache, we still need the check in read_block_for_search(), for scrub and
shared tree blocks.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202757
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202759
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202761
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202767
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202769
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 14 Mar 2019 07:52:35 +0000 (09:52 +0200)]
btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails
If a an eb fails to be read for whatever reason - it's corrupted on disk
and parent transid/key validations fail or IO for eb pages fail then
this buffer must be removed from the buffer cache. Currently the code
calls free_extent_buffer if an error occurs. Unfortunately this doesn't
achieve the desired behavior since btrfs_find_create_tree_block returns
with eb->refs == 2.
On the other hand free_extent_buffer will only decrement the refs once
leaving it added to the buffer cache radix tree. This enables later
code to look up the buffer from the cache and utilize it potentially
leading to a crash.
The correct way to free the buffer is call free_extent_buffer_stale.
This function will correctly call atomic_dec explicitly for the buffer
and subsequently call release_extent_buffer which will decrement the
final reference thus correctly remove the invalid buffer from buffer
cache. This change affects only newly allocated buffers since they have
eb->refs == 2.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
Reported-by: Jungyeon <jungyeon@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 19 Mar 2019 06:04:17 +0000 (14:04 +0800)]
btrfs: Make btrfs_(set|clear)_header_flag return void
From the introduction of btrfs_(set|clear)_header_flag, there is no
usage of its return value. So just make it return void.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 18 Mar 2019 02:48:19 +0000 (10:48 +0800)]
btrfs: reloc: Fix NULL pointer dereference due to expanded reloc_root lifespan
Commit
d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after
merge_reloc_roots()") expands the life span of root->reloc_root.
This breaks certain checs of fs_info->reloc_ctl. Before that commit, if
we have a root with valid reloc_root, then it's ensured to have
fs_info->reloc_ctl.
But now since reloc_root doesn't always mean a valid fs_info->reloc_ctl,
such check is unreliable and can cause the following NULL pointer
dereference:
BUG: unable to handle kernel NULL pointer dereference at
00000000000005c1
IP: btrfs_reloc_pre_snapshot+0x20/0x50 [btrfs]
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 10379 Comm: snapperd Not tainted
Call Trace:
create_pending_snapshot+0xd7/0xfc0 [btrfs]
create_pending_snapshots+0x8e/0xb0 [btrfs]
btrfs_commit_transaction+0x2ac/0x8f0 [btrfs]
btrfs_mksubvol+0x561/0x570 [btrfs]
btrfs_ioctl_snap_create_transid+0x189/0x190 [btrfs]
btrfs_ioctl_snap_create_v2+0x102/0x150 [btrfs]
btrfs_ioctl+0x5c9/0x1e60 [btrfs]
do_vfs_ioctl+0x90/0x5f0
SyS_ioctl+0x74/0x80
do_syscall_64+0x7b/0x150
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7fd7cdab8467
Fix it by explicitly checking fs_info->reloc_ctl other than using the
implied root->reloc_root.
Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 18 Mar 2019 15:45:18 +0000 (17:45 +0200)]
btrfs: Remove unused -EIO assignment in end_bio_extent_readpage
In case we hit the error case for a metadata buffer in
end_bio_extent_readpage then 'ret' won't really be checked before it's
written again to. This means the -EIO in this case will never be
checked, just remove it.
Fixes-coverity-id:
1442513
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 11 Mar 2019 07:55:38 +0000 (09:55 +0200)]
btrfs: Exploit the fact that pages passed to extent_readpages are always contiguous
Currently extent_readpages (called from btrfs_readpages) will always
call __extent_readpages which tries to create contiguous range of pages
and call __do_contiguous_readpages when such contiguous range is
created.
It turns out this is unnecessary due to the fact that generic MM code
always calls filesystem's ->readpages callback (btrfs_readpages in
this case) with already contiguous pages. Armed with this knowledge it's
possible to simplify extent_readpages by eliminating the call to
__extent_readpages and directly calling contiguous_readpages.
The only edge case that needs to be handled is when
add_to_page_cache_lru fails. This is easy as all that is needed is to
submit whatever is the number of pages successfully added to the lru.
This can happen when the page is already in the range, so it does not
need to be read again, and we can't do anything else in case of other
errors.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 24 Aug 2018 14:31:17 +0000 (16:31 +0200)]
btrfs: switch extent_buffer::lock_nested to bool
The member is tracking simple status of the lock, we can use bool for
that and make some room for further space reduction in the structure.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 24 Aug 2018 14:24:26 +0000 (16:24 +0200)]
btrfs: use assertion helpers for extent buffer write lock counters
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::write_locks become unused and can be
moved to the appropriate section, saving a few bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 24 Aug 2018 14:20:02 +0000 (16:20 +0200)]
btrfs: add assertion helpers for extent buffer write lock counters
The write_locks are a simple counter to track locking balance and used
to assert tree locks. Add helpers to make it conditionally work only in
DEBUG builds. Will be used in followup patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 24 Aug 2018 14:15:51 +0000 (16:15 +0200)]
btrfs: use assertion helpers for extent buffer read lock counters
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::read_locks become unused and can be
moved to the appropriate section, saving a few bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 24 Aug 2018 14:13:41 +0000 (16:13 +0200)]
btrfs: add assertion helpers for extent buffer read lock counters
The read_locks are a simple counter to track locking balance and used to
assert tree locks. Add helpers to make it conditionally work only in
DEBUG builds. Will be used in followup patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>