dnode_next_offset: backtrack if lower level does not match #16025

rrevans · 2024-03-25T13:15:20Z

This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level.

Motivation and Context

Normally a higher level search match in the first loop of dnode_next_offset always points to a matching block in the second loop, but there are cases where this does not happen:

Racing block pointer updates from dbuf_write_ready.

Before f664f1e (Reduce lock contention on dn_struct_rwlock #8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs.

This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated.

dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block.

This case was found with ztest.
txg > 0, non-hole case. This is subtle bug in dnode_next_offset() with txg > 0 #11196.

Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0.

Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg.

Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated.

The same behavior is possible with dnode search at L0.

This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects.

This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case.

In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no further matches in the entire object.

This PR is also a first step towards teaching dnode_next_offset to consider dirty dbuf state. In the next PR, dnode_next_offset_level is modified to stop at any dirty indirect block when a new flag is set. This allows dnode_next_offset to match dirty L0 blocks (or freed-but-not-synced L0 ranges) the same as synced out data blocks (or holes). However that approach requires backtracking since a dirty higher-level indirect may not match once the L0/L1 state is inspected (f.e. consider a data search reaching an L2 block that is dirty but all L0 blocks previously created under that L2 are now newly freed in dirty state).

Description

Old algorithm:

Start at minlvl
Increment lvl until a matching block is found or maxlvl exceeded.
Decrement lvl until minlvl reached or no match found.

New algorithm:

Start at minlvl
Do a tree traversal checking for a match at each block:
a. If matched, decrement lvl until minlvl reached.
b. If not matched, adjust offset to next BP at lvl+1 and increment lvl.

The new algorithm continues the search at the next possible offset at the next higher level when no match is found. This performs in-order traversal of the tree while skipping non-existing or non-matching ranges.

This also cleans up some fiddliness around *offset handling and adds new assertions:

Large objects can exceed 2**64 bytes at maxlvl. Previously the code would search for pointers at offsets beyond this limit and set error accordingly, but then clamp the output offset to start after *offset << span overflowed. Now the code limits the search to representable offsets within the object which prevents offset overflow altogether.
When matching the same block, dnode_next_level_offset could previously return with *offset set to a value less than the starting offset (or greater than for backwards search). This is because offsets within the starting block were truncated away. This PR guarantees that on output the offset is not before/after the starting offset.
For reverse search, previously search would fail if *offset was such that it pointed beyond the number of BPs at dn_nlevels. This occurs because the i < epb test for the forward search would stop iteration (also prevents reading beyond the end of the array). This is notionally wrong as reverse search from such offsets should match blocks that exist in the file. dnode_next_offset_level now addresses this by clamping the starting block.
Also previously for reverse search, if search goes past offset zero, then the returned offset was set somewhat incidentally to min(start, 0 + (1 << span) - 1). This is confusing and does not match the forward search case which leaves *offset unmodified on overflow. The new logic leaves *offset unmodified to match the forward search case.

How Has This Been Tested?

Many ztest and ZTS runs as well as seek(SEEK_DATA/SEEK_HOLE) stress tests. This surfaced a lot of problems getting *offset semantics right, and also found a novel PANIC in free_children which this happens to fix.

I don't know how to really test maxlvl == 0 changes (see also comments in #11200), and it would be nice to have more unit-oriented tests for dnode_next_offset. Any feedback appreciated.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>

rrevans · 2024-03-28T13:13:42Z

See master...rrevans:zfs:find_dirty for the rest of the patchset here.

dnode_next_offset: add DNODE_FIND_DIRTY
dmu_offset_next: Use DNODE_FIND_DIRTY for SEEK_HOLE/SEEK_DATA
dnode_free_range: Replace L1 dirty walk with DNODE_FIND_DIRTY

adamdmoss · 2024-05-14T15:56:29Z

I found your notes quite educational as a background so I'm repeating the link here for future readers:
https://gist.github.com/rrevans/e6a2e14be9dea7f9711b83c2d18303d5

tonyhutter · 2024-08-19T21:55:44Z

@rrevans sorry no one has taken a look at this yet. I just tried pulling down the patch, but looks like it's now out of sync with master. Would you mind re-basing on top of master?

jumbi77 · 2024-09-20T17:04:43Z

@rrevans sorry no one has taken a look at this yet. I just tried pulling down the patch, but looks like it's now out of sync with master. Would you mind re-basing on top of master?

Politely ping @rrevans to rebase this great work!

rrevans · 2024-12-23T01:45:12Z

@jumbi77 @tonyhutter thanks for the ping. I'll have a look here and post a rebase in the next week or so.

Edit: Updated!

This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level. Normally higher level blocks always point to matching blocks, but there are cases where this does not happen: 1. Racing block pointer updates from dbuf_write_ready. Before f664f1e (openzfs#8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs. This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated. dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block. This case was found with ztest. 2. txg > 0, non-hole case. This is openzfs#11196. Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0. Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg. Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated. The same behavior is possible with dnode search at L0. This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects. This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case. In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no more matches in the entire object. Signed-off-by: Robert Evans <[email protected]>

rrevans force-pushed the traverse branch from 62bced1 to 0fe0b07 Compare March 25, 2024 14:10

behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 27, 2024

behlendorf self-requested a review March 27, 2024 22:03

rrevans mentioned this pull request Mar 28, 2024

dnode_is_dirty: use dn_dirty_txg to check dirtiness #15615

Draft

13 tasks

rincebrain mentioned this pull request May 11, 2024

Criteria for re-enabling block cloning (toggle zfs_bclone_enabled) by default #16189

Closed

rrevans force-pushed the traverse branch from 0fe0b07 to a33e429 Compare December 23, 2024 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dnode_next_offset: backtrack if lower level does not match #16025

dnode_next_offset: backtrack if lower level does not match #16025

rrevans commented Mar 25, 2024 •

edited

Loading

rrevans commented Mar 28, 2024

adamdmoss commented May 14, 2024

tonyhutter commented Aug 19, 2024

jumbi77 commented Sep 20, 2024

rrevans commented Dec 23, 2024 •

edited

Loading

dnode_next_offset: backtrack if lower level does not match #16025

Are you sure you want to change the base?

dnode_next_offset: backtrack if lower level does not match #16025

Conversation

rrevans commented Mar 25, 2024 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

rrevans commented Mar 28, 2024

adamdmoss commented May 14, 2024

tonyhutter commented Aug 19, 2024

jumbi77 commented Sep 20, 2024

rrevans commented Dec 23, 2024 • edited Loading

rrevans commented Mar 25, 2024 •

edited

Loading

rrevans commented Dec 23, 2024 •

edited

Loading