Allocate zfs_locked_range_t memory externally from zfs_rangelock_{try,}enter() #16896

ryao · 2024-12-22T16:41:13Z

Motivation and Context

The other day, a friend complained about slow rm -r operations on a pool that had two mirrored spinning disks. He was unlinking a large number of files, so I suggested he do a parallel rm and what should have taken a few more minutes to complete ended in seconds:

find /path/to/subtree -name -type f | parallel -j250 rm --
rm -r /path/to/subtree

I have long known that our VFS operations are slower than other filesystems such as ext4, and I have long attributed that to our use of range locks for scaling versus the spin locks that ext4 uses. Since scalability is more important than single threaded performance, I made no attempt to change this. Today, I had a flash of inspiration after writing a comment about this in a hacker news discussion:

https://news.ycombinator.com/item?id=42486279

It occurred to me that the uncontended case could be made faster. My initial idea was to try to make an adaptive range lock that would minimize work done in uncontended cases to approach the performance of ext4's spin locks. However, upon looking at the range lock code, I realized that the amount of work being done in the uncontended case is already somewhat minimal while significant improvement in all cases would be possible by changing how we allocate memory for zfs_locked_range_t.

Instead of allocating memory for zfs_locked_range_t as part of zfs_rangelock_{try,}enter(), we can allocate it externally either on the stack or as part of another structure. This is possible with trivial refactoring in all cases except vdev_raidz_io_start(), where slightly more extensive refactoring is needed to allocate zfs_locked_range_t as part of raidz_map_t.

I recall seeing flame graphs from close to the start of ZFSOnLinux showing significant time spent in kmem_alloc() when doing range locks. Doing this refactoring eliminates that entirely. This should make our VFS and zvol operations slightly faster. Some RAID-Z operations will also become slightly faster.

Description

We do trivial refactoring to allocate zfs_locked_range_t either on the stack or as part of other structures in all but vdev_raidz_io_start(), where non-trival refactoring is needed.

How Has This Been Tested?

I have only done build tests so far. I am letting the buildbot do runtime testing.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin

I don't remember if I saw it as a problem, but I can believe that allocations are not great. Though modern allocations are pretty well cached and I think we are talking about one allocation per operation here.

The patch looks OK to me, but my worry here is a stack consumption. I haven't checked on Linux, but on FreeBSD zfs_locked_range_t takes 96 bytes, which now will be on stack instead of 8 byte pointer. It would be nice to try to compact it, if possible, once you are already changing the API. For example, lr_rangelock seems to be used only in 2 places, and I guess we could just pass it directly to zfs_rangelock_exit() and zfs_rangelock_reduce() to save 8 bytes. May be lr_type could be reduced to uint8_t, which is not very clean, but could save another 8 byte.

ryao · 2024-12-22T20:18:42Z

I don't remember if I saw it as a problem, but I can believe that allocations are not great. Though modern allocations are pretty well cached and I think we are talking about one allocation per operation here.

It was never huge, but it was non-negligible as per my recollection. When I looked into why ext4 was so much faster at unlinking than we were in the past, I had blamed the range locks and dropped the matter since range locks were important for scalability. It occurs to me that was not an entirely accurate assessment, but it was the conclusion I made at the time.

The patch looks OK to me, but my worry here is a stack consumption.

The VFS and block layer on the zvol side have never been stack constrained as far as I know, so the only place where this is potentially a problem is in raidz_reflow_scratch_sync(), which is executed in txg_sync_thread. I am not aware of that one being stack constrained either (correct me if I am wrong), so we should be okay here.

How about we subject this to testing to see if this is a problem and if it passes, let it be? If it is a problem, then we can switch back to kmem_alloc() for raidz_reflow_scratch_sync().

I haven't checked on Linux, but on FreeBSD zfs_locked_range_t takes 96 bytes, which now will be on stack instead of 8 byte pointer. It would be nice to try to compact it, if possible, once you are already changing the API. For example, lr_rangelock seems to be used only in 2 places, and I guess we could just pass it directly to zfs_rangelock_exit() and zfs_rangelock_reduce() to save 8 bytes.

In multiple places in the code, the zfs_locked_range_t is passed to another thread and freed elsewhere, and we depend on ->lr_rangelock to find the original range lock since the pointer to it is gone. In order to do what you suggest, we will need to pass a pointer to the original range lock alongside the zfs_locked_range_t. A few places in the raidz code and {zfs,zvol,ztest}_get_done would need to change. It is cleaner to do it the current way, although I don't mind changing this if it results in a measurable improvement. However, I am not sure if this will result in any savings as you will see below.

May be lr_type could be reduced to uint8_t, which is not very clean, but could save another 8 byte.

It takes 208 bytes on my Linux machine:

$ pahole $(find . -name zfs.ko) -C zfs_locked_range
struct zfs_locked_range {
        zfs_rangelock_t *          lr_rangelock;         /*     0     8 */
        avl_node_t                 lr_node;              /*     8    24 */
        uint64_t                   lr_offset;            /*    32     8 */
        uint64_t                   lr_length;            /*    40     8 */
        uint_t                     lr_count;             /*    48     4 */
        zfs_rangelock_type_t       lr_type;              /*    52     4 */
        kcondvar_t                 lr_write_cv;          /*    56    72 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        kcondvar_t                 lr_read_cv;           /*   128    72 */
        /* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */
        uint8_t                    lr_proxy;             /*   200     1 */
        uint8_t                    lr_write_wanted;      /*   201     1 */
        uint8_t                    lr_read_wanted;       /*   202     1 */

        /* size: 208, cachelines: 4, members: 11 */
        /* padding: 5 */
        /* last cacheline: 16 bytes */
};

This is because the Linux SPL's kcondvar_t is huge:

$ pahole $(find . -name spl.ko) -C kcondvar_t
typedef struct {
        int                cv_magic;             /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        wait_queue_head_t  cv_event;             /*     8    24 */
        wait_queue_head_t  cv_destroy;           /*    32    24 */
        atomic_t           cv_refs;              /*    56     4 */
        atomic_t           cv_waiters;           /*    60     4 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        kmutex_t *         cv_mutex;             /*    64     8 */

        /* size: 72, cachelines: 2, members: 6 */
        /* sum members: 68, holes: 1, sum holes: 4 */
        /* last cacheline: 8 bytes */
} kcondvar_t;

Interestingly, if I try to save some space by switching lr_type to uint8_t and move it to the end, nothing changes due to compiler added padding:

$ pahole $(find . -name zfs.ko) -C zfs_locked_range
struct zfs_locked_range {
        zfs_rangelock_t *          lr_rangelock;         /*     0     8 */
        avl_node_t                 lr_node;              /*     8    24 */
        uint64_t                   lr_offset;            /*    32     8 */
        uint64_t                   lr_length;            /*    40     8 */
        uint_t                     lr_count;             /*    48     4 */

        /* XXX 4 bytes hole, try to pack */

        kcondvar_t                 lr_write_cv;          /*    56    72 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        kcondvar_t                 lr_read_cv;           /*   128    72 */
        /* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */
        uint8_t                    lr_proxy;             /*   200     1 */
        uint8_t                    lr_write_wanted;      /*   201     1 */
        uint8_t                    lr_read_wanted;       /*   202     1 */
        uint8_t                    lr_type;              /*   203     1 */

        /* size: 208, cachelines: 4, members: 11 */
        /* sum members: 200, holes: 1, sum holes: 4 */
        /* padding: 4 */
        /* last cacheline: 16 bytes */
};

This yields no improvement. I wonder if FreeBSD sees any savings.

This also makes me wonder if removing ->lr_rangelock is just going to cause the hole to become bigger. If it does, the additional work to pass the pointer externally from zfs_locked_range_t in the RAID-Z code would paradoxically result in us using more memory.

amotin · 2024-12-22T21:31:23Z

It takes 208 bytes on my Linux machine

It makes me worry about stack usage even more. We don't know from what context VFS calls us, and if it is not directly from user-space, I worry there might be some unpleasant surprises.

This is because the Linux SPL's kcondvar_t is huge

Yea. On FreeBSD it takes only 16 bytes, that makes other optimizations more noticeable.

if I try to save some space by switching lr_type to uint8_t and move it to the end

Why do you move it to the end if there is 4-byte hole where it was before after lr_count?

ryao · 2024-12-22T23:50:12Z

It takes 208 bytes on my Linux machine

It makes me worry about stack usage even more. We don't know from what context VFS calls us, and if it is not directly from user-space, I worry there might be some unpleasant surprises.

Very few places in the Linux kernel call into the VFS. It is generally discouraged.

I am not very concerned about this on Linux. In 2014, Linux switched from 8KB stacks where stack space had been painful for us to 16KB stacks. We still support a few of the kernels that have 8KB stacks, but stack space when called from the VFS was never a problem even in the 8KB stack days, so adding this to the stack should not create an issue.

if I try to save some space by switching lr_type to uint8_t and move it to the end

Why do you move it to the end if there is 4-byte hole where it was before after lr_count?

The enum is an int per the C standard and int is 4-bytes in our memory model. Changing the enum to a uint8_t would create a 3-byte hole, which I expected to be padded to ensure the following member had at least 4-byte alignment, so I moved it to the end where there was already padding. To my surprise, the compiler is padding to give at least 8-byte alignment, such that a 4-byte hole emerged from the move of lr_type to the end.

Even if I shrink lr_type to 1 byte from 4 bytes, the saved bytes just become padding, no matter where I put it after making it 1 byte. Things might be different on 32-bit systems, but I don't think many people use those anymore to make it worth doing this just for them. It is counter-intuitive, but there is no savings from shrinking lr_type.

amotin · 2024-12-23T00:39:05Z

Very few places in the Linux kernel call into the VFS. It is generally discouraged.

May be. But even if so, I guess there might be some more complicate structures like nullfs and similar mounts, etc, increasing call depth.

To my surprise, the compiler is padding to give at least 8-byte alignment,

It is not a surprise that structure is padded to 8 bytes, since it includes 8 byte elements that must be aligned. That is why I thought about reusing the already existing hole after lr_count. Than structure would end just after lr_read_cv.

ryao · 2024-12-23T05:07:24Z

Very few places in the Linux kernel call into the VFS. It is generally discouraged.

May be. But even if so, I guess there might be some more complicate structures like nullfs and similar mounts, etc, increasing call depth.

Linux bind mounts would be the equivalent of the FreeBSD nullfs driver. Those are handled by the Linux VFS if I recall. The main things that would call into the VFS would be the loop driver for mounting filesystems from files and FUSE. However, none of this ever been a problem even on the kernels with 8KB stacks. Linux actually had a push to move to 4KB stacks:

https://lwn.net/Articles/279229/

However, practicality won and in 2014, Linux went to 16KB stacks:

https://lwn.net/Articles/600644/

However the 4KB stack enablement work did plenty to minimize stack usage in these paths in upstream Linux. Linux also has a tool for seeing maximum stack utilization, although I have never had to use them as the stack space issues in the project were already fixed by @behlendorf before I started contributing.

To my surprise, the compiler is padding to give at least 8-byte alignment,

It is not a surprise that structure is padded to 8 bytes, since it includes 8 byte elements that must be aligned. That is why I thought about reusing the already existing hole after lr_count. Than structure would end just after lr_read_cv.

There is no existing hole after lr_count. That hole was made after I tried shrinking lr_type and moved it to the end. See the first pahole $(find . -name zfs.ko) -C zfs_locked_range output for what it currently is. The second one reflected your suggested change and made the hole. The size of both is the same, regardless of whether lr_type is left where it was or is moved to the end after shrinking it.

We might be able to save 8 bytes part of the time by removing ->lr_rangelock as you suggested, but it is a bunch of refactoring. I have done about half of the work needed to do that locally so far, but I wonder if saving 8 bytes on the stack is worth it when we are not stack limited here. The code is cleaner the current way.

amotin · 2024-12-23T14:27:23Z

but I wonder if saving 8 bytes on the stack is worth it when we are not stack limited here.

When compared to 208 bytes, may be not. But I am not buying that we are not stack limited here or anywhere in kernel. I'll let others comment.

…,}enter() Typically, the memory allocated by kmem_alloc() can be trivially allocated from either the stack or as part of another structure. The only case where it cannot is in vdev_raidz_io_start(), although some further refactoring should be able to eliminate that case too. Allocating from the stack or as part of another data structure is faster as it gives us this memory for free, so there is little reason not to do it. This eliminates a non-neligible amount of CPU time that I have seen in flame graphs going back to the early days of OpenZFS when the tree was the ZFSOnLinux tree. This should make our VFS and zvol operations slightly faster. Some RAID-Z operations will also become slightly faster. Signed-off-by: Richard Yao <[email protected]>

ryao · 2024-12-23T15:53:23Z

but I wonder if saving 8 bytes on the stack is worth it when we are not stack limited here.

When compared to 208 bytes, may be not. But I am not buying that we are not stack limited here or anywhere in kernel. I'll let others comment.

We are stack limited in the ZIO threads, which this does not touch. I am not aware of stack limitations elsewhere. If it helps, I could set aside time to try measuring.

If we were to decide not to stack allocate things, this API change would still be beneficial since it coalesces allocations in a number of places, such as in the RAID-Z code and {zfs,zvol,ztest}_get_done().

That being said, there is something subtly wrong here. From qemu-x86 (freebsd13-3r):

Fatal trap 12: page fault while in kernel mode
  cpuid = 1; apic id = 01
  fault virtual address	= 0x22710010
  fault code		= supervisor read data, page not present
  instruction pointer	= 0x20:0xffffffff8280012a
  stack pointer	        = 0x28:0xfffffe009a2d6860
  frame pointer	        = 0x28:0xfffffe009a2d6870
  code segment		= base 0x0, limit 0xfffff, type 0x1b
  			= DPL 0, pres 1, long 1, def32 0, gran 1
  processor eflags	= interrupt enabled, resume, IOPL = 0
  current process		= 1273 (txg_thread_enter)
  trap number		= 12
  panic: page fault
  cpuid = 1
  time = [173](https://github.com/openzfs/zfs/actions/runs/12457044088/job/34771163635?pr=16896#step:11:175)4896301
  KDB: stack backtrace:
  #0 0xffffffff80c514c5 at kdb_backtrace+0x65
  #1 0xffffffff80c04e22 at vpanic+0x152
  #2 0xffffffff80c04cc3 at panic+0x43
  #3 0xffffffff810c206d at trap_fatal+0x38d
  #4 0xffffffff810c20bf at trap_pfault+0x4f
  #5 0xffffffff81099c18 at calltrap+0x8
  #6 0xffffffff82a5772e at zfs_rangelock_enter_impl+0x1ee
  #7 0xffffffff82a5ba57 at zfs_get_data+0x207
  #8 0xffffffff82a66620 at zil_lwb_write_issue+0x2c0
  #9 0xffffffff82a620eb at zil_commit_impl+0x3cb
  #10 0xffffffff82a59c03 at zfs_fsync+0x63
  #11 0xffffffff828568bc at zfs_file_fsync+0x7c
  #12 0xffffffff8284a3a8 at vdev_file_io_start+0x48
  #13 0xffffffff82a74f77 at zio_vdev_io_start+0x367
  #14 0xffffffff82a6b70c at zio_nowait+0x10c
  #15 0xffffffff829a5753 at vdev_config_sync+0x173
  #16 0xffffffff82969f1a at spa_sync+0x118a
  #17 0xffffffff82984381 at txg_sync_thread+0x4c1
  Uptime: 8m10s
  Automatic reboot in 15 seconds - press a key on the console to abort
  Rebooting...
  cpu_reset: Restarting BSP
  cpu_reset_proxy: Stopped CPU 1

Unfortunately, I am not sure what. This case is not even stack allocated, as it is part of the zgd_t allocation. I am going to move the position of the structure member to the top to see how the buildbot reacts on the chance that this is somehow alignment related, although this backtrace does not make sense to me unless there is some other zgd_t definition on FreeBSD that I am not seeing in the tree.

ryao force-pushed the rangelock branch from 4a57609 to df82974 Compare December 22, 2024 16:52

amotin reviewed Dec 22, 2024

View reviewed changes

ryao force-pushed the rangelock branch from df82974 to bc7f63b Compare December 22, 2024 19:21

ryao force-pushed the rangelock branch from bc7f63b to 88cddea Compare December 23, 2024 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocate zfs_locked_range_t memory externally from zfs_rangelock_{try,}enter() #16896

Allocate zfs_locked_range_t memory externally from zfs_rangelock_{try,}enter() #16896

ryao commented Dec 22, 2024 •

edited

Loading

amotin left a comment

ryao commented Dec 22, 2024 •

edited

Loading

amotin commented Dec 22, 2024

ryao commented Dec 22, 2024 •

edited

Loading

amotin commented Dec 23, 2024

ryao commented Dec 23, 2024 •

edited

Loading

amotin commented Dec 23, 2024

ryao commented Dec 23, 2024 •

edited

Loading

Allocate zfs_locked_range_t memory externally from zfs_rangelock_{try,}enter() #16896

Are you sure you want to change the base?

Allocate zfs_locked_range_t memory externally from zfs_rangelock_{try,}enter() #16896

Conversation

ryao commented Dec 22, 2024 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin left a comment

Choose a reason for hiding this comment

ryao commented Dec 22, 2024 • edited Loading

amotin commented Dec 22, 2024

ryao commented Dec 22, 2024 • edited Loading

amotin commented Dec 23, 2024

ryao commented Dec 23, 2024 • edited Loading

amotin commented Dec 23, 2024

ryao commented Dec 23, 2024 • edited Loading

ryao commented Dec 22, 2024 •

edited

Loading

ryao commented Dec 22, 2024 •

edited

Loading

ryao commented Dec 22, 2024 •

edited

Loading

ryao commented Dec 23, 2024 •

edited

Loading

ryao commented Dec 23, 2024 •

edited

Loading