Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: Map CompareAndSwap is not atomic with respect to other map operations in Go 1.24 #70970

Open
znkr opened this issue Dec 23, 2024 · 12 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@znkr
Copy link
Contributor

znkr commented Dec 23, 2024

Go version

go version devel go1.24-eef35e3bd9 Mon Dec 23 07:08:40 2024 -0800 darwin/arm64

Output of go env in your module/workspace:

AR='ar'
CC='clang'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='clang++'
GCCGO='gccgo'
GO111MODULE=''
GOARCH='arm64'
GOARM64='v8.0'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/Users/flo/Library/Caches/go-build'
GODEBUG=''
GOENV='/Users/flo/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/1_/dsgxrj_j2ws_5h4ykmddqjyr0000gn/T/go-build418421246=/tmp/go-build -gno-record-gcc-switches -fno-common'
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMOD='/Users/flo/Projects/dex/go.mod'
GOMODCACHE='/Users/flo/go/pkg/mod'
GONOPROXY='*.home.znkr.io'
GONOSUMDB='*.home.znkr.io'
GOOS='darwin'
GOPATH='/Users/flo/go'
GOPRIVATE='*.home.znkr.io'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/Users/flo/Projects/go/go'
GOSUMDB='sum.golang.org'
GOTELEMETRY='on'
GOTELEMETRYDIR='/Users/flo/Library/Application Support/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/Users/flo/Projects/go/go/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='devel go1.24-eef35e3bd9 Mon Dec 23 07:08:40 2024 -0800'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

I am experimenting with weak maps and ran into a problem that I can't explain. I am either missing something and there's a bug somewhere in my implementation or there's a bug in Go.

I reduced the problem to

package main

import (
	"fmt"
	"runtime"
	"sync"
	"weak"
)

const N = 1_000_000
const P = 5_000

type dummy [32]byte

func main() {
	var m sync.Map
	var wg sync.WaitGroup
	wg.Add(N)
	for i := range N {
		go func() {
			defer wg.Done()
			a := get(&m, i%P)
			b := get(&m, i%P)
			if a != b {
				fmt.Printf("a != b (%p vs %p)\n", a, b)
			}
		}()
	}
	wg.Wait()
}

func get(m *sync.Map, key int) *dummy {
	nv := new(dummy)
	nw := weak.Make(nv)
	for {
		w, loaded := m.LoadOrStore(key, nw)
		if !loaded {
			runtime.AddCleanup(nv, cleanup, cleanupArg{m, key, nw})
			return nv
		}
		if v := w.(weak.Pointer[dummy]).Value(); v != nil {
			return v
		}

		// Weak pointer was reclaimed, try to replace it with nw.
		if m.CompareAndSwap(key, w, nw) {
			runtime.AddCleanup(nv, cleanup, cleanupArg{m, key, nw})
			return nv
		}

		// This works:
		// m.CompareAndDelete(key, w)
	}
}

type cleanupArg struct {
	m     *sync.Map
	key   int
	value weak.Pointer[dummy]
}

func cleanup(arg cleanupArg) {
	arg.m.CompareAndDelete(arg.key, arg.value)
}

What did you see happen?

Unfortunately, this is a bit flaky and it doesn't always reproduce. I sometimes have to run the program multiple times. When it reproduces it prints something like this:

a != b (0x1400506bee0 vs 0x1400506bf00)
a != b (0x14006316540 vs 0x1400206dc40)
a != b (0x140040dca40 vs 0x140040dca80)

I can also reproduce it on the Go playground: https://go.dev/play/p/HBJ2IU2IGdF?v=gotip

What did you expect to see?

I expect to always observe that same pointer being returned by get for the same key.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 23, 2024
@dr2chase
Copy link
Contributor

dr2chase commented Dec 23, 2024

I'm going to spend a little time staring at your code to try to figure out what I think ought to be happening, but it might not hurt if comments in the example or a later commentary explained what you thought was supposed to be going on. It can be easier to check your assumptions, than to run this GC-interacting piece of code in my error-prone head.

@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 23, 2024
@cherrymui cherrymui added this to the Go1.24 milestone Dec 23, 2024
@znkr
Copy link
Contributor Author

znkr commented Dec 23, 2024

You're absolutely right. I don't have time right now, but I'll get back to it later today. Sorry for the missing comments, I was too deep in the weeds to notice.

@dr2chase
Copy link
Contributor

dr2chase commented Dec 23, 2024

I think this is the doc comment you would want to prepend to get:

// get returns the non-nil (pointer) value of the weak pointer
// in m at key, possibly storing a new weak pointer there to
// make it so, if (and only if) the key is not in the map 
// or the weak pointer found there has gone empty.
// If/when a new weak pointer is stored, get adds
// a cleanup for the weak-referenced/returned value that will
// remove it from the map if it is present.

And the larger expectation is that because the returned value from the first call (assigned to a) is live until the compare/print:

  • the cleanup will not run
  • therefore the key remains populated
  • the weak pointer at the key will not go nil
    therefore a and b should always have the same value.

If this is your bug, it is darn subtle, because I don't see it. I will keep looking. I don't see where it needs a keepalive, either.

And thank you, I am betting "our bug" at this point, and this close to release it needs attention.

@mknyszek

@dr2chase dr2chase added Critical A critical problem that affects the availability or correctness of production systems built using Go release-blocker labels Dec 23, 2024
@mknyszek
Copy link
Contributor

mknyszek commented Dec 23, 2024

This is a bug in the map implementation, I think. CompareAndDelete and CompareAndSwap must appear atomic with respect to one another, and they don't. I can't reproduce this in the old map implementation.

I think I know how to fix the problem. I should have a CL out shortly. Worst case we switch the GOEXPERIMENT default back to the old map implementation, but I don't think it'll come to that.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 23, 2024

I'm fairly certain the problem was introduced in CL 606462 and that rolling back that change will resolve this issue. That being said, I will take a little bit of time to try and salvage that optimization with high confidence that we don't introduce any newer, subtler bugs.

@mknyszek mknyszek changed the title runtime: sync.Map or weak.Pointer issue sync: Map CompareAndSwap is not atomic with respect to other map operations in Go 1.24 Dec 23, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/638559 mentions this issue: internal/sync: guarantee atomicity of swaps in HashTrieMap

@mknyszek
Copy link
Contributor

mknyszek commented Dec 23, 2024

CL 638559 is a fix that works. It still needs a test and a little bit of cleanup, but I feel OK about it.

It's a bit late in the release cycle. Here are our options:

  • Revert go.dev/cl/606462, which means a possible performance regression.
  • Land go.dev/cl/638559, which has a risk of unknown bugs.
  • Disable the GOEXPERIMENT, which is risk-free, but also means that we don't get any sync.Map optimizations in this release and we should remove the change from the release notes, or just note that it's opt-in only.

@znkr
Copy link
Contributor Author

znkr commented Dec 23, 2024

I guess I don't need to update the snippet above with better comments now, I'll keep in mind for next time though. Great job in pin-pointing that bug so quickly, btw.

@ianlancetaylor
Copy link
Member

@mknyszek How noticeable is the performance regression?

@mknyszek
Copy link
Contributor

I'll measure it. IIRC, there was one microbenchmark in the existing sync.Map benchmark suite that really cared.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/638615 mentions this issue: Revert "internal/sync: optimize CompareAndSwap and Swap"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

7 participants