Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

converged-timeout is currently broken #743

Open
ffrank opened this issue Mar 5, 2024 · 6 comments
Open

converged-timeout is currently broken #743

ffrank opened this issue Mar 5, 2024 · 6 comments

Comments

@ffrank
Copy link
Contributor

ffrank commented Mar 5, 2024

Versions:

  • mgmt version (eg: mgmt --version): 0.0.24-199-g296fc484

  • operating system/distribution (eg: uname -a):

Linux jammy 5.15.0-73-generic #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • golang version (eg: go version): go version go1.22.0 linux/amd64

Description:

The --converged-timeout value is not applied. When supplied, mgmt begins shutting down immediately after assuming a converged state for the first time.

$ ./mgmt run --converged-timeout=3600 lang examples/lang/file0.mcl
2024/03/05 15:41:32 cli: lang: lexing/parsing...
2024/03/05 15:41:32 cli: lang: init...
2024/03/05 15:41:32 cli: lang: interpolating...
2024/03/05 15:41:32 cli: lang: building scope...
2024/03/05 15:41:32 cli: lang: running type unification...
2024/03/05 15:41:32 cli: lang: input: examples/lang/file0.mcl
2024/03/05 15:41:32 cli: lang: tree:
/
├── file0.mcl
└── metadata.yaml
This is: mgmt, version: 0.0.24-199-g296fc484
Copyright (C) 2013-2024+ James Shubin and the project contributors
Written by James Shubin <[email protected]> and the project contributors
15:41:32 main: start: 1709653292375592509
15:41:32 main: working prefix is: /home/felix/.cache/mgmt/
15:41:32 PGP: Imported key: B7BFCF22
15:41:32 main: no seeds specified!
15:41:32 etcd: running...
15:41:32 etcd: watching chooser...
15:41:32 etcd: bootstrapping...
15:41:32 etcd: nominated: jammy=http://localhost:2380
15:41:32 etcd: waiting for server...
15:41:32 etcd: server: runServer: (newCluster=true): jammy=http://localhost:2380
15:41:32 etcd: server: starting...
15:41:33 etcd: server: ready
15:41:33 etcd: connect...
15:41:33 etcd: connected!
15:41:33 etcd: watching nominees...
15:41:33 etcd: nominated: jammy=http://localhost:2380
15:41:33 etcd: watching volunteers...
15:41:33 etcd: volunteering...
15:41:33 etcd: volunteers: jammy=http://localhost:2380
15:41:33 etcd: chooser: (jammy=http://localhost:2380)/(jammy=http://localhost:2380)
15:41:33 etcd: chooser result(+/-): []/[]
15:41:33 etcd: watching endpoints...
15:41:33 main: etcd is ready!
15:41:33 main: waiting...
15:41:33 etcd: volunteers: jammy=http://localhost:2380
15:41:33 etcd: chooser: (jammy=http://localhost:2380)/(jammy=http://localhost:2380)
15:41:33 etcd: chooser result(+/-): []/[]
15:41:33 main: running...
15:41:33 main: waiting...
15:41:33 gapi: generating new graph...
15:41:33 gapi: swap!
15:41:33 gapi: lang: lexing/parsing...
15:41:33 gapi: lang: init...
15:41:33 gapi: lang: interpolating...
15:41:33 gapi: lang: building scope...
15:41:33 gapi: lang: running type unification...
15:41:33 gapi: lang: building function graph...
15:41:33 gapi: lang: function engine initializing...
15:41:33 gapi: lang: function engine starting...
15:41:33 gapi: lang: stream...
15:41:33 gapi: lang: funcs: aggregated send
15:41:33 gapi: generating new graph...
15:41:33 gapi: lang: running interpret...
15:41:33 main: new graph took: 94.156µs
15:41:33 engine: autoedge: adding autoedges...
15:41:33 main: auto edges took: 75.351µs
15:41:33 engine: autogroup: algorithm: wrappedGrouper: NonReachabilityGrouper...
15:41:33 main: auto grouping took: 41.668µs
15:41:33 main: send/recv building took: 801ns
15:41:33 main: commit...
15:41:33 engine: graph sync...
15:41:33 main: graph: Vertices(2), Edges(0)
15:41:33 main: waiting...
15:41:33 main: converged for 3600 seconds, exiting!
15:41:33 engine: file[/tmp/dir1/]: Working...
15:41:33 engine: file[/tmp/dir1/]: CheckApply(true)
15:41:33 engine: file[/tmp/dir1/]: CheckApply(true): Return(true, <nil>)
15:41:33 engine: file[/tmp/file1]: Working...
15:41:33 engine: file[/tmp/file1]: CheckApply(true)
15:41:33 engine: file[/tmp/file1]: CheckApply(true): Return(true, <nil>)
15:41:33 main: destroy...
15:41:33 main: deploy: exited
15:41:33 main: loop: exited
15:41:33 engine: graph sync...
15:41:33 engine: file[/tmp/file1]: Exited...
15:41:33 engine: file[/tmp/dir1/]: Exited...
15:41:33 etcd: destroy...
15:41:33 etcd: got chooser shutdown...
15:41:33 etcd: unvolunteering...
15:41:33 etcd: list of volunteers is empty
15:41:33 etcd: chooser: (jammy=http://localhost:2380)/()
15:41:33 etcd: chooser result(+/-): []/[jammy]
15:41:33 etcd: unnominated: shutting down 1 members...
15:41:33 etcd: unnominate: removing self...
15:41:33 etcd: list of nominations is empty
15:41:33 etcd: server: destroyServer...
15:41:33 etcd: server: runServer: done!
15:41:33 etcd: server: destroyServer: done!
15:41:33 main: goodbye!

Note especially:

15:41:32 main: start: 1709653292375592509
...
15:41:33 main: converged for 3600 seconds, exiting!
@ffrank
Copy link
Contributor Author

ffrank commented Mar 5, 2024

From my reading: When the converger was rewritten, it got shiny new StartTimer() and ResetTimer() methods...but those do not seem to be called.
Instead we add a "StateFn" that fires when we transition from not converged to converged, and immediately shuts down the main loop.

@ffrank
Copy link
Contributor Author

ffrank commented Mar 5, 2024

As a side note, I think tests are currently passing only because FileRes and friends manage to finish their work before the shutdown hits.
(I have added a shell test for my rewritten MountRes and this one does NOT manage to CheckApply before the premature finish.)

@purpleidea
Copy link
Owner

I am not surprised if this has rotted. Further more the converger package was probably some of the first code I wrote, and it is probably terrible and needs fixing. Thank you for letting me know!

@ffrank
Copy link
Contributor Author

ffrank commented Mar 5, 2024

Upon further reading, I now think it wasn't the 2019 rewrite of the converger package, but rather the 2019 rewrite of the core algorithm in 253ed78 that made us lose those method calls 😅

@purpleidea
Copy link
Owner

Could be. I'll definitely have to dig into it at some point... My reprieve is that I don't think it's a very high-priority feature anymore, but if you have a real-world use case where it's required, please let me know. This could definitely be a "thing we add when a customer needs it" sort of deal.

@ffrank
Copy link
Contributor Author

ffrank commented Mar 5, 2024

No I agree that it's a terrible crutch in our test shell scripts, and should be replaced with something more sustainable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants