Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve RemoteSequencer tracing and logging #2286

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

muhamadazmy
Copy link
Contributor

@muhamadazmy muhamadazmy commented Nov 13, 2024

Improve RemoteSequencer tracing and logging

Also avoid sending over the open connection if the
inflight commits channel is closed

@muhamadazmy muhamadazmy force-pushed the pr2286 branch 2 times, most recently from 49e7718 to 8ab3785 Compare November 13, 2024 15:24
@muhamadazmy muhamadazmy marked this pull request as ready for review November 13, 2024 15:24
@muhamadazmy
Copy link
Contributor Author

muhamadazmy commented Nov 13, 2024

Couple of comments:

  • It took multiple iterations to actually land on the remote sequencer issue. After this fix I was not able to reproduce the same issue but. I got other failures though, that I will add to the main issues
  • I am wondering if the "remote-sequencer-connection" should have a TaskKind::SequencerAppender and then gracefully shut it down. to have a chance to process pending commit tokens. (I will update the PR)

@tillrohrmann
Copy link
Contributor

So the problem was that the Inflight commits channel is closed? If yes, how did this happen?

@muhamadazmy
Copy link
Contributor Author

@tillrohrmann one case where this can happen if a a sealed response was received, this breaks the loop, the channel is then drained and all inflight commits are resolved as 'sealed'. But at the same time the connection is still valid, hence sending the next append will still work, but pushing to the inflight channel will fail.

Anyway, i am refactoring this code again. Consider this PR as a draft for now

@tillrohrmann
Copy link
Contributor

@tillrohrmann one case where this can happen if a a sealed response was received, this breaks the loop, the channel is then drained and all inflight commits are resolved as 'sealed'. But at the same time the connection is still valid, hence sending the next append will still work, but pushing to the inflight channel will fail.

Anyway, i am refactoring this code again. Consider this PR as a draft for now

And the sealed information wouldn't have propagated and eventually closed the RemoteSequencer? Like here

if self.known_global_tail().is_sealed() {
?

@tillrohrmann
Copy link
Contributor

The reason why I am asking about the exact failure scenario is that in the scenario that I observed, the appends still completed. So I am wondering whether this doesn't indicate that the loglet wasn't sealed. Maybe you've observed a different scenario then?

@muhamadazmy
Copy link
Contributor Author

You are right. I didn't observe a seal. But I was trying to find a scenario where it is possible for the channel to be closed while the connection is not.

I added few logs and then I noticed a situation where that is exactly the case. But without there was no try to reconnect but just repeated failures to send over the channel. This made me think it might be:

  • Connection is returning a different error on failure to send, so I expanded the list of errors on send where we retry
  • Changed the task kind to a different type (to basically wait for the drain)
  • Make sure channel is checked before attempt to send.

@tillrohrmann
Copy link
Contributor

I think it would be great to understand the exact situation to learn from it (also for my personal closure ;-))

@muhamadazmy
Copy link
Contributor Author

@tillrohrmann I totally understand your concern, I will try to reproduce again without the fix but with enough logs. Unfortunately I lost the logs from the failed run.

@muhamadazmy muhamadazmy force-pushed the pr2286 branch 3 times, most recently from 446aa89 to 1993419 Compare November 14, 2024 08:16
/// If a LogletCommitResolver is dropped without being
/// 'resolved', we resolve it automatically as being cancelled
/// To make it distinguished from a Shutdown.
impl Drop for LogletCommitResolver {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is necessary. Since if the LogletCommitResolver is dropped and the sender channel is closed, this causes a RecvErr to be received at the receiver side which is then mapped into an AppendError::Shutdown.

I am wondering if a Shutdown is what we need to return here. This still can happen for example if a connection is lost in case of remote sequencer.

What do you think @AhmedSoliman

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can map the RecvErr to a more descriptive type. For instance, CommitStatusUnknown and mark it as retryable append error?

};

let appended = match appended {
let appended = match rpc_token.recv().await {
Copy link
Contributor Author

@muhamadazmy muhamadazmy Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly blocked forever if response message never came (lost)? Do you think it's better to timeout here?

@tillrohrmann @AhmedSoliman

@tillrohrmann
Copy link
Contributor

My theory is that https://github.com/restatedev/restate/blob/891c350981f021f815b5b17fdba1dbfc37df4f9f/crates/bifrost/src/providers/replicated_loglet/remote_sequencer.rs#L243 solved the problem we were seeing.

What could have caused the problem initially could be the following: There is already a connection from node 0 to node 1 where the sequencer for log-1 is running. Now PP-0 on node 0 wants to append something to log-1 and therefore creates a RemoteSequencer for this loglet. It will reuse the existing connection but spawn handle_appended_responses on its own runtime pp-0. Now PP-0 gets shut down which also stops the task for handle_appended_responses. Now PP-0 gets restarted and wants to append to log-1 yet again. It will reuse the existing RemoteSequencer which still has a valid connection assigned but no longer a handle_appended_responses task. Note that this wouldn't have been a problem if the connection reactor would also have run on pp-0 because then it would have been terminated as well.

So maybe we revert this change in our tests to see whether we can reproduce the problem.

@AhmedSoliman
Copy link
Contributor

@tillrohrmann yep, that's exactly what was happening and that's why I fixed it.

In this PR we should improve observability and figure out if there is a design change that makes this component more resilient but I know that the actual root cause of the particular issue we saw is fixed by my change ;)

@muhamadazmy muhamadazmy force-pushed the pr2286 branch 2 times, most recently from ec730c6 to 14eced8 Compare November 14, 2024 12:58
@tillrohrmann
Copy link
Contributor

@tillrohrmann yep, that's exactly what was happening and that's why I fixed it.

In this PR we should improve observability and figure out if there is a design change that makes this component more resilient but I know that the actual root cause of the particular issue we saw is fixed by my change ;)

Alrighty. I was under the wrong impression that we are still trying to fix the root cause of the observed problem with this PR. Then I've reached my personal closure with this bug :-)

@tillrohrmann
Copy link
Contributor

Only semi-related: Runtime context inheritance can be quite tricky at times. Especially when components are shared across runtimes where one of them gets stopped eventually this can lead to surprising effects.

Also avoid sending over the open connection if the
inflight commits channel is closed
@AhmedSoliman
Copy link
Contributor

@tillrohrmann It's true, and there maybe a way to make this more explicit, or we actually go to a thread-per-core design and fix the root of this rolling ball of issues.

@muhamadazmy
Copy link
Contributor Author

@tillrohrmann tbh without the enriched logs and understanding of the 'inherited' runtime problem it was really hard for me to figure out why the task was aborted. The only thing that I was sure of is that the task exited while connection was still valid. And I built my solution to render the connection as disconnected if that happened ever again (for this reason or another).

I also make sure that on abort of this task, that it also actually does a graceful shutdown and drain all inflight commits. This was done implicitly of course as when the CommitResolver is dropped that it automatically resolve as Shutdown so this improvement might not be necessary.

Copy link
Contributor

@AhmedSoliman AhmedSoliman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments. In general, I don't think the current design of a task per connection will be reliable. I propose that we change the strategy and invest in RpcRouter providing the facility to resolve receive tokens when the underlying connection is dropped.

@@ -156,21 +156,40 @@ pub type SendableLogletReadStream = Pin<Box<dyn LogletReadStream + Send>>;

#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this?

/// If a LogletCommitResolver is dropped without being
/// 'resolved', we resolve it automatically as being cancelled
/// To make it distinguished from a Shutdown.
impl Drop for LogletCommitResolver {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can map the RecvErr to a more descriptive type. For instance, CommitStatusUnknown and mark it as retryable append error?


#[derive(Debug, Clone, Copy, thiserror::Error)]
#[error("Commit resolver was dropped")]
struct CommitCancelled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "cancelled" often indicates a graceful user-requested cancellation of something. Abort on the other hand is used to denote an abrupt ungraceful one.

That said, I'm not sure if this error type adds enough context to the caller to understand why the commit was aborted. Do you have ideas on how either:
A) Making this error more precise to pin point why the append was aborted
B) Possibly remove the need for this error in lieu of an existing one?

Comment on lines +138 to +143
let permits = self
.record_permits
.clone()
.acquire_many_owned(len)
.await
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: Can we close() the semaphore when detect that the loglet is sealed?

| NetworkError::ConnectionClosed(_)
| NetworkError::Timeout(_) => {
// we retry to re-connect one time
err @ NetworkError::Full => return Err(err.into()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never receive this error as you are using a async-blocking send on the channel.

// we retry to re-connect one time
err @ NetworkError::Full => return Err(err.into()),
_ => {
// we retry on any other network error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is what we want. If the sequencer node died and the loglet was sealed, should we retry for a few times (retry policy) and then we setup Bifrost's appender to check if the loglet was sealed/metadata updated before retrying again?

let connection = self
.networking
.node_connection(self.params.sequencer)
.await?;

Span::current().record("renewed", true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this really adds value. Let's remove it. The log message should be sufficient.

}

/// Gets or starts a new remote sequencer connection
#[instrument(level = "debug", skip_all)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The span doesn't add any value here since it has no context to add

@@ -228,26 +233,32 @@ where
#[derive(Clone)]
struct RemoteSequencerConnection {
inner: WeakConnection,
tx: mpsc::UnboundedSender<RemoteInflightAppend>,
inflight: mpsc::Sender<RemoteInflightAppend>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely convinced that this in general is a good design. A different approach can be to have one task associated with the remote sequencer to handle all responses and we use refactor RpcRouter to resolve the receive token when the connection is dropped.

@muhamadazmy
Copy link
Contributor Author

Thank you @AhmedSoliman for your review. I will go over and process all comments asap.

@AhmedSoliman
Copy link
Contributor

Should we consider this one abandoned and close it?

@muhamadazmy
Copy link
Contributor Author

@AhmedSoliman I wanted to pick it up again later. I will convert it to a draft for now

@muhamadazmy muhamadazmy marked this pull request as draft December 19, 2024 12:58
@muhamadazmy muhamadazmy removed the request for review from tillrohrmann December 19, 2024 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants