-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improve lost-increment message in repo analysis #131200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve lost-increment message in repo analysis #131200
Conversation
Today repository analysis may fail with a message like the following:
[test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
should have value [10] but instead had value
[OptionalBytesReference[00 00 00 00 00 00 00 09]]
This is confusing because one might interpret `should have value [10]`
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.
This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:
[test-repo] successfully completed all [10] atomic increments of
register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
but reading its value with [getRegister] unexpectedly yielded
[OptionalBytesReference[00 00 00 00 00 00 00 09]]
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @DaveCTurner, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one minor question
| value is [%s], but reading its value with [%s] unexpectedly yielded [%s]. This \ | ||
| anomaly may indicate an atomicity failure amongst concurrent compare-and-exchange \ | ||
| operations on registers in this repository.""", | ||
| expectedFinalRegisterValue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the number of increments, right? Do you mean to pass it down from the caller of finalRegisterValueVerifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the number of increments, yes: it counts the number of successful responses to a ContendedRegisterAnalyzeAction (each of which does one increment operation). The caller of finalRegisterValueVerifier doesn't know this yet, because this is called before any of the actions have run.
We could simplify this because we always either increment this value or fail the whole analysis (skipping these checks). IIRC that wasn't always the case in some earlier draft and this never got cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see the content is doubled as the count as well since it starts from 0 and increment 1 each time. Thanks for explaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #131274
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Today repository analysis may fail with a message like the following:
[test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
should have value [10] but instead had value
[OptionalBytesReference[00 00 00 00 00 00 00 09]]
This is confusing because one might interpret `should have value [10]`
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.
This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:
[test-repo] Successfully completed all [10] atomic increments of
register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
but reading its value with [getRegister] unexpectedly yielded
[OptionalBytesReference[00 00 00 00 00 00 00 09]]. This anomaly may
indicate an atomicity failure amongst concurrent
compare-and-exchange operations on registers in this repository.
Today repository analysis may fail with a message like the following:
[test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
should have value [10] but instead had value
[OptionalBytesReference[00 00 00 00 00 00 00 09]]
This is confusing because one might interpret `should have value [10]`
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.
This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:
[test-repo] Successfully completed all [10] atomic increments of
register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
but reading its value with [getRegister] unexpectedly yielded
[OptionalBytesReference[00 00 00 00 00 00 00 09]]. This anomaly may
indicate an atomicity failure amongst concurrent
compare-and-exchange operations on registers in this repository.
Today repository analysis may fail with a message like the following:
This is confusing because one might interpret
should have value [10]as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.
This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write: