Skip to content

Conversation

@DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Jul 14, 2025

Today repository analysis may fail with a message like the following:

[test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
should have value [10] but instead had value
[OptionalBytesReference[00 00 00 00 00 00 00 09]]

This is confusing because one might interpret should have value [10]
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.

This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:

[test-repo] Successfully completed all [10] atomic increments of
register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
but reading its value with [getRegister] unexpectedly yielded
[OptionalBytesReference[00 00 00 00 00 00 00 09]]. This anomaly may
indicate an atomicity failure amongst concurrent
compare-and-exchange operations on registers in this repository.

Today repository analysis may fail with a message like the following:

    [test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
    should have value [10] but instead had value
    [OptionalBytesReference[00 00 00 00 00 00 00 09]]

This is confusing because one might interpret `should have value [10]`
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.

This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:

    [test-repo] successfully completed all [10] atomic increments of
    register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
    expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
    but reading its value with [getRegister] unexpectedly yielded
    [OptionalBytesReference[00 00 00 00 00 00 00 09]]
@DaveCTurner DaveCTurner requested a review from ywangd July 14, 2025 11:28
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.2.0 labels Jul 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jul 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one minor question

value is [%s], but reading its value with [%s] unexpectedly yielded [%s]. This \
anomaly may indicate an atomicity failure amongst concurrent compare-and-exchange \
operations on registers in this repository.""",
expectedFinalRegisterValue,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the number of increments, right? Do you mean to pass it down from the caller of finalRegisterValueVerifier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the number of increments, yes: it counts the number of successful responses to a ContendedRegisterAnalyzeAction (each of which does one increment operation). The caller of finalRegisterValueVerifier doesn't know this yet, because this is called before any of the actions have run.

We could simplify this because we always either increment this value or fail the whole analysis (skipping these checks). IIRC that wasn't always the case in some earlier draft and this never got cleaned up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see the content is doubled as the count as well since it starts from 0 and increment 1 each time. Thanks for explaining.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #131274

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 6f55796 into elastic:main Jul 15, 2025
33 checks passed
@DaveCTurner DaveCTurner deleted the 2025/07/14/repo-analysis-lost-increment-message branch July 15, 2025 07:32
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 17, 2025
Today repository analysis may fail with a message like the following:

    [test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
    should have value [10] but instead had value
    [OptionalBytesReference[00 00 00 00 00 00 00 09]]

This is confusing because one might interpret `should have value [10]`
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.

This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:

    [test-repo] Successfully completed all [10] atomic increments of
    register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
    expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
    but reading its value with [getRegister] unexpectedly yielded
    [OptionalBytesReference[00 00 00 00 00 00 00 09]]. This anomaly may
    indicate an atomicity failure amongst concurrent
    compare-and-exchange operations on registers in this repository.
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 17, 2025
Today repository analysis may fail with a message like the following:

    [test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
    should have value [10] but instead had value
    [OptionalBytesReference[00 00 00 00 00 00 00 09]]

This is confusing because one might interpret `should have value [10]`
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.

This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:

    [test-repo] Successfully completed all [10] atomic increments of
    register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
    expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
    but reading its value with [getRegister] unexpectedly yielded
    [OptionalBytesReference[00 00 00 00 00 00 00 09]]. This anomaly may
    indicate an atomicity failure amongst concurrent
    compare-and-exchange operations on registers in this repository.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants