Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: TestHalfOnlineLossOfQuorumRecovery failed #124386

Closed
cockroach-teamcity opened this issue May 18, 2024 · 9 comments · Fixed by #125094
Closed

cli: TestHalfOnlineLossOfQuorumRecovery failed #124386

cockroach-teamcity opened this issue May 18, 2024 · 9 comments · Fixed by #125094
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team X-duplicate Closed as a duplicate of another issue.
Projects

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 18, 2024

cli.TestHalfOnlineLossOfQuorumRecovery failed on master @ 93ad913106b6f0f6ec98bc2cfa788ff6d8085bd4:

=== RUN   TestHalfOnlineLossOfQuorumRecovery
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestHalfOnlineLossOfQuorumRecovery557329092
    test_log_scope.go:81: use -show-logs to present logs inline
[debug recover make-plan --confirm=y --certs-dir=test_certs --host=127.0.0.1:34717 --plan=/var/lib/engflow/worker/work/3/exec/_tmp/2dcc0a708d6e811d98e64cdd1ab12ba9/TestHalfOnlineLossOfQuorumRecovery916069754/recovery-plan.json]
    debug_recover_loss_of_quorum_test.go:535: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/cli/debug_recover_loss_of_quorum_test.go:535
        	Error:      	"debug recover make-plan --confirm=y --certs-dir=test_certs --host=127.0.0.1:34717 --plan=/var/lib/engflow/worker/work/3/exec/_tmp/2dcc0a708d6e811d98e64cdd1ab12ba9/TestHalfOnlineLossOfQuorumRecovery916069754/recovery-plan.json\nERROR: failed to retrieve replica info from cluster: rpc error: code = Unknown desc = failed retrieving replicas from node n1 during fan-out: recv msg error: grpc: key /Local/RangeID/1/r/RangeGCThreshold does not have /Local/Range prefix [code 2/Unknown]\n" does not contain "- node n1"
        	Test:       	TestHalfOnlineLossOfQuorumRecovery
        	Messages:   	planner didn't provide correct apply instructions
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestHalfOnlineLossOfQuorumRecovery557329092
--- FAIL: TestHalfOnlineLossOfQuorumRecovery (31.38s)

Parameters:

  • attempt=1
  • race=true
  • run=2
  • shard=14
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-38861

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels May 18, 2024
@cockroach-teamcity cockroach-teamcity added this to roachtest/unit test backlog in KV May 18, 2024
@kvoli kvoli added the X-duplicate Closed as a duplicate of another issue. label May 20, 2024
@cockroach-teamcity
Copy link
Member Author

cli.TestHalfOnlineLossOfQuorumRecovery failed on master @ d146ecff6f687e438706cf63591cafca60cc116d:

=== RUN   TestHalfOnlineLossOfQuorumRecovery
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestHalfOnlineLossOfQuorumRecovery3317157959
    test_log_scope.go:81: use -show-logs to present logs inline
[debug recover make-plan --confirm=y --certs-dir=test_certs --host=127.0.0.1:43875 --plan=/var/lib/engflow/worker/work/2/exec/_tmp/2dcc0a708d6e811d98e64cdd1ab12ba9/TestHalfOnlineLossOfQuorumRecovery1639678948/recovery-plan.json]
    debug_recover_loss_of_quorum_test.go:535: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/cli/debug_recover_loss_of_quorum_test.go:535
        	Error:      	"debug recover make-plan --confirm=y --certs-dir=test_certs --host=127.0.0.1:43875 --plan=/var/lib/engflow/worker/work/2/exec/_tmp/2dcc0a708d6e811d98e64cdd1ab12ba9/TestHalfOnlineLossOfQuorumRecovery1639678948/recovery-plan.json\nERROR: failed to retrieve replica info from cluster: rpc error: code = Unknown desc = failed retrieving replicas from node n1 during fan-out: recv msg error: grpc: key /Local/RangeID/1/r/RangeGCThreshold does not have /Local/Range prefix [code 2/Unknown]\n" does not contain "- node n1"
        	Test:       	TestHalfOnlineLossOfQuorumRecovery
        	Messages:   	planner didn't provide correct apply instructions
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestHalfOnlineLossOfQuorumRecovery3317157959
--- FAIL: TestHalfOnlineLossOfQuorumRecovery (46.72s)

Parameters:

  • attempt=1
  • race=true
  • run=2
  • shard=14
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

cli.TestHalfOnlineLossOfQuorumRecovery failed on master @ 0ea8b622c5354a891cfb867d8d872a0746ba77b3:

=== RUN   TestHalfOnlineLossOfQuorumRecovery
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestHalfOnlineLossOfQuorumRecovery4224013447
    test_log_scope.go:81: use -show-logs to present logs inline
[debug recover make-plan --confirm=y --certs-dir=test_certs --host=127.0.0.1:37559 --plan=/var/lib/engflow/worker/work/0/exec/_tmp/2dcc0a708d6e811d98e64cdd1ab12ba9/TestHalfOnlineLossOfQuorumRecovery3625772260/recovery-plan.json]
    debug_recover_loss_of_quorum_test.go:535: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/cli/debug_recover_loss_of_quorum_test.go:535
        	Error:      	"debug recover make-plan --confirm=y --certs-dir=test_certs --host=127.0.0.1:37559 --plan=/var/lib/engflow/worker/work/0/exec/_tmp/2dcc0a708d6e811d98e64cdd1ab12ba9/TestHalfOnlineLossOfQuorumRecovery3625772260/recovery-plan.json\nERROR: failed to retrieve replica info from cluster: rpc error: code = Unknown desc = failed retrieving replicas from node n1 during fan-out: recv msg error: grpc: key /Local/RangeID/1/r/RangeGCThreshold does not have /Local/Range prefix [code 2/Unknown]\n" does not contain "- node n1"
        	Test:       	TestHalfOnlineLossOfQuorumRecovery
        	Messages:   	planner didn't provide correct apply instructions
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestHalfOnlineLossOfQuorumRecovery4224013447
--- FAIL: TestHalfOnlineLossOfQuorumRecovery (32.06s)

Parameters:

  • attempt=1
  • race=true
  • run=2
  • shard=14
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@rickystewart
Copy link
Collaborator

Maybe this should be skipped under race?

@cockroach-teamcity
Copy link
Member Author

cli.TestHalfOnlineLossOfQuorumRecovery failed on master @ e5f5597b1c1570fff03ee6dfd4a1e8be65c0e008:

=== RUN   TestHalfOnlineLossOfQuorumRecovery
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestHalfOnlineLossOfQuorumRecovery1927781118
    test_log_scope.go:81: use -show-logs to present logs inline
[debug recover make-plan --confirm=y --certs-dir=test_certs --host=127.0.0.1:40383 --plan=/var/lib/engflow/worker/work/2/exec/_tmp/627c07b75a1b49827ad9adf3e9f30827/TestHalfOnlineLossOfQuorumRecovery4013921476/recovery-plan.json]
    debug_recover_loss_of_quorum_test.go:535: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/cli/debug_recover_loss_of_quorum_test.go:535
        	Error:      	"debug recover make-plan --confirm=y --certs-dir=test_certs --host=127.0.0.1:40383 --plan=/var/lib/engflow/worker/work/2/exec/_tmp/627c07b75a1b49827ad9adf3e9f30827/TestHalfOnlineLossOfQuorumRecovery4013921476/recovery-plan.json\nERROR: failed to retrieve replica info from cluster: rpc error: code = Unknown desc = failed retrieving replicas from node n1 during fan-out: recv msg error: grpc: key /Local/RangeID/1/r/RangeGCThreshold does not have /Local/Range prefix [code 2/Unknown]\n" does not contain "- node n1"
        	Test:       	TestHalfOnlineLossOfQuorumRecovery
        	Messages:   	planner didn't provide correct apply instructions
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestHalfOnlineLossOfQuorumRecovery1927781118
--- FAIL: TestHalfOnlineLossOfQuorumRecovery (32.19s)

Parameters:

  • attempt=1
  • race=true
  • run=3
  • shard=14
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@andrewbaptist andrewbaptist added the P-2 Issues/test failures with a fix SLA of 3 months label Jun 3, 2024
@nvanbenschoten
Copy link
Member

The timing of this lines up with #123959 and the error looks related:

recv msg error: grpc: key /Local/RangeID/1/r/RangeGCThreshold does not have /Local/Range prefix [code 2/Unknown]\n" does not contain "- node n1"

@RaduBerinde can you take a look when you get a chance?

@nvanbenschoten nvanbenschoten added A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team and removed T-kv KV Team labels Jun 3, 2024
@blathers-crl blathers-crl bot added this to Incoming in Storage Jun 3, 2024
@nvanbenschoten nvanbenschoten removed this from roachtest/unit test backlog in KV Jun 3, 2024
@RaduBerinde
Copy link
Member

RaduBerinde commented Jun 4, 2024

Interesting. We set UpperBound: keys.LocalRangeMax on the iterator, so I'm not sure how we can see a /Local/RangeID key..

@RaduBerinde
Copy link
Member

Oh, /Local/RangeID keys are before. That is also odd, since we are seeking to LocalRangePrefix and to keys made by RangeDescriptorKey...

@jbowens jbowens moved this from Incoming to In Progress (this milestone) in Storage Jun 4, 2024
@RaduBerinde
Copy link
Member

This is very strange. With some debugging info added:

panic: key: /Local/RangeID/1/r/RangeGCThreshold  "\x01i\x89rlgc-"  Last SeekGE() key: empty timestamp case /Local/Range/Table/Max/RangeDescriptor/1717532590.198522999,2147483647  "\x01k\x12\xfa\x00\x01rdsc"  NextKey() calls since: 0

We have an MVCCIterator which we SeekGE to \x01k\x12\xfa\x00\x01rdsc. The iterator returns key "\x01i\x89rlgc-" which is smaller. This is a problem in MVCCIterator or inside Pebble.

@RaduBerinde
Copy link
Member

Oh, we are using unsafeMVCCIterator which explains why we see this only in race mode.

craig bot pushed a commit that referenced this issue Jun 5, 2024
125094: kvstorage: fix key aliasing during range descriptor iteration r=RaduBerinde a=RaduBerinde

In one case, we were passing the key returned by the `MVCCIterator`
back to `SeekGE`. This is not legal, and causes the
`unsafeMVCCIterator` (used in race mode) to trigger a failure.

Fixes: #124386
Release note: None

Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
@craig craig bot closed this as completed in 07aede1 Jun 5, 2024
Dhruv-Sachdev1313 pushed a commit to Dhruv-Sachdev1313/cockroach that referenced this issue Jun 7, 2024
In one case, we were passing the key returned by the `MVCCIterator`
back to `SeekGE`. This is not legal, and causes the
`unsafeMVCCIterator` (used in race mode) to trigger a failure.

Fixes: cockroachdb#124386
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team X-duplicate Closed as a duplicate of another issue.
Projects
Status: Done
Storage
  
In Progress (this milestone)
Development

Successfully merging a pull request may close this issue.

6 participants