Skip to content

Allocate_stale_primary appears to succeed on wrong node #37098

@DaveCTurner

Description

@DaveCTurner

Today if one issues an allocate_stale_primary reroute command requesting the primary to be allocated on a node which does not hold a stale copy of the shard in question then the reroute command still returns 200 OK. The recovery subsequently fails, of course, because there is no copy of the shard from which to recover:

[2019-01-03T08:48:16,731][WARN ][o.e.c.r.a.AllocationService] [node-0] failing shard [failed shard, shard [i][1], node[0ZoVYAp6TzC6VjGW7qo2-w], [P], recovery_source[existing store recovery; bootstrap_history_uuid=true], s[INITIALIZING], a[id=H9r7KK24TkuzUz6OdNzwlg], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-01-03T08:47:40.487Z], failed_attempts[1], delayed=false, details[failed shard on node [0ZoVYAp6TzC6VjGW7qo2-w]: failed recovery, failure RecoveryFailedException[[i][1]: Recovery failed on {node-1}{0ZoVYAp6TzC6VjGW7qo2-w}{psSlOmPKQZeVqw63jcncyA}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: []]; nested: FileNotFoundException[no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/Users/davidturner/discuss/162719/elasticsearch-6.5.1/data-1/nodes/0/indices/YPnPuK8hRnyDHjzM2EhLtQ/1/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@46a205e7)): files: []]; ], allocation_status[no_valid_shard_copy]], message [failed recovery], failure [RecoveryFailedException[[i][1]: Recovery failed on {node-1}{0ZoVYAp6TzC6VjGW7qo2-w}{psSlOmPKQZeVqw63jcncyA}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: []]; nested: FileNotFoundException[no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/Users/davidturner/discuss/162719/elasticsearch-6.5.1/data-1/nodes/0/indices/YPnPuK8hRnyDHjzM2EhLtQ/1/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@46a205e7)): files: []]; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [i][1]: Recovery failed on {node-1}{0ZoVYAp6TzC6VjGW7qo2-w}{psSlOmPKQZeVqw63jcncyA}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2139) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) ~[elasticsearch-6.5.1.jar:6.5.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to fetch index version after copying it over
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:389) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:302) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1645) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2135) ~[elasticsearch-6.5.1.jar:6.5.1]
	... 4 more
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: shard allocated for local recovery (post api), should exist, but doesn't, current files: []
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:374) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:302) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1645) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2135) ~[elasticsearch-6.5.1.jar:6.5.1]
	... 4 more
Caused by: java.io.FileNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/Users/davidturner/discuss/162719/elasticsearch-6.5.1/data-1/nodes/0/indices/YPnPuK8hRnyDHjzM2EhLtQ/1/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@46a205e7)): files: []
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:442) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:131) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:201) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:186) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:364) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:302) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1645) ~[elasticsearch-6.5.1.jar:6.5.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2135) ~[elasticsearch-6.5.1.jar:6.5.1]
	... 4 more

Metadata

Metadata

Labels

:Distributed Coordination/AllocationAll issues relating to the decision making around placing a shard (both master logic & on the nodes)>bugv6.5.1

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions