Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Argus crash on 2024-03-21 #5111

Open
ignazio-bovo opened this issue Mar 21, 2024 · 2 comments
Open

Investigate Argus crash on 2024-03-21 #5111

ignazio-bovo opened this issue Mar 21, 2024 · 2 comments
Assignees
Labels
argus Argus distributor node

Comments

@ignazio-bovo
Copy link
Contributor

TLDR

Ping at 7.40am CET on 2024-03-21 revealed that multiple nodes have crashed

Image

@ignazio-bovo ignazio-bovo self-assigned this Mar 21, 2024
@ignazio-bovo ignazio-bovo added the argus Argus distributor node label Mar 21, 2024
@kdembler
Copy link
Member

2024-03-21 06:41:38:4138 StorageNodeApi �[31merror�[39m: �[31mRequest timeout of 5000ms reached�[39m
{
    "0": {
        "endpoint": "https://sieemmastorage.com/storage/api/v1"
    },
    "timeoutMs": 5000,
    "trace_id": "e189941ce41181ed61025e9e07b8e34c",
    "span_id": "0a86f2d94e50bc5e",
    "trace_flags": "01"
}
2024-03-21 06:41:38:4138 StorageNodeApi �[31merror�[39m: �[31mUnexpected error while requesting data object�[39m
{
    "0": {
        "endpoint": "https://sieemmastorage.com/storage/api/v1"
    },
    "objectId": "2513914",
    "err": {
        "message": "Request timeout"
    },
    "trace_id": "e189941ce41181ed61025e9e07b8e34c",
    "span_id": "0a86f2d94e50bc5e",
    "trace_flags": "01"
}
2024-03-21 06:41:38:4138 NetworkingManager �[31merror�[39m: �[31mData object download failed�[39m
{
    "err": {
        "message": "Failed to download object 2513914 from any availablable storage provider",
        "stack": "Error: Failed to download object 2513914 from any availablable storage provider\n    at fail (/joystream/distributor-node/lib/services/networking/NetworkingService.js:224:24)\n    at Queue.<anonymous> (/joystream/distributor-node/lib/services/networking/NetworkingService.js:265:21)\n    at Queue.emit (node:events:517:28)\n    at Queue.done (/joystream/node_modules/queue/index.js:194:8)\n    at next (/joystream/node_modules/queue/index.js:118:16)\n    at /joystream/node_modules/queue/index.js:150:14\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at runNextTicks (node:internal/process/task_queues:64:3)\n    at listOnTimeout (node:internal/timers:538:9)\n    at process.processTimers (node:internal/timers:512:7)"
    },
    "trace_id": "e189941ce41181ed61025e9e07b8e34c",
    "span_id": "0a86f2d94e50bc5e",
    "trace_flags": "01"
}
2024-03-21 06:41:38:4138 PublicApi �[31merror�[39m: �[31mmiddlewareError�[39m
{
    "err": {
        "message": "Failed to download object 2513914 from any availablable storage provider",
        "stack": "Error: Failed to download object 2513914 from any availablable storage provider\n    at fail (/joystream/distributor-node/lib/services/networking/NetworkingService.js:223:25)\n    at Queue.<anonymous> (/joystream/distributor-node/lib/services/networking/NetworkingService.js:265:21)\n    at Queue.emit (node:events:517:28)\n    at Queue.done (/joystream/node_modules/queue/index.js:194:8)\n    at next (/joystream/node_modules/queue/index.js:118:16)\n    at /joystream/node_modules/queue/index.js:150:14\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at runNextTicks (node:internal/process/task_queues:64:3)\n    at listOnTimeout (node:internal/timers:538:9)\n    at process.processTimers (node:internal/timers:512:7)"
    },
    "req": {
        "url": "/api/v1/assets/2513914",
        "method": "GET",
        "httpVersion": "1.1",
        "originalUrl": "/api/v1/assets/2513914",
        "query": {}
    },
    "trace_id": "e189941ce41181ed61025e9e07b8e34c",
    "span_id": "0a86f2d94e50bc5e",
    "trace_flags": "01"
}
2024-03-21 06:41:38:4138 PublicApi �[35mhttp�[39m: �[35mHTTP GET /api/v1/assets/2513914�[39m
{
    "meta": {},
    "trace_id": "e189941ce41181ed61025e9e07b8e34c",
    "span_id": "0a86f2d94e50bc5e",
    "trace_flags": "01"
}

<--- Last few GCs --->

[7:0x57ad870] 121787624 ms: Mark-sweep 4042.3 (4129.7) -> 4038.5 (4126.1) MB, 1341.4 / 0.0 ms  (average mu = 0.242, current mu = 0.079) allocation failure; scavenge might not succeed
[7:0x57ad870] 121789771 ms: Mark-sweep 4056.1 (4127.9) -> 4054.2 (4157.8) MB, 2132.2 / 0.0 ms  (average mu = 0.139, current mu = 0.007) allocation failure; scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb95b60 node::Abort() [node]
 2: 0xa9a7f8  [node]
 3: 0xd6f2f0 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xd6f697 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xf4cba5  [node]
 6: 0xf5f08d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 7: 0xf3978e v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 8: 0xf3ab57 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 9: 0xf1bd2a v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
10: 0x12e114f v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
11: 0x170deb9  [node]
/joystream/distributor-node/runner.sh: line 8:     7 Aborted                 (core dumped) node --require @joystream/opentelemetry ./bin/run $*
Loaded Application Instrumentation: "Distributor Node"
Starting tracing..

There are hundreds of logs at the exact same second, all about not being able to download an object. I think there may be an infinite loop/recursion somewhere there and it just runs out of memory.

@ignazio-bovo
Copy link
Contributor Author

This looks like very hard to reproduce.
I also suspect that the download is failing because there's no sufficient HEAP space for the file to be stored in memory before it gets saved in the disk (or somewhat along these lines), so the error might be somewhere else as pointed out by Klaudiusz.
I would leave this issue open, and if the error represents itself often (like at least once per week) then proceed with a proper investigation and I won't do nothing in the meantime as this looks very time consuming to reproduce and also I am not the one who wrote the Argus code.
Let me know what do you think @kdembler @zeeshanakram3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
argus Argus distributor node
Projects
None yet
Development

No branches or pull requests

2 participants