Storage Squid integration with Argus/Colossus #5001

zeeshanakram3 · 2023-12-13T21:17:04Z

closes #4957

This PR:

integrates Storage-Squid with Distributor Node
integrates Storage-Squid with Storage Node
adds NPM script to regenerate storage-node CLI docs + regenerated document (fixes colossus: command docs #4983)
adds docker-compose.storage-squid.yml file to run storage-squid services
updates integration tests bash scripts to start storage-squid services when running Argus/Colossus
Enable telemetry for storage-squid's graphql-server

…ation

…s when running Argus/Colossus

…cker image

…5001-1

kdembler

Amazing to see this realize, great work @zeeshanakram3

I'll try to get to proper code review later, for now tried Argus out and seeing some problems:

Getting {"type":"exception","message":"Unexpected end of JSON input"} from /status when squid is not available
On start (serving bucket 0:0) I'm seeing

2023-12-28 22:59:51:5951 ContentService info: ContentService initializing...
{
    "supportedObjects": 0,
    "filesCountOnStartup": 0,
    "cacheItemsCountOnStartup": 0
}
2023-12-28 22:59:51:5951 ContentService info: ContentService initialized
{
    "filesDropped": 0,
    "cacheItemsDropped": 0,
    "contentSizeSum": 0
}

I expected the supportedObjects to give proper number. I can get objects fine

kdembler · 2023-12-28T21:33:53Z

distributor-node/config.yml

@@ -1,6 +1,6 @@
 id: test-node
 endpoints:
-  queryNode: http://localhost:8081/graphql
+  queryNode: http://localhost:4352/graphql


I think we should make it clear that this is not the query node now. I suggest to rename to storageSquid or just squid

I don't mind the name so much. Just wondering where the port number selection came from, is it just 4250 (orion graphql port) + 2 ?

I don't mind the name so much. Just wondering where the port number selection came from, is it just 4250 (orion graphql port) + 2 ?

@mnaamani yes.

I don't mind the name so much.

It's not purely cosmetic, I'm just thinking about operators upgrading and possibly forgetting to update this config value. If we leave the current name and somebody keeps the old value, then they will try to query old QN, which will most likely lead to some weird GraphQL errors that may not be as readable and clear.

If we update the config key, any operator that forgot to update their config will get a clear error about missing config key.

If we update the config key, any operator that forgot to update their config will get a clear error about missing config key.

I think that is a good argument actually.

…egration

…run-tests.sh

mnaamani

Great work. I left a few comments.

I see the annoying test curator moderation actions: Failed failing often. I think this started happening more frequently when we introduced the accept pending service because of delay between object being uploaded and the accepted status being updated?

mnaamani · 2024-01-04T06:40:36Z

.env

+SQUID_DB_PORT=23332
+
+# Processor service prometheus port
+SQUID_PROCESSOR_PROMETHEUS_PORT=3337


PROCESSOR_PROMETHEUS_PORT earlier in the file is also assigned 3337
But this is only a problem if orion processor is also running on same host.

Addressed in 36212f0

mnaamani · 2024-01-04T07:42:02Z

distributor-node/config.yml

@@ -1,6 +1,6 @@
 id: test-node
 endpoints:
-  queryNode: http://localhost:8081/graphql
+  queryNode: http://localhost:4352/graphql


I don't mind the name so much. Just wondering where the port number selection came from, is it just 4250 (orion graphql port) + 2 ?

mnaamani · 2024-01-04T08:00:52Z

distributor-node/src/services/networking/NetworkingService.ts

+import { DataObjectDetailsFragment } from './query-node/generated/queries'
+import { DistributionBucketOperatorStatus } from './query-node/generated/schema'
+import { RuntimeApi } from './runtime/api'
+import { StorageNodeApi } from './storage-node/api'


I think we discussed this once, but it is hard to see what imports were added or removed when they are also being re-ordered.

Reverted the re-ordering change

mnaamani · 2024-01-04T08:16:18Z

distributor-node/package.json

@@ -138,6 +138,7 @@
    "prepack": "rm -rf lib && tsc -b && oclif-dev manifest && generate:all",
    "test": "nyc --extension .ts mocha --forbid-only \"test/**/*.test.ts\"",
    "version": "generate:docs:cli && git add docs/cli/*",
+    "generate:schema:graphql": "docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql",


Suggested change

"generate:schema:graphql": "docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql",

"generate:schema:graphql": "docker inspect joystream/storage-squid:latest && docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql",

get-graphql-schema is existing with value 0 when graphql endpoint is unreachable:

FetchError: request to http://localhost:4352/graphql failed, reason: connect ECONNREFUSED 127.0.0.1:4352

In cases docker run fails or scripts doesn't connect, the resulting src/services/networking/query-node/schema.graphql is overwritten with no content.

This step is done manually and not part of automated build step, but it would be better for it to fail in a better way.

I'm guessing on my machine 5 seconds was not long enough for graphql-server to startup.

Interactively it works fine..

docker run -it joystream/storage-squid:latest sh

I have created a new shell script that will only write to output file if the returned schema response is not empty. Let me know if it's fine.

Added in 3a6b0e4

mnaamani · 2024-01-04T10:26:53Z

docker-compose.storage-squid.yml

+      - DB_PORT=${SQUID_DB_PORT}
+      - DB_HOST=${SQUID_DB_HOST}
+      - GQL_PORT=${SQUID_GQL_PORT}
+      - ARCHIVE_GATEWAY_URL=${SQUID_ARCHIVE_GATEWAY_URL}


SQUID_ARCHIVE_GATEWAY_URL is empty in .env, what is the default if ARCHIVE_GATEWAY_URL is empty? Does it sync directly from chain using rcp endpoint?

Answering myself: Yes :)

tests/network-tests/start-storage.sh

mnaamani · 2024-01-05T04:30:17Z

tests/network-tests/start-storage.sh


 echo "Staring storage infrastructure"

-HOST_IP=`$THIS_DIR/get-host-ip.sh`
+# Start Storage-Squid
+docker-compose -f $THIS_DIR/../../docker-compose.storage-squid.yml up -d


I think we want to also start the storage-squid in project root
start.sh and start-multi-storage.sh

Addressed in 8c03814

mnaamani · 2024-01-05T04:41:04Z

distributor-node/src/services/networking/NetworkingService.ts

@@ -65,6 +67,12 @@ export class NetworkingService {
    this.downloadQueue.on('error', (err) => {
      this.logger.error('Data object download failed', { err })
    })
+
+    this.initRuntimeApi().catch((err) => this.logger.error('Runtime API initialization failed:', { err }))


I suppose it is not critical for runtimeApi to be functioning, as only place i see it used is in the public api endpoint to get the status. But if the api is not initialized and the endpoint is called and getQueryNodeStatus() thorws, we should ensure the request handler catches it and return a HTTP 503 status rather than cause the express server to cause the process to exit with an unhandled expection.

we should ensure the request handler catches it and return a HTTP 503 status rather than cause the express server to cause the process to exit with an unhandled expection.

I don't think this is currently happening for distributor-node (yes, storage-node is crashing), if getQueryNodeStatus() throws, then it is being caught by the express error handler middleware.

if getQueryNodeStatus() throws, then it is being caught by the express error handler middleware.

Was this added recently? Referencing my previous comment:

Getting {"type":"exception","message":"Unexpected end of JSON input"} from /status when squid is not available

Hmm actually it is handled and Argus doesn't crash, so I guess we just need better error message in that case

mnaamani · 2024-01-05T05:00:13Z

distributor-node/src/services/networking/query-node/api.ts

    this.apolloClient = new ApolloClient({
-      link: splitLink,
+      link: from([errorLink, new HttpLink({ uri: this.config.endpoints.queryNode, fetch })]),


I don't see where this from() function is imported. Is it a typescript/es6 syntax?

It's being imported from Apollo
https://github.com/zeeshanakram3/joystream/blob/36212f0242f1ac362f3d251b23beef20a5cdc3cc/distributor-node/src/services/networking/query-node/api.ts#L1

👍 thanks I guess it was hard to spot because of so many occurrences of the from keyword.

mnaamani · 2024-01-05T05:45:28Z

storage-node/src/services/webApi/controllers/stateApi.ts

@@ -158,7 +159,7 @@ export async function getStatus(req: express.Request, res: express.Response<Stat
    downloadBuckets,
    sync,
    cleanup,
-    queryNodeStatus: await getQueryNodeStatus(qnApi),
+    queryNodeStatus: await getQueryNodeStatus(api, qnApi),


how does express handle the response if we get an uncaught exception here? Is it handled by

joystream/storage-node/src/services/webApi/app.ts

Line 109 in 691e35f

app.use((err: Error, req: express.Request, res: express.Response, next: express.NextFunction) => {

?

I guess this is a similar point made in #5001 (review)

Actually, error handler middleware is not really used/called in the storage-node (even in all the previous versions), so whenever the unhandled exception occurred in the route handlers, it caused the app to crash.

If async route handlers throw then its the responsibility of the app to pass this error to the next() function which then will be handled by the default error handler

I have fixed this for storage-node, for distributor-node this is not a problem as error thrown from the route handlers are already passed to the next() function

Addressed in e2ce216

…e default error handler middleware

kdembler · 2024-01-10T19:51:28Z

I'd suggest to do it on a server so that it indexes all the time. Locally (only syncing when my laptop was on) it didn't happen to me as well

After your comment I have dropped my squid, started fresh sync and it crashed again

…egration

…ents in storage squid

kdembler · 2024-01-15T18:29:15Z

@zeeshanakram3 one more question: do you think it would be a lot of effort to include storage squid version in argus&colossus status endpoints? #4732

zeeshanakram3 · 2024-01-15T18:49:53Z

@zeeshanakram3 one more question: do you think it would be a lot of effort to include storage squid version in argus&colossus status endpoints? #4732

I will implement it.

mnaamani

Some more code to drop (upload authentication related)
A couple suggestions on docker-compose config for squid.

mnaamani · 2024-01-16T08:10:31Z

docker-compose.storage-squid.yml

+    networks:
+      - joystream
+    ports:
+      - '${SQUID_DB_PORT}:${SQUID_DB_PORT}'


Suggested change

- '${SQUID_DB_PORT}:${SQUID_DB_PORT}'

- '127.0.0.1:${SQUID_DB_PORT}:${SQUID_DB_PORT}'

To prevent the db being exposed on public interface?

mnaamani · 2024-01-16T08:54:43Z

docker-compose.storage-squid.yml

+    ports:
+      - '${SQUID_DB_PORT}:${SQUID_DB_PORT}'
+    command: ['postgres', '-c', 'config_file=/etc/postgresql/postgresql.conf', '-p', '${SQUID_DB_PORT}']
+    shm_size: 1g


Should this be a bit more than what postgres tries to allocate as configured in postgresql.conf?

mnaamani · 2024-01-16T09:37:13Z

docker-compose.storage-squid.yml

+  squid_db:
+    container_name: squid_db
+    hostname: squid-db
+    image: postgres:14


Does squid work with postgres:16 ? Is there any downside of using latest if its compatible?

Yeah, should be safe upgrading to postgres:16, so changed it.

mnaamani · 2024-01-16T09:52:28Z

storage-node/src/services/sync/acceptPendingObjects.ts

+      // Delete the stale objects from the pending folder
+      await Promise.allSettled(
+        deletedObjectIds.map(({ data: { dataObjectId } }) =>
+          fsPromises.unlink(path.join(this.pendingDataObjectsDir, dataObjectId))


should also remove the id from pendingDataObjects array so we don't attempt to process it in next step.

If an object has been deleted from the runtime then, pendingDataObjects wont contain it in the first place since QN wont return it. Wouldn't that be the case?

Correct, but that original array i'm refering to includes files picked up from the pending folder. So here we are deleting when we find that it has been deleted by runtime. Now it is odd for that file to get placed there but non-the less if it was placed, or somehow uploaded again, then we will delete it but later still try to actually moving to uploads folder because no matching bucket will be found for it.

we might actually try to compute the ipfs hash of it first.. but still the general point applies. correct?

Sorry my bad, I confused it with the pendingIds array.

mnaamani · 2024-01-16T09:55:54Z

storage-node/src/services/webApi/app.ts

@@ -82,7 +79,8 @@ export async function createApp(config: AppConfig): Promise<Express> {
      validateRequests: true,
      operationHandlers: {
        basePath: path.join(__dirname, './controllers'),
-        resolver: OpenApiValidator.resolvers.modulePathResolver,
+        resolver: (basePath: string, route: RouteMetadata, apiDoc: OpenAPIV3.Document) =>
+          asyncHandler(OpenApiValidator.resolvers.modulePathResolver(basePath, route, apiDoc)),


What is this change fixing?

It catches the errors thrown by the async route handlers and passes it to the default error handler. Without this change if any of the route handlers (e.g. handler for /status) threw an error then it would crash the application since it would not be caught anywhere.

It's the equivalent of following piece of code in distributor node
https://github.com/zeeshanakram3/joystream/blob/0b8c3f8bdd6539a408bfcf6d9d4539a53cbb7649/distributor-node/src/services/httpApi/HttpApiBase.ts#L27-L31

storage-node/src/services/webApi/app.ts

mnaamani · 2024-01-16T10:00:04Z

storage-node/src/services/webApi/app.ts

  })

  return app
 }

-/**


Since we are getting rid of these.. we can also drop

// when enabling upload auth ensure the keyring has the operator role key and set it here. const enableUploadingAuth = false const operatorRoleKey = undefined

in server.ts and the corresponding fields in the AppConfig type

Does it make sense then to also remove the relevant parts in the api spec openapi.yaml?

Done in fd44d0d. Also removed relevant parts from the API spec file.

mnaamani · 2024-01-16T13:38:48Z

docker-compose.yml

@@ -237,7 +237,7 @@ services:
      "

  indexer:
-    image: joystream/hydra-indexer:v5.0.0-alpha.1
+    image: hydra-indexer


I think like me you were using local build of hydra-indexer ;)

Yeah ;), fixed that in 190daa0

mnaamani

Everything looks good, final changes

fix the prometheus port conflict in docker compose file.
Bump storage node version to v3.11.0 and the storage cli package to v3.3.0

mnaamani · 2024-01-17T05:09:01Z

docker-compose.storage-squid.yml

+      - RPC_ENDPOINT=${JOYSTREAM_NODE_WS}
+
+    ports:
+      - '127.0.0.1:${PROCESSOR_PROMETHEUS_PORT}:${PROCESSOR_PROMETHEUS_PORT}'


change to use SQUID_PROCESSOR_PROMETHEUS_PORT instead. Otherwise services are failing to start with:

Error response from daemon: driver failed programming external connectivity on endpoint squid_processor (e1a9bb40428326391a37414aa61f52dfd4817f99dc1662c278013f4780d7c979): Bind for 127.0.0.1:3337 failed: port is already allocated

When we start network with start.sh or start-multistorage.sh that starts orion services first which is listening on PROCESSOR_PROMETHEUS_PORT port.

Done in 897c7c0

… compose

zeeshanakram3 · 2024-01-17T07:20:16Z

Bump storage node version to v3.11.0 and the storage cli package to v3.3.0

@mnaamani regarding the package versions, shouldn't we bump the major versions in both Argus & Colossus, since we have a breaking change in configuration options (changed --queryNodeEndpoint to --storageSquidEndpoint)?

mnaamani · 2024-01-17T07:58:12Z

Bump storage node version to v3.11.0 and the storage cli package to v3.3.0

@mnaamani regarding the package versions, shouldn't we bump the major versions in both Argus & Colossus, since we have a breaking change in configuration options (changed --queryNodeEndpoint to --storageSquidEndpoint)?

That might actually make more sense yes.

mnaamani · 2024-01-17T08:01:41Z

Should we also consider forking current master to keep the old version on a separate branch so we can slowly transition all operators just in-case we need to create some important patch while operators are still on older version?

zeeshanakram3 · 2024-01-17T08:05:21Z

Should we also consider forking current master to keep the old version on a separate branch so we can slowly transition all operators just in-case we need to create some important patch while operators are still on older version?

Yeah, makes sense

zeeshanakram3 · 2024-01-17T15:33:49Z

Failing integration tests should be fixed by Joystream/storage-squid#13

kdembler · 2024-01-23T14:07:47Z

@zeeshanakram3 could you reply to this? #5001 (comment) I think Argus (and Colossus) should still provide valid response for /status request even when the squid is down. Currently we're just forwarding the raw error to the user.

As a counterargument, we need that kind of handling for asset requests as well, not only for status and that currently isn't handled

kdembler

Created followup for the thing I mentioned: #5056

Great work on this @zeeshanakram3, LGTM!

zeeshanakram3 added 8 commits December 14, 2023 01:57

integrate Storage-Squid with Distributor Node

de5166d

integrate Storage-Squid with Storage Node

8f6e650

add storage squid related environment variables

1757c48

update 'graphql' verion in package.json files

65f3220

update QN build script

bb195a8

add script to regenerate storage-node CLI docs + regenerated document…

d1e0ba3

…ation

added docker-compose setup to run storage-squid services

c758662

updated integration tests bash scripts to start storage-squid service…

7bc5bd3

…s when running Argus/Colossus

zeeshanakram3 requested a review from mnaamani December 13, 2023 21:18

zeeshanakram3 added 3 commits December 14, 2023 14:39

generate graphql.schema file in distributor node

44240df

update 'generate:schema:graphql' commands to generate schema using do…

01d1f29

…cker image

enable telemetry for storage-squid docker-compose service

986d4d0

kdembler self-assigned this Dec 21, 2023

Merge remote-tracking branch 'upstream/master' into pr/zeeshanakram3/…

d6cf6ba

…5001-1

kdembler self-requested a review December 21, 2023 19:14

mnaamani added colossus argus Argus distributor node labels Dec 23, 2023

kdembler reviewed Dec 28, 2023

View reviewed changes

zeeshanakram3 added 6 commits January 2, 2024 05:52

Merge remote-tracking branch 'upstream/master' into storage-squid-int…

6a20fa7

…egration

Merge remote-tracking branch 'upstream/master' into storage-squid-int…

a1355e9

…egration

Fix: remove storage-squid services before removing other services in …

a23306e

…run-tests.sh

WIP: 1ae35f2 [colossus] fix: qn query Maximum call stack size exceeded

877f53b

[Colossus]: fix query to fetch assigned bags

5993958

[Argus]: fix query to fetch assigned bags

2bbf0af

mnaamani requested changes Jan 5, 2024

View reviewed changes

zeeshanakram3 added 5 commits January 5, 2024 14:42

start storage-squid in 'start.sh' and 'start-multistorage.sh' scripts

8c03814

revert shell script formatting change

f63efd1

change 'SQUID_PROCESSOR_PROMETHEUS_PORT' default value to avoid conflict

36212f0

[Colossus] remove unused OpenApiValidator authentication

83d3f43

[Colossus] fix: catch exceptions thrown from the route handlers in th…

7d5d37f

…e default error handler middleware

zeeshanakram3 added 2 commits January 15, 2024 20:25

Merge remote-tracking branch 'upstream/master' into storage-squid-int…

04b5cc6

…egration

unlink data objects from panding folder based on DataObjectDeleted ev…

0b8c3f8

…ents in storage squid

zeeshanakram3 mentioned this pull request Jan 16, 2024

Expose Storage Squid version in Argus & Colossus status endpoints #5046

Merged

mnaamani requested changes Jan 16, 2024

View reviewed changes

zeeshanakram3 added 2 commits January 16, 2024 16:07

[Colossus] remove unused upload auth related code

fd44d0d

address docker-compose CRs

7ff9f63

zeeshanakram3 requested a review from mnaamani January 16, 2024 11:28

revert hydra indexer image change in docker-compose.yml

190daa0

mnaamani reviewed Jan 16, 2024

View reviewed changes

mnaamani self-requested a review January 16, 2024 13:40

mnaamani requested changes Jan 17, 2024

View reviewed changes

use 'SQUID_PROCESSOR_PROMETHEUS_PORT' env var in storage squid docker…

897c7c0

… compose

bump Argus/Colossus package versions

b625322

zeeshanakram3 force-pushed the storage-squid-integration branch from 67247bf to b625322 Compare January 17, 2024 08:33

log storage squid services logs in integration tests

d8476d5

mnaamani self-requested a review January 18, 2024 05:16

mnaamani approved these changes Jan 18, 2024

View reviewed changes

ignazio-bovo mentioned this pull request Jan 18, 2024

Colossus/uptime feat #5049

Merged

kdembler mentioned this pull request Jan 23, 2024

Gracefully handle unavailable storage squid #5056

Open

kdembler approved these changes Jan 23, 2024

View reviewed changes

mnaamani merged commit 3b0965e into Joystream:master Jan 23, 2024
23 checks passed

	"generate:schema:graphql": "docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql",
	"generate:schema:graphql": "docker inspect joystream/storage-squid:latest && docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql",

	- '${SQUID_DB_PORT}:${SQUID_DB_PORT}'
	- '127.0.0.1:${SQUID_DB_PORT}:${SQUID_DB_PORT}'

Storage Squid integration with Argus/Colossus #5001

Storage Squid integration with Argus/Colossus #5001

Conversation

zeeshanakram3 commented Dec 13, 2023 • edited

kdembler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mnaamani Jan 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mnaamani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mnaamani Jan 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdembler commented Jan 10, 2024

kdembler commented Jan 15, 2024

zeeshanakram3 commented Jan 15, 2024

mnaamani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mnaamani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeeshanakram3 commented Jan 17, 2024

mnaamani commented Jan 17, 2024

mnaamani commented Jan 17, 2024

zeeshanakram3 commented Jan 17, 2024

zeeshanakram3 commented Jan 17, 2024

kdembler commented Jan 23, 2024

kdembler left a comment

Choose a reason for hiding this comment

zeeshanakram3 commented Dec 13, 2023 •

edited

mnaamani Jan 4, 2024 •

edited

mnaamani Jan 4, 2024 •

edited