Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue reported on discord regarding un-uploaded YPP videos #174

Open
zeeshanakram3 opened this issue Apr 2, 2023 · 3 comments
Open

Issue reported on discord regarding un-uploaded YPP videos #174

zeeshanakram3 opened this issue Apr 2, 2023 · 3 comments

Comments

@zeeshanakram3
Copy link
Contributor

Summary

An issue was reported: Of the 109 videos uploaded through YPP, 49 did not upload. I started investigating the issue and it turns out that not uploading of the said videos was a result of another (more serious) issue. That is, some of the videos were duplicated on Joystream.
The issue that caused video duplicates happened because of two things, a database configuration issue, and a lack of checks in YT-Synch BE to handle DB exceptions.

Explanation

Explanation of Database configuration issue

Before describing the DB configuration problem, here is some context of the DB architecture, the YT-synch service is using, YT_Synch use Dynamodb (an AWS cloud-based) DB to persist the records for channels & videos and their state. DynamoDB is a fully managed solution so as an application developer you don’t have to manage any DB-related infra such as server, disk space & memory specs, or update the infrastructure when the read/write load exceeds, etc. 



However, as an application developer, you still have to specify how much throughput you need, and Dynamodb will automatically scale up to that requirement. It provides two configuration options to specify that (one should be selected):

  • PROVISIONED: Specify the number of reads and writes per second that you require for your application (default)
  • ON_DEMAND: pay-per-request, scales automatically as the read/write requests increase

Actual Problem

A Youtube channel with ID UC1p45mMUW1ivJ2dVN7Eo_KQ signed up for the YPP program with 48 videos on 24/03/23. At this time the Dynamodb capacity mode was set to PROVISIONED with absolutely minimum read/write capacity (1 read, 1 write). These were the default values specified in the template that was used to set up the production tables.

After the channel was added, the 1) YT-synch service started downloading the videos by querying DB for URLs, and eventually 2) creating these videos by sending on-chain transactions. However, when the service was done with step 2) the PROVISIONED capacity had already been reached, so no more read/write was possible. Because of this issue, the service failed to commit the state of the video it just created to VideoCreated (hence, the video was retried for on-chain creation which led to duplicates). As the server logs show, the video state update operation failed with ProvisionedThroughputExceededException error.

image

YT-Synch error handling

Despite the invalid capacity mode option, the Yt-synch service should have handled the ProvisionedThroughputExceededException error gracefully so I investigated that this was not handled/tested/caught in either fault tolerance testing or community load testing.

  • For infra load testing, when the community set up the YT-synch service, they used a local instance of Dynamodb(not real AWS-based DB), the local Dnamodb does not have this limitation of read/write capacity. That's why in community testing one channel with 512 videos was added for syncing, and it got synced without any exceptions.

  • I looked at the fault tolerance QA plan for the reason for this discrepancy, although it was pretty detailed and covered different failures for external APIs (e.g. RPC, QN, Storage Node & Google API), It did not have any test cases to mock/test the Database API failures.

Also, we didn't get this issue in the YT-synch dev setup, which was used for a considerable amount of time, because capacity was configured to ON_DEMAND

Problem resolution

The problem automatically got resolved eventually. As some videos successfully got created, the number of videos whose state needs to be periodically queried & updated was reduced, so read/write used were within capacity limits, and the state of new videos created was successfully committed to DB.

State of the Affected Channel (UC1p45mMUW1ivJ2dVN7Eo_KQ)

This table shows the list of duplicate videos (~28) of the affected channel. The first column lists the youtube video IDs, the second column shows the count of each video (duplicates), and the third column shows the Joystream video IDs of duplicate videos (these duplicate video IDs were only created on chain, their assets couldn't get uploaded on storage nodes).

Video Id count duplicate videos (Joystream Video IDs)
7-j8LFh0uxU 3 1154, 1246
jItgCuucmMg 2 1222
aLj8TDX1ly4 2 1251
8gXXQhKAFN0 2 1228
rtPRsFCBhEM 2 1249
gpmLDQc6rz8 2 1236
C8qzmthTL4Y 4 1193, 1191
YgemQs1iaW8 3 1235
vkgY-bKUGxU 6 1242, 1224, 1192, 1185, 1162
IbKLZUbZFvo 3 1219, 1207
Znkln1MkhcE 3 1175, 1157
GsK-O-jAtGU 3 1229, 1225
wnExEF82GDY 3 1218
ypjTseOJTJE 3 1213, 1161
EYKMYXI9ySo 2 1197
Ryk1_zlacgQ 5 1217, 1212, 1204, 1184
JwfuhwtaMIY 7 1205, 1200, 1187, 1179, 1152
Xp16O9L8_L4 3 1206, 1198
XVBK_RDq8aA 4 1214, 1181
AWfLROY8OVM 2 1174
eXQmLM_IoKw 2 1166
OZY04_g_doU 2 1153
iy0Ful_LLvo 3 1168, 1150
nTxMX-320Zg 3 1186
RJC-kONbKpc 2 1176
7o-8CYW3VHE 2 1178
yN4T4xvDsg8 2 1160
wi4XlhmIG5c 2 1148
8EreFLjrNSc 3 1156, 1147

I am not sure what's the best action can be taken in this regard. The creator can remove the video ID that I mentioned, or any moderator can do that?

@bedeho
Copy link
Member

bedeho commented Apr 2, 2023

Let me first start by addressing this

I am not sure what's the best action can be taken in this regard. The creator can remove the video ID that I mentioned, or any moderator can do that?

Def. the creators have to do this on their own, but moderators can get in touch and help them. Moderator powers are probably too broad right now, see here

Joystream/joystream#4589

@bedeho
Copy link
Member

bedeho commented Apr 2, 2023

  1. First off, thank you for unpacking and reviewing this so quickly!
  2. Excellent work also looking into why this was not identified earlier in our own QA, that is always very useful to know to make sure we improve our methods for future releases.
  3. I think I understood the problem, at least at a high level, but am I understanding you to say that the only software change you are recommending is to handle the particular exception better? I don't think I understand why that would be sufficient, because once that exception occurs, you are basically in a state where you for some time cannot commit the true state of your system to your DB, and any fault or crash which occurs from this time to whenever you possibly are in fact able to reconcile your DB is period of time the system is highly at risk of ending up in an inconsistent state, even if you perfectly handle the exception itself. Isn't the fundamental issue here that you cannot start causing side-effects on the Joystream blockchain in a non-atomic way with the reflecting this action in the DB? I am not at all certain here, because the details of how this works is not clear to me, but I thought I should ask this question regardless. Perhaps whatever issue you make which describes the fix can go into some detail on this
  4. Does each operator of yt-synch have to decide individually on what configuration to use? if so, should we just switch to ON_DEMAND, to sidestep this problem for Gleev specifically?

@zeeshanakram3
Copy link
Contributor Author

Sorry, I almost forgot to reply to this amid other work.

  1. ... Isn't the fundamental issue here that you cannot start causing side-effects on the Joystream blockchain in a non-atomic way with the reflecting this action in the DB?

Is your concern is that even without this specific DB exception whether the side effects on Joystream are happening in a fully atomic way? Yes, the the action of creating video on Joystream is atomic operation (if we sidestep this particular exception), the way it works is that right before sending the extrinsic, the service does a pre-commit changing the state of the video from New to CreatingVideo, now even if the service crashes before the video could actually be reflected as VideoCreated in the database (assuming the extrinsic was successful), we can resolve this state inconsistency whenever the service restarts.

Upon service initialization a process will run that will check all the videos in DB in CreatingVideo state and then match the ID of these video against video.ytVideoId from the QN, if they exist then the state of those videos will be changed to VideoCreated or New depending on whether video was already created or not.

Conceptually, this is how most databases internally design the transactions to be an atomic operation, the act of pre-commiting is known as write-ahead-lorging(WAL) in PostgresSQL, and transaction logs in many other databases. These transaction logs are then used to rollback or apply unfinished changes (whenever the DB restarts) specified in the transaction by looking at the state of already committed changes in the DB.

  1. ... but am I understanding you to say that the only software change you are recommending is to handle the particular exception better? I don't think I understand why that would be sufficient, because once that exception occurs, you are basically in a state where you for some time cannot commit the true state of your system to your DB

So I think handling this DB exception (of not able to commit the sate to DB), coupled with the fact that video creation is an atomic operation, will fully solve the problem.

  1. Does each operator of yt-synch have to decide individually on what configuration to use? if so, should we just switch to ON_DEMAND, to sidestep this problem for Gleev specifically?

Yes, it's up to them, but if they use the Infrastructure-as-a-Code template provided in the YT-synch repo to bootstrap the database tables, the tables will be created with the ON_DEMAND capacity option

For Gleev's instance of YPP, I switched to ON_DEMAND yesterday, while we released v1.1.0 with operator reward script & collaborator status endpoint features

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants