Avoiding Postgres transaction deadlocks #268

zeeshanakram3 · 2023-12-04T09:36:51Z

Problem

From the Postgres documentation, why the deadlocks occur:

Consider the case in which two concurrent transactions modify a table. The first transaction executes:
UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 11111;
This acquires a row-level lock on the row with the specified account number. Then, the second transaction executes:
UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 22222;
UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 11111;
The first UPDATE statement successfully acquires a row-level lock on the specified row, so it succeeds in updating that row. However, the second UPDATE statement finds that the row it is attempting to update has already been locked, so it waits for the transaction that acquired the lock to complete. Transaction two is now waiting on transaction one to complete before it continues execution. Now, transaction one executes:

UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 22222;
Transaction one attempts to acquire a row-level lock on the specified row, but it cannot: transaction two already holds such a lock. So it waits for transaction two to complete. Thus, transaction one is blocked on transaction two, and transaction two is
blocked on transaction one: a deadlock condition. PostgreSQL will detect this situation and abort one of the transactions.

How to fix deadlocks

The best defense against deadlocks is generally to avoid them by being certain that all applications using a database acquire locks on multiple objects in a consistent order. In the example above, if both transactions had updated the rows in the same order, no deadlock would have occurred.

e.g. based on the above example, the following order of transactions execution won't cause any deadlock

-- Transaction 1
BEGIN;
UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 11111; 

-- Transaction 2
BEGIN;
UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 11111; 

-- Transaction 1
UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 22222; 
COMMIT;

-- Transaction 2
UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 22222;
COMMIT;

Avoiding deadlocks in Orion code

Identify all the processes/applications that are concurrently updating the Postgres DB
Ensure that the queries those concurrent applications execute have all the entity updates in the same order (i.e. ascending)

For example, we know that at least two different connections are responsible for concurrent updates,

The processor that sends a bunch of update requests to DB in a single transaction
The updateVideoRelevanceValue function that periodically updates the video_relevance scores for all videos.

So taking the example of only the video table, when the processor submits a transaction to the DB to update all the videos (updates caused due to processing of mappings over N blocks), we need to ensure that both transactions - the processor transaction and the video relevance transaction - has the video entities updates in the ascending order of their IDs.

The text was updated successfully, but these errors were encountered:

ignazio-bovo · 2023-12-04T12:12:11Z

There are several instances of deadlock that I have occurred for a deadlock:

Conflicting updates on overlay flush

This suggest a possible conflict between two update made during the overlay flush operation, in which internally the EntityManager save operation is called. Reading the message it says that it's happening when 2 TX are updating the same row 446881 on the video table and they want to update their Share lock into an Exclusive lock. This upgrade cannot be controlled via Typeorm (it could be mitigated by having a TX committing one row at a time instead of multiple row as above as recommended here)
But the most direct solution would be to catch the deadlock error and retry like it is done here on the typeorm documentation (https://typeorm.io/transactions)

Conflicting updates on the updateVideoRelevanceValue loop

Video relavence is update via transaction here

orion/src/utils/VideoRelevanceManager.ts

Line 30 in ed864f4

const [

This is a big transaction that relies on postgres autocommit feature (i.e. transaction query is not wrapped in BEGING + COMMIT statements). I tried to add them to the and re run the synch process and I am getting weird errors like

Without them appears to work smoothly but it's not reliable as I have seen the deadlock error happened on production deployment (but I haven't save the screeshot).
Hence this update loop is quite hard to debug and unpredictable, thus:

I solution could be adding a separate table
(videoId, videoRelevance) in order to READ from the video table and WRITE on the relevance table
A same reasoning could be applied to the comment relevance update loop. In this way the competing transactions actually operates on two different rows when updating the same video id

ignazio-bovo · 2023-12-04T12:53:26Z

Overlay

The overlay feature was provided as a in memory cache to avoid constantly querying the database in order to speed up the processor synching and block processing.
We had so far several issue with concurrency with the overlay, but they are not directly related to the overlay functionality per se, but more on how it handles concurrent transactions via the entity manager instance.
There's no clean way to turn it off / on and at this point we should live with it, as I believe it is useful and it is just a matter of retouching the flush operations via transactional statements with proper isolation level
But at this point it is extremely inconvenient to abandon the overlay

zeeshanakram3 · 2023-12-04T13:29:23Z

I think that for this type of information in order to simplify everything it could be worth adding a separate table
(videoId, videoRelevance) in order to READ from the video table and WRITE on the relevance table
A same reasoning could be applied to the comment relevance update loop

I think doing schema change does not really either address the problem at its code or fix the problem, it only moves the possibility of deadlock happening from video to new video_relevance table (If two concurrent processes try to update the video_relevance table) .

The solution I provided (as well as suggested in the Postgres official documentation) The best defense against deadlocks is generally to avoid them by being certain that all applications using a database acquire locks on multiple objects in a consistent order. is the best way to avoid deadlock scenarios from ever happening I think.

So coming to the actual solution, I see that the processor's save transaction already tries to submit the UPDATE queries in ascending order of the IDs (see the above screenshot). Now only change you have to do this to to make sure the updateVideoRelevanceValue also tries to submit the update video relevance score in the same order. This is how the updated function should look like:

async updateVideoRelevanceValue(em: EntityManager, forceUpdateAll?: boolean) {
 if (this.videosToUpdate.size || forceUpdateAll) {
   const [
     newnessWeight,
     viewsWeight,
     commentsWeight,
     reactionsWeight,
     [joystreamTimestampWeight, ytTimestampWeight] = [7, 3],
     defaultChannelWeight,
   ] = await config.get(ConfigVariable.RelevanceWeights, em);
   const channelWeight = defaultChannelWeight ?? 1;
   await em.query(`
     WITH weighted_timestamp AS (
       SELECT 
         "video"."id",
         (
           extract(epoch from video.created_at)*${joystreamTimestampWeight} +
           COALESCE(extract(epoch from video.published_before_joystream), extract(epoch from video.created_at))*${ytTimestampWeight}
         ) / ${joystreamTimestampWeight} + ${ytTimestampWeight} as wtEpoch,
         "channel"."channel_weight" as CW
       FROM 
         "video" 
         INNER JOIN
           "channel" ON "video"."channel_id" = "channel"."id"
       ${
         forceUpdateAll
           ? ''
           : `WHERE "video"."id" IN (${[...this.videosToUpdate.values()]
               .map((id) => `'${id}'`)
               .join(', ')})`
       }
+        ORDER BY "video"."id" -- Added ORDER BY clause to sort by ID
     )
     UPDATE 
       "video"
     SET
       "video_relevance" = ROUND(
         (
           (extract(epoch from now()) - wtEpoch) / ${NEWNESS_SECONDS_DIVIDER} * ${newnessWeight * -1} +
           (views_num * ${viewsWeight}) +
           (comments_count * ${commentsWeight}) +
           (reactions_count * ${reactionsWeight})
         ) * COALESCE(CW, ${channelWeight}),
         2)
     FROM
       weighted_timestamp
     WHERE
       "video".id = weighted_timestamp.id
   `);
   this.videosToUpdate.clear();
 }
}

ignazio-bovo · 2023-12-05T12:03:17Z

The ORDER BY clause is not allowed in the postgresql (at least the way you placed it)

The issue here, as I mentioned, are two:

Potential conflicts in between Overlay save operation on entities

In the screenshot presented there are 2 TX both holding a shared lock on the videoId highlited. This is happening basically:

Tx1 acquires Shared lock (i.e. read only lock)
Tx2 acquires Shared lock
Tx1 wants to upgrade to Execute lock (i.e. write lock), thus it doesn't release the Shared lock but it waits for all the other Shared locks to be released
Tx2 wants to upgrade to Execute lock (i.e. write lock), thus it doesn't release the Shared lock but it waits for all the other Shared locks to be released

The result is that Tx1 and Tx2 are waiting for each other.
There's not really much we can do about this, unless having explicitly in the typeorm code deadlock detection (i.e. a try-catch block) with a retry policy

Concurrency with Upgrade query

The concurrency here is given by :
a. Overlay that wants to update the video table
b. Update query that wants to update the video table just for the relevancy score

What I am suggesting is that we split the two tables so that Overlay operates on any table except the relevancy score one and the Update loop query operates solely on the relevancy score.

This should solve any concurrency issues between overlay and update query, which is what we are trying to solve here

zeeshanakram3 · 2023-12-05T14:50:40Z

The ORDER BY clause is not allowed in the postgresql (at least the way you placed it)

Correct, I have updated the query

Potential conflicts in between Overlay save operation on entities
...thus it doesn't release the Shared lock but it waits for all the other Shared locks to be released

I think not entirely accurate since as many transactions can hold Shared lock(read only locks)

You are right that the conflict is happening b/w Overlay and Video relevance score update transactions, call them Tx1 and Tx2, respectively. Now on my attempt to explaining how adding ORDER BY should fix the deadlock, say Tx1 would update videos Id [1, 10] while Tx2 would update [10, 1] (basically same videos but in different order)

Tx1 starts, it wants to acquire row exclusive lock on video ID 1, there is no other write only lock on the 1 so it acquires the lock successfully and makes the update
Tx2 starts, it wants to acquire row exclusive lock on video ID 10, there is no other write only lock on the 10 so it acquires the lock successfully and makes the update
Tx1 wants to acquires row exclusive lock on video ID 10, but Tx2 already holds lock on that row, so Tx1 will wait for Tx2 to complete and release the lock
Tx2 wants to acquires row exclusive lock on video ID 1, but Tx1 already holds lock on that row, so Tx2 will wait for Tx1 to complete and release the lock

Now, you see that after step 4, both these transaction are waiting for each other (i.e. cyclic dependency), so the Postgres will detect that and abort one of the transaction.

There's not really much we can do about this, unless having explicitly in the typeorm code deadlock detection (i.e. a try-catch block) with a retry policy

This is not entirely true, we can prevent any deadlock scenario from happening in the first place. Considering above example, if the Tx2 had updated the rows in the same order as Tx1. Tx2 would actually never have started in the first place till Tx1 was done

The concurrency here is given by :
a. Overlay that wants to update the video table
b. Update query that wants to update the video table just for the relevancy score

Correct.

What I am suggesting is that we split the two tables so that Overlay operates on any table except the relevancy score one and the Update loop query operates solely on the relevancy score.

Yeah, this should work, but I don't think we should employ that, as this seems like an anti-pattern and this is not usually deadlock scenarios are fixed.

zeeshanakram3 mentioned this issue Dec 5, 2023

Fix/homepage scoring #251

Merged

ignazio-bovo mentioned this issue Dec 5, 2023

Hotfix/deadlock #270

Closed

WRadoslaw added the new issue label Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding Postgres transaction deadlocks #268

Avoiding Postgres transaction deadlocks #268

zeeshanakram3 commented Dec 4, 2023 •

edited

ignazio-bovo commented Dec 4, 2023 •

edited

ignazio-bovo commented Dec 4, 2023 •

edited

zeeshanakram3 commented Dec 4, 2023 •

edited

ignazio-bovo commented Dec 5, 2023 •

edited

zeeshanakram3 commented Dec 5, 2023

Avoiding Postgres transaction deadlocks #268

Avoiding Postgres transaction deadlocks #268

Comments

zeeshanakram3 commented Dec 4, 2023 • edited

Problem

How to fix deadlocks

Avoiding deadlocks in Orion code

ignazio-bovo commented Dec 4, 2023 • edited

Conflicting updates on overlay flush

Conflicting updates on the updateVideoRelevanceValue loop

ignazio-bovo commented Dec 4, 2023 • edited

Overlay

zeeshanakram3 commented Dec 4, 2023 • edited

ignazio-bovo commented Dec 5, 2023 • edited

Potential conflicts in between Overlay save operation on entities

Concurrency with Upgrade query

zeeshanakram3 commented Dec 5, 2023

zeeshanakram3 commented Dec 4, 2023 •

edited

ignazio-bovo commented Dec 4, 2023 •

edited

ignazio-bovo commented Dec 4, 2023 •

edited

zeeshanakram3 commented Dec 4, 2023 •

edited

ignazio-bovo commented Dec 5, 2023 •

edited