Sunday, December 23, 2012

Deadlock in highly concurrent situation

I recently experienced a deadlock case which happened in highly concurrent situation. So many concurrent access were requested at the same time against a single stored procedure and that caused a lot of deadlock situations. I think real case was more complex than what I describe here, but I just want to summarize some interesting deadlock case here.

The deadlock case occurred around a stored procedure called GetNextId. This SP returns a next job id from Job and JobQueue tables based on certain condition.

CASE 1: Here is the SP that caused deadlock. JobJobQueueView is a simple view that joins Job an JobQueue table based on JobId. So the SP selects a next job id from the view with certain where condition. It works fine in normal sitation but if highly concurrent calls are made, deadlock situation occurred.

WITH CTE1 (NextJobId) AS 
( 
   SELECT TOP 1 JobId 
   FROM JobJobQueueView WITH (HOLDLOCK) 
   WHERE (JobType = @JobType) 
    AND (SentToMachine IS NULL) 
    ORDER BY Priority DESC, CreateTime ASC
)
UPDATE JobQueue 
SET @JobId = JobId, SentToMachine = @MachineName,StartTime = GETDATE()
FROM CTE1 
WHERE JobId = COALESCE(CTE1.NextJobId, -1);
-- @JobID is OUTPUT

The problem is : one SPID (55) holds Shared (S) lock on JobQueue table (and Job) when executing SELECT statement. At the same time another SPID (56) hold Shared lock on JobQueue while running SELECT. Since it has HOLDLOCK lock hint, the Shared locks are not released. Now SPID 55 tried to acquire exclusive X lock on JobQueue table to update data but cannot get it because SPID 56 holds Shared lock. At the same time, SPID 56 tried to get X lock on JobQueue but failed because SPID 55 still holds the S lock. Hence the deadlock. HOLDLOCK has the same effect of setting transaction isolation level to SERIALIZABLE.

CASE 2: To eliminate the side effect of HOLDLOCK (or SERIALIZABLE level), the HOLDLOCK lock hint was removed. Now no deadlock occurred but there is another side effect which only occurs in highly concurrent situation - the race condition. Some GetNextId requests got the same Id.

WITH CTE1 (NextJobId) AS 
(
    SELECT TOP 1 JQ.JobId
    FROM Job J, JobQueue JQ
    WHERE J.JobId = JQ.JobId
      AND (SentToMachine IS NULL)
    ORDER BY Priority DESC, CreateTime ASC
)
UPDATE JobQueue
SET @JobId = JobId, SentToMachine = @MachineName, StartTime = GETDATE()
FROM CTE1
WHERE JobId = COALESCE(CTE1.NextJobId, -1);
-- @JobId is OUTPUT

The problem is : one SPID (55) holds Shared (S) lock on JobQueue and Job tables while executing SELECT, and at the same time another SPID (56) holds Shared (S) lock on JobQueue during SELECT, so they have the same next job id. But each Shared lock will be released as soon as the SELECT ends. One of them, say SPID 55, goes into UPDATE and hold exclusive lock on JobQueue table. SPID 56 will wait for X lock release. As soon as X lock is released from 55, 56 will update with the same id. There are other cases similar to this race condition. For example, SPID 55 release S lock after SELECT and if SPID 56 manages to read and write the same next job id before SPID 55 gets X lock on JobQueue, it is also possible for two SPIDs to have the same id.

CASE 3: To eliminate the side effect of case 2, I put UPDLOCK lock hint on JobQueue table. UPDLOCK allows other SPIDs to have shared lock but disallows others to have exclusive or update lock. This prevents the same id to be allowed at the same time.

WITH CTE1 (NextJobId) AS 
(
    SELECT TOP 1 JQ.JobId
    FROM Job J, JobQueue JQ WITH (UPDLOCK)
    WHERE J.JobId = JQ.JobId
      AND (JobType = @JobType)
      AND (SentToMachine IS NULL)
    ORDER BY Priority DESC, CreateTime ASC
)
UPDATE JobQueue
SET @JobId = JobId, SentToMachine = @MachineName, StartTime = GETDATE()
FROM CTE1
WHERE JobId = COALESCE(CTE1.NextJobId, -1);
-- @JobId is OUTPUT

By holding Update lock on JobQueue table during SELECT, another SPID cannot acquire UPDLOCK at the same time and should wait until the U lock is released. When one SPID (55) holds UPDLOCK in SELECT, it allows others to get S lock but disallows U or X lock. Another SPID (56) enters SELECT statement but cannot acquire U lock since it is already held by 55. SPID 56 will simply wait until 55 is done with update. Once 55 completes UPDATE statement, SPID 56 can proceed with SELECT statement where it will get next job id. When 55 executes UPDATE JobQueue statement, 55 will release U lock and acquire X lock (can be expressed as converting U to X lock) and so no other SPIDs or threads cannot access this resource.

Note: I wrote a C# test application to simulate deadlock. Click here for details.

No comments:

Post a Comment