drm/amd/sched: fix deadlock caused by unsignaled fences of deleted jobs
authorNicolai Hähnle <nicolai.haehnle@amd.com>
Thu, 28 Sep 2017 09:57:32 +0000 (11:57 +0200)
committerAlex Deucher <alexander.deucher@amd.com>
Fri, 6 Oct 2017 21:44:32 +0000 (17:44 -0400)
Highly concurrent Piglit runs can trigger a race condition where a pending
SDMA job on a buffer object is never executed because the corresponding
process is killed (perhaps due to a crash). Since the job's fences were
never signaled, the buffer object was effectively leaked. Worse, the
buffer was stuck wherever it happened to be at the time, possibly in VRAM.

The symptom was user space processes stuck in interruptible waits with
kernel stacks like:

    [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
    [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
    [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300
    [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm]
    [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
    [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
    [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
    [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
    [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu]
    [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
    [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
    [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
    [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
    [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
    [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
    [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
    [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
    [<ffffffffffffffff>] 0xffffffffffffffff

Note: The correctness of this change depends on the earlier commit
"drm/amd/sched: move adding finish callback to amd_sched_job_begin"

v2: set an error on the finished fence

Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Andres Rodriguez <andresx7@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drivers/gpu/drm/amd/scheduler/gpu_scheduler.c

index 4693be20e30a1ce0038c9def92436d50e5fdd341..08e1332d814a2f090885f46c433f8446a3b87f0a 100644 (file)
@@ -227,8 +227,14 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched,
                 */
                kthread_park(sched->thread);
                kthread_unpark(sched->thread);
-               while (kfifo_out(&entity->job_queue, &job, sizeof(job)))
+               while (kfifo_out(&entity->job_queue, &job, sizeof(job))) {
+                       struct amd_sched_fence *s_fence = job->s_fence;
+                       amd_sched_fence_scheduled(s_fence);
+                       dma_fence_set_error(&s_fence->finished, -ESRCH);
+                       amd_sched_fence_finished(s_fence);
+                       dma_fence_put(&s_fence->finished);
                        sched->ops->free_job(job);
+               }
 
        }
        kfifo_free(&entity->job_queue);