Unexplained stalls in CUDA API calls - reproducer attached

Hi,

Sorry for the late reply.

Guess that you refer to sample ‘~/NVIDIA_CUDA-8.0_Samples/6_Advanced/concurrentKernels’.
Please remember to launch kernel with the different stream to allow concurrent execution.

From the picture you shared at #17, the kernel doesn’t execute concurrently.
Could you share profiler’s results of concurrent version?

Thanks.

Hi AastaLLL,

Guess that you refer to sample ‘~/NVIDIA_CUDA-8.0_Samples/6_Advanced/concurrentKernels’.

No, I was referring to array summation above, ~/NVIDIA_CUDA-6.5_Samples/6_Advanced/reduction/.

Please remember to launch kernel with the different stream to allow concurrent execution.
From the picture you shared at #17, the kernel doesn’t execute concurrently.

The screenshot I attached in #17 was with a single execution stream only.

Could you share profiler’s results of concurrent version?

Attached. You can see the kernels in streams 13, 14, and 15 now run more or less concurrently - I’m guessing depending on what the scheduler decides to do.

However, you can see the real problem is the same as before - there is a random stall at cudaLaunch. Sometimes it happens at cudaEventRecord, or cudaSetupArgument.

Hi AastaLLL,

I’ve made a new observation which may shed some light into the elusive random stall problem.

Please refer to the attached test program source.

On my side, if I build the program to force memory synchronisation to complete, I observe output similar to below:

ubuntu@tegra-ubuntu:~/test$ ./reproducer 
Compiled with: FORCE_SYNC_MEM_TRANSFERS=1

Running scenario 1


Processed 8192 frames, time avg: 0.192 min: 0.176 max: 1.152 disparity: 0.976

Processing stats:
           1ms:     8192 frames (100.00%)
           2ms:        0 frames (  0.00%)
           3ms:        0 frames (  0.00%)
           4ms:        0 frames (  0.00%)
           5ms:        0 frames (  0.00%)
           6ms:        0 frames (  0.00%)
           7ms:        0 frames (  0.00%)
           8ms:        0 frames (  0.00%)
           9ms:        0 frames (  0.00%)
          10ms:        0 frames (  0.00%)
          11ms:        0 frames (  0.00%)
          12ms:        0 frames (  0.00%)
          13ms:        0 frames (  0.00%)
          14ms:        0 frames (  0.00%)
        >=15ms:        0 frames (  0.00%)

Total run time: 00:00:26

On the other hand, if I don’t, the output is then for me similar to below:

ubuntu@tegra-ubuntu:~/test$ ./reproducer 
Compiled with: FORCE_SYNC_MEM_TRANSFERS=0

Running scenario 1

        Disparity problem: 4.481 frame 4
                CUDA processing time: 4.673ms CPU time: 4847us
        Disparity problem: 4.365 frame 120
                CUDA processing time: 4.531ms CPU time: 4679us

Processed 8192 frames, time avg: 0.171 min: 0.165 max: 4.673 disparity: 4.508

Processing stats:
           1ms:     8189 frames ( 99.96%)
           2ms:        1 frames (  0.01%)
           3ms:        0 frames (  0.00%)
           4ms:        0 frames (  0.00%)
           5ms:        2 frames (  0.02%)
           6ms:        0 frames (  0.00%)
           7ms:        0 frames (  0.00%)
           8ms:        0 frames (  0.00%)
           9ms:        0 frames (  0.00%)
          10ms:        0 frames (  0.00%)
          11ms:        0 frames (  0.00%)
          12ms:        0 frames (  0.00%)
          13ms:        0 frames (  0.00%)
          14ms:        0 frames (  0.00%)
        >=15ms:        0 frames (  0.00%)

Total run time: 00:00:25

Could you comment on the reason for this?
sync-async-mem-transfers.zip (123 KB)

Hi,

This result is under expectation.

For example, if the maximal thread of your device is three, and your job requires two threads at a time.

[Time]: [Thread1] [Thread2] [Thread3]
Without sync
T0: Job1.A, Job1.B, Job2.A
T1: Job2.B, Job3.A, Job3.B

Execution time of Job1 is T0
Execution time of Job2 is T0+T1 = 2T0

With sync
T0: Job1.A, Job1.B, IDLE
T1: Job2.A, Job2.B, IDLE

Execution time of Job1 is T0
Execution time of Job2 is T1 = T0

Thanks.

Hi AastaLLL,

I agree with what you say in general, but in the case which I attached, the jobs are already surrounded by synchronisation code:

checkCuda(cudaEventRecord(startEvent, 0));
load(); // GPU jobs submitted here
checkCuda(cudaEventRecord(stopEvent, 0));
checkCuda(cudaEventSynchronize(stopEvent));

The cudaEventSynchronize call will “wait until the completion of all device work preceding the most recent call to cudaEventRecord()”, as per the documentation.

For this reason, the extra cudaDeviceSynchronize calls (surrounded by the FORCE_SYNC_MEM_TRANSFERS block) only affect the memory transfer operations in the source code which I attached. Do you agree?

Hi,

DeviceSynchronize is to make sure all the CUDA threads already finish its jobs.
If the extra loop is only for memory transition, then the answer to your question should be yes.

Thanks.

OK, thanks for clarifying.

Can I ask for your help on the other question I’ve posted?