Unexplained stalls in CUDA API calls - reproducer attached

AastaLLL · October 20, 2017, 5:47am

Hi,

Sorry for the late reply.

Guess that you refer to sample ‘~/NVIDIA_CUDA-8.0_Samples/6_Advanced/concurrentKernels’.
Please remember to launch kernel with the different stream to allow concurrent execution.

From the picture you shared at #17, the kernel doesn’t execute concurrently.
Could you share profiler’s results of concurrent version?

Thanks.

micjan · October 20, 2017, 8:16am

Hi AastaLLL,

Guess that you refer to sample ‘~/NVIDIA_CUDA-8.0_Samples/6_Advanced/concurrentKernels’.

No, I was referring to array summation above, ~/NVIDIA_CUDA-6.5_Samples/6_Advanced/reduction/.

Please remember to launch kernel with the different stream to allow concurrent execution.
From the picture you shared at #17, the kernel doesn’t execute concurrently.

The screenshot I attached in #17 was with a single execution stream only.

Could you share profiler’s results of concurrent version?

Attached. You can see the kernels in streams 13, 14, and 15 now run more or less concurrently - I’m guessing depending on what the scheduler decides to do.

However, you can see the real problem is the same as before - there is a random stall at cudaLaunch. Sometimes it happens at cudaEventRecord, or cudaSetupArgument.

micjan · October 20, 2017, 8:33am

Hi AastaLLL,

I’ve made a new observation which may shed some light into the elusive random stall problem.

Please refer to the attached test program source.

On my side, if I build the program to force memory synchronisation to complete, I observe output similar to below:

ubuntu@tegra-ubuntu:~/test$ ./reproducer 
Compiled with: FORCE_SYNC_MEM_TRANSFERS=1

Running scenario 1


Processed 8192 frames, time avg: 0.192 min: 0.176 max: 1.152 disparity: 0.976

Processing stats:
           1ms:     8192 frames (100.00%)
           2ms:        0 frames (  0.00%)
           3ms:        0 frames (  0.00%)
           4ms:        0 frames (  0.00%)
           5ms:        0 frames (  0.00%)
           6ms:        0 frames (  0.00%)
           7ms:        0 frames (  0.00%)
           8ms:        0 frames (  0.00%)
           9ms:        0 frames (  0.00%)
          10ms:        0 frames (  0.00%)
          11ms:        0 frames (  0.00%)
          12ms:        0 frames (  0.00%)
          13ms:        0 frames (  0.00%)
          14ms:        0 frames (  0.00%)
        >=15ms:        0 frames (  0.00%)

Total run time: 00:00:26

On the other hand, if I don’t, the output is then for me similar to below:

ubuntu@tegra-ubuntu:~/test$ ./reproducer 
Compiled with: FORCE_SYNC_MEM_TRANSFERS=0

Running scenario 1

        Disparity problem: 4.481 frame 4
                CUDA processing time: 4.673ms CPU time: 4847us
        Disparity problem: 4.365 frame 120
                CUDA processing time: 4.531ms CPU time: 4679us

Processed 8192 frames, time avg: 0.171 min: 0.165 max: 4.673 disparity: 4.508

Processing stats:
           1ms:     8189 frames ( 99.96%)
           2ms:        1 frames (  0.01%)
           3ms:        0 frames (  0.00%)
           4ms:        0 frames (  0.00%)
           5ms:        2 frames (  0.02%)
           6ms:        0 frames (  0.00%)
           7ms:        0 frames (  0.00%)
           8ms:        0 frames (  0.00%)
           9ms:        0 frames (  0.00%)
          10ms:        0 frames (  0.00%)
          11ms:        0 frames (  0.00%)
          12ms:        0 frames (  0.00%)
          13ms:        0 frames (  0.00%)
          14ms:        0 frames (  0.00%)
        >=15ms:        0 frames (  0.00%)

Total run time: 00:00:25

Could you comment on the reason for this?
sync-async-mem-transfers.zip (123 KB)

AastaLLL · October 23, 2017, 6:16am

Hi,

This result is under expectation.

For example, if the maximal thread of your device is three, and your job requires two threads at a time.

[Time]: [Thread1] [Thread2] [Thread3]
Without sync
T0: Job1.A, Job1.B, Job2.A
T1: Job2.B, Job3.A, Job3.B
…
Execution time of Job1 is T0
Execution time of Job2 is T0+T1 = 2T0

With sync
T0: Job1.A, Job1.B, IDLE
T1: Job2.A, Job2.B, IDLE
…
Execution time of Job1 is T0
Execution time of Job2 is T1 = T0

Thanks.

micjan · October 23, 2017, 7:32am

Hi AastaLLL,

I agree with what you say in general, but in the case which I attached, the jobs are already surrounded by synchronisation code:

checkCuda(cudaEventRecord(startEvent, 0));
load(); // GPU jobs submitted here
checkCuda(cudaEventRecord(stopEvent, 0));
checkCuda(cudaEventSynchronize(stopEvent));

The cudaEventSynchronize call will “wait until the completion of all device work preceding the most recent call to cudaEventRecord()”, as per the documentation.

For this reason, the extra cudaDeviceSynchronize calls (surrounded by the FORCE_SYNC_MEM_TRANSFERS block) only affect the memory transfer operations in the source code which I attached. Do you agree?

AastaLLL · October 24, 2017, 7:20am

Hi,

DeviceSynchronize is to make sure all the CUDA threads already finish its jobs.
If the extra loop is only for memory transition, then the answer to your question should be yes.

Thanks.

micjan · November 16, 2017, 6:16pm

OK, thanks for clarifying.

Can I ask for your help on the other question I’ve posted?

Topic		Replies	Views
CUDA very slow performance CUDA Programming and Performance	21	16743	March 6, 2020
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	8075	April 5, 2015
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	52289	June 7, 2022
Cannot get any stream parallelism. CUDA Programming and Performance	13	1296	December 31, 2019
Question about CUDA kernels parallel execution CUDA Programming and Performance cuda , parallel-computing	7	2643	April 27, 2024
performance problem CUDA Programming and Performance	2	611	July 16, 2018
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1243	May 14, 2019
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20149	May 4, 2007
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
CUDA hangups Jetson TK1	26	3667	October 18, 2021

Unexplained stalls in CUDA API calls - reproducer attached

Related topics