simpleStreams mistake ?! Concurrent kernels

Yunior · June 29, 2010, 1:24pm

I tested simpleStreams and what I recieve is:

[ simpleStreams ]

Device name : GeForce GTX 480
CUDA Capable SM 2.0 hardware with 15 multi-processors
scale_factor = 1.0000
array_size = 16777216

memcopy: 10.60
kernel: 0.99
non-streamed: 11.37 (11.59 expected)
4 streams: 10.84 (3.64 expected with compute capability 1.1 or later)

PASSED

Press ENTER to exit…

And question are:
a) why there is 10.84, NOT 3.64 ?
B) should I add some extra instruction during compilation ?
The improvement from 11.37 to 10.84 is not significant - I have expected much more…

Y.

PS. From device query:
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
…
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device
simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.0, CUDA Runtime Version = 3.0, NumDevs = 1, Device = GeForce GTX 480

Cliff_Woolley · June 29, 2010, 10:37pm

This is due to a problem with the SDK sample where it’s not submitting enough work to keep Fermi busy long enough to get good measurements. We fixed the issue already, but the fix didn’t make it in to the SDK in time for the CUDA 3.1 release of the SDK.

The fix is as follows (in simpleStreams.cu at around line 121):

[font=“Courier New”] if(deviceProp.major >= 2)

    niterations = 100;

else

{		    

    if(deviceProp.minor > 1)

	niterations = 5;

    else

	niterations = 1; // reduced workload for compute capability 1.0 and 1.1

}[/font]

Cliff_Woolley · June 29, 2010, 10:37pm

This is due to a problem with the SDK sample where it’s not submitting enough work to keep Fermi busy long enough to get good measurements. We fixed the issue already, but the fix didn’t make it in to the SDK in time for the CUDA 3.1 release of the SDK.

The fix is as follows (in simpleStreams.cu at around line 121):

[font=“Courier New”] if(deviceProp.major >= 2)

    niterations = 100;

else

{		    

    if(deviceProp.minor > 1)

	niterations = 5;

    else

	niterations = 1; // reduced workload for compute capability 1.0 and 1.1

}[/font]

Yunior · June 30, 2010, 7:21am

Ok, it helped for sdk sample (about 17% improvement).

[1] However for my problem (4 kernels without any transfer between gpu-cpu - each of them does one operation on matrix (multiplication or addition) for big matrices (100,000 x 100,000 or 500,000x500,000) there is only 2-3% improvement comparing 4-streams with non-streamed 4 kernels. And in my opinion this problem should keep Fermi busy…

cudaEventRecord(start_event, 0);

for(int k = 0; k < 1000; k++){

// asynchronously launch nstreams kernels, each operating on its own portion of data

kernel1_gpu<<< grid1, thread1, 0, streams[0]>>>( … );

kernel2_gpu<<< grid2, thread2, 0, streams[1]>>>( … );

kernel3_gpu<<< grid3, thread3, 0, streams[2]>>>( … );

kernel4_gpu<<< grid4, thread4, 0, streams[3]>>>( … );

}

cudaEventRecord(stop_event, 0);

cudaEventSynchronize(stop_event);

cutilSafeCall( cudaEventElapsedTime(&elapsed_time, start_event, stop_event) );

printf(“\n%d streams:\t%.2f \n”, nstreams, elapsed_time );

[2] Any ideas how to group kernels - because I have kernel1,kernel3 which work on small problems, and kernel2, kernel4 which work on big problems

Y.

This is due to a problem with the SDK sample where it’s not submitting enough work to keep Fermi busy long enough to get good measurements. We fixed the issue already, but the fix didn’t make it in to the SDK in time for the CUDA 3.1 release of the SDK.

The fix is as follows (in simpleStreams.cu at around line 121):

[font=“Courier New”] if(deviceProp.major >= 2)
    niterations = 100;

else

{		    

    if(deviceProp.minor > 1)

	niterations = 5;

    else

	niterations = 1; // reduced workload for compute capability 1.0 and 1.1

}[/font]

Yunior · June 30, 2010, 7:21am

Ok, it helped for sdk sample (about 17% improvement).

[1] However for my problem (4 kernels without any transfer between gpu-cpu - each of them does one operation on matrix (multiplication or addition) for big matrices (100,000 x 100,000 or 500,000x500,000) there is only 2-3% improvement comparing 4-streams with non-streamed 4 kernels. And in my opinion this problem should keep Fermi busy…

cudaEventRecord(start_event, 0);

for(int k = 0; k < 1000; k++){

// asynchronously launch nstreams kernels, each operating on its own portion of data

kernel1_gpu<<< grid1, thread1, 0, streams[0]>>>( … );

kernel2_gpu<<< grid2, thread2, 0, streams[1]>>>( … );

kernel3_gpu<<< grid3, thread3, 0, streams[2]>>>( … );

kernel4_gpu<<< grid4, thread4, 0, streams[3]>>>( … );

}

cudaEventRecord(stop_event, 0);

cudaEventSynchronize(stop_event);

cutilSafeCall( cudaEventElapsedTime(&elapsed_time, start_event, stop_event) );

printf(“\n%d streams:\t%.2f \n”, nstreams, elapsed_time );

[2] Any ideas how to group kernels - because I have kernel1,kernel3 which work on small problems, and kernel2, kernel4 which work on big problems

Y.

This is due to a problem with the SDK sample where it’s not submitting enough work to keep Fermi busy long enough to get good measurements. We fixed the issue already, but the fix didn’t make it in to the SDK in time for the CUDA 3.1 release of the SDK.

The fix is as follows (in simpleStreams.cu at around line 121):

[font=“Courier New”] if(deviceProp.major >= 2)
    niterations = 100;

else

{		    

    if(deviceProp.minor > 1)

	niterations = 5;

    else

	niterations = 1; // reduced workload for compute capability 1.0 and 1.1

}[/font]

tmurray · June 30, 2010, 7:29am

If you’re doing very large operations, why do you think concurrent kernels should help? You’re already filling the GPU, so the only advantage you’re going to see is by launching the next kernel before the previous kernel has drained completely.

tmurray · June 30, 2010, 7:29am

If you’re doing very large operations, why do you think concurrent kernels should help? You’re already filling the GPU, so the only advantage you’re going to see is by launching the next kernel before the previous kernel has drained completely.

Yunior · June 30, 2010, 7:31am

Any ideas how to group kernels - because I have kernel1,kernel3 which work on small problems, and kernel2, kernel4 which work on big problems. Which combination will be the best:

small-small-big-big
small-big-small-big
big-small-big-small
big-big-small-small

? Of course I can test it myself- but maybe there is some rule.

Y

Yunior · June 30, 2010, 7:31am

Any ideas how to group kernels - because I have kernel1,kernel3 which work on small problems, and kernel2, kernel4 which work on big problems. Which combination will be the best:

small-small-big-big
small-big-small-big
big-small-big-small
big-big-small-small

? Of course I can test it myself- but maybe there is some rule.

Y

Yunior · June 30, 2010, 7:45am

I expected improvment according to “White Paper” (page 18).

Yunior · June 30, 2010, 7:45am

I expected improvment according to “White Paper” (page 18).

seibert · June 30, 2010, 4:15pm

And the white paper says that concurrent execution is useful for cases where one kernel cannot efficiently use the entire device.

seibert · June 30, 2010, 4:15pm

And the white paper says that concurrent execution is useful for cases where one kernel cannot efficiently use the entire device.

kappa · June 30, 2010, 6:30pm

There is a reason to do what he is doing (even if that is not the reason he is): future scalability. Ignoring the other reasons for not getting maximum performance from a GPU of memory transfer, cache/shared memory usage, synchronization/thread divergence, etc. and just focusing on keeping the GPU busy in the form of occupancy, concurrent kernels can be great.

As an aside, I would like to express my true, sincere appreciation to everyone at NVIDIA for concurrent kernels in FERMI/API 3.x. I was counting on NVIDIA to introduce concurrent kernels long before the first whisper about FERMI. As a partial explanation as to why: one of my ultimate interests/goals is functional analysis–my company logo expresses this since it is the functional kernel for a Feynman path integral, Ïˆ (Î» Îº).

There are several ways of handling the problem of achieving maximum occupancy:

1> Hand tune the kernel and its launch for a given GPU or set of GPUs–this may be done with either the CUDA runtime or driver API but is not very future proof or scalable.

2> Use the CUDA driver API to JIT compile the kernel, get its real GPU attributes (register usage, etc). and dynamically size the kernel launches to achieve maximum occupancy.

3> Put the kernel launches on different streams (and make sure there are no synchronizations, etc.) to achieve concurrent kernels. IMHO this is the most scalable and future proof method. This method is compatible with method 2–although maximum occupancy is (more or less) maximum occupancy.

4> Don’t care–do nothing (well–it is always an option and it is a local maximum External Media ).

Feel free to list other methods.

BTW, I also think NVIDIA is doing a great job of addressing other possible performance bottlenecks such as memory transfer, etc.

kappa · June 30, 2010, 6:30pm

There is a reason to do what he is doing (even if that is not the reason he is): future scalability. Ignoring the other reasons for not getting maximum performance from a GPU of memory transfer, cache/shared memory usage, synchronization/thread divergence, etc. and just focusing on keeping the GPU busy in the form of occupancy, concurrent kernels can be great.

As an aside, I would like to express my true, sincere appreciation to everyone at NVIDIA for concurrent kernels in FERMI/API 3.x. I was counting on NVIDIA to introduce concurrent kernels long before the first whisper about FERMI. As a partial explanation as to why: one of my ultimate interests/goals is functional analysis–my company logo expresses this since it is the functional kernel for a Feynman path integral, Ïˆ (Î» Îº).

There are several ways of handling the problem of achieving maximum occupancy:

1> Hand tune the kernel and its launch for a given GPU or set of GPUs–this may be done with either the CUDA runtime or driver API but is not very future proof or scalable.

2> Use the CUDA driver API to JIT compile the kernel, get its real GPU attributes (register usage, etc). and dynamically size the kernel launches to achieve maximum occupancy.

3> Put the kernel launches on different streams (and make sure there are no synchronizations, etc.) to achieve concurrent kernels. IMHO this is the most scalable and future proof method. This method is compatible with method 2–although maximum occupancy is (more or less) maximum occupancy.

4> Don’t care–do nothing (well–it is always an option and it is a local maximum External Media ).

Feel free to list other methods.

BTW, I also think NVIDIA is doing a great job of addressing other possible performance bottlenecks such as memory transfer, etc.

tmurray · June 30, 2010, 6:57pm

Oh sure, on a hypothetical GPU with 1024 SMs, concurrent kernels would be incredibly important. My point is that if you know you’re filling the GPU now and you compare performance with and without concurrent kernels enabled, you’re not going to see a huge benefit from concurrent kernels.

tmurray · June 30, 2010, 6:57pm

Oh sure, on a hypothetical GPU with 1024 SMs, concurrent kernels would be incredibly important. My point is that if you know you’re filling the GPU now and you compare performance with and without concurrent kernels enabled, you’re not going to see a huge benefit from concurrent kernels.

kappa · June 30, 2010, 7:28pm

If your code is worthwhile, future-proofing your code is not just a hypothetical, it will keep future programmers (including yourself) from cursing your name. The benefits, as you point out, may not be in terms of current performance,but in terms of future software use and maintenance. I have had code that I designed and wrote to be future-proof for five years be deployed for worldwide use by a major multinational AFTER the end of that five years (at the same time that that same multinational was eliminating almost all internally developed software). (That last statement should not necessarily be taken to imply that there were not programmers cursing my name :unsure: .) Developing only for the GPU sitting in front of them is not what anybody should be encouraged to do. I am open to suggestions (especially from employees of NVIDIA External Media ) for better ways to design CUDA software that will work and scale into the future.

Also, though, concurrent kernels provide one important method today, to fill the GPU–if you are not already. For his group of smaller kernels for example.

kappa · June 30, 2010, 7:28pm

If your code is worthwhile, future-proofing your code is not just a hypothetical, it will keep future programmers (including yourself) from cursing your name. The benefits, as you point out, may not be in terms of current performance,but in terms of future software use and maintenance. I have had code that I designed and wrote to be future-proof for five years be deployed for worldwide use by a major multinational AFTER the end of that five years (at the same time that that same multinational was eliminating almost all internally developed software). (That last statement should not necessarily be taken to imply that there were not programmers cursing my name :unsure: .) Developing only for the GPU sitting in front of them is not what anybody should be encouraged to do. I am open to suggestions (especially from employees of NVIDIA External Media ) for better ways to design CUDA software that will work and scale into the future.

Also, though, concurrent kernels provide one important method today, to fill the GPU–if you are not already. For his group of smaller kernels for example.

Topic		Replies	Views
Cannot get any stream parallelism. CUDA Programming and Performance	13	1236	December 31, 2019
Question on Stream, Connection and Performance CUDA Programming and Performance hw , cuda	6	950	February 23, 2024
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	7923	April 5, 2015
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2060	February 5, 2020
Not answered question? CUDA and OpenMP? CUDA Programming and Performance	28	10885	September 30, 2010
CUDA 4.0 CUDA Programming and Performance	63	507393	March 28, 2013
CUDA very slow performance CUDA Programming and Performance	21	16390	March 6, 2020
Unexplained stalls in CUDA API calls - reproducer attached Jetson TK1	27	2920	October 18, 2021
Fermi? Sounds interesting... CUDA Programming and Performance	58	15504	October 18, 2009
Concurrently kernels running on one device CUDA Programming and Performance	17	2724	March 2, 2010

simpleStreams mistake ?! Concurrent kernels

memcopy: 10.60 kernel: 0.99 non-streamed: 11.37 (11.59 expected) 4 streams: 10.84 (3.64 expected with compute capability 1.1 or later)

Related topics

memcopy: 10.60
kernel: 0.99
non-streamed: 11.37 (11.59 expected)
4 streams: 10.84 (3.64 expected with compute capability 1.1 or later)