simpleStreams mistake ?! Concurrent kernels

I tested simpleStreams and what I recieve is:

[ simpleStreams ]

Device name : GeForce GTX 480
CUDA Capable SM 2.0 hardware with 15 multi-processors
scale_factor = 1.0000
array_size = 16777216

memcopy: 10.60
kernel: 0.99
non-streamed: 11.37 (11.59 expected)
4 streams: 10.84 (3.64 expected with compute capability 1.1 or later)

PASSED

Press ENTER to exit…

And question are:
a) why there is 10.84, NOT 3.64 ?
B) should I add some extra instruction during compilation ?
The improvement from 11.37 to 10.84 is not significant - I have expected much more…

Y.

PS. From device query:
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0

Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device
simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.0, CUDA Runtime Version = 3.0, NumDevs = 1, Device = GeForce GTX 480

This is due to a problem with the SDK sample where it’s not submitting enough work to keep Fermi busy long enough to get good measurements. We fixed the issue already, but the fix didn’t make it in to the SDK in time for the CUDA 3.1 release of the SDK.

The fix is as follows (in simpleStreams.cu at around line 121):

[font=“Courier New”] if(deviceProp.major >= 2)

    niterations = 100;

else

{		    

    if(deviceProp.minor > 1)

	niterations = 5;

    else

	niterations = 1; // reduced workload for compute capability 1.0 and 1.1

}[/font]

This is due to a problem with the SDK sample where it’s not submitting enough work to keep Fermi busy long enough to get good measurements. We fixed the issue already, but the fix didn’t make it in to the SDK in time for the CUDA 3.1 release of the SDK.

The fix is as follows (in simpleStreams.cu at around line 121):

[font=“Courier New”] if(deviceProp.major >= 2)

    niterations = 100;

else

{		    

    if(deviceProp.minor > 1)

	niterations = 5;

    else

	niterations = 1; // reduced workload for compute capability 1.0 and 1.1

}[/font]

Ok, it helped for sdk sample (about 17% improvement).

[1] However for my problem (4 kernels without any transfer between gpu-cpu - each of them does one operation on matrix (multiplication or addition) for big matrices (100,000 x 100,000 or 500,000x500,000) there is only 2-3% improvement comparing 4-streams with non-streamed 4 kernels. And in my opinion this problem should keep Fermi busy…

cudaEventRecord(start_event, 0);

for(int k = 0; k < 1000; k++){

// asynchronously launch nstreams kernels, each operating on its own portion of data

kernel1_gpu<<< grid1, thread1, 0, streams[0]>>>( … );

kernel2_gpu<<< grid2, thread2, 0, streams[1]>>>( … );

kernel3_gpu<<< grid3, thread3, 0, streams[2]>>>( … );

kernel4_gpu<<< grid4, thread4, 0, streams[3]>>>( … );

}

cudaEventRecord(stop_event, 0);

cudaEventSynchronize(stop_event);

cutilSafeCall( cudaEventElapsedTime(&elapsed_time, start_event, stop_event) );

printf("\n%d streams:\t%.2f \n", nstreams, elapsed_time );

[2] Any ideas how to group kernels - because I have kernel1,kernel3 which work on small problems, and kernel2, kernel4 which work on big problems

Y.

Ok, it helped for sdk sample (about 17% improvement).

[1] However for my problem (4 kernels without any transfer between gpu-cpu - each of them does one operation on matrix (multiplication or addition) for big matrices (100,000 x 100,000 or 500,000x500,000) there is only 2-3% improvement comparing 4-streams with non-streamed 4 kernels. And in my opinion this problem should keep Fermi busy…

cudaEventRecord(start_event, 0);

for(int k = 0; k < 1000; k++){

// asynchronously launch nstreams kernels, each operating on its own portion of data

kernel1_gpu<<< grid1, thread1, 0, streams[0]>>>( … );

kernel2_gpu<<< grid2, thread2, 0, streams[1]>>>( … );

kernel3_gpu<<< grid3, thread3, 0, streams[2]>>>( … );

kernel4_gpu<<< grid4, thread4, 0, streams[3]>>>( … );

}

cudaEventRecord(stop_event, 0);

cudaEventSynchronize(stop_event);

cutilSafeCall( cudaEventElapsedTime(&elapsed_time, start_event, stop_event) );

printf("\n%d streams:\t%.2f \n", nstreams, elapsed_time );

[2] Any ideas how to group kernels - because I have kernel1,kernel3 which work on small problems, and kernel2, kernel4 which work on big problems

Y.

If you’re doing very large operations, why do you think concurrent kernels should help? You’re already filling the GPU, so the only advantage you’re going to see is by launching the next kernel before the previous kernel has drained completely.

If you’re doing very large operations, why do you think concurrent kernels should help? You’re already filling the GPU, so the only advantage you’re going to see is by launching the next kernel before the previous kernel has drained completely.

Any ideas how to group kernels - because I have kernel1,kernel3 which work on small problems, and kernel2, kernel4 which work on big problems. Which combination will be the best:

small-small-big-big
small-big-small-big
big-small-big-small
big-big-small-small

? Of course I can test it myself- but maybe there is some rule.

Y

Any ideas how to group kernels - because I have kernel1,kernel3 which work on small problems, and kernel2, kernel4 which work on big problems. Which combination will be the best:

small-small-big-big
small-big-small-big
big-small-big-small
big-big-small-small

? Of course I can test it myself- but maybe there is some rule.

Y

I expected improvment according to “White Paper” (page 18).

I expected improvment according to “White Paper” (page 18).

And the white paper says that concurrent execution is useful for cases where one kernel cannot efficiently use the entire device.

And the white paper says that concurrent execution is useful for cases where one kernel cannot efficiently use the entire device.

There is a reason to do what he is doing (even if that is not the reason he is): future scalability. Ignoring the other reasons for not getting maximum performance from a GPU of memory transfer, cache/shared memory usage, synchronization/thread divergence, etc. and just focusing on keeping the GPU busy in the form of occupancy, concurrent kernels can be great.

As an aside, I would like to express my true, sincere appreciation to everyone at NVIDIA for concurrent kernels in FERMI/API 3.x. I was counting on NVIDIA to introduce concurrent kernels long before the first whisper about FERMI. As a partial explanation as to why: one of my ultimate interests/goals is functional analysis–my company logo expresses this since it is the functional kernel for a Feynman path integral, ψ (λ κ).

There are several ways of handling the problem of achieving maximum occupancy:

1> Hand tune the kernel and its launch for a given GPU or set of GPUs–this may be done with either the CUDA runtime or driver API but is not very future proof or scalable.

2> Use the CUDA driver API to JIT compile the kernel, get its real GPU attributes (register usage, etc). and dynamically size the kernel launches to achieve maximum occupancy.

3> Put the kernel launches on different streams (and make sure there are no synchronizations, etc.) to achieve concurrent kernels. IMHO this is the most scalable and future proof method. This method is compatible with method 2–although maximum occupancy is (more or less) maximum occupancy.

4> Don’t care–do nothing (well–it is always an option and it is a local maximum :devil: ).

Feel free to list other methods.

BTW, I also think NVIDIA is doing a great job of addressing other possible performance bottlenecks such as memory transfer, etc.

There is a reason to do what he is doing (even if that is not the reason he is): future scalability. Ignoring the other reasons for not getting maximum performance from a GPU of memory transfer, cache/shared memory usage, synchronization/thread divergence, etc. and just focusing on keeping the GPU busy in the form of occupancy, concurrent kernels can be great.

As an aside, I would like to express my true, sincere appreciation to everyone at NVIDIA for concurrent kernels in FERMI/API 3.x. I was counting on NVIDIA to introduce concurrent kernels long before the first whisper about FERMI. As a partial explanation as to why: one of my ultimate interests/goals is functional analysis–my company logo expresses this since it is the functional kernel for a Feynman path integral, ψ (λ κ).

There are several ways of handling the problem of achieving maximum occupancy:

1> Hand tune the kernel and its launch for a given GPU or set of GPUs–this may be done with either the CUDA runtime or driver API but is not very future proof or scalable.

2> Use the CUDA driver API to JIT compile the kernel, get its real GPU attributes (register usage, etc). and dynamically size the kernel launches to achieve maximum occupancy.

3> Put the kernel launches on different streams (and make sure there are no synchronizations, etc.) to achieve concurrent kernels. IMHO this is the most scalable and future proof method. This method is compatible with method 2–although maximum occupancy is (more or less) maximum occupancy.

4> Don’t care–do nothing (well–it is always an option and it is a local maximum :devil: ).

Feel free to list other methods.

BTW, I also think NVIDIA is doing a great job of addressing other possible performance bottlenecks such as memory transfer, etc.

Oh sure, on a hypothetical GPU with 1024 SMs, concurrent kernels would be incredibly important. My point is that if you know you’re filling the GPU now and you compare performance with and without concurrent kernels enabled, you’re not going to see a huge benefit from concurrent kernels.

Oh sure, on a hypothetical GPU with 1024 SMs, concurrent kernels would be incredibly important. My point is that if you know you’re filling the GPU now and you compare performance with and without concurrent kernels enabled, you’re not going to see a huge benefit from concurrent kernels.

If your code is worthwhile, future-proofing your code is not just a hypothetical, it will keep future programmers (including yourself) from cursing your name. The benefits, as you point out, may not be in terms of current performance,but in terms of future software use and maintenance. I have had code that I designed and wrote to be future-proof for five years be deployed for worldwide use by a major multinational AFTER the end of that five years (at the same time that that same multinational was eliminating almost all internally developed software). (That last statement should not necessarily be taken to imply that there were not programmers cursing my name :unsure: .) Developing only for the GPU sitting in front of them is not what anybody should be encouraged to do. I am open to suggestions (especially from employees of NVIDIA :whistling: ) for better ways to design CUDA software that will work and scale into the future.

Also, though, concurrent kernels provide one important method today, to fill the GPU–if you are not already. For his group of smaller kernels for example.

If your code is worthwhile, future-proofing your code is not just a hypothetical, it will keep future programmers (including yourself) from cursing your name. The benefits, as you point out, may not be in terms of current performance,but in terms of future software use and maintenance. I have had code that I designed and wrote to be future-proof for five years be deployed for worldwide use by a major multinational AFTER the end of that five years (at the same time that that same multinational was eliminating almost all internally developed software). (That last statement should not necessarily be taken to imply that there were not programmers cursing my name :unsure: .) Developing only for the GPU sitting in front of them is not what anybody should be encouraged to do. I am open to suggestions (especially from employees of NVIDIA :whistling: ) for better ways to design CUDA software that will work and scale into the future.

Also, though, concurrent kernels provide one important method today, to fill the GPU–if you are not already. For his group of smaller kernels for example.