CUDA_PROFILE disables streams

We’ve just had a wail of a time figuring out that the CUDA_PROFILE envvar disables streams, apparently. We found no hint to this in the documentation or on the forums.

To reproduce, run the simpleStreams SDK example, once with CUDA_PROFILE set to 1, and once with it set to 0 or unset. No need to install the Visual Profiler, just set this in the shell you launch from. Here’s an example run:

[codebox]

SDK2/bin/linux/release>./simpleStreams

running on: GeForce GTX 280

memcopy: 33.06

kernel: 17.20

non-streamed: 50.03 (50.26 expected)

4 streams: 33.79 (25.46 expected with compute capability 1.1 or later)


Test PASSED

Press ENTER to exit…

SDK2/bin/linux/release>setenv CUDA_PROFILE 1

SDK2/bin/linux/release>./simpleStreams

running on: GeForce GTX 280

memcopy: 33.07

kernel: 17.21

non-streamed: 50.04 (50.28 expected)

4 streams: 50.31 (25.48 expected with compute capability 1.1 or later)


Test PASSED

Press ENTER to exit…

[/codebox]

We tried both a couple of GTX 280s and C1060s, using driver versions 180.22, 180.29 and 185.??, Linux x86 and x86_64 (supported distros of course), with the 2.1 final toolkit.

Is this expected behaviour? I tend to doubt it, because the documentation explicitly talks about the stream ID (CUDA_profiler_2_1.txt in the doc/ directory). Any clue what we might be doing wrong?

Thanks,

dom

Pretty sure this was fixed in 2.2. In fact, let’s find out!

tim@thor:~/NVIDIA_CUDA_SDK/bin/linux/release$ ./simpleStreams 

running on: Tesla C1060

memcopy:	19.82

kernel:		25.15

non-streamed:	44.86 (44.98 expected)

4 streams:	27.37 (30.11 expected with compute capability 1.1 or later)

-------------------------------

Test PASSED

Press ENTER to exit...

tim@thor:~/NVIDIA_CUDA_SDK/bin/linux/release$ CUDA_PROFILE=1 ./simpleStreams 

running on: Tesla C1060

memcopy:	19.84

kernel:		25.13

non-streamed:	44.88 (44.97 expected)

4 streams:	28.01 (30.09 expected with compute capability 1.1 or later)

-------------------------------

Test PASSED

Press ENTER to exit...

so yes, this is fixed in 2.2

Now that’s what I call a fast turnaround time! Great, thanks :)

Would it be possible to know of any outstanding errors in CUDA 2.2 that would minimize loss of time during development? thanks.

Gordon

A lack of async support was just a limitation in the profiler that’s been fixed in 2.2. I’m not aware of any really obvious limitations anymore, although I bet CUDA_PROFILE still sets CPU affinity to a single core (to be able to do meaningful timestamps).