GTX480 Streams Issues

Hello,

I’m experiencing issues with this GPU card when I try to use Streams on my code. I’ve checked the three requisites to stream (deviceOverlap OK, Kernel execution and Data Transfers to be overlapped occurring in different-non-default streams and host memory involved as pinned memory) but I can see on nVidia Profiler that overlapping between data transfers and kernels is unsuccessful.

At last I’ve tried to run this basic example just to make sure that it’s not my fault…

But it still does not work. Memcpy’s and Kernels do not overlap.

Does anyone knows if there is some kind of problems with this GPU to use Streams?

It may help you.
I just copied the code and run on my Tesla K20c and the overlap is working perfectly fine. H2D, Kernel and D2H, all are overlapping.

Ok, thanks!

So, you simply copied the code, compiled it and run .exe on the profiler, you did not add anything, is this correct?

Yes.

I am not familiar with the GTX 480 or the linked example. A couple of thoughts:

(1) The GTX 480 has a single Copy Engine (DMA engine), while the Tesla K20c has two. This means that the GTX 480 can overlap kernel execution with a copy in one direction (either host->device, OR device->host), but cannot perform simultaneous up- and downloads. The Tesla K20c can, at the same time, execute a kernel, transfer data host->device, and transfer device->host.

(2) There are various ways in which concurrent copies could be disabled. For example setting the environment variable CUDA_LAUNCH_BLOCKING=1, invoking the profiler with --concurrent-kernels-off, or enabling serialized trace. Check the option under Nsight|Options…|Analysis|CUDA Kernel Trace Mode

Ok, I review all these ideas and it is not working. One question, what operating system are you using? Linux or Windows?

Me?
Anyways, Linux.

Mark this: With Linux, my code can use Streams without problems. I do not know if there is some kind of issue between this card (GTX480) and Windows in order to overlap data transfers and kernels.

Thanks for your collaboration!

In my experience, it’s difficult to get a WDDM GPU in windows to work correctly with concurrency. One of the issues is that WDDM batches commands to the GPU. This batching of operations can interfere with expected sequencing of operations, visible when you try to profile the app.

You won’t be able to put your GeForce device in TCC mode, but for GPUs that can be run in TCC mode, it’s usually easier to get expected results in these cases.