Odd cudaStream_t behavior

So I’ve got 8 streams allocated to a GPU doing overlapping memcpyasync’s and then launching a series of kernels. The basic flow is as follows:

for(i = 0; i < 8; i++)
create stream #i
for(i = 0; i < 8; i++)
memcpyasync(… stream #i, host to device);
for(i = 0; i < 8; i++)
for(i = 0; i < 8; i++)
memcpyasync(… stream #i, device to host);

Unfortunately every time a kernel is queued for launch cudaGetLastError() returns cudaErrorLaunchOutOfResources. This behaviour is not found when running in device emu mode. Further when I increase the stream count from 2 to 32 when using the stream example in the CUDA programming guide I get similar results. I’m running the current version of CUDA and have tried all CUDA 2.0 revisions that support the GTX 280, same results.


Same exact kernel launches in another program w/out using streams with 512 blocks and 512 threads per block.

Inside the program that uses streams I can run at most 256 blocks with 256 threads per block in a single schedule. Further the one that uses streams does not appear to make any kind of system call which reports an error so the bug is definitely in libcudart.

This project has given rise to numerous bugs in cudart. Is there an email address to which I can forward a list of these along with code and debugging output? Or is this forum the only means of reporting bugs?


Some more information:

I am using shared memory, but its far below the max shm per block reported by my card (approximately 1/8th). It also appears this bug only triggers if cudaGetDeviceProperties is called on the device after cudaSetDevice. When I reorder these calls I still get the same error with 512 threads per block, but not with 384… Definately something fishy in the cuda runtime lib since the ioctl’s stop after the memcpy async request.

I think you need to call cudaStreamSynchronize(l_asyncStream)+cudaThreadSyncronize() at the end ( and not only cudaThreadSyncronize() ).

how many registers are your kernel using? On G80/G92 you can have at most 10 registers when you want to use 512 threads per block. Number of blocks should not matter, only when going beyond 64k in one dimension.
If your kernel uses less than 16k shared mem, you will also never have a problem launching the kernel (it just might limit the number of concurrent blocks per SM, but with 512 threads you will have just 1 block per SM on G80/92)

It turns out to be a combination of things. One a bug in CUDA’s resource usage estimate (it never asks the card, just looks at the fatbin headers) and the second bug is in nvcc involving improper unrolling of loops. Must we write everything in PTX? If so please synchronize the PTX ISA.

the numbers in .cubin are not estimates, it is machine code that knows which registers it uses, so those numbers should be exact. Otherwise it is smart to file a bugreport.

if a loop gets improperly unrolled you can put a #pragma unroll 0 on the line before it to prevent unrolling. But maybe it is smart to also file a bug for this one.

I checked the binary, 17 registers per thread, 512 threads per block and 2k shared memory usage per block. Far below the limits of any CUDA compatible hardware. I’ve sent a reproduction stub to a fellow from nVidia that contacted me, but have yet to hear back that he got it.

Unrolling this loop is a no-brainer as the iterator has nothing to do with the statements in the loop:

for(i = 0; i < 32; i++)


b ^= x[v & 1];

v >>= 1;


17*512 = 8704 and hence above the 8192 registers that are available for 1.0/1.1 hardware. Unless you are using g200 hardware it won’t run.