launching kernels using value from device memory

Say I have a kernel that does some work and computes some integer N and stores in device memory address d_s

I want to a launch a new kernel with # blocks equal to N, which is stored in d_s

I don’t want to have to cudaMemcpy the value of *d_s into host to launch the kernel, as cudaMemcpy is very very slow. Any tips?

And obviously something like kernel_launch<<<*d_s, blah>>>(); fails miserably

Is there an upper limit N_max known a priori? If so, you could launch a kernel with N_max blocks, passing in a pointer to d_s. The threads in each block would check their block index against d_s (which contains N) and exit immediately if the block index is greater than the contents of d_s (i.e. N).

Hmm seems more like a hack to me than a solution - as each thread in a block would still need to do a global read (i guess broadcasted?) into the memory space.

BUMP! anyone?

Try to use zero-copy memory, the transfer is implicit and pretty fast.

Or try placing it in the constant memory space.

Anyways the value should be cached in the L1 and L2 cache so it shouldn’t require a global read more than a few times. When a new block is launched on a given SM it can query whether or not to run based on the data in the L1 cache.

Say the upper limit is 250,000 threads in total (448, 512) , and upon calculating N, i only needed about 125,000 (256, 512)

that’s still an extra 125,000 threads that will be reading a global value, even if they are all coalesced (broadcasted reads), each warp would still need to fetch at least once no?

ie. 125,000/32 => about 4,000 extra global reads.

No I wouldn’t think so. In the case of simple global fetches on Fermi arch the first say 15 blocks would do the actual global fetch ( which will store the value N in fast L1 cache), once the same SM context switches to a new block the data should remain in the L1 cache, meaning no new fetch is required.

The other option is to do a cudaMemcpyDeviceToDevice into constant memory space, which is quickly accessible by all blocks.

About “Hmm seems more like a hack to me than a solution” I would say you are definetly doing something a little out of the ordinary and if you want code that looks very streamlined and generic maybe you should go with mfatica’s suggestion.

Hope this helps.

Just cuMemcpyDtoH and launch with the normal mechanism. It’s not going to slow you down. It’s hardly worth trying the crazy ideas mentioned here. But it is interesting that Direct3D 11 (which is generally pretty bad for GPGPU) does actually have this feature: ID3D11DeviceContext::DispatchIndirect.