Is there an upper limit N_max known a priori? If so, you could launch a kernel with N_max blocks, passing in a pointer to d_s. The threads in each block would check their block index against d_s (which contains N) and exit immediately if the block index is greater than the contents of d_s (i.e. N).
Hmm seems more like a hack to me than a solution - as each thread in a block would still need to do a global read (i guess broadcasted?) into the memory space.
Anyways the value should be cached in the L1 and L2 cache so it shouldn’t require a global read more than a few times. When a new block is launched on a given SM it can query whether or not to run based on the data in the L1 cache.
Say the upper limit is 250,000 threads in total (448, 512) , and upon calculating N, i only needed about 125,000 (256, 512)
that’s still an extra 125,000 threads that will be reading a global value, even if they are all coalesced (broadcasted reads), each warp would still need to fetch at least once no?
No I wouldn’t think so. In the case of simple global fetches on Fermi arch the first say 15 blocks would do the actual global fetch ( which will store the value N in fast L1 cache), once the same SM context switches to a new block the data should remain in the L1 cache, meaning no new fetch is required.
The other option is to do a cudaMemcpyDeviceToDevice into constant memory space, which is quickly accessible by all blocks.
About “Hmm seems more like a hack to me than a solution” I would say you are definetly doing something a little out of the ordinary and if you want code that looks very streamlined and generic maybe you should go with mfatica’s suggestion.
Just cuMemcpyDtoH and launch with the normal mechanism. It’s not going to slow you down. It’s hardly worth trying the crazy ideas mentioned here. But it is interesting that Direct3D 11 (which is generally pretty bad for GPGPU) does actually have this feature: ID3D11DeviceContext::DispatchIndirect.
I assume the value to launch the next kernel refers to something needed for block or grid dimensions. Another possibility is to launch a kernel of a fixed dimension, perhaps a grid-stride loop, and then have the loop traversal limits use the value in device memory. no copy or other transfer of data from D->H is needed in that case.