Is there an upper limit N_max known a priori? If so, you could launch a kernel with N_max blocks, passing in a pointer to d_s. The threads in each block would check their block index against d_s (which contains N) and exit immediately if the block index is greater than the contents of d_s (i.e. N).
Anyways the value should be cached in the L1 and L2 cache so it shouldn’t require a global read more than a few times. When a new block is launched on a given SM it can query whether or not to run based on the data in the L1 cache.
No I wouldn’t think so. In the case of simple global fetches on Fermi arch the first say 15 blocks would do the actual global fetch ( which will store the value N in fast L1 cache), once the same SM context switches to a new block the data should remain in the L1 cache, meaning no new fetch is required.
The other option is to do a cudaMemcpyDeviceToDevice into constant memory space, which is quickly accessible by all blocks.
About “Hmm seems more like a hack to me than a solution” I would say you are definetly doing something a little out of the ordinary and if you want code that looks very streamlined and generic maybe you should go with mfatica’s suggestion.
Just cuMemcpyDtoH and launch with the normal mechanism. It’s not going to slow you down. It’s hardly worth trying the crazy ideas mentioned here. But it is interesting that Direct3D 11 (which is generally pretty bad for GPGPU) does actually have this feature: ID3D11DeviceContext::DispatchIndirect.