What happens if every thread in a warp reads from the same address in global memory? Is this read in one memory transaction and broadcast to all threads or is this counted as non-coalesced, requiring 32 separate memory transactions? I have been using constant memory for this access pattern, but I am running out of it.
Unless you are using a 2xx class card, there will be one read (i.e. uncoalesced) for each thread. Or put differently: don’t even think about doing it like that.
You could use tex1Dfetch, but IMO performance is not really good and I think textures are likely to cause pain if you use threads (though I think CUDA + multiple threads probably always is).
Just load it into shared memory once (either once per block + syncthreads or once per warp) and use it from there.
On 2xx hardware, it probably has similar (maybe somewhat better) performance to the “load into shared memory once per warp” approach, but if you are memory-bandwidth bound you can still be faster by caching the value manually into shared memory.
Can you show, how you are using it?
How do you put Data into constant memory?
From the Programming Guide:
__constant__ float constData; float data; cudaMemcpyToSymbol(constData, data, sizeof(data));