If multiple threads access the same memory location in constant memory, and that location resides in the local constant memory cache, will the accesses be serialized? Like they would if multiple threads access the same shared memory location?
I don`t think so. I am getting excellent performance from constant memory with all threads hammering the same location in constant cache. For my project it’s polygon edge coordinates - all screen pixels are getting tested against these. Each thread in a block tests a different screen pixel against the same edges.
So if these accesses would get serialized I wouldn’t possibly see the performance I am seeing.
In fact, this is the ideal way to access constant memory. It is fastest when the value is broadcast to all threads in a warp.
Warps are serialized when threads in the warp read different values in the constant memory, making textures and/or shared memory potentially more attractive for this memory pattern.
OK… so when all reads are to the same address, it’s best to use constant memory. And if reads are to different addresses, it’s best to use shared memory (or textures).
A few more questions:
- Does the texture cache have a certain number of banks, like the shared memory does?
- Do the registers have similar restrictions? (like, if multiple threads read the same register, they’ll get serialized)?
The thing is that I’m gonna make semi-random read accesses and I can’t guarantee that reads will be to different banks. I’m wondering why there is such a restrictions about reading to the same address, I mean, read-after-read operation should bring no synchronization problem.
Not that I’m aware of.
Registers are allocated per-thread, and you can’t read one thread’s registers from another.
The documentation doesn’t really say, but I would guess that there is only one constant memory read unit per warp in the hardware. Hence, when a warp accesses multiple values from constant memory, it must serialize access to that hardware unit. But that is just a guess. Consider that constant memory is probably the underlying hardware that graphics shaders use for constant parameters to the shader. In that case, every thread in the shader is reading the same parameter simultaneously, so the hardware would be optimized for this use case to give the best graphics performance.
If you are going to make semi-random read accesses and are lucky enough that your data fits in shared memory, that will probably be your best bet. But it never hurts to try out the various ways to see which is faster in your circumstance.