I’m working on the design of a kernel which requires a full set of arrays (~30) that are stored in the device’s global memory. The question that now arises is how to transmit the addresses of these arrays to my kernel function.
The simplest strategy (the one used in the SDK samples) is to pass the pointers as arguments to the kernel. These (array-)pointers will then be stored in shared memory (Am I right on this point ?) and will thus be accessible from all the threads in my block. There won’t be any bank conflict as all threads will access the same shared memory bank to read the pointer. Each thread will then read the global memory using this pointer.
As I may run short of shared memory, I tried to think about another solution. I’m not using much of my constant memory and as it is cached accessing it is pretty fast. I could thus store the pointers to the global memory in the constant memory and then have all my threads read from this memory to find the address of the array in global memory.
I haven’t found much information about the constant memory in the documentation, so I don’t know how big the cache is and how fast a (broadcast) read from constant memory is. Could you help me here ?
Could you also tell me if the strategy I’m trying to implement is smart or not ? Would you advice another way of doing this ?
Thanks
P.S.: The other solution I have is to group all my arrays in a single one and then play with the indices on the device. This would however lead to a big re-factoring of my code.
Mmmhh. Ok. But if I allocate this in the device global memory then reading any value in one of my arrays will require two reads form global mem instead of one. Am I missing something ?
Mmmhh. Ok. But if I allocate this in the device global memory then reading any value in one of my arrays will require two reads form global mem instead of one. Am I missing something ?
You are correct. The two reads will have disastrous performance consequences on compute 1.x hardware, not so bad on 2.x as the pointer will likely be kept in L1 cache. While we are on the topic of compute 2.x, it automatically puts all function parameters in constant memory for you.
Are you really worried about the lost 240 bytes of shared memory that you don’t want to pass the pointers into the kernel by value? At that level, what you should be worried about is the 256 byte limit on kernel function parameters in compute 1.x.
You are correct. The two reads will have disastrous performance consequences on compute 1.x hardware, not so bad on 2.x as the pointer will likely be kept in L1 cache. While we are on the topic of compute 2.x, it automatically puts all function parameters in constant memory for you.
Are you really worried about the lost 240 bytes of shared memory that you don’t want to pass the pointers into the kernel by value? At that level, what you should be worried about is the 256 byte limit on kernel function parameters in compute 1.x.
Sure, if the compiler keeps the pointer around in memory. Common subexpression elimination would probably do this effectively even if you used pData->array everywhere.
I guess maybe I over-worry about the overhead - I’m still used to thinking about coalescing in compute 1.0 where an uncoalesced read is at 1/16 performance. In a complicated kernel that reads pData->array dozens of times, the overhead of the hit for the pointer read is probably not that large. However, in a simple kernel it could be huge: example
In the first kernel, there are 2 pointer reads, 1 (lets say int) read and 1 write for a total of 28+24 = 24 bytes per thread.
In the second, only 8 bytes per thread are accessed - so the kernel will likely run 3x faster (or even more if you factor in the cost of the uncoalesced pointer read).
Arguably, one probably would not pass in a struct of 20 pointers for such a simple kernel - I’m just memory bandwidth bound in almost every kernel I write so I have a habit of avoiding absolutely all unnecessary memory reads.
Sure, if the compiler keeps the pointer around in memory. Common subexpression elimination would probably do this effectively even if you used pData->array everywhere.
I guess maybe I over-worry about the overhead - I’m still used to thinking about coalescing in compute 1.0 where an uncoalesced read is at 1/16 performance. In a complicated kernel that reads pData->array dozens of times, the overhead of the hit for the pointer read is probably not that large. However, in a simple kernel it could be huge: example
In the first kernel, there are 2 pointer reads, 1 (lets say int) read and 1 write for a total of 28+24 = 24 bytes per thread.
In the second, only 8 bytes per thread are accessed - so the kernel will likely run 3x faster (or even more if you factor in the cost of the uncoalesced pointer read).
Arguably, one probably would not pass in a struct of 20 pointers for such a simple kernel - I’m just memory bandwidth bound in almost every kernel I write so I have a habit of avoiding absolutely all unnecessary memory reads.