I have an array which contains 48 elements with each element of size 4 bytes. The array is read-only for both the host and the device.
For the purpose of optimization, i thought i should declare the array as global constant, but according to the cuda documentation
“the constant cache is best when threads in the same warp accesses only a few distinct locations. If all threads of a warp access the same location, then constant memory can be as fast as a register access”
But in my case every thread in a warp will access random element of the array, so most probably the cache-miss will occur.
The same array is used by both the host function and the device kernel.
In which manner can i declare this read-only array so that only one definition is used by both (host and device).
Because currently i have two definition of the same lookup table, one for the device in , say kernel.cu; and the other for the host in host.c
Is their any other approach besides constant which can somehow benefit me in the performance gain in both (host and device).
Constant memory indeed is a poor choice for the reason you cited. Shared memory would be a much better location.
A lookup table of up to 32 elements can even be placed in registers and accessed via __shfl_sync() (using one register of each thread and keeping separate tables for each warp).
Larger lookup tables that do not fit into shared memory can be placed in a texture, although nowadays there is little difference vs just keeping them in global memory.
In order to avoid duplication you can cudaMemcpy() the table from the host.
If the table is listed literally in the source code, you could also use a #define to avoid duplication.
When there is non-uniform access to the constant cache across a warp, the problem is not cache misses, but serialization. The constant cache can serve one chunk of data per cycle, but has a broadcast feature that can supply that data to all threads in a warp in parallel. If multiple different addresses are presented across the warp, data for these will be served in consecutive cycles until all requests are satisfied (serialization).
Empirically, for up to three different addresses presented across the warp, putting such an array into constant memory is often still the best choice; otherwise use the approach recommended by tera. Since it is relatively easy to misjudge the amount of intra-warp address divergence (been there, done that, got the t-shirt), I would suggest prototyping both solutions and running a quick experiment.
Yet another alternative might be to replace this small table by computation. Sometimes standards will represent a functional relationship as a table which can also be expressed as a simple function. For example, the beta table in H.264 in-loop deblocking (table 8-16 in my PDF copy) expresses a simple piece-wise linear relationship.
It would not be feasible to use registers for lookup table because its almost 256 bytes in size.
I have tried the shared memory approach for the lookup table and the results are acceptable. Though i am a little confused about why the shared load transaction per access is greater then the ideal value i.e. 1, in my case. Because each entry of my lookup table is 4 bytes in size so there shouldn’t be any alignment issues while reading at entries level (4 bytes level).
Bank conflicts shouldn’t be an issue because warp-threads are just READING the values from shared memory (Almost Randomly) and if more then one thread within the warp try to read the same bank, it should just be broadcast (not serialized). Correct me if i’m missing something here.
more than 1 transaction per request will come about if the values are not organized one per bank.
bank conflicts can occur on reads as well as writes.
If two threads access the same location, then broadcast will occur. But if two threads access two separate values in the same bank, then bank conflicts will occur.