This may be a simple question, but is there a good way to load an entire array (a small array) into a thread’s local memory? The array is contiguous and aligned, but the arrays that each thread will have to load is not known a priori and most likely will not be near each other in memory. Is there a way to “coalesce” the loads for a single thread since all its private loads will be in order, or will bandwith always be wasted since individual threads are not accessing adjacent memory simultaneously?
There is no way to “coalesce” loads from a single thread. Coalescing refers to the idea of collapsing the requests from multiple threads in a warp into a single transaction for the warp. If you are doing this kind of array load into multiple threads, then an interleaved data format will allow you to coalesce reads, but if the only loading going on is for a single thread, the best you can do is load the data in the largest chunks possible (e.g. 16 bytes at a time, e.g. float4) and load them sequentially to get cache benefit on cc 2.0 and later. But multiple transactions from a single thread will always be multiple transactions – at least to the first level of cache.
Perfect explanation, thanks!