Coalesce access in single thread?

matthew9bc3t · August 27, 2019, 1:19am

Sometime one thread needs consecutive data, AFAIK the common way to do that is to use the whole block to cooperatively load coalesce data into shared memory.

Is there a way to do this without the cooperative load and shared memory hassle?
(I can think of one hecky way is if you need 2 consecutive data you can cast the pointer to double the size of the original data, then your get coalesce access across threads, obviously doesn’t work for arbitrary size)

Thanks

Robert_Crovella · August 27, 2019, 1:39am

There is no concept of coalescing applied to a single thread. Coalescing refers to address grouping behavior considered warp-wide.

The maximum amount of consecutive data that can be loaded by a single thread in a single instruction is 16 bytes. I usually refer to larger-than-8-byte loads as a vector load, because it requires using a vector type (e.g. int4, double2, etc.) to achieve it (referring to CUDA C++, anyway).

It’s generally not wise to think of operations using only a single thread in CUDA. Even in the case of a 16-byte “vector load”, there are still efficiencies to be gained doing this warp-wide, and using the data pulled in by an entire warp (hopefully in a coalesced fashion).

Constructing isolated operations to be performed by a single thread is generally not an efficient use of the machine, which is why you commonly see the “cooperative load and shared memory hassle” or other approaches to make use of operations warp-wide, at a minimum.