How does the LDS.128 SASS instruction work? For example, the compiler might generate the SASS below:
LDS.128 R8, [R2]
My understanding is that LDS.128 is a vector load from shared memory. Is this the same kind of “vector load” that occurs when threads in the same warp load multiple elements from shared memory in one transaction in the absence of bank conflicts?
Followup question: How does the vector load/store mechanism affect global memory coalescing?
For example, they provided the following code in the blog post that uses int4 instead of int to get vector loads/stores:
__global__ void device_copy_vector4_kernel(int* d_in, int* d_out, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = idx; i < N/4; i += blockDim.x * gridDim.x) {
reinterpret_cast<int4*>(d_out)[i] = reinterpret_cast<int4*>(d_in)[i];
}
// in only one thread, process final elements (if there are any)
int remainder = N%4;
if (idx==N/4 && remainder!=0) {
while(remainder) {
int idx = N - remainder--;
d_out[idx] = d_in[idx];
}
}
}
void device_copy_vector4(int* d_in, int* d_out, int N) {
int threads = 128;
int blocks = min((N/4 + threads-1) / threads, MAX_BLOCKS);
device_copy_vector4_kernel<<<blocks, threads>>>(d_in, d_out, N);
}
In this code, assuming we assign one thread per int4 element, if we only look at loads, does each warp still use coalescing to do all the loads in 4 memory transactions, where each transaction would load 32 consecutive 32-bit elements?