LDS.128 loads from shared memory

How does the LDS.128 SASS instruction work? For example, the compiler might generate the SASS below:

LDS.128 R8,  [R2]

My understanding is that LDS.128 is a vector load from shared memory. Is this the same kind of “vector load” that occurs when threads in the same warp load multiple elements from shared memory in one transaction in the absence of bank conflicts?

The general vector load/store mechanism applies to other memory types as well, - global, constant and there’s a good explantion of it here.

As for:

bear in mind that successive 32bit elements are stored in different banks in shared memory, so conflict free is not guaranteed.

Followup question: How does the vector load/store mechanism affect global memory coalescing?

For example, they provided the following code in the blog post that uses int4 instead of int to get vector loads/stores:

__global__ void device_copy_vector4_kernel(int* d_in, int* d_out, int N) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for(int i = idx; i < N/4; i += blockDim.x * gridDim.x) {
    reinterpret_cast<int4*>(d_out)[i] = reinterpret_cast<int4*>(d_in)[i];

  // in only one thread, process final elements (if there are any)
  int remainder = N%4;
  if (idx==N/4 && remainder!=0) {
    while(remainder) {
      int idx = N - remainder--;
      d_out[idx] = d_in[idx];

void device_copy_vector4(int* d_in, int* d_out, int N) {
  int threads = 128;
  int blocks = min((N/4 + threads-1) / threads, MAX_BLOCKS);

  device_copy_vector4_kernel<<<blocks, threads>>>(d_in, d_out, N);

In this code, assuming we assign one thread per int4 element, if we only look at loads, does each warp still use coalescing to do all the loads in 4 memory transactions, where each transaction would load 32 consecutive 32-bit elements?

Yes, given correct alignment. The hardware accesses memory in 32, 64 or 128 byte transactions, as outlined here and here.