How to conveniently avoid bank conflict when load 1-dimension data to shared memory?

When using shared memory, I know a classic way to avoid bank conflicts is to use padding, for example:

__shared__ int s_data[32][32+1]

But if the data is one-dimensional,I mean:

__shared__ int s_data[1024]

is there a convenient way to use padding?

What is your access pattern for shared memory?

I would suggests showing your actual code that accesses shared memory, and showing the launch configuration as well. Ideally you would extract relevant code into a minimal self-contained repro code that other’s can build and run.

Thanks for your suggestion.

I want to use CUDA to perform segmented sorting. Currently, I am trying to implement bitonic sorting on shared memory. For indexing convenience, I am using one-dimensional index. However, this inevitably causes bank conflicts. My kernel function is as follows:

__global__ void bitonic_sort(int* data_d, int* sorted, int cols, int rows) {
    int tid = threadIdx.x;
    __shared__ int s_data[1024];
    int temp = 0;
    if (cols <= 1024) {
        s_data[tid] = data_d[tid + blockIdx.x * cols];
        for (int i = 2; i <= cols; i <<= 1) {
            for (int j = i >> 1; j > 0; j >>= 1) {
                int tid_comp = tid ^ j;
                if (tid_comp > tid) {
                    if ((tid & i) == 0) {  // ascending
                        if (s_data[tid] > s_data[tid_comp]) {
                            temp = s_data[tid];
                            s_data[tid] = s_data[tid_comp];
                            s_data[tid_comp] = temp;
                        }
                    } else {  // desending
                        if (s_data[tid] < s_data[tid_comp]) {
                            temp = s_data[tid];
                            s_data[tid] = s_data[tid_comp];
                            s_data[tid_comp] = temp;
                        }
                    }
                }
                __syncthreads();
            }
            sorted[tid + blockIdx.x * cols] = s_data[tid];
        }
    }
}

And the main function is as follows:

#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>

int main() {
    const int cols = 1024;
    const int rows = 1024;
    int *data, *data_d, *sorted_d;
    data = (int*)malloc(cols * rows * sizeof(int));
    cudaMalloc(&data_d, cols * rows * sizeof(int));
    cudaMalloc(&sorted_d, cols * rows * sizeof(int));

    for (int i = 0; i < cols * rows; i++) data[i] = cols * rows - i;
    cudaMemcpy(data_d, data, cols * rows * sizeof(int), cudaMemcpyHostToDevice);

    bitonic_sort_2<<<rows, 1024, 1024 * sizeof(int)>>>(data_d, sorted_d, cols, rows);

    cudaMemcpy(data, sorted_d, cols * rows * sizeof(int), cudaMemcpyDeviceToHost);
    for (int i = 0; i < 10; i++) printf("%d  ", data[(rows - 1 - i) * cols]);

    cudaFree(sorted_d);
    cudaFree(data_d);
    free(data);
    return 0;
}

The output show be:

1  1025  2049  3073  4097  5121  6145  7169  8193  9217

Thank you for your reply. I have pasted the complete code in another reply.

  1. I imagine you need a __syncthreads() statement in-between this line:

and this line:

  1. You might want to use a cub block-level sort rather than writing your own
  2. Regarding:

Regarding access patterns and shared bank conflicts, this should not be an issue:

so that leaves this to think about:

i only ever has a single bit set. j is just that i bit shifted right by varying amounts. Let’s assume cols is 1024 so we must consider threadIdx.x exclusive-or’ed with 2^10, 2^9, etc. across threads. Example:

tid   |   tid^(2^5)  |    tid^(2^2)
    0        32+0            4+0
    1        32+1            4+1
    2        32+2            4+2
    3        32+3            4+3
    4        32+4            4-4
    5        32+5            5-4
 ....
   31        32+31          31-4

Based on that non-exhaustive pattern analysis, it looks to me like “threadIdx.x exclusive-or’ed with a single set bit” pattern does not appear to introduce any bank-conflicted patterns.

So from my perspective, this statement:

is not immediately obvious, and would need clarification of the concern. I don’t see it.

2 Likes

Thanks a lot for your patient explanation.

Cub or Modern GPU libraries can meet my needs indeedly. But I want to increase the workload for my thesis paper, so I choose to speed up some kernel functions by myself.

I still need to study your analysis carefully. However, after testing the program with NSight Compute, I did find there are a few bank conflicts. I will continue to update my results later.

assessing bank conflicts using nsight compute may require some analysis.

If you have identified the shared access patterns properly and have ruled out bank conflicted patterns, then there is no reason to interpret nsight compute output in some different fashion. Your code is simple enough that ruling out shared bank conflicts based on study of access patterns from the source code should be possible.