I would suggests showing your actual code that accesses shared memory, and showing the launch configuration as well. Ideally you would extract relevant code into a minimal self-contained repro code that other’s can build and run.
I want to use CUDA to perform segmented sorting. Currently, I am trying to implement bitonic sorting on shared memory. For indexing convenience, I am using one-dimensional index. However, this inevitably causes bank conflicts. My kernel function is as follows:
__global__ void bitonic_sort(int* data_d, int* sorted, int cols, int rows) {
int tid = threadIdx.x;
__shared__ int s_data[1024];
int temp = 0;
if (cols <= 1024) {
s_data[tid] = data_d[tid + blockIdx.x * cols];
for (int i = 2; i <= cols; i <<= 1) {
for (int j = i >> 1; j > 0; j >>= 1) {
int tid_comp = tid ^ j;
if (tid_comp > tid) {
if ((tid & i) == 0) { // ascending
if (s_data[tid] > s_data[tid_comp]) {
temp = s_data[tid];
s_data[tid] = s_data[tid_comp];
s_data[tid_comp] = temp;
}
} else { // desending
if (s_data[tid] < s_data[tid_comp]) {
temp = s_data[tid];
s_data[tid] = s_data[tid_comp];
s_data[tid_comp] = temp;
}
}
}
__syncthreads();
}
sorted[tid + blockIdx.x * cols] = s_data[tid];
}
}
}
And the main function is as follows:
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
int main() {
const int cols = 1024;
const int rows = 1024;
int *data, *data_d, *sorted_d;
data = (int*)malloc(cols * rows * sizeof(int));
cudaMalloc(&data_d, cols * rows * sizeof(int));
cudaMalloc(&sorted_d, cols * rows * sizeof(int));
for (int i = 0; i < cols * rows; i++) data[i] = cols * rows - i;
cudaMemcpy(data_d, data, cols * rows * sizeof(int), cudaMemcpyHostToDevice);
bitonic_sort_2<<<rows, 1024, 1024 * sizeof(int)>>>(data_d, sorted_d, cols, rows);
cudaMemcpy(data, sorted_d, cols * rows * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < 10; i++) printf("%d ", data[(rows - 1 - i) * cols]);
cudaFree(sorted_d);
cudaFree(data_d);
free(data);
return 0;
}
Regarding access patterns and shared bank conflicts, this should not be an issue:
so that leaves this to think about:
i only ever has a single bit set. j is just that i bit shifted right by varying amounts. Let’s assume cols is 1024 so we must consider threadIdx.x exclusive-or’ed with 2^10, 2^9, etc. across threads. Example:
Based on that non-exhaustive pattern analysis, it looks to me like “threadIdx.x exclusive-or’ed with a single set bit” pattern does not appear to introduce any bank-conflicted patterns.
So from my perspective, this statement:
is not immediately obvious, and would need clarification of the concern. I don’t see it.
Cub or Modern GPU libraries can meet my needs indeedly. But I want to increase the workload for my thesis paper, so I choose to speed up some kernel functions by myself.
I still need to study your analysis carefully. However, after testing the program with NSight Compute, I did find there are a few bank conflicts. I will continue to update my results later.
assessing bank conflicts using nsight compute may require some analysis.
If you have identified the shared access patterns properly and have ruled out bank conflicted patterns, then there is no reason to interpret nsight compute output in some different fashion. Your code is simple enough that ruling out shared bank conflicts based on study of access patterns from the source code should be possible.