I am currently trying to implement a 32 4-byte element look up table using __shfl_sync. However, I am not able to reach the throughput of 32 per SM advertised in the programming guide. As I saw the topic about the shared memory bandwidth (would be an alternative implementation, but shuffle is better as the table is frequently updated) I thought, that the topics might be related and decided to wait with creating an own post. However, as I get more and more confused by the results of the other thread, I thought, that it might be more convenient to explore the __shfl_sync behaviour in another thread.
Currently I use the following code for benchmarking:
#include <iostream>
#include <cuda.h>
#define N 8000000
#define BLOCK_COUNT 28L * 3
#define BLOCK_THREADS 32 * 16
__global__ void benchmark(int* out) {
int offset = blockIdx.x * BLOCK_THREADS + threadIdx.x;
int pos = threadIdx.x, pos2 = threadIdx.x;
int salt = out[offset], salt2 = 2 * out[offset];
int res = 0, res2 = 0;
for (int i = 0; i < N; i+=2) {
res += __shfl_sync(-1, salt, pos);
pos += salt;
res2 += __shfl_sync(-1, salt2, pos2);
pos2 += salt2;
}
out[offset] = res + res2;
}
int main(){
int *out, *d_out;
out = (int*)malloc(sizeof(int) * BLOCK_COUNT * BLOCK_THREADS);
for (int i = 0; i < BLOCK_COUNT * BLOCK_THREADS; i++) out[i] = 0;
cudaMalloc((void**)&d_out, sizeof(int) * BLOCK_COUNT * BLOCK_THREADS);
cudaMemcpy(d_out, out, sizeof(int) * BLOCK_COUNT * BLOCK_THREADS, cudaMemcpyHostToDevice);
std::cout << "Invoking Kernel" << std::endl;
float milliseconds;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
benchmark<<<BLOCK_COUNT, BLOCK_THREADS>>>(d_out);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
std::cout << "Finished Kernel" << std::endl;
std::cout << "Execution took " << milliseconds << " ms" << std::endl;
double gops = N * BLOCK_COUNT * BLOCK_THREADS / (milliseconds * 1000000);
std::cout << "That's " << sizeof(int) * gops / 28 / 1.987 / 32 << " bytes per thread per cycle" << std::endl;
cudaMemcpy(out, d_out, sizeof(int) * BLOCK_COUNT * BLOCK_THREADS, cudaMemcpyDeviceToHost);
cudaFree(d_out);
cudaError_t cudaError = cudaGetLastError();
if (cudaError != cudaSuccess) {
std::cerr << "CUDA error: " << cudaGetErrorString(cudaError) << std::endl;
return 1;
}
}
Note, that the out array is initialized to 0 and used to keep the compiler from optimizing things away. Each thread is basically accessing its own value in each operation.
I would expect to see a speed of 4 bytes per SM cycle, as the throughput is given as 32. However, I do only get ~2 bytes per SM cycle (i.e., a throughput of 16). If I make the shuffle exchange 8 bytes, the throughput increases to ~4 bytes (8 expected) which would be consistent with a throughput of 16.
As the latency of shuffle is allegedly quite high, I implemented two parallel shuffles (line with salt and salt2) but the throughput did not increase (also, there is no data dependency between the shuffles anyway). Also unrolling did not help.
Is there anything I am doing wrong here? Or am I just misinterpreting, what a throughput of 32 means? Nsight compute reports the MIO to be full most of the time.