I implemented array reversal (using multiblcok) by 2 approaches. When I compare their performance using CUDA PROFILER, one approach performs much better than the other. But I am confused why it is performing better? Would anyone plese explain?
Approach 1:
_global__ void reverseArrayBlock(int *d_out, int *d_in)
{
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}
Approach 2:
__global__ void reverseArrayBlock(int *d_out, int *d_in)
{
extern __shared__ int s_data[];
int id,id1,N,tid;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
tid=threadIdx.x;
id = blockIdx.x * blockDim.x + threadIdx.x;
N=blockDim.x * gridDim.x;
id1=(N-1-id);
s_data[tid]= d_in[id1];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
d_out[id]=s_data[tid];
}
If you want to reverse arrays larger than your block size, you could use a 1D texture for the read and perform a coalesced write.
And nutti is correct, you coalesce all accesses in your first kernel and not the 2nd. Coalescing is the NUMBER ONE optimization you must perform. Coalescing can mean the difference between a code that performs equal to (or slower than) the CPU to one that is 30 times faster.
Your array reversal performs essentially no computations, so optimizing the memory bandwidth by coalescing is the only thing you can do to improve the performance.
You may get a tiny sub 1% performance boost by reading 2 ints at a time in an int2, but it probably isn’t worth the code complexity.
Hello,
I was also looking into Optimizing Array Reversal. (It’s one of the recommended exercises for beginners by NVIDIA.)
Reading this thread and the Programming Guide, I gather that performance can be improved via coalesced access to global device memory, and then performing the reversal operation in shared memory.
I get the idea, but how to enforce coalesced memory access. How do I take advantage of this???
Thanks!