Optimizing Array Reversal

I implemented array reversal (using multiblcok) by 2 approaches. When I compare their performance using CUDA PROFILER, one approach performs much better than the other. But I am confused why it is performing better? Would anyone plese explain?

Approach 1:

_global__ void reverseArrayBlock(int *d_out, int *d_in)

{

    extern __shared__ int s_data[];

   int inOffset  = blockDim.x * blockIdx.x;

    int in  = inOffset + threadIdx.x;

   // Load one element per thread from device memory and store it 

    // *in reversed order* into temporary shared memory

    s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];

   // Block until all threads in the block have written their data to shared mem

    __syncthreads();

   // write the data from shared memory in forward order, 

    // but to the reversed block offset as before

   int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);

   int out = outOffset + threadIdx.x;

    d_out[out] = s_data[threadIdx.x];

}

Approach 2:

__global__ void reverseArrayBlock(int *d_out, int *d_in)

{

    extern __shared__ int s_data[];

    int id,id1,N,tid;

    // Load one element per thread from device memory and store it 

    // *in reversed order* into temporary shared memory

   

	tid=threadIdx.x;

	id = blockIdx.x * blockDim.x + threadIdx.x;

	N=blockDim.x * gridDim.x;

    id1=(N-1-id);

	

	

	s_data[tid]= d_in[id1];

   // Block until all threads in the block have written their data to shared mem

    __syncthreads();

   // write the data from shared memory in forward order, 

    // but to the reversed block offset as before

    d_out[id]=s_data[tid];

}

Output from cuda prof.

Approach #1

memcopy 395.84

kernel 179.776

memcpy 512.416

CPUtime 2513.16

Approach #2

memcopy 396.064

kernel 459.456

memcpy 507.136

CPUtime 2771.64

You are breaking the coalescing rules in the second kernel:

See section “5.1.2.1 Global Memory” in the programming guide (CUDA_Programming_Guide1.1.pdf).

Björn

If you want to reverse arrays larger than your block size, you could use a 1D texture for the read and perform a coalesced write.

And nutti is correct, you coalesce all accesses in your first kernel and not the 2nd. Coalescing is the NUMBER ONE optimization you must perform. Coalescing can mean the difference between a code that performs equal to (or slower than) the CPU to one that is 30 times faster.

hmm, after spending a good amount of time I got the concept of coalescing. Thanks a lot to all of you.

Is there any other optimization that I can apply here to get better performance?

Here’s a paper that I haven’t really read yet, but the title is awesome:

[b]Optimization Principles and Application Performance Evaluation

of a Multithreaded GPU Using CUDA[/b]

http://delivery.acm.org/10.1145/1350000/13…CFTOKEN=6184618

Mark Harris’s slides on CUDA optimization from the Supercomputing 2007 CUDA tutorial is dense and good:

http://www.gpgpu.org/sc2007/

http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op…tion_Harris.pdf

Your array reversal performs essentially no computations, so optimizing the memory bandwidth by coalescing is the only thing you can do to improve the performance.

You may get a tiny sub 1% performance boost by reading 2 ints at a time in an int2, but it probably isn’t worth the code complexity.

ok, thanks a lot guys. I read the paper and its really helpful.

Hello,
I was also looking into Optimizing Array Reversal. (It’s one of the recommended exercises for beginners by NVIDIA.)
Reading this thread and the Programming Guide, I gather that performance can be improved via coalesced access to global device memory, and then performing the reversal operation in shared memory.
I get the idea, but how to enforce coalesced memory access. How do I take advantage of this???
Thanks!

The Programming Guide says stuff about half-warps with regard to coalesced memory access, but I don’t get it.