Reverse array low instruction throughput :(

Hi all,

I just started working with CUDA. I went through the Dr. Dobbs column on CUDA and decided to improve performance of the reverseBlockArray kernel. I came with this solution.

__global__ void reverseArrayBlock(int *d_out, int *d_in)

{

   int in  = blockDim.x * blockIdx.x + threadIdx.x;

   int out = (blockDim.x * gridDim.x - 1)-in;

   d_out[out] = d_in[in];

}

//profiler output

which gives the following output in the profiler

=== Start profiling for session 'Session1' ===

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #1 ...

Total elapsed time for kernel: 0.275648 msecs.Correct!

Program run #1 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #2 ...

Total elapsed time for kernel: 0.087936 msecs.Correct!

Program run #2 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #3 ...

Total elapsed time for kernel: 0.200640 msecs.Correct!

Program run #3 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #4 ...

Total elapsed time for kernel: 0.088576 msecs.Correct!

Th original reverse array code gave me this output

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #1 ...

Total elapsed time for kernel: 0.353568 msecs.Correct!

Program run #1 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #2 ...

Total elapsed time for kernel: 0.089248 msecs.Correct!

Program run #2 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #3 ...

Total elapsed time for kernel: 0.201856 msecs.Correct!

Program run #3 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #4 ...

Total elapsed time for kernel: 0.151360 msecs.Correct!

My version performs slightly better in terms of elapsedtime, the instruction throughput for my version is 0.313969 whereas for the original version it is 0.548697. Could anyone explain to me why my version has a slow instruction throughput?

Regards,

Mobeen

Hi all,

I just started working with CUDA. I went through the Dr. Dobbs column on CUDA and decided to improve performance of the reverseBlockArray kernel. I came with this solution.

__global__ void reverseArrayBlock(int *d_out, int *d_in)

{

   int in  = blockDim.x * blockIdx.x + threadIdx.x;

   int out = (blockDim.x * gridDim.x - 1)-in;

   d_out[out] = d_in[in];

}

//profiler output

which gives the following output in the profiler

=== Start profiling for session 'Session1' ===

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #1 ...

Total elapsed time for kernel: 0.275648 msecs.Correct!

Program run #1 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #2 ...

Total elapsed time for kernel: 0.087936 msecs.Correct!

Program run #2 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #3 ...

Total elapsed time for kernel: 0.200640 msecs.Correct!

Program run #3 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #4 ...

Total elapsed time for kernel: 0.088576 msecs.Correct!

Th original reverse array code gave me this output

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #1 ...

Total elapsed time for kernel: 0.353568 msecs.Correct!

Program run #1 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #2 ...

Total elapsed time for kernel: 0.089248 msecs.Correct!

Program run #2 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #3 ...

Total elapsed time for kernel: 0.201856 msecs.Correct!

Program run #3 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #4 ...

Total elapsed time for kernel: 0.151360 msecs.Correct!

My version performs slightly better in terms of elapsedtime, the instruction throughput for my version is 0.313969 whereas for the original version it is 0.548697. Could anyone explain to me why my version has a slow instruction throughput?

Regards,

Mobeen