NVIDIA Developer Forums

Reverse array low instruction throughput :(

Accelerated Computing CUDA CUDA Programming and Performance

Mobeen October 31, 2010, 9:47am 1

Hi all,

I just started working with CUDA. I went through the Dr. Dobbs column on CUDA and decided to improve performance of the reverseBlockArray kernel. I came with this solution.

__global__ void reverseArrayBlock(int *d_out, int *d_in)

{

   int in  = blockDim.x * blockIdx.x + threadIdx.x;

   int out = (blockDim.x * gridDim.x - 1)-in;

   d_out[out] = d_in[in];

}

//profiler output

which gives the following output in the profiler

=== Start profiling for session 'Session1' ===

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #1 ...

Total elapsed time for kernel: 0.275648 msecs.Correct!

Program run #1 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #2 ...

Total elapsed time for kernel: 0.087936 msecs.Correct!

Program run #2 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #3 ...

Total elapsed time for kernel: 0.200640 msecs.Correct!

Program run #3 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #4 ...

Total elapsed time for kernel: 0.088576 msecs.Correct!

Th original reverse array code gave me this output

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #1 ...

Total elapsed time for kernel: 0.353568 msecs.Correct!

Program run #1 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #2 ...

Total elapsed time for kernel: 0.089248 msecs.Correct!

Program run #2 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #3 ...

Total elapsed time for kernel: 0.201856 msecs.Correct!

Program run #3 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #4 ...

Total elapsed time for kernel: 0.151360 msecs.Correct!

My version performs slightly better in terms of elapsedtime, the instruction throughput for my version is 0.313969 whereas for the original version it is 0.548697. Could anyone explain to me why my version has a slow instruction throughput?

Regards,

Mobeen

Mobeen October 31, 2010, 9:47am 2

Hi all,

I just started working with CUDA. I went through the Dr. Dobbs column on CUDA and decided to improve performance of the reverseBlockArray kernel. I came with this solution.

__global__ void reverseArrayBlock(int *d_out, int *d_in)

{

   int in  = blockDim.x * blockIdx.x + threadIdx.x;

   int out = (blockDim.x * gridDim.x - 1)-in;

   d_out[out] = d_in[in];

}

//profiler output

which gives the following output in the profiler

=== Start profiling for session 'Session1' ===

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #1 ...

Total elapsed time for kernel: 0.275648 msecs.Correct!

Program run #1 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #2 ...

Total elapsed time for kernel: 0.087936 msecs.Correct!

Program run #2 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #3 ...

Total elapsed time for kernel: 0.200640 msecs.Correct!

Program run #3 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArrayNew.exe' run #4 ...

Total elapsed time for kernel: 0.088576 msecs.Correct!

Th original reverse array code gave me this output

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #1 ...

Total elapsed time for kernel: 0.353568 msecs.Correct!

Program run #1 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #2 ...

Total elapsed time for kernel: 0.089248 msecs.Correct!

Program run #2 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #3 ...

Total elapsed time for kernel: 0.201856 msecs.Correct!

Program run #3 completed.

Start program 'D:/PhD_Stuff/PhD/Codes/CUDA Stuff/SupercomputingFortheMassesCodes/RevArray/RevArray/RevArray.exe' run #4 ...

Total elapsed time for kernel: 0.151360 msecs.Correct!

My version performs slightly better in terms of elapsedtime, the instruction throughput for my version is 0.313969 whereas for the original version it is 0.548697. Could anyone explain to me why my version has a slow instruction throughput?

Regards,

Mobeen

Topic		Replies	Views	Activity
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7167	December 6, 2007
First kernel execution takes longer CUDA Programming and Performance	8	2888	December 8, 2014
Optimizing Array Reversal CUDA Programming and Performance	8	9423	January 19, 2010
CUDA trouble CUDA Programming and Performance	3	990	March 19, 2013
Performance leakage due excessive API times CUDA Programming and Performance	5	661	May 24, 2019
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8643	December 18, 2008
Inconsistent kernel run times CUDA Programming and Performance	12	5803	August 5, 2009
Instruction level parallelism and maximum instruction per clock CUDA Programming and Performance	6	1121	September 3, 2019
Timing of kernel getting more than a function that runs on only CPU why so...?? CUDA Programming and Performance	1	618	May 15, 2014
Strange Runtime behavior CUDA Programming and Performance	7	3104	December 18, 2009