I am currently trying to improve the usability of my code and wanted to reduce the function parameters.
This could be achieved by grouping the logically connected values of array data, width and length into a structure.
Before I changed my complete code I created a test file to compare the performance of two otherwise identical implementations.
The result is quite bad, the function using structures is running 20% slower.
Here are the functions and the generated assembler code:
First function, separated variables:
__global__ void mul1(float * array, const float * array2,const int width, const int height)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
for(int j = i; j < width*height; j+=gridDim.x*blockDim.x)
{
array[j] = array[j] * array2[j];
}
}
The assembler code created from line 6 looks like this:
SHL R6, R5, 0x2;
IADD R6, R0, R6;
MOV R6, R6;
LD R7, [R6];
SHL R6, R5, 0x2;
IADD R6, R2, R6;
MOV R6, R6;
LD R6, [R6];
FMUL R7, R7, R6;
SHL R6, R5, 0x2;
IADD R6, R0, R6;
MOV R6, R6;
ST [R6], R7;
Second function, using a structure containing array, width and height:
__global__ void mul2(s_array Array1, const s_array Array2)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
for(int j = i; j < Array1.height*Array1.width; j+=gridDim.x*blockDim.x)
{
Array1.data[j] = Array1.data[j] * Array2.data[j];
}
}
The assembler code created from line 6 looks like this:
IADD R3, R0, RZ;
MOV R3, R3;
LD R3, [R3];
SHL R4, R2, 0x2;
IADD R3, R3, R4;
MOV R3, R3;
LD R4, [R3];
IADD R3, R0, 0x10;
MOV R3, R3;
LD R3, [R3];
SHL R5, R2, 0x2;
IADD R3, R3, R5;
MOV R3, R3;
LD R3, [R3];
FMUL R5, R4, R3;
IADD R3, R0, RZ;
MOV R3, R3;
LD R3, [R3];
SHL R4, R2, 0x2;
IADD R3, R3, R4;
MOV R3, R3;
ST [R3], R5;
Also noticable is that the calculation of width*height is done multiple times, even though they are passed as constant arguments, thus the value could be stored instead of multiplying again.
Is there any reason the output of the code is so different?
To me it looks like ugly code will produce the fastest program.
Compiler optimizations are set to maximum, no restriction on used registers was made.
Using Nsight Eclipse Edition with Cuda Toolkit 6.5