I am wondering if there is a way to truely eliminate address computation instructions when performing pointer chasing benchmarks.
For example, when I have the following kernel
__global__ void myKernel(
float a,
float b,
float *array,
unsigned int *startTime,
unsigned int *endTime,
unsigned int *sm_id
)
{
__shared__ float svalue;
b = b - 1;
int iter = 1000;
unsigned int start, end;
start=clock();
for (int i=0; i< iter; i++){
a=array[__float_as_uint(a)];
repeat30(a=a+b;)
}
end=clock();
if (threadIdx.x == 0) svalue=a;
int pid = threadIdx.x + blockIdx.x*blockDim.x;
startTime[pid]=start;
endTime[pid]=end;
sm_id[pid] = get_smid();
}
The repeat30 macro is defined in repeat.h, which can be found in Demystifying GPU Microarchitecture through Microbenchmarking | stuffedcow
In the compiled sass code (using nvcc 10.2 -arch=sm_35) there are two instructions before every memory load instruction as follows:
ISCADD R8.CC, R2, c[0x0][0x148], 0x2;
IMAD.U32.U32.HI.X R9, R2, R3, c[0x0][0x14c];
LD.E R8, [R8];
I understand that every time R8 changes, the address needs to be recomputed as R8*4+&array[0], which is done by the above two instructions ISCASS and IMAD.
My question is that can we achieve:
- initialized the array elements so that array[i] = &array[j] rather than array[i]=j as in the traditional pointer chasing benchmark.
- eliminate the two address computation instructions before every memory load instruction in the SASS code.