May I ask what is the granularity of memory access from GPU for cudamallochost? If I wish to measure its granularity, how should I proceed?Below is one possible approach that I have considered:
#include <stdio.h>
#define KB 1024
#define MB 1024 * 1024
__global__
void get(int *array,int i){
int lengthMod = 8 * MB - 1;
int tmp;
for (int j = 0; j < 6; j++) {
for (unsigned int k = 0; k < 512 * MB; k++) {
tmp += array[(k * i) & lengthMod];
}
}
}
void test_time(){
int * array;
cudaMallocHost(&array,8 * MB);
int i;
get<<<1,1>>>(array,1);
get<<<1,1>>>(array,1);
get<<<1,1>>>(array,1);
for(i = 1; i <= 2048*2048; i *= 2){
cudaEvent_t startEvent, stopEvent;
cudaEventCreate(&startEvent);
cudaEventCreate(&stopEvent);
cudaEventRecord(startEvent, 0);
get<<<1,1>>>(array,i);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
float time;
cudaEventElapsedTime(&time, startEvent, stopEvent);
cudaEventDestroy(startEvent);
cudaEventDestroy(stopEvent);
time /= 6;
printf("For i value: %d Time Taken:%lf\n",i,time);
}
cudaFreeHost(array);
}
int main() {
test_time();
return 0;
}
I hope to measure access granularity through the above approach, corresponding to the cache line size in the CPU. However, the results are as follows:
plaintext
For i value: 1 Time Taken:0.000853
For i value: 2 Time Taken:0.000853
For i value: 4 Time Taken:0.000848
For i value: 8 Time Taken:0.000667
For i value: 16 Time Taken:0.000709
For i value: 32 Time Taken:0.000715
For i value: 64 Time Taken:0.000667
For i value: 128 Time Taken:0.000720
For i value: 256 Time Taken:0.000683
For i value: 512 Time Taken:0.000683
For i value: 1024 Time Taken:0.000853
For i value: 2048 Time Taken:0.000677
For i value: 4096 Time Taken:0.000512
For i value: 8192 Time Taken:0.000683
For i value: 16384 Time Taken:0.000683
For i value: 32768 Time Taken:0.000683
For i value: 65536 Time Taken:0.000512
For i value: 131072 Time Taken:0.000683
For i value: 262144 Time Taken:0.000683
For i value: 524288 Time Taken:0.000683
For i value: 1048576 Time Taken:0.000512
For i value: 2097152 Time Taken:0.000683
For i value: 4194304 Time Taken:0.000683
As you can see, the time for other iterations does not show significant changes. I believe there are two possibilities: one is that the access granularity is indeed 1 byte, and the other is that the overhead caused by granularity here is overshadowed by other significant overheads. I hope someone can help clarify my confusion. Thank you very much.