recently I was testing the pinned memory feature of cuda. and I run the following code:
float* h_data_paged_a = (float*) malloc(sizeof(float)*data_length);
float* h_data_paged_b = (float*) malloc(sizeof(float)*data_length);
float* d_data;
for(int i = 0; i < data_length; i++){
h_data_paged_a[i] = i;
}
and then do transfer data from host a to device memory, and then transfer device memory back to host b. I got the following output:
test pageable transfer bandwidth...
Host to Device bandwidth (GB/s): 9.622258
Device to Host bandwidth (GB/s): 5.180916
you can see that the bandwidth from device to host is much slower. But if i add this line into my code:
memset(h_data_paged_b, 0, sizeof(float) * data_length);
I can get the following output:
Host to Device bandwidth (GB/s): 8.906919
Device to Host bandwidth (GB/s): 8.595308
I wonder why i got slower bandwidth if i leave “h_data_paged_b” unset?