slower transfer time if host memory is not set? why?

recently I was testing the pinned memory feature of cuda. and I run the following code:

float* h_data_paged_a = (float*) malloc(sizeof(float)*data_length);
    float* h_data_paged_b = (float*) malloc(sizeof(float)*data_length);
    float* d_data;

    for(int i = 0; i < data_length; i++){
        h_data_paged_a[i] = i;

and then do transfer data from host a to device memory, and then transfer device memory back to host b. I got the following output:

test pageable transfer bandwidth...
  Host to Device bandwidth (GB/s): 9.622258
  Device to Host bandwidth (GB/s): 5.180916

you can see that the bandwidth from device to host is much slower. But if i add this line into my code:

memset(h_data_paged_b, 0, sizeof(float) * data_length);

I can get the following output:

Host to Device bandwidth (GB/s): 8.906919
  Device to Host bandwidth (GB/s): 8.595308

I wonder why i got slower bandwidth if i leave “h_data_paged_b” unset?

bandwidth from device to host is around 5 gb/s when “h_data_paged_b” is not set.
a little bit strange. Is this a bug or a feature of compiler?

on some OS’s, malloc is a “lazy” allocator. The memory is assigned and reserved, but not necessarily paged into existence until it is “touched”.

When you “touch” it yourself, then the page is instantiated and ready to go for the cudaMemcpy operation. If you don’t “touch” it yourself, then the cudaMemcpy operation will incur additional overhead as the the pages are “paged into existence” during the copy operation.

oh, I got it!! thanks for clearing my confusion~