Pinned memory slows CPU computation

On Jetson kit if we allocate pinned memory to increase the speed of H2D/D2H transfer what we have observed is that the time to access the memory increases on the host side. i.e.

If I allocated memory using malloc and did some simple computation like vector addition it takes lesser time as compared to allocating pinned memory and doing same computation on cpu.

The difference in timing is as high as 2x slowdown. To increase speed on GPU we end up increasing host side time. This is true for even small allocations of few kb memory ( i.e. to nullify the effect of OS not getting enough pages ) which is very small.

Has anyone observed this behavior?

Hi bharatkumarsharma

CPU and GPU frequency scaling is supported on Jetson, and may impact your result.

Is the measurement with fixed clocks or floating? Is the host copy in to pinned buffer included or excluded from the timing?

Cheers

Hi,

CPU and GPU has been set to performance mode with all CPU always active at maximum frequency as stated in http://elinux.org/Jetson/Performance

Here is the sample snapshot of code:

////////////////Normal allocation calculation////////////////////
float temp = (float)malloc(Nsizeof(float));
float ptr = (float)malloc(N
sizeof(float));
for(int i =0; i < N; i++)
{
ptr[i] = (float)i;
temp[i] = (float)i;
}
//Time this part
for(int i =0 ; i < N; i++)
{
ptr[i] = temp[i];
}

versus
////////////////Pinned allocation calculation////////////////////

float temp = (float)malloc(N*sizeof(float));
float ptr;
cudaMallocHost((void
*)&ptr,size);

for(int i =0; i < N; i++)
{
ptr[i] = (float)i;
temp[i] = (float)i;
}
//Time this part
for(int i =0 ; i < N; i++)
{
ptr[i] = temp[i];
}

When allocated through pinned memory the loop timed is twice as slow as compared to when allocation is done using malloc. We are using gettimeOfDay to time this.

Hi bharatkumarsharma,

The cudaMallocHost on TK1 is marked CPU uncached, this is the reason there you’re seeing slower access from cpu compared to malloc memory.

Cheers

This makes sense. But then is there any way I can use benefits of pinned memory (fast transfers … ) without sacrificing CPU performance i.e. make it still CPU cached.

The reason for asking this is we want to do workload balancing in GPU and CPU. Hence computation needs to be performed on both. If I use pinned memory the CPU part of computation gets affected.

I found similar post here with explanation:

I will try with cudaMallocManaged to see if there is any gain.