cudaMemCpy HostToDevice VS. DeviceToHost

Someone can tell me why this function take different times to transfer the same quantity of memory?

In this case for 300MB of memory HtoD take 200 ms, DtoH 400 ms

Perhaps in one case the operation is from/to pinned memory, in the other case it is not. The visual profiler will give you considerable information about a cudaMemcpy operation if you hover your mouse over it. I’m not sure this can be answered just by looking at the timeline.

A complete code, along with a description of the system you are running it on, may be necessary.

Ehm no,the info are the same:

HtoD

Source: Pageable
Destination : Device

DtoH

Source : Device
Destination : Pageable

This is the code. I use a Jetson TK1.

#include <stdio.h>
#include <stdlib.h>

//Define some constant. SIZE in the number of memory positions allocated
#define NUM_THREAD 128
#define NUM_BLOCK 16
#define SIZE 39321600

//This is a simple kernel that do only load and store from Global Memory
__global__ void globalmemory_kernel(double* a, double* b) {

	int step = gridDim.x*blockDim.x;
	int tid = threadIdx.x + blockIdx.x*blockDim.x;

#pragma unroll
	for (int t = 0; t < 100; t++){
#pragma unroll
	for(int i = tid; i < SIZE; i += step)
		b[i] = a[i];
	}

}

/**
 * Host function that prepares data array and passes it to the CUDA kernel.
 */
int main(void) {

	double* a;
	double* a_dev;
	double* b;
	double* b_dev;

	//Host memory allocation
	a = (double*)malloc(sizeof(double)*SIZE);
	b = (double*)malloc(sizeof(double)*SIZE);


	for (int i = 0; i< SIZE; i++)
		a[i] = i;

	//Device memory allocation
	cudaMalloc((void**)&a_dev,sizeof(double)*SIZE);
	cudaMalloc((void**)&b_dev,sizeof(double)*SIZE);

	cudaMemcpy(a_dev,a,sizeof(double)*SIZE,cudaMemcpyHostToDevice);

	globalmemory_kernel<<<NUM_BLOCK,NUM_THREAD>>>(a_dev,b_dev);

	cudaMemcpy(b,b_dev,sizeof(double)*SIZE,cudaMemcpyDeviceToHost);

	free(a);
	free(b);
	cudaFree(a_dev);
	cudaFree(b_dev);
	cudaDeviceReset();
	return 0;
}

In the case of the first (faster) transfer, using a, you touched every element of a just prior to transferring to the device. In the case of the second (slower) transfer, using b, you did not. Only the allocation of b occurred. This has an impact on OS activity. A cudaMemcpy device-to-host operation on non-pinned memory has the effect of creating a temporary host pinned allocation, a device-to-host copy to the pinned allocation, followed by a copy from the pinned allocation to your unpinned variable b.

Try adding the following code immediately prior to the cudaMemcpy(b, b_dev,…) operation:

memset(b, 0, sizeof(double)*SIZE);

and I’ll think you’ll see much closer to a match in the transfer times. This is, I believe, due to a difference in the OS paging associated with b, which is not fully set up by the malloc operation, but is fully established once you touch every element of b.

Yeah. It is much closer to the first transfer. The reason is clear. Thanks a lot. For every answer on this forum :D