help me understanding the report of Profiler about reading the Profiler report

Hi, I tried to use Visual Profiler today, but I found I cannot understand the report from Profiler, can anyone help me?

I have a small code, to only test the timing of memory copy:

TIMER_LOGGER_BEGIN("api novector memory benchmark");

	

	unsigned int number = 1024 * 1024 / 4 * 100; // 100 MB

	size_t size = number * sizeof(float);

	unsigned int sizeInMB = size / 1024 / 1024;

	float* seqFloat = (float*)malloc(size);

	memset(seqFloat, 0, size);

	

	float* devMem = 0;

	cudaSetDevice(0);

	

	TIMER_LOGGER_NEXT("scatter with api novector directly malloc of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMalloc = cudaMalloc((void**)&devMem, size);

	cudaThreadSynchronize();

	

	TIMER_LOGGER_NEXT("scatter with api novector directly #1 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMemcpy = cudaMemcpy(devMem, seqFloat, size, cudaMemcpyHostToDevice);

	cudaThreadSynchronize();

	

	TIMER_LOGGER_NEXT("scatter with api novector directly #2 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t ret2	  = cudaMemcpy(devMem, seqFloat, size, cudaMemcpyHostToDevice);

	cudaThreadSynchronize();

	

	TIMER_LOGGER_NEXT("scatter with api novector directly #3 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t ret3	  = cudaMemcpy(devMem, seqFloat, size, cudaMemcpyHostToDevice);

	cudaThreadSynchronize();

	

	TIMER_LOGGER_NEXT("scatter with api novector directly finished");

	cudaError_t retFree = cudaFree(devMem);

	cudaThreadSynchronize();

	free(seqFloat);

	

	//cout << retMemcpy << endl;

	

	BOOST_CHECK(retMalloc == cudaSuccess);

	BOOST_CHECK(retMemcpy == cudaSuccess);

	BOOST_CHECK(retFree == cudaSuccess);

	

	TIMER_LOGGER_END();

	TIMER_LOGGER_DUMP_LOG();

The timer logger is a time measure of our own, and when I run this code, it report:

Running 1 test case...

api novector memory benchmark									  0.221

scatter with api novector directly malloc of 100 MB				0.046

scatter with api novector directly #1 of 100 MB					0.121

scatter with api novector directly #2 of 100 MB					0.122

scatter with api novector directly #3 of 100 MB					0.114

scatter with api novector directly finished						0.012

which means the malloc took 0.046s, and memory copy took around 0.120s (Vista+8600+CUDA2.0)

But when I run this test program from Profiler, I got strange report:

# method	GPU Time

1 memcopy   46170.8

2 memcopy   47152.1

3 memcopy   47688.4

I want to know two things:

  1. what’s the time unit of Profiler? Because I cannot match this result to my time logger output, they are too different.

  2. why the time became more longer? in Profiler report, #1 < #2 < #3, and it’s always like this no matter how many time I ran it. But in my time logger report, the time differs, sometimes, #2 is the longest, sometimes, others.

If I change the data block to 1 MB, we can see the strange things more clearly:

Running 1 test case...

api novector memory benchmark									  0.032

scatter with api novector directly malloc of 1 MB				  0.033

scatter with api novector directly #1 of 1 MB					  0.002

scatter with api novector directly #2 of 1 MB					  0.001

scatter with api novector directly #3 of 1 MB					  0.001

scatter with api novector directly finished						0.000

and here is the Profiler output:

# method	GPU Time

1 memcopy   463.328

2 memcopy   932.896

3 memcopy   934.016

see it? the #2 uses 2 times time as #1, this is very strange.

I did another test, this time, I use cudaMallocHost instead of malloc, seems it speed up the memcpy a lot, here is the new code, and I

set the data size to 10 MB:

TIMER_LOGGER_BEGIN("api host memory benchmark");

	

	cudaSetDevice(0);

	size_t size = number * sizeof(float);

	unsigned int sizeInMB = size / 1024 / 1024;

	vector<float> seqFloat(number, 8.8f);

	

	float* hostMem = 0;

	TIMER_LOGGER_NEXT("malloc host and memcpy");

	cudaError_t retMallocHost = cudaMallocHost((void**)&hostMem, size);

	memcpy(hostMem, &seqFloat[0], size);

	

	float* devMem = 0;

	TIMER_LOGGER_NEXT("scatter with api host directly malloc of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMalloc  = cudaMalloc((void**)&devMem, size);

	TIMER_LOGGER_NEXT("scatter with api host directly #1 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMemcpy1 = cudaMemcpy(devMem, hostMem, size, cudaMemcpyHostToDevice);

	TIMER_LOGGER_NEXT("scatter with api host directly #2 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMemcpy2 = cudaMemcpy(devMem, hostMem, size, cudaMemcpyHostToDevice);

	TIMER_LOGGER_NEXT("scatter with api host directly #3 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMemcpy3 = cudaMemcpy(devMem, hostMem, size, cudaMemcpyHostToDevice);

	TIMER_LOGGER_NEXT("scatter with api host directly finished");

	cudaError_t retFree = cudaFree(devMem);

	cudaError_t retFreeHost = cudaFreeHost(hostMem);

	

	BOOST_CHECK(retMalloc == cudaSuccess);

	BOOST_CHECK(retMemcpy1 == cudaSuccess);

	BOOST_CHECK(retMemcpy2 == cudaSuccess);

	BOOST_CHECK(retMemcpy3 == cudaSuccess);

	BOOST_CHECK(retFree == cudaSuccess);

	BOOST_CHECK(retMallocHost == cudaSuccess);

	BOOST_CHECK(retFreeHost == cudaSuccess);

	

	TIMER_LOGGER_END();

	TIMER_LOGGER_DUMP_LOG();

and here is the result:

api host memory benchmark										  0.042

malloc host and memcpy											 0.049

scatter with api host directly malloc of 10 MB					 0.002

scatter with api host directly #1 of 10 MB						 0.008

scatter with api host directly #2 of 10 MB						 0.005

scatter with api host directly #3 of 10 MB						 0.005

scatter with api host directly finished							0.003

but I cannot use Profiler to run this program (with cudaMallocHost), it always said:

Error -96 in reading profiler output. No data rows in profiler output file.

does anyone know why this happen?

For a compare, here is the output when not using cudaMallocHost for 10MB size data:

api novector memory benchmark									  0.044

scatter with api novector directly malloc of 10 MB				 0.038

scatter with api novector directly #1 of 10 MB					 0.011

scatter with api novector directly #2 of 10 MB					 0.011

scatter with api novector directly #3 of 10 MB					 0.010

scatter with api novector directly finished						0.002

so as a result:

  • use cudaMallocHost can speed up around 2x of memory copy

  • use cudaMallocHost cannot use Profiler to get timing data.