help me understanding the report of Profiler about reading the Profiler report

Zhang_Xiang · December 23, 2008, 9:41am

Hi, I tried to use Visual Profiler today, but I found I cannot understand the report from Profiler, can anyone help me?

I have a small code, to only test the timing of memory copy:

TIMER_LOGGER_BEGIN("api novector memory benchmark");

	

	unsigned int number = 1024 * 1024 / 4 * 100; // 100 MB

	size_t size = number * sizeof(float);

	unsigned int sizeInMB = size / 1024 / 1024;

	float* seqFloat = (float*)malloc(size);

	memset(seqFloat, 0, size);

	

	float* devMem = 0;

	cudaSetDevice(0);

	

	TIMER_LOGGER_NEXT("scatter with api novector directly malloc of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMalloc = cudaMalloc((void**)&devMem, size);

	cudaThreadSynchronize();

	

	TIMER_LOGGER_NEXT("scatter with api novector directly #1 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMemcpy = cudaMemcpy(devMem, seqFloat, size, cudaMemcpyHostToDevice);

	cudaThreadSynchronize();

	

	TIMER_LOGGER_NEXT("scatter with api novector directly #2 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t ret2	  = cudaMemcpy(devMem, seqFloat, size, cudaMemcpyHostToDevice);

	cudaThreadSynchronize();

	

	TIMER_LOGGER_NEXT("scatter with api novector directly #3 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t ret3	  = cudaMemcpy(devMem, seqFloat, size, cudaMemcpyHostToDevice);

	cudaThreadSynchronize();

	

	TIMER_LOGGER_NEXT("scatter with api novector directly finished");

	cudaError_t retFree = cudaFree(devMem);

	cudaThreadSynchronize();

	free(seqFloat);

	

	//cout << retMemcpy << endl;

	

	BOOST_CHECK(retMalloc == cudaSuccess);

	BOOST_CHECK(retMemcpy == cudaSuccess);

	BOOST_CHECK(retFree == cudaSuccess);

	

	TIMER_LOGGER_END();

	TIMER_LOGGER_DUMP_LOG();

The timer logger is a time measure of our own, and when I run this code, it report:

Running 1 test case...

api novector memory benchmark									  0.221

scatter with api novector directly malloc of 100 MB				0.046

scatter with api novector directly #1 of 100 MB					0.121

scatter with api novector directly #2 of 100 MB					0.122

scatter with api novector directly #3 of 100 MB					0.114

scatter with api novector directly finished						0.012

which means the malloc took 0.046s, and memory copy took around 0.120s (Vista+8600+CUDA2.0)

But when I run this test program from Profiler, I got strange report:

# method	GPU Time

1 memcopy   46170.8

2 memcopy   47152.1

3 memcopy   47688.4

I want to know two things:

what’s the time unit of Profiler? Because I cannot match this result to my time logger output, they are too different.
why the time became more longer? in Profiler report, #1 < #2 < #3, and it’s always like this no matter how many time I ran it. But in my time logger report, the time differs, sometimes, #2 is the longest, sometimes, others.

If I change the data block to 1 MB, we can see the strange things more clearly:

Running 1 test case...

api novector memory benchmark									  0.032

scatter with api novector directly malloc of 1 MB				  0.033

scatter with api novector directly #1 of 1 MB					  0.002

scatter with api novector directly #2 of 1 MB					  0.001

scatter with api novector directly #3 of 1 MB					  0.001

scatter with api novector directly finished						0.000

and here is the Profiler output:

# method	GPU Time

1 memcopy   463.328

2 memcopy   932.896

3 memcopy   934.016

see it? the #2 uses 2 times time as #1, this is very strange.

Zhang_Xiang · December 23, 2008, 11:42am

I did another test, this time, I use cudaMallocHost instead of malloc, seems it speed up the memcpy a lot, here is the new code, and I

set the data size to 10 MB:

TIMER_LOGGER_BEGIN("api host memory benchmark");

	

	cudaSetDevice(0);

	size_t size = number * sizeof(float);

	unsigned int sizeInMB = size / 1024 / 1024;

	vector<float> seqFloat(number, 8.8f);

	

	float* hostMem = 0;

	TIMER_LOGGER_NEXT("malloc host and memcpy");

	cudaError_t retMallocHost = cudaMallocHost((void**)&hostMem, size);

	memcpy(hostMem, &seqFloat[0], size);

	

	float* devMem = 0;

	TIMER_LOGGER_NEXT("scatter with api host directly malloc of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMalloc  = cudaMalloc((void**)&devMem, size);

	TIMER_LOGGER_NEXT("scatter with api host directly #1 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMemcpy1 = cudaMemcpy(devMem, hostMem, size, cudaMemcpyHostToDevice);

	TIMER_LOGGER_NEXT("scatter with api host directly #2 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMemcpy2 = cudaMemcpy(devMem, hostMem, size, cudaMemcpyHostToDevice);

	TIMER_LOGGER_NEXT("scatter with api host directly #3 of " + boost::lexical_cast<string>(sizeInMB) + " MB");

	cudaError_t retMemcpy3 = cudaMemcpy(devMem, hostMem, size, cudaMemcpyHostToDevice);

	TIMER_LOGGER_NEXT("scatter with api host directly finished");

	cudaError_t retFree = cudaFree(devMem);

	cudaError_t retFreeHost = cudaFreeHost(hostMem);

	

	BOOST_CHECK(retMalloc == cudaSuccess);

	BOOST_CHECK(retMemcpy1 == cudaSuccess);

	BOOST_CHECK(retMemcpy2 == cudaSuccess);

	BOOST_CHECK(retMemcpy3 == cudaSuccess);

	BOOST_CHECK(retFree == cudaSuccess);

	BOOST_CHECK(retMallocHost == cudaSuccess);

	BOOST_CHECK(retFreeHost == cudaSuccess);

	

	TIMER_LOGGER_END();

	TIMER_LOGGER_DUMP_LOG();

and here is the result:

api host memory benchmark										  0.042

malloc host and memcpy											 0.049

scatter with api host directly malloc of 10 MB					 0.002

scatter with api host directly #1 of 10 MB						 0.008

scatter with api host directly #2 of 10 MB						 0.005

scatter with api host directly #3 of 10 MB						 0.005

scatter with api host directly finished							0.003

but I cannot use Profiler to run this program (with cudaMallocHost), it always said:

Error -96 in reading profiler output. No data rows in profiler output file.

does anyone know why this happen?

For a compare, here is the output when not using cudaMallocHost for 10MB size data:

api novector memory benchmark									  0.044

scatter with api novector directly malloc of 10 MB				 0.038

scatter with api novector directly #1 of 10 MB					 0.011

scatter with api novector directly #2 of 10 MB					 0.011

scatter with api novector directly #3 of 10 MB					 0.010

scatter with api novector directly finished						0.002

so as a result:

use cudaMallocHost can speed up around 2x of memory copy
use cudaMallocHost cannot use Profiler to get timing data.

Topic		Replies	Views
How to explain the performance difference? CUDA Programming and Performance	7	3569	March 26, 2008
reliability of profiler output data profiler output data unplausible? CUDA Programming and Performance	3	2231	May 6, 2008
CUDA Profiler [memcopy] weird result CUDA Programming and Performance	7	7315	November 8, 2007
Execution time is different in Profiller and Console. why? CUDA Programming and Performance	4	3792	August 3, 2009
preview of NVIDIA Visual Profiler CUDA Programming and Performance	76	89346	May 18, 2010
CUDA 2.0 and Visual Profiler 1.0.11 cudaprof reports are missing memcopy's CUDA Programming and Performance	5	5714	August 26, 2008
cuda visual profiler CUDA Programming and Performance	12	8264	July 30, 2008
cudaMemcpy timing CUDA Programming and Performance	1	6815	December 8, 2010
Profiler Times just need some info CUDA Programming and Performance	4	4579	June 16, 2010
Visual profiler missing information CUDA Programming and Performance	6	8460	May 26, 2009

help me understanding the report of Profiler about reading the Profiler report

Related topics