Hi, I tried to use Visual Profiler today, but I found I cannot understand the report from Profiler, can anyone help me?
I have a small code, to only test the timing of memory copy:
TIMER_LOGGER_BEGIN("api novector memory benchmark");
unsigned int number = 1024 * 1024 / 4 * 100; // 100 MB
size_t size = number * sizeof(float);
unsigned int sizeInMB = size / 1024 / 1024;
float* seqFloat = (float*)malloc(size);
memset(seqFloat, 0, size);
float* devMem = 0;
cudaSetDevice(0);
TIMER_LOGGER_NEXT("scatter with api novector directly malloc of " + boost::lexical_cast<string>(sizeInMB) + " MB");
cudaError_t retMalloc = cudaMalloc((void**)&devMem, size);
cudaThreadSynchronize();
TIMER_LOGGER_NEXT("scatter with api novector directly #1 of " + boost::lexical_cast<string>(sizeInMB) + " MB");
cudaError_t retMemcpy = cudaMemcpy(devMem, seqFloat, size, cudaMemcpyHostToDevice);
cudaThreadSynchronize();
TIMER_LOGGER_NEXT("scatter with api novector directly #2 of " + boost::lexical_cast<string>(sizeInMB) + " MB");
cudaError_t ret2 = cudaMemcpy(devMem, seqFloat, size, cudaMemcpyHostToDevice);
cudaThreadSynchronize();
TIMER_LOGGER_NEXT("scatter with api novector directly #3 of " + boost::lexical_cast<string>(sizeInMB) + " MB");
cudaError_t ret3 = cudaMemcpy(devMem, seqFloat, size, cudaMemcpyHostToDevice);
cudaThreadSynchronize();
TIMER_LOGGER_NEXT("scatter with api novector directly finished");
cudaError_t retFree = cudaFree(devMem);
cudaThreadSynchronize();
free(seqFloat);
//cout << retMemcpy << endl;
BOOST_CHECK(retMalloc == cudaSuccess);
BOOST_CHECK(retMemcpy == cudaSuccess);
BOOST_CHECK(retFree == cudaSuccess);
TIMER_LOGGER_END();
TIMER_LOGGER_DUMP_LOG();
The timer logger is a time measure of our own, and when I run this code, it report:
Running 1 test case...
api novector memory benchmark 0.221
scatter with api novector directly malloc of 100 MB 0.046
scatter with api novector directly #1 of 100 MB 0.121
scatter with api novector directly #2 of 100 MB 0.122
scatter with api novector directly #3 of 100 MB 0.114
scatter with api novector directly finished 0.012
which means the malloc took 0.046s, and memory copy took around 0.120s (Vista+8600+CUDA2.0)
But when I run this test program from Profiler, I got strange report:
# method GPU Time
1 memcopy 46170.8
2 memcopy 47152.1
3 memcopy 47688.4
I want to know two things:
-
what’s the time unit of Profiler? Because I cannot match this result to my time logger output, they are too different.
-
why the time became more longer? in Profiler report, #1 < #2 < #3, and it’s always like this no matter how many time I ran it. But in my time logger report, the time differs, sometimes, #2 is the longest, sometimes, others.
If I change the data block to 1 MB, we can see the strange things more clearly:
Running 1 test case...
api novector memory benchmark 0.032
scatter with api novector directly malloc of 1 MB 0.033
scatter with api novector directly #1 of 1 MB 0.002
scatter with api novector directly #2 of 1 MB 0.001
scatter with api novector directly #3 of 1 MB 0.001
scatter with api novector directly finished 0.000
and here is the Profiler output:
# method GPU Time
1 memcopy 463.328
2 memcopy 932.896
3 memcopy 934.016
see it? the #2 uses 2 times time as #1, this is very strange.