DATA tranfer from CPU to GPU

I have a problem about data tranfer from CPU to GPU
I copy 496kB it take 0.2156455 milisecond
I copy 768kB it take 0.2149359 milisecond
I copy 1.4MB it take 0.2054923 milisecond
I copy 4.76MB it take 0.2105404 milisecond

why why??? it take the same time. :no:
can anybody tell me what is this problem?
I write main() function on *.cpp file and call this function from *.cu file

this is my code

//start timer copy data tu cpu-gpu
unsigned int timer = 0;
CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));

//copy data from CPU to GPU
unsigned char* zero_gpu;
int Main_size = sizeof(unsigned char)mainzero_colmainzero_row;
cudaMalloc((void**)&zero_gpu,Main_size);
cudaMemcpy(zero_gpu, zero_cpu, Main_size, cudaMemcpyHostToDevice);

// stop and destroy timer
CUT_SAFE_CALL(cutStopTimer(timer));
printf(“copy CPU - GPU time: %f (ms) \n”, cutGetTimerValue(timer));
CUT_SAFE_CALL(cutDeleteTimer(timer));

when I copy back data from GPU to CPU
it very quickly

I copy 496kB it take 0.0090532 milisecond
I copy 768kB it take 0.0165901 milisecond
I copy 1.4MB it take 0.031168 milisecond
I copy 4.76MB it take 0.1980835 milisecond

I’m not sure but try to do a threadsynchronize, I thought I had same problem with that. And why are you so upset about those timings? Taken in mind that you also have an overhead to make connection with the GPU for the first time?

it is only 200 microseconds?

Thank you very much

i had used it

when you measure the time on GPU it look ok

but when measure the time on CPU it different

I mean that

in main() function before i call the cuda function

I start a timer for measure (call CPU_timer)

in the cuda function I start another timer (call GPU_timer)

but the result of two timers is very different

my code. in *.cpp

int main()

{

//do some thing here

clock_t start;

clock_t stop;

start = clock () + CLOCKS_PER_SEC ;

data_transfer(tem_0,tem_col,tem_row);//call cuda function

stop = clock () + CLOCKS_PER_SEC ;

double duration = (double)(stop - start) / CLOCKS_PER_SEC;

printf( “%2.6f seconds\n”, duration );

}

my code in *.cu

void data_tranfer( unsigned char *zero_cpu,int mainzero_col,int mainzero_row)

{

cudaThreadSynchronize();

unsigned int timer = 0;

CUT_SAFE_CALL(cutCreateTimer(&timer));

CUT_SAFE_CALL(cutStartTimer(timer));

//copy data from CPU to GPU

unsigned char* zero_gpu;

int Main_size = sizeof(unsigned char)mainzero_colmainzero_row;

cudaMalloc((void**)&zero_gpu,Main_size);

cudaMemcpy(zero_gpu, zero_cpu, Main_size, cudaMemcpyHostToDevice);//now the matrix had been stored into Global memory

cudaThreadSynchronize();

// stop and destroy timer

CUT_SAFE_CALL(cutStopTimer(timer));

printf(“copy CPU - GPU time: %f (ms) \n”, cutGetTimerValue(timer));

CUT_SAFE_CALL(cutDeleteTimer(timer));

}

the resolution of CPU_timer is very hight

CPU_timer 0.321340 second

GPU_timer 0.009102 second

i think that the time for calling function not too much like this CPU_timer-GU_timer

Memory allocation may takes more time then copying.

Try to measure time of copying only

before i use this function (!)

cudaThreadSynchronize();

the different of GPU_timer and CPU_timer is very small

after i use this function (!)

cudaThreadSynchronize();

the different of GPU_timer and CPU_timer is very large

so i think they has some happen on it :(

the client is only know the CPU_timer and don’t care GPU_timer :(

but before or after use the cudaThreadSynchronize(); function

the CPU_timer wasn’t change(same as a constant) :D

the GPU_timer was change(big change) :(

another problem :">

why GPU_timer for copy data from CPU to GPU is not same the GPU_timer for copy data from GPU to CPU (the different is very large) <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

I use two timer for measure data copy from CPU to GPU and GPU to CPU :)

CPU to GPU :(

I copy 4.76MB it take 0.2105404 milisecond

GPU to CPU :D

I copy 4.76MB it take 0.1980835 milisecond

:no: :no: :no: :no: :no:

please give me the answer

you should be averaging the time of 100s of these because the times will fluctuate.

CPU->GPU copy has a different bandwith as vice-versa usually (look at the output of bandwithtest)

The reason CPU time is higher as GPU time is because the CPU has to move the memory to a pinned memory buffer, then call the GPU to do the DMA transfer, and wait for the GPU to finish. So the CPU has some extra stuff to do besides waiting for the GPU to finish the DMA transfer.

thank you DenisR

my code dose not has any mistake (syntax with algorithm)

I try to find what wrong in my code

i had to used cuda profiler and the result is same as i GPU_timer

So I think that your idea is correct.

But it not which i expect to. :(