cudaMemcpyAsync not behaving asynchronously

stencil · July 1, 2008, 11:20pm

Hi, I’m trying to get asynchronous memory copy calls working correctly, but am unable to do so. I am using a Geforce 8800 GTX. It’s compute capability 1.0, but asynchronous memory copy should still work, right?

Here is a sample program that does not behave as expected:

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

#include <cuda_runtime_api.h>

#  define SAFE_CALL(call) do {                                         \

   cudaError_t err = call;                                                    \

   if(err != cudaSuccess) {                                                \

       fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \

               __FILE__, __LINE__, cudaGetErrorString( err) );              \

   exit(1);                                                      \

   } } while (0)

#define NUM_BYTES 200000000

   

void timestamp(char* message);

   

int main() {

 char* host_ptr;

 char* device_ptr;

SAFE_CALL(cudaMallocHost((void**)(&host_ptr),NUM_BYTES));

 SAFE_CALL(cudaMalloc((void**)(&device_ptr),NUM_BYTES));

 timestamp("done mallocing");

SAFE_CALL(cudaMemcpyAsync(device_ptr,host_ptr,NUM_BYTES,cudaMemcpyHostToDevice,0));

timestamp("done issuing memory copy");

SAFE_CALL(cudaThreadSynchronize());

timestamp("completed memory copy");

}

clock_t last_time = 0;

void timestamp(char* message) {

 clock_t current_time = (clock()*1000) / CLOCKS_PER_SEC;

 fprintf(stderr,"%s +%dms (overall time=%dms)\n",message,current_time - last_time,current_time);

 last_time=current_time;

}

If I compile this program as nvcc test.cu, then run it, I get the following output:

done mallocing +187ms (overall time=187ms)

done issuing memory copy +78ms (overall time=265ms)

completed memory copy +0ms (overall time=265ms)

Am I doing something wrong? Can someone else try this code and let me know what you get, thanks!

Sarnath · July 2, 2008, 4:23am

Your timing includes the time of “fprintf” as well!

stencil · July 2, 2008, 7:50pm

Hi,

The time for fprintf is insignificant here, if I set NUM_BYTES to 1, then the memory copy portion of the program (including the associated fprintfs) takes 0ms. The entire output is:
done mallocing +109ms (overall time=109ms)
done issuing memory copy +0ms (overall time=109ms)
completed memory copy +0ms (overall time=109ms)

(post edited because I had the output wrong at first)

Kravell · July 3, 2008, 12:38pm

Hi, I tried your code on a Tesla C870, so basically the same as 8800 GTX.

I get the following output :

This shows that concurrent memcopy/CPU execution is working (I have no idea why in your case it doesnt work) . But if I remember correctly, you dont need “cudaMemcpyAsync” for that. “cudaMemcpyAsync” is intended to provide concurrent memcopy/kernel execution with streams, but only on 1.1 hardware.

stencil · July 3, 2008, 7:23pm

Thanks for the info. What version of CUDA are you using? I’m using version 1.1.

Kravell · July 4, 2008, 7:58am

I am using version 2.0 with driver 177.13.

Topic		Replies	Views
Async Memcpy calls blocking main thread CUDA Programming and Performance	3	2467	November 19, 2011
Questions about "cudaMemcpyAsync" Legacy PGI Compilers	1	2372	November 18, 2011
Asynchronous memory copy from Host to Device CUDA Programming and Performance	5	3095	June 12, 2008
cudaMemcpyAsync not giving any answers using cudaMemcpyAsync function CUDA Programming and Performance	1	813	September 5, 2011
performance variation when using asynchronous calls CUDA Programming and Performance	1	637	February 11, 2011
cudaMemcpyAsync not "async" in cuda 3.1 cudaMemcpyAsync blocking cuda 3.1 CUDA Programming and Performance	7	1984	July 12, 2010
cudaMemcpyAsync code problem CUDA Programming and Performance	3	4576	September 16, 2008
Odd cudaMemcpyAsync() behavior with Kepler K20c and CUDA 5.0 CUDA Programming and Performance	0	945	January 14, 2013
cudaMemcpyAsync problem CUDA Programming and Performance	9	3200	May 26, 2020
cudaMemcpyAsync slower than cudaMemcpy? CUDA Programming and Performance	1	3101	March 10, 2009

cudaMemcpyAsync not behaving asynchronously

Related topics