In the below code, I am getting an error in the call to cudaMemcpyAsync, if I replace with cudaMemcpy, with the same argument the code works, can anyone
look and give me a pointer regarding this ? I have bolded underlined the relevant code
////////////////////////////////////////////////////////////////////////////////
// GPU thread
////////////////////////////////////////////////////////////////////////////////
typedef struct {
//Device id
int device;
//Host-side input data
int dataN;
float *h_Data;
//Partial sum for this GPU
float *h_Sum;
} TGPUplan;
static CUT_THREADPROC solverThread(TGPUplan *plan){
const int BLOCK_N = 32;
const int THREAD_N = 256;
const int ACCUM_N = BLOCK_N * THREAD_N;
This is wrong. The last argument of cudaMemcpyAsync is a cudaStream_t obtained through cudaStreamCreate. See the Cuda Reference Manual, page 34 and 13.
Additionally to using a cudaStream_t as the last argument in the above two calls, you may have to ensure that plan->h_Data and h_Sum both pointed to page-locked memory. This can be achieved with cudaMallocHost() (see page 30 of the CudaReferenceManual_2.0.pdf and the description of cudaMemcpyAsync() on page 34).
I am looking at the code example called simpleMultiGpu, what I am trying to do is an async copy instead of the copy, I added code as you suggested to create the cudaStream_t in plan->device I thought CudaStream_T is int, cudaMallocHost is used for host memory, anbd cudaMalloc for GPU memory, seems correct to me but still the last parameter (as you have pointed … is incorrect), can you show me the correct line ?
static CUT_THREADPROC solverThread(TGPUplan *plan){
const int BLOCK_N = 32;
const int THREAD_N = 256;
const int ACCUM_N = BLOCK_N * THREAD_N;