I have a program in CUDA and it’s seems to long to run.
I have rfData_h as an input
cudaMalloc((void **) &rfData_d, size_nmpts_nmlne);
cudaMemcpy(rfData_d, rfData_h, size_nmpts_nmlne, cudaMemcpyHostToDevice);
My kernels invocation
cudaMemcpy(rfData_h, rfDataMODE_d, size_nmpts_nmlne, cudaMemcpyDeviceToHost);
My kernel is quick so no problem at all ( <300 us )
But the memory transfer takes too long (>100ms)
Can I use the shared memory instead of the local memory so that it’ll be faster?
I need to have the copy -> kernel -> retrieve <1ms.