Fast copy (device->kernel) after aync kernel call

Hi all,
I’d like to copy data from device to host,after the execution of a kernel queued in a given stream.
I want to minimize the time beetween the end execution of kernel,and the end of copy.
I have two scenarios:
copy of some value,few bytes(like a long)
copy of a vector,like 500->64k bytes

I’ve tried two approach:
-map the pinned host memory into CUDA address space,and launch the copy from the kernel(coalesced access,each thread copy 4/8… bytes each time)(I’ve also trie writeCombined)
-add a callback in the stream queue after the call to async kernel…this callback method contain a cudaMemcpy o cudaMemcpyAsync…(yeah,I know what I can use a cudaMemcpyAsync immediatly after the kernel invocation,without handwritten addCallback,but I’d like to have more flexibility for the future)…

The first approach ever performs worse than the second(5-20 times)…my GPU is a gt540m(fermi)…do you know if cudaMemcpyasync is always better than copy from inside the kernel,also for small amount of data?or is it more depent from the used architecture?or can you suggest me other approaches?
thanks for your time,
Marco

Technically a cudaMemcpy in a stream callback is illegal:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-callbacks

“A callback must not make CUDA API calls (directly or indirectly), as it might end up waiting on itself if it makes such a call leading to a deadlock.”

Considering only the cudaMemcpy operation, just putting cudaMemcpyAsync in the stream immediately after the kernel call should be just as fast as using a callback.

I don’t see how the first approach could perform 5-20 times worse than the second approach (whatever 5-20 times means), and you’ve provided no support for that statement whatsoever, so I’ll leave it at that. Perhaps you’re measuring pinning costs or other overheads in your measurement.

Thank,I didn’t see this statement . It works,and I didn’t think about legality

I’m agree…the fact is that in the future I may need to perform some atomic operation with data “returned” by the kernel…so,if I perform copy from kernel,I would use cuda atomics studd,otherwise I need to use some lock/atomics from CPU side…but I’m not sure that I will need this part

I’ve looked about time measurement performed by cuda visual profiler(linux)…
According to my measurement,copy performed by kernel is sligly worse than cudaMemcpy when is executed on big array,like 1 GB…In that case, kernel_copy/memcpy time is 1.37 for host->device and 1.14 for device to host…

but I’m working with small kernels(about 1ms each)…in that case,I compute something,and I need to store few values on host…is seem faster to call a cudaMemcpy after the kernel,instead of make everything inside the kernel…otherwise the kernel would take an additional time of 5/20 the time used by memcpy…

Generally speaking, kernels running for 1 ms are unproblematic. Specific performance issues with short-running kernels typically don’t show up until kernel execution times falls below the 100 us mark. The approach with cudaMemcpyAsync() in the same stream as the kernel is sound. However, your transfers are very small, which means the efficiency of each such transfer is low due to relatively high overhead cost.

Without knowing any specifics of the use case, it is only possible to make vague recommendations. You may want to consider the following improve performance: (a) obviously you would want to use at least two CUDA streams to hide the transfers behind the kernel execution; (b) eliminate the small device->host transfers by porting the (control?) code currently residing on the host to the GPU. © increase the size of each transfer, e.g. by increasing the amount of data processed by each kernel call (larger input sets and larger output sets).

I would expect the two approaches to be “roughly” comparable in terms of execution time.

Since you’ve not bothered to show any code or demonstrate what exactly you are timing, I put something together.

Depending on what size of data I operate on, I find that in one case, the cuda-kernel-copy-to-host method is slightly faster, and in the other case the kernel followed by a cudaMemcpy is slightly faster, but I don’t see anything to suggest a 5-20x perf difference between the two cases.

Starting with a kernel that operates on about 40MB of data, and in the kernel+cudaMemcpy case, takes about 1ms for the kernel to execute, the code is like this:

#include <stdio.h>

#define DSIZE 1048576*10
#define nTPB 256
#define nBLK 64
#define TVAL 1

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

template <typename T>
__global__ void datakernel(const T *d_in, T *d_out, const size_t len, const T tval){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;

  while (idx < len){
    d_out[idx] = d_in[idx] + tval;
    idx += blockDim.x*gridDim.x;}
}

__global__ void flagkernel(int *data){

  *data = 1;
}

int main(){

  int *d_data1, *d_data2, *h_rpin, *h_flag;

  size_t dsize = DSIZE*sizeof(int);
  cudaMalloc(&d_data1, dsize);
  cudaMalloc(&d_data2, dsize);
  cudaCheckErrors("cudaMalloc fail");
  cudaHostAlloc(&h_rpin, dsize, cudaHostAllocMapped);
  cudaHostAlloc(&h_flag, sizeof(int), cudaHostAllocMapped);
  cudaCheckErrors("cudaHostAlloc fail");
  cudaMemset(d_data1, 0, dsize);
  cudaMemset(d_data2, 0, dsize);
  memset(h_rpin, 0, dsize);
  memset(h_flag, 0, sizeof(int));
  unsigned long long t1 = dtime_usec(0);
  datakernel<<<nBLK, nTPB>>>(d_data1, h_rpin, DSIZE, TVAL);
  flagkernel<<<1,1>>>(h_flag);
  while (!(*((volatile int *)h_flag)));
  t1 = dtime_usec(t1);
  unsigned long long t2 = dtime_usec(0);
  datakernel<<<nBLK, nTPB>>>(d_data1, d_data2, DSIZE, TVAL);
  cudaMemcpy(h_rpin, d_data2, dsize, cudaMemcpyDeviceToHost);
  t2 = dtime_usec(t2);
  printf("t1 = %fs, t2 = %fs\n", t1/(float)USECPSEC, t2/(float)USECPSEC);
  return 0;

}

Compiling on Fedora 20/CUDA 7, for a Quadro 5000 (also a Fermi device) and running the code with the profiler:

$ nvcc -O3 -o t836 t836.cu
$ nvprof --print-gpu-trace ./t836
==616== NVPROF is profiling process 616, command: ./t836
t1 = 0.006693s, t2 = 0.007332s
==616== Profiling application: ./t836
==616== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
766.83ms  368.73us                    -               -         -         -         -  41.943MB  113.75GB/s  Quadro 5000 (0)         1         7  [CUDA memset]
767.20ms  366.52us                    -               -         -         -         -  41.943MB  114.44GB/s  Quadro 5000 (0)         1         7  [CUDA memset]
778.29ms  6.6061ms             (64 1 1)       (256 1 1)        10        0B        0B         -           -  Quadro 5000 (0)         1         7  void datakernel<int>(int const *, int*, unsigned long, int) [190]
784.90ms  1.8440us              (1 1 1)         (1 1 1)         7        0B        0B         -           -  Quadro 5000 (0)         1         7  flagkernel(int*) [193]
784.91ms  855.31us             (64 1 1)       (256 1 1)        10        0B        0B         -           -  Quadro 5000 (0)         1         7  void datakernel<int>(int const *, int*, unsigned long, int) [199]
785.77ms  6.4553ms                    -               -         -         -         -  41.943MB  6.4975GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$

So using the timing built into the code, we see that the t1 case (kernel copying directly to host) is somewhat faster than the t2 case (kernel followed by cudaMemcpy). Looking at the breakdown in the profiler, we see that in the kernel-only case, the kernel runs for 6.6ms. In the kernel + cudaMemcpy case, the kernel itself runs for 0.855ms and the subsequent cudaMemcpy operation takes 6.45ms, achieving about 6.5GB/s bandwidth (which is correct for this device - a Gen2 x16 PCIE link). The kernels are not doing precisely the same thing: the kernel-only case has 1 read from global memory, followed by a write to host memory. In the kernel + cudaMemcpy case, the kernel does a read and write to global memory, and the cudaMemcpy does a read from global memory and a write to host memory. The time for a read or a write to global memory is approximated by the duration of the [CUDA memset] kernels, which are about 370us each. If we factor that into the above comparison, we see that the two cases are nearly identical.

If I reduce the data set size by a factor of 10 (to 4MB), then the kernel+cudaMemcpy case starts to win, but not by anything like a factor of 5x to 20x. Here’s that data for comparison:

t1 = 0.000956s, t2 = 0.000757s
==683== Profiling application: ./t836
==683== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
750.45ms  38.783us                    -               -         -         -         -  4.1943MB  108.15GB/s  Quadro 5000 (0)         1         7  [CUDA memset]
750.49ms  37.024us                    -               -         -         -         -  4.1943MB  113.29GB/s  Quadro 5000 (0)         1         7  [CUDA memset]
750.84ms  913.75us             (64 1 1)       (256 1 1)        10        0B        0B         -           -  Quadro 5000 (0)         1         7  void datakernel<int>(int const *, int*, unsigned long, int) [190]
751.75ms  1.8440us              (1 1 1)         (1 1 1)         7        0B        0B         -           -  Quadro 5000 (0)         1         7  flagkernel(int*) [193]
751.77ms  86.413us             (64 1 1)       (256 1 1)        10        0B        0B         -           -  Quadro 5000 (0)         1         7  void datakernel<int>(int const *, int*, unsigned long, int) [199]
751.86ms  648.86us                    -               -         -         -         -  4.1943MB  6.4641GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$

In this case, the code measurement for the t1 case is a little less than 1.5x the t2 case, and the t1 kernel takes almost 1.5x longer than the cudaMemcpy time for the t2 case, which is a little surprising to me, but nothing like the 5-20x you mentioned.

Here’s a modified version of the code above that steps through various sizes. Sometimes one method is faster, sometimes the other method is faster:

$ cat t836.cu
#include <stdio.h>

#define DSIZE 1048576*10
#define nTPB 256
#define nBLK 64
#define TVAL 1

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

template <typename T>
__global__ void datakernel(const T *d_in, T *d_out, const size_t len, const T tval){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;

  while (idx < len){
    d_out[idx] = d_in[idx] + tval;
    idx += blockDim.x*gridDim.x;}
}

__global__ void flagkernel(int *data){

  *data = 1;
}

int main(){

  int *d_data1, *d_data2, *h_rpin, *h_flag;

  size_t dsize = DSIZE*sizeof(int);
  cudaMalloc(&d_data1, dsize);
  cudaMalloc(&d_data2, dsize);
  cudaCheckErrors("cudaMalloc fail");
  cudaHostAlloc(&h_rpin, dsize, cudaHostAllocMapped);
  cudaHostAlloc(&h_flag, sizeof(int), cudaHostAllocMapped);
  cudaCheckErrors("cudaHostAlloc fail");
  for (; dsize > 4; dsize /= 10){
    cudaMemset(d_data1, 0, dsize);
    cudaMemset(d_data2, 0, dsize);
    memset(h_rpin, 0, dsize);
    memset(h_flag, 0, sizeof(int));
    unsigned long long t1 = dtime_usec(0);
    datakernel<<<nBLK, nTPB>>>(d_data1, h_rpin, dsize/sizeof(int), TVAL);
    flagkernel<<<1,1>>>(h_flag);
    while (!(*((volatile int *)h_flag)));
    t1 = dtime_usec(t1);
    unsigned long long t2 = dtime_usec(0);
    datakernel<<<nBLK, nTPB>>>(d_data1, d_data2, dsize/sizeof(int), TVAL);
    cudaMemcpy(h_rpin, d_data2, dsize, cudaMemcpyDeviceToHost);
    t2 = dtime_usec(t2);
    printf("t1 = %fs, t2 = %fs, dsize: %d\n", t1/(float)USECPSEC, t2/(float)USECPSEC, dsize/sizeof(int));}
  return 0;

}
$ nvcc -O3 -o t836 t836.cu
$ ./t836
t1 = 0.006709s, t2 = 0.007336s, dsize: 10485760
t1 = 0.000897s, t2 = 0.000742s, dsize: 1048576
t1 = 0.000105s, t2 = 0.000085s, dsize: 104857
t1 = 0.000020s, t2 = 0.000043s, dsize: 10485
t1 = 0.000012s, t2 = 0.000014s, dsize: 1048
t1 = 0.000011s, t2 = 0.000013s, dsize: 104
t1 = 0.000011s, t2 = 0.000013s, dsize: 10
$

You are in true…copy from kernel could be better than memcpy…the strange fact is that I’ve used,for testing purpose,a kernel similar to your,but with “for” instead of “while”…and my version used 22(or 20) registers,while your only 15…

Thanks for your replies,
marco

Just one question…is your kernel_flag faster than the usage of DeviceSynchronize(and similar)?

Probably not. But you have the code and the testing methodology. You can just copy my code, make whatever changes you want, and recompile and run the test. If you use nvprof just as I have shown, you can answer for yourself which is quicker.

I’ve seen that on my platform(Fermi) is not faster…I’ve asked because I don’t know if it could be faster with other Compute Capabilities ;-)

I don’t know what it would be across other compute capabilities. I was testing on Fermi too.

Ok,no problem…Thank you for your exhaustives answers