I would expect the two approaches to be “roughly” comparable in terms of execution time.
Since you’ve not bothered to show any code or demonstrate what exactly you are timing, I put something together.
Depending on what size of data I operate on, I find that in one case, the cuda-kernel-copy-to-host method is slightly faster, and in the other case the kernel followed by a cudaMemcpy is slightly faster, but I don’t see anything to suggest a 5-20x perf difference between the two cases.
Starting with a kernel that operates on about 40MB of data, and in the kernel+cudaMemcpy case, takes about 1ms for the kernel to execute, the code is like this:
#include <stdio.h>
#define DSIZE 1048576*10
#define nTPB 256
#define nBLK 64
#define TVAL 1
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__global__ void datakernel(const T *d_in, T *d_out, const size_t len, const T tval){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < len){
d_out[idx] = d_in[idx] + tval;
idx += blockDim.x*gridDim.x;}
}
__global__ void flagkernel(int *data){
*data = 1;
}
int main(){
int *d_data1, *d_data2, *h_rpin, *h_flag;
size_t dsize = DSIZE*sizeof(int);
cudaMalloc(&d_data1, dsize);
cudaMalloc(&d_data2, dsize);
cudaCheckErrors("cudaMalloc fail");
cudaHostAlloc(&h_rpin, dsize, cudaHostAllocMapped);
cudaHostAlloc(&h_flag, sizeof(int), cudaHostAllocMapped);
cudaCheckErrors("cudaHostAlloc fail");
cudaMemset(d_data1, 0, dsize);
cudaMemset(d_data2, 0, dsize);
memset(h_rpin, 0, dsize);
memset(h_flag, 0, sizeof(int));
unsigned long long t1 = dtime_usec(0);
datakernel<<<nBLK, nTPB>>>(d_data1, h_rpin, DSIZE, TVAL);
flagkernel<<<1,1>>>(h_flag);
while (!(*((volatile int *)h_flag)));
t1 = dtime_usec(t1);
unsigned long long t2 = dtime_usec(0);
datakernel<<<nBLK, nTPB>>>(d_data1, d_data2, DSIZE, TVAL);
cudaMemcpy(h_rpin, d_data2, dsize, cudaMemcpyDeviceToHost);
t2 = dtime_usec(t2);
printf("t1 = %fs, t2 = %fs\n", t1/(float)USECPSEC, t2/(float)USECPSEC);
return 0;
}
Compiling on Fedora 20/CUDA 7, for a Quadro 5000 (also a Fermi device) and running the code with the profiler:
$ nvcc -O3 -o t836 t836.cu
$ nvprof --print-gpu-trace ./t836
==616== NVPROF is profiling process 616, command: ./t836
t1 = 0.006693s, t2 = 0.007332s
==616== Profiling application: ./t836
==616== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
766.83ms 368.73us - - - - - 41.943MB 113.75GB/s Quadro 5000 (0) 1 7 [CUDA memset]
767.20ms 366.52us - - - - - 41.943MB 114.44GB/s Quadro 5000 (0) 1 7 [CUDA memset]
778.29ms 6.6061ms (64 1 1) (256 1 1) 10 0B 0B - - Quadro 5000 (0) 1 7 void datakernel<int>(int const *, int*, unsigned long, int) [190]
784.90ms 1.8440us (1 1 1) (1 1 1) 7 0B 0B - - Quadro 5000 (0) 1 7 flagkernel(int*) [193]
784.91ms 855.31us (64 1 1) (256 1 1) 10 0B 0B - - Quadro 5000 (0) 1 7 void datakernel<int>(int const *, int*, unsigned long, int) [199]
785.77ms 6.4553ms - - - - - 41.943MB 6.4975GB/s Quadro 5000 (0) 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
So using the timing built into the code, we see that the t1 case (kernel copying directly to host) is somewhat faster than the t2 case (kernel followed by cudaMemcpy). Looking at the breakdown in the profiler, we see that in the kernel-only case, the kernel runs for 6.6ms. In the kernel + cudaMemcpy case, the kernel itself runs for 0.855ms and the subsequent cudaMemcpy operation takes 6.45ms, achieving about 6.5GB/s bandwidth (which is correct for this device - a Gen2 x16 PCIE link). The kernels are not doing precisely the same thing: the kernel-only case has 1 read from global memory, followed by a write to host memory. In the kernel + cudaMemcpy case, the kernel does a read and write to global memory, and the cudaMemcpy does a read from global memory and a write to host memory. The time for a read or a write to global memory is approximated by the duration of the [CUDA memset] kernels, which are about 370us each. If we factor that into the above comparison, we see that the two cases are nearly identical.
If I reduce the data set size by a factor of 10 (to 4MB), then the kernel+cudaMemcpy case starts to win, but not by anything like a factor of 5x to 20x. Here’s that data for comparison:
t1 = 0.000956s, t2 = 0.000757s
==683== Profiling application: ./t836
==683== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
750.45ms 38.783us - - - - - 4.1943MB 108.15GB/s Quadro 5000 (0) 1 7 [CUDA memset]
750.49ms 37.024us - - - - - 4.1943MB 113.29GB/s Quadro 5000 (0) 1 7 [CUDA memset]
750.84ms 913.75us (64 1 1) (256 1 1) 10 0B 0B - - Quadro 5000 (0) 1 7 void datakernel<int>(int const *, int*, unsigned long, int) [190]
751.75ms 1.8440us (1 1 1) (1 1 1) 7 0B 0B - - Quadro 5000 (0) 1 7 flagkernel(int*) [193]
751.77ms 86.413us (64 1 1) (256 1 1) 10 0B 0B - - Quadro 5000 (0) 1 7 void datakernel<int>(int const *, int*, unsigned long, int) [199]
751.86ms 648.86us - - - - - 4.1943MB 6.4641GB/s Quadro 5000 (0) 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
In this case, the code measurement for the t1 case is a little less than 1.5x the t2 case, and the t1 kernel takes almost 1.5x longer than the cudaMemcpy time for the t2 case, which is a little surprising to me, but nothing like the 5-20x you mentioned.