Hello !
I’m trying to stream my app. I have to launch the same kernel many times (few hundred to few thousand) on differents inputs. Before each launch I have to send 2 chunks of data (first is between a few kB and 3MB and second is 195kB). The kernel doesn’t do many computations and it’s very small. Basically it compute 5 memory reads, 1 mul, 3 add, and a final memory write.
I thought my problem would perfectly fit in streaming, however I have worst performance with than without streaming.
Of course memory is pinned with cudaMallocHost, and device is able to overlap : it’s a GTX295. Thinking about an overhead with cudaMemcpyAsync and small chunk of data, I just modified “n” var in simpleStreams SDK example to check that. As a result, streaming was always slower with less than 768000 int (3MB data).
Here is an extract of the output :
-------------------------------
> array_size = 716800
memcopy: 0.55
kernel: 0.88
non-streamed: 1.37 (1.44 expected)
4 streams: 1.42 (1.02 expected with compute capability 1.1 or later)
-------------------------------
> array_size = 768000
memcopy: 0.59
kernel: 0.95
non-streamed: 1.46 (1.55 expected)
4 streams: 1.45 (1.10 expected with compute capability 1.1 or later)
-------------------------------
> array_size = 819200
memcopy: 0.63
kernel: 1.01
non-streamed: 1.55 (1.63 expected)
4 streams: 1.51 (1.17 expected with compute capability 1.1 or later)
-------------------------------
Well, can anyone confirm my experiments : cudaMemcpyAsync has to much overhead to allow stream to be used with small (<3-4MB) chunk of data ?
May I expect better result with zero-copy ?