Howdy, Stranger!
It looks like you're new here. If you want to get involved, click one of these buttons!
Categories
- All Discussions1,524
- General534
- Graphics109
- GPU Computing419
- Mobile141
- Pro Graphics163
- Tools158
In this Discussion
- fwende February 10
- paulvisschers February 10
Tags in this Discussion
- cuda 422
- linux 177
- gpu-computing-sdk 63
- tesla 50
Tesla multi copy not as fast as expected
-
When I run the simpleMultiCopy in the SDK (4.0) on the Tesla C2050 I get the following results:
[simpleMultiCopy] starting...
[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores)
> Device name: Tesla C2050
> CUDA Capability 2.0 hardware with 14 multi-processors
> scale_factor = 1.00
> array_size = 4194304
Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(compute capability >= 2.0 AND (Tesla product OR Quadro 4000/5000)
Measured timings (throughput):
Memcpy host to device : 2.725792 ms (6.154988 GB/s)
Memcpy device to host : 2.723360 ms (6.160484 GB/s)
Kernel : 0.611264 ms (274.467599 GB/s)
Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 6.060416 ms
Compute can overlap with one transfer: 5.449152 ms
Compute can overlap with both data transfers: 2.725792 ms
Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 6.113555 ms
Avg. time when overlapped using 4 streams : 4.308822 ms
Avg. speedup gained (serialized - overlapped) : 1.804733 ms
Measured throughput:
Fully serialized execution : 5.488530 GB/s
Overlapped using 4 streams : 7.787379 GB/s
[simpleMultiCopy] test results...
PASSED
This shows that the expected runtime is 2.7 ms, while it actually takes 4.3. What is it exactly that causes this discrepancy? -
5 Comments sorted by
-
hi,
i guess the problem here is, that (see the source code) the kernel overlaps with just one copy process (the one which brings new data to the device). the second copy process which moves the computed data from the device to the host runs on the same stream like the kernel, so that both the kernel and the (async-)memcpy execute serial. the latter is necessary since data is valid only if the kernel has finished execution. so you cannot compare the 2.7ms with the 4.3ms, as the former one considers data-transfers in both directions (as supported by your device) concurrently to kernel execution. but why is it not (2.7+0.69)ms=3.39ms? the kernel and the memcpy processes run inside a loop which for each iteration forces a stream synchronisation, and this is expensive. at the end we get 4.3ms.
the output of the sample is somewhat missleading, as it suggest that 4 streams are used in order to get high performance.
the 2.7ms are computed as the max( memcpy-host-to-device, memcpy-device-to-host, kernel-execution ). if you would incorporate some kind of double buffering then it might be possible to achieve the 2.7ms. -
@fwende: Thanks for your response. I believe you misunderstood the source code of the program though.
In each iteration data is uploaded using the next stream, then the next iteration this next stream becomes the current stream. This means that the kernel en download that are done then are put into the same stream as the upload that provided the data for the kernel. You want these operations in a single stream, as they must wait for eachother.
Each loop the stream that is used is changed. This means that another upload-kernel-download frame is scheduled on a different stream, meaning that all three operations in this loop are able to overlap with that of the previous loop. So the 4 streams are actually useful. You can check by reducing the number of streams, as run time increases.
Finally there is an event recorded once each frame is done. Note that the synchronization on this stream is not actually done the loop after, but several loops later, when we've round-robined back to this stream. The synch makes sure that the CPU paces itself and doesn't spam schedule all the operations on the GPU at once, which would actually decrease performance (disable the synch to see).
The wonky order in which operations are scheduled is due to the fact that if you only have one copy engine, scheduling in a more straightforward manner will lead to unwanted inter-stream synchronization. -
hi again,
at first, you're right: almost all kernel calls overlap with at least two memcompies (i read over the last line in the loop where streams are switched round robin). that is, the problem is not the concurrent kernel-memcpy scheme. but the point is, that between lines 157 and 182 in the code, runtimes for kernel execution and memcopies are measured with no concurrency. the result is what you see in the program output under 'measured timings (throughput)'.
if within the loop kernels overlap with memcopies, the situation is somewhat different, as now both memcpy-host-to-device and memcopy-device-to-host share the same pcie-bus which allows for at most 8gb/s in both directions (pcie-2). i measured on two different up-to-date systems (with tesla m2090 cards installed) that even in the case that one kernel and one memcpy-host-to-device and one memcpy-device-to-host run on 3 different streams, the pcie-bus does not give 6gb/s in both directions, but something which is significantly below this value. if i run the kernel concurrently to just one memcpy, then i get the 6gb/s. so the values given by the sdk-sample assume that your system is capabale of writing data to the device with 6gb/s and at the same time to read data from device also with 6gb/s. but this does not seem to be the case. so what is given are theoretical values. what is actually measured differs due to limited pcie-bandwidth (or limited memory bandwidth of your host system). if you draw the gpu-queue/pipeline filled with kernels and memcopies, you will also see that there are 4 memcopies which at the same time try to move data over the same pcie-bus. it should be clear than none of them gets 6gb/s up-/downstream. since after the 3rd iteration stream-synchronizations become necessary, overall timings slow down.
to summarize: the time for concurrent memcpy-host-to-device and memcpy-device-to-host is not max( t1, t2 ) if t1 and t2 are the respective times given in the 'measured timings (throughput)' section (here the 2.72..ms). but this is what the sdk-sample assumes as theoretical value.
i hope that now my post does not suffer from 'misunderstandings' :) -
@fwende: I understand that the theoretical value is incorrect because there is either something that is being done serially, some overhead is introduced somewhere, or there is a bandwidth issue somewhere in the hardware. But I don't know what it is exactly, and I would like to so that I can adapt my model that predicts the run time for programs. (I have done it with a correction factor retrieved from measurements I've done, but having a better understanding of what is actually happening will make the model more robust and more insightful.)
You're not the first to mention the bandwidth of the PCI/e bus, but it can transfer 8 GB/s in each direction (so 16 GB/s in total when using bidirectional transfers). So it shouldn't be the cause of the slowdown. -
@paulvisschers: 8gb/s is the theoretical limit, but most likely you will never reach that limit. there is a couple of limiting factors which make your program run slower than theoretically expected. as you can see from the program output, your system is not even capable of copying data from device to host (or vice versa) with the theoretical 8gb/s, with no kernel and no other memcpy interfering. maybe its your gpu that limits the bandwidth from or to device, or it is your host's memory that limits. so why should your system then give you 8gb/s (or 6gb/s) in both directions when performing two simultaneous memcopies.
if i run the simpleMultiCopy program on a tesla m2090 cluster node, the program output is almost the same as yours, and also on gtx590/580 workstations bandwidths are below 7gb/s.
if you have a multi-socket system (maybe a nehalem dual xeon system), it might be possible that the calling host-thread runs on a cpu that is not directly connected to the gpu your calculations run onto. you can try to pin your host program to one of the sockets and then run your kernels on a gpu that is connected to the respective socket. for instance you can use libnuma in your code or you can run your program with 'numactl -N xxx -l ./program.x' (where xxx=1,2,3,...) or something like that.
runtime predictions: according to the program output, your system allows for up to 7.8gb/s in the asynchronous case, and up to 5.5gb/s for serial execution (for the scenario considered). so you have upper and lower bounds that can be used for runtime 'predictions'. if you are to use your gpu in multi-threaded applications with high cpu-load, then it would become hard to give a prediction as multiple threads (if you run more threads than your system has cores) may interfer eachother and barriers are not reached at expected times.