I’m sorry if the same topic was created before and in this case will be glad to recieve the link. Also I will be grateful if some body indicate me the way of the solution of the next problem:
I have to develop iterative algorithm based on CUDA capabilities. The problem looks like that:
CUDA kernel call…
Recieving data from device (cudaMemcpy)
Write data to the file
Retrun to the step 1
The step 2 is very expensive. It needs rather long time than step 1. So I need suggestion how to optimize this scheme.
Obviously the device to host copy is the first thing to optimize then. How fast is that transfer - do you get anywhere near the results from the bandwidth test in the SDK?
How much data is transferred in each iteration? If it is not much, you might want to save it in a buffer and then transfer results from several iterations at once.
If optimizing the transfer somehow gets the transfer time closer to the execution time of the kernel, overlapping them becomes worthwhile too.
I don’t have much advice other than normal methods for optimizing data transfers (use pinned memory and perhaps zero copy).
FYI, DirectCompute includes a “DispatchIndirect” function where the number of threads to launch can be specified indirectly from the contents of a buffer in GPU memory, which avoids reading back data to the CPU. This kind of functionality will probably come to CUDA at some point.
I’ve implemented a SelectNth/Median algorithm which involves an arbitrary # kernel launches. See here. My solution was relatively simple and effective:
Implement a cutoff, such that after the data gets reduced to a small size (when kernel launch time dominates), I just switch to sorting the remaining data, instead of continuing with iteration.
Other solutions:
Wait for an on die CPU, which will greatly reduce latency.
Speculative execution - Launch the kernel multiple times speculatively, but check at the beginning if the termination condition has been met and do nothing if true.
Thanks to all… I’m going to try consistently suggestions you gave. Also I was thinking over copying data to the buffer and then copying them to the host. It seems to be a good idea.
I was following suggestions which tera gave me. So I use GeForce9800GT the bandwith test indicates that the speed of reading\writing
HostToDevice(DeviceToHost) is approximatly 1,5Gbytes per second. I suppose it must be enought to transfer data of size 12Mb, isn’t it?..
Also I made some simple test(but right now without values). And the test show me if data size approches to 12Mb my application speed extremly decreasing.
I made the conclusion that I need to cutoff the data into series of small data parts (it seems to be the solution Uncle Joe) and then process them separately.
Sorry, I misunderstood your question Dmitrii. My comments don’t really apply to your situation.
If the copy step takes much longer than the calculation step, then your GPU utilization will be low, even if you overlap transfers and computation perfectly.
The question I would ask is if all that data really needs to be written to the file? What’s your application
and can you reduce the data being written?
If not, then I don’t think you want to do that part of the application on the GPU.
If transfer time dominates your iteration, and you need to transfer 12Mb at 1.5 Gb/s, you still should be able to run about 100 iterations per second.
If you actually get to that order of magnitude, writing the data out to disk will be the bottleneck, not the PCIe transfer.
Are you aware that kernel launches are asynchronous? I.e. if you launch a kernel and then do a device->host memcpy (your steps 1. and 2.), the kernel launch will return immediately and the memcpy will then wait for the kernel to finish before doing the actual transfer.
May be I have a logical error?
It should work faster!
May be I’m looking the solution at the wrong place?
The solution have to be similar to the solver for the System of Linear Equations…
As I wrote earlier, you need a [font=“Courier New”]cudaThreadSynchronize();[/font] after the kernel launch in order to measure the time to execute the kernel. Without this, you just measure the time to launch the kernel, and the cudaMemcpy will appear to take up all time while in fact it is just waiting for the kernel to finish.
To make your kernel run faster, use a blockSize.x that is a multiple of 16. This will allow the 16 memory accesses from a half-warp to be coalesced into a single memory transaction. You will however never be able to reach the speed of the CPU version as this simple kernel is entirely bound by the time it takes to transfer the data through PCIe to and from the GPU. You need a task with more computational intensity (more computation per byte transferred) in order to beat the CPU.
Times for both CPU and GPU execution seem to be rather large (shouldn’t take the computer long to add just 64kb of data), which probably is due to the fact that the time() function has a quite low precision (somewhere between 10 and 55 miliseconds I think, depending on your hardware). Use a more precise timer to get accurate timings. gettimeofday() would be the way to go under Linux and other Unix-like systems. QueryPerformanceCounter() is the function of choice under Windows, but I am not familiar with the latter.
Finally, next time you post code in the forum please enclose in between [[/b]code] [[b]/code] tags.
merge all your kernels into 1 big kernel with manual block synchronisation
The biggest drawback of this method is that the register count will increase significantly, for example, if you are merging kernels which use 30 regs each, then the merged kernel may use from 30 to 60 regs.
As mentioned - do speculative kernel launches on cpu, and spin checking some variable status which you update everytime in the kernel. The biggest drawback here is that this method actually introduces a cpu-gpu synchronization, and if you want to further submit some work to the gpu after that, you will have a gap when gpu doesn’t do anything (this can be up to a couple of milliseconds) because cpu did not subming anything yet.
Only if you know maximum iteration count, then you just submit maximum iteration number, and inside the kernel before doing any executions you check if you need to run this iteration or just skip it since all has been done during previous iterations.
As for the first method I could provide you with some code which does manage the virtual block configuration/scheduling inside the “merged” kernel including the inter-block synchronization, since I’ve been playing with this recently.