I am getting some weird timing for a very simple kernel.
The following function is used to find the min of an array.
global void k_gpuFCT_timestep_pass1(float *dt, int nbrPoints, float *d)
{
float minv = dt[0];
for (int i=0; i<nbrPoints; i++) minv = min(minv,dt[i]);
d[SOLVERDATA_TIMESTEP] = minv;
}
This is run with only one block and one thread. d & dt are 2 cudaMalloc arrays. d is about 16 floats and a has 32000 floats.
Time like this the routine takes about 6.1ms.
If I comment out the d[SOLVERDATA_TIMESTEP]=minv, the timing is 0.03ms
I can not figure out why the extra line is a problem. Any ideas?
In fact if somebody has a better way to compute the min of an array without having to switch back to 1bloc x 1thread, I would appreciate the tip. I have looked at the scan SDK example and adapted it but this is far too expensive for such a simple problem.
I have also tried to replace d[SOLVERDATA_TIMESTEP] by a device variable without much luck.
If I understand your code correctly, d[SOLVERDATA_TIMESTEP] actually has to be a device variable. It is the result of the min calc, right?
In that case, if you comment out this line, you are not storing the result of the calc and the compiler will thus optimize away everything leaving you with a do-nothing kernel, which does run fast External Image
The scan actually is the fastest way to compute the min/max/sum etc of an array that I have found so far.
Yes If I comment out last line I do not store the result. … which is useless. IT was just for timing pupose. Nevertheless I do not understand why such a difference in timing.
6.1ms for the version with d = minv and 0.032ms for the version without d = minv.
if tried the d array to be allocated with cudaMemAlloc and d defined as device straight in the cu file. Is there any difference? timing-wise none whatsoever.
I will give a try again to the scan version.
I have an extra question. The solver I am coding is requiring some initial constant parameters which are passed to the kernels such as number of sub-iteration, various factors and flags. What is the best way to store those parameters (interger/float) in the GPU? I do not need to modify them.
On another side, the kernels are computing parameters (error values, timestep) that resides on the device and sometimes extracted to the host. Let’s say timestep extracted every 10 cycles. Those parameters 10-16 max (float) are continuously rewritten at each cycles. What would be the best way to store them and way to retrieve them?
Is cudaMemcpy(*, *, *, cudaMemcpyDeviceToHost) the only efficient way?
As I said above, if you remove the final usage of minv, the compiler will figure that it does not need to compute it at all and consequently will remove the entire loop. With the loop gone, minv is never updated and the compiler will thus also remove the first line leaving you with an empty kernel. The 0.032ms are thus the time needed to call a kernel that returns immediately, so you are measuring basically the overhead of a kernel call now.
There is no kernel runtime difference between the two.
Use the constant memory. It is cached across multiprocessors so there is no latency if one thread has got the constant once. I use a constant array of char type and calculate pointers into it for mixed float/int/struct etc storage.
Yes. It is also the only way to retrieve data written by a kernel directly to host. If the data to be retrieved isn’t too big, use cuMemAllocHost to create a host array that sits in special system memory and thus allows higher transfer bandwidth.
I understand what you meant by optimization. Doing nothing really quick obviously. Is the kernel runtime call overhead always the same?
in 0.9 doc (page 20) it is said that constant variable cannot be assigned from the device but the host. Would you have a small example of the creation and setting from host and reading from device. I would appreciate.
My data is indeed very small and cuMemAllocHost will be usefull. Correct me if I am wrong. This command allocate some array on the device in a separate way from cuMemAlloc. Do you still need to use cudaMemcpy() to retrieve. I am currently using the util that came with the SDK, is it preferable to use the driver API straight? (did not find cudaMemAllocHost there)?
Another question related the scan SDK example. If I clearly understood the method, it needs an extra array with a size twice as big as the one, one works on? am I correct? If this is the case, that will be a problem for me.
This has been discussed on the forum several times how. You might want to do a search.
No the cuMemAllocHost allocates memory on the host, hence the name. Yes, you retrieve data with cudaMemcpy. It will recognize that the target mem is special system mem.
Advice: read the manual a second time, check all the SDK examples.