I have a 3D array of floats with 256^3 dimension. If I call the dimensions of this 3D array cube as [X, Y, Z] then I would like to calculate each cell value of this cube by the following substitution:
ZX(-1+cos(1/20*(40Z^2+10X^2+10*Y^2)^(1/2) ) )
In the above equation, X, Y and Z are the index value of the corresponding cell in the cube.
I guess that using CUDA for this calculation will be 10 times faster then CPU. Do I need to use CudaBLAS library or do I need to write raw device functions…
Also is there any link or sample or source in Cuda SDK just for this substitution operation?
I am a little confused by your terminology. Are you simply trying to populate the matrix based on your function? In this case the matrix will contain the same values each time it is populated, right?
Assuming you are just wanting to populate the matrix…
On a machine with a q6600 oc to 3.2GHz and a 8800GTX compiled when compiled with gcc 4.1.2 20061115 and -O2 I see .88s on the CPU (1.19 with -ffloat-store) and .15s on the GPU (a lot of time in the memcpy). Of course the CPU version could easily be improved with OpenMP to the point of probably matching the GPU version. GPU occupancy is 100%.
But if you don’t need the result in CPU memory afterwards you can leave out the memcpy from the timings and replace it with a cudaSynchronizeThread(). In which case the speedup will be a lot higher, probably.
Thank you very much for your reply. In the future I dont need to copy it back to CPU memory so you are right. But I dont understand where to use “cudaSynchronizeThread” function. In which line do I need to add this call? For which purpose? arent these threads already synchronized?
Based on Appendix B of the programming guide, it looks like __cos() is only a good idea if you know your argument is in the range [-pi,pi]. cosf() is accurate over the interval [-48039, +48039] presumably because it does some kind of argument reduction first before calling __cos().
I’m not sure cudaSynchronizeThread() is is needed between kernel calls but it is needed it you want to measure the timing of the kernel. Simply insert the call immediately following the kernel call. Without it the kernel will appear to have completed much faster than it actually did.
Set the environment variable CUDA_PROFILE to 1, run your application, and then examine the newly generated file. I forget the name… cuda_profile.txt?
Use the -keep compiler option, compile your program, examine the .cubin file, and plug the numbers into the NVIDIA provided Excel occupancy calculator spreadsheet.
You might want to experiment with how the x, y, z indices are calculated as different memory access patterns can have quite an impact.