Albert Obviously never used a GPU - "Insanity: doing the same thing over and over again and exp

Albert Obviously never used a GPU - “Insanity: doing the same thing over and over again and expecting different results.” - Albert Einstein

Well I must be insane as I expect (based on this thing I call experience) and do get different results. When compiling and loading a CUDA kernel for the first time I get all zeros for a volume reconstruction. If run again I get an answer. I have read that in order to correctly time GPU kernels the GPU needs to be “warmed up” or that the kernel needs to be loaded, usually by calling it one time. Is there a cuda call/method for ensuring the gpu has actually loaded the kernel and it is ready to be called without actually calling it one time?

No, there isn’t. But this is only relevant for measuring the running time of a program. There is probably something wrong with your program if it returns incorrect results for the first run. In other words: your kernel apparently does not do the same thing over and over again ;)

If doing same thing causes different results then such differences can be pumped infintely to get arbitrage (just like in option theory, or free energy)… Therez no free lunch. You pay for what you get… And you get for what you pay.




EDIT: sorry about double posting

I would agree there is something wrong… what it is I am unsure. After the first run the kernel does do the same thing over and over again. I tested it and I do a difference between the volumes of sequential runs and only between the first and second is there a difference. Basically I am doing a medical imaging application where I use projection data and calibration information to preform a reconstruction and generate a volume. The data is transferred to the gpu before the kernel is called using cudaMemcpy[3D], cudaMalloc, and friends. Could it be something that the data is not posted to the gpu before the recon. I am not performing any fancy memory transfers and I was under the understanding that the calls I am preforming (minus the kernel) are all blocking calls as far as the GPU is concerned.

Basically my steps are as follows:

  1. Create memroy on gpu

  2. Transfer data to gpu allocated memory

  3. perform recon (call kernel)

  4. call cudaThreadSynchronize to wait for kernel to complete

  5. Copy result from gpu back to cpu

  6. display volume

Yep already got plenty of those splattered throughout the kernel in an attempt to figure out what is going on.

Global memory read write dependency?

  1. is superfluous, synchronization happens implicitly when you copy data from device memory to host memory :)

What is to keep the cpu from copying the data back to the cpu or the cpu from deleting the memroy before the gpu kernel has had a chance to write the data? Reading the Programmers Guide and Reference manual I am still unclear on the topic. However I know that cudaThreadSynchronize would be needed for the Async calls not that I am using these.