so guys i am using cudaMemcpy in a loop just to test what is going on inside the kernel and i am getting very strange results

the first time i call cudaMemcpy the result is correct the second and all iteration afterwards seem to give corrupted values

gpuAssert( cudaMemcpy(temp, device_largestElements, sizeof(float), cudaMemcpyDeviceToHost) );

where temp is an array of float containing just one element

and device_largestElements is an array which contains 22332 floats

and i only wish to get the first element of the array is it possible the cudaMemcpy is bugging due to the array size differences?

(btw if i remove the cudaMemcpy from outside the loop and just call it once after the loop then the result is correct)

Almost certainly not. In most of my linear algebra codes, I run my own GPU memory manager which looks after a gigantic chunk of pre-allocated device memory for the life of the application. My codes wind up doing the equivalent of

cudaMemcpy(host_chunk, device_chunk+random_offset, (size_t)random_size * sizeof(random_type), cudaMemcpyDeviceToHost)

all day long (literally thousands of times over hours/days) and never miss a beat, so anecdotally I would suggest you look somewhere else for your problem.

you made me laugh with the emphasis and effort you put in to prove a point :D needed that thanks