Async Memcpy calls blocking main thread

i7-975
12 GB RAM
1.5 GB gtx580
win7 64-bit
using VS 2010, CUDA 4.0

Ok, so I’m having an issue I couldnt find in this forum which is that it seems that my cudaMemcpyAsync calls are not actually asynchronous.

For example:

cpu_array[0] = -1;

cudaMemcpyAsync(cpu_array,gpu_array,n*sizeof(float),cudaMemcpyDeviceToHost,stream); //copies into cpu_array with data values that are always positive

if (cpu_array[0] != -1){
//it always ends up here where cpu_array[0] equals the value that was supposed to be copied.
}

If the call was asynchronous then the conditional statement should evaluate to false. Considering I set cpu_array[0] to -1 right before the memcpy call, it should be cached, which means that it is highly unlikely for the conditional statement to evaluate after the gpu copied anything to RAM. At least that’s what I think, correct me if Im wrong.

Also, I have another issue (which I resolved by moving to floats, which is an acceptable change for what I’m working on) that I started a post on here:

It’s starting to seem like it’s compiling with a low compute capability, but Ive made sure in the settings that I set compute_20 and sm_20.

Is there any way to print out the CC at runtime? What could possibly be the problem here?

Thanks in advance

the PCIe transfer will be snooped and invalidate your cacheline as soon as that copy appears

Thanks for the reply.

I’m not sure I understand what youre saying. It seems like youre saying that the CPU will go to main memory as opposed to cache because of the fact that main memory has been changed? But I’m not sure if that’s what you mean. And what exactly does snooped mean? Sorry for being a newb lol.

Anyway, assuming that that is what you meant by your statement, how can the main memory be accessed and changed so quickly? I figured that by the time the GPU started even thinking about transferring to main memory, the CPU would have already completed the conditional statement.

Either way, I just want to make sure that my memcpy calls are asynchronous. Are you trying to say that they are? I’ve even compared timings of kernel calls (which are always asynchronous) to my memcpyasync calls, and my memcpyasync call was taking 4-5x the amount of time it takes to call four kernels.

Ok well I just tried increasing the amount of memory that was being transferred in my calls and checking again the difference in time it took to call the 4 kernels vs the one memcpyasync call, and the difference was the same. Even when I increased it to 20 times the original amount of memory being transferred… So I guess it is asynchronous? I’m confused…