cudaMapAddress() has been removed from the API because it is slow and uses the bus very inefficiently. Memory copying (as opposed to memory mapping) is the best way to read from or write to device memory.
To answer your first point below: cudaMapAddress() didn’t enable writing to main memory from a kernel; you cannot call cudaMapAddress() from a kernel or any function from the host runtime component for that matter (see section 4.5 of the programming guide).
Although I cannot call cudaMapAddress() from a kernel, it seems if you write to an address in a kernel which has been mapped to some address in the CPU memory, this writing may be redirected to the CPU memory.
Here is another question, CUDA doesn’t support writing to CPU memory in the kernel? this ability should be very important in combing the CPU and GPU, especially when using GPU as parallel co-processor.
As Cyril pointed out above, you have the cudaMemcpy functions instead. Indeed, I prefer the copy approach to the mapping approach as with the copy you have less concurrent memory accesses to worry about, ie. you are sure not to have race conditions with CPU threads accessing memory that is currently mapped. Else the synchronization between the CUDA threads would need an extension to also lock CPU access, which will be a very slow implementation.
But, cudaMemcpy is inappropriate in my situation. I use GPU as co-processor to the CPU.
Initially, CPU assigns a computational task to GPU, later CPU wants to check whether there is some interesting results obtained by GPU. I hope there is a very cheap way to detect the status of data on GPU. in OpenGL it is occlusion query which is pipelined operation. but in CUDA, I can only use cudaMemcpy, which,I presume,will flush the GPU pipeline and slow down the CPU.
I think you misunderstand how CUDA works. There is no such thing as a display context so there is no asynchronous command submission (currently). In other words, your CUDA kernel call will block until completed.
So if you want parallel work to be done on CPU and GPU, you need a separate (CPU) thread for CUDA anyway. It can download a small status array at the right moment and place it in CPU shared memory for the other CPU threads to read. So this should be perfectly parallel.
Be warned however if you run CUDA and rendering on the same card. See other discussions in this forum why.