Can my CUDA kernel access host memory if running on NVIDIA ION?

vvolkov · August 3, 2010, 2:53pm

As far as I understand, host and device memories on NVIDIA ION system are physically same. Then it should not be necessary to copy data between the device and the host memory spaces. However, I tried accessing host memory in a CUDA kernel on ION but it didn’t work. Am I doing something wrong?

Is it still necessary to copy data from the host memory to the device memory on ION?

Vasily

pcchen · August 3, 2010, 3:00pm

To my understanding, you still need to allocate a pinned memory or otherwise there would be some problems with CPU’s cache behavior.

vvolkov · August 3, 2010, 3:40pm

I can’t make it work even if using pinned memory.

Say, it works if I allocate device memory using cudaMalloc. But if I change every cudaMalloc to cudaMallocHost, it doesn’t work.

tera · August 3, 2010, 4:13pm

Are you trying to use host pointers as device pointers?

I think that doesn’t work even on Ion, as the GPU doesn’t go through the CPU’s MMU and so either uses physical addresses or it’s own logical address space.

pcchen · August 3, 2010, 4:14pm

When using pinned memory, the pointer can’t be used for device IIRC. You need to call cudaHostGetDevicePointer to get a pointer for the device before passing it to the kernel function. Take a look at the “simpleZeroCopy” sample for more details.

vvolkov · August 3, 2010, 6:00pm

Thanks, now it works! I also had to make a few other changes: (i) call cudaSetDeviceFlags( cudaDeviceMapHost ) when initializing; (ii) add a couple of cudaThreadSynchronize() to ensure that data has passed over PCIe. There is a useful detailed document on the topic at http://forums.nvidia.com/index.php?showtopic=98502.

Vasily

vvolkov · August 3, 2010, 8:51pm

New problem — mapped host memory appears twice as slow as global memory on NVIDIA ION.

Say, cudaMemcpy() runs at 4.9 GB/s if arrays are allocated using cudaMalloc(), and only at 2.4 GB/s if they are allocated using cudaHostAlloc().

Any fix?

Vasily

pcchen · August 3, 2010, 9:33pm

I don’t have an ION system so I’m not sure what’s the problem is. However, I have a Mac mini with GeForce 9400M, which is the same GPU as ION.
I ran bandwidthTest with pinned memory and the result is something like:

host-to-device: 4GB/s
device-to-host: 4.4GB/s
device-to-device: 7.4GB/s

So I suspect that host-to-device and device-to-host copy has to go though the PCI express bus even though they all reside in the same physical memory. That’s probably why it’s slower.

vvolkov · August 3, 2010, 9:43pm

Yeah, i get about the same figure on my ION:

host-to-device: 2.1GB/s

device-to-host: 2.3GB/s

device-to-device: 4.6GB/s

Quite unexpected.