As MisterAnderson said, the device_grid pointer is allocated on the host.
It looks like you are using the cells array as your host storage and want device_grid to be your device storage. If I understand correctly, you should have done a cudaHostAlloc for the cells array and then filled that with values on the CPU. Once that is populated, you can cudaMalloc the device_grid device pointer and do the cudaMemcpy from the cells array to device_grid (cudaMemcpyHostToDevice) to transfer the data from the CPU to the device.
That way your host data will be in the cells array and your device data will be on the GPU and pointed to by device_grid.
That’s a bit poo :(. Glad I made this discovery anyhow. Can I use something similar to a texture that can used mapped memory or is it just not possible? My requirements are that the memory needs to be dynamic and large, which texture memory provides but it is currently not able to use mapped memory?
Any ideas? Can I use a cuda array or something? Do you think mapped memory will help much in my case?
I’m creating a raytracer with triangles and a grid structure stored in textures which get updated every frame. Any help with the above questions would be greatly appreciated.
Well, you could just try reading the device pointer directly. Since this is mapped memory from the host, the normal coalescing rules don’t apply in quite the same way. Tim did mention in a previous post that you still want to have threads in a warp accessing nearby values in the array to get the most out of each PCI-e burst, but that is something you would have to do to get good performance from the textures anyways.
I don’t know whether mapped memory makes the most sense for your application. As I see it, there is only one “big win” situation for mapped memory is in a massive compute bound problem that only needs to slowly pull in 100’s of MiB to many GiBs of data. Thus you can run the kernel, reading mapped memory as the kernel executes instead of needing all that wasted “start-up” time to copy the data over. The other interesting application I see for it is to allow kernels with small outputs (like a sum reduction) to write their own results into host memory, thus potentially removing the latency of a small 4-byte cudaMemcpy.
Interesting stuff. I have another problem though - My kernel is getting very long and it is taking about 20 mins to build now. Have you got any tips to make it build faster? god knows what nvcc is doing to make it take that long