Pinned memory error invalid device pointer

Hi,

I am trying to allocate some pinned memory but keep getting an “invalid device pointer” error, any ideas?

Please help

[codebox]

void * device_grid;

CUDA_SAFE_CALL(cudaFreeHost(device_grid));

CUDA_SAFE_CALL(cudaHostAlloc( &device_grid, (sizeof(int)*(totalCounts)), cudaHostAllocWriteCombined));

CUDA_SAFE_CALL(cudaMemcpy(device_grid, &cells[0], (sizeof(int)*(totalCounts)), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaBindTexture(0, grid, device_grid, (sizeof(int)*(totalCounts))));

[/codebox]

cudaHostAlloc allocates a host pointer. You then cudaMemcpy to that pointer with cudaMemcpyHostToDevice. Thus, you should be getting an “invalid device pointer” error.

I see… that’s quite a noobish thing I’m doing. But what should I be doing instead?

As MisterAnderson said, the device_grid pointer is allocated on the host.

It looks like you are using the cells array as your host storage and want device_grid to be your device storage. If I understand correctly, you should have done a cudaHostAlloc for the cells array and then filled that with values on the CPU. Once that is populated, you can cudaMalloc the device_grid device pointer and do the cudaMemcpy from the cells array to device_grid (cudaMemcpyHostToDevice) to transfer the data from the CPU to the device.

That way your host data will be in the cells array and your device data will be on the GPU and pointed to by device_grid.

Thank you both, I have fixed the noobish error now.

I’ve run into another one though, I’m doing this and getting another invalid device pointer error:

[codebox]

float4 *hostPtr = 0;

void * devPtr;

int cudaError2;

cudaSetDeviceFlags( cudaDeviceMapHost );

cudaSetDevice(0);

cudaHostAlloc( (void**) &(hostPtr), sizeof(float4) * 30000, cudaHostAllocMapped | cudaHostAllocPortable );

cudaError2=cudaHostGetDevicePointer(&devPtr, hostPtr, 0);

if (cudaError2)

printf ("Failed to get device pointer \n");

CUDA_SAFE_CALL(cudaBindTexture(0, grid, devPtr, (sizeof(int)*(totalCounts))));

[/codebox]

I’m trying to map the memor to eliminate the memory copy to the device. Is tis possible? What am I doing wrong for it not to be working?

Here is your code wrapped up into a full reproduction case:

#include <stdio.h>

texture<float4, 1, cudaReadModeElementType> grid;

int main()

	{

	float4 *hostPtr = 0;

	void * devPtr;

	cudaSetDeviceFlags( cudaDeviceMapHost );

	cudaSetDevice(0);

	cudaError_t error = cudaHostAlloc( (void**) &(hostPtr), sizeof(float4) * 30000, cudaHostAllocMapped | cudaHostAllocPortable );

	if (error != cudaSuccess)

		printf("error allocating\n");

	error = cudaHostGetDevicePointer(&devPtr, hostPtr, 0);

	if (error != cudaSuccess)

		printf("error mapping\n");

	error = cudaBindTexture(0, grid, devPtr, (sizeof(float4)*(30000)));

	if (error != cudaSuccess)

		printf ("error binding\n");

	}

When I run it on my CUDA 2.2 beta machine on a GTX 280, I get “error binding” as the output.

I think everything you are doing is correct… at least according to the manual.

Maybe Tim will comment on whether or not host mapped memory can be bound to a texture or not. It isn’t mentioned specifically in the programming guide.

It indeed cannot! I should probably make sure the programming guide contains that limitation…

That’s a bit poo :(. Glad I made this discovery anyhow. Can I use something similar to a texture that can used mapped memory or is it just not possible? My requirements are that the memory needs to be dynamic and large, which texture memory provides but it is currently not able to use mapped memory?

Any ideas? Can I use a cuda array or something? Do you think mapped memory will help much in my case?

I’m creating a raytracer with triangles and a grid structure stored in textures which get updated every frame. Any help with the above questions would be greatly appreciated.

Well, you could just try reading the device pointer directly. Since this is mapped memory from the host, the normal coalescing rules don’t apply in quite the same way. Tim did mention in a previous post that you still want to have threads in a warp accessing nearby values in the array to get the most out of each PCI-e burst, but that is something you would have to do to get good performance from the textures anyways.

I don’t know whether mapped memory makes the most sense for your application. As I see it, there is only one “big win” situation for mapped memory is in a massive compute bound problem that only needs to slowly pull in 100’s of MiB to many GiBs of data. Thus you can run the kernel, reading mapped memory as the kernel executes instead of needing all that wasted “start-up” time to copy the data over. The other interesting application I see for it is to allow kernels with small outputs (like a sum reduction) to write their own results into host memory, thus potentially removing the latency of a small 4-byte cudaMemcpy.

Interesting stuff. I have another problem though - My kernel is getting very long and it is taking about 20 mins to build now. Have you got any tips to make it build faster? god knows what nvcc is doing to make it take that long