Pinned memory error invalid device pointer

st5486 · April 9, 2009, 4:08pm

Hi,

I am trying to allocate some pinned memory but keep getting an “invalid device pointer” error, any ideas?

Please help

[codebox]

void * device_grid;

CUDA_SAFE_CALL(cudaFreeHost(device_grid));

CUDA_SAFE_CALL(cudaHostAlloc( &device_grid, (sizeof(int)*(totalCounts)), cudaHostAllocWriteCombined));

CUDA_SAFE_CALL(cudaMemcpy(device_grid, &cells[0], (sizeof(int)*(totalCounts)), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaBindTexture(0, grid, device_grid, (sizeof(int)*(totalCounts))));

[/codebox]

MisterAnderson42 · April 9, 2009, 4:26pm

cudaHostAlloc allocates a host pointer. You then cudaMemcpy to that pointer with cudaMemcpyHostToDevice. Thus, you should be getting an “invalid device pointer” error.

st5486 · April 9, 2009, 4:28pm

I see… that’s quite a noobish thing I’m doing. But what should I be doing instead?

jph4599 · April 9, 2009, 5:19pm

As MisterAnderson said, the device_grid pointer is allocated on the host.

It looks like you are using the cells array as your host storage and want device_grid to be your device storage. If I understand correctly, you should have done a cudaHostAlloc for the cells array and then filled that with values on the CPU. Once that is populated, you can cudaMalloc the device_grid device pointer and do the cudaMemcpy from the cells array to device_grid (cudaMemcpyHostToDevice) to transfer the data from the CPU to the device.

That way your host data will be in the cells array and your device data will be on the GPU and pointed to by device_grid.

st5486 · April 9, 2009, 5:40pm

Thank you both, I have fixed the noobish error now.

I’ve run into another one though, I’m doing this and getting another invalid device pointer error:

[codebox]

float4 *hostPtr = 0;

void * devPtr;

int cudaError2;

cudaSetDeviceFlags( cudaDeviceMapHost );

cudaSetDevice(0);

cudaHostAlloc( (void**) &(hostPtr), sizeof(float4) * 30000, cudaHostAllocMapped | cudaHostAllocPortable );

cudaError2=cudaHostGetDevicePointer(&devPtr, hostPtr, 0);

if (cudaError2)

printf ("Failed to get device pointer \n");

CUDA_SAFE_CALL(cudaBindTexture(0, grid, devPtr, (sizeof(int)*(totalCounts))));

[/codebox]

I’m trying to map the memor to eliminate the memory copy to the device. Is tis possible? What am I doing wrong for it not to be working?

MisterAnderson42 · April 9, 2009, 7:51pm

Here is your code wrapped up into a full reproduction case:

#include <stdio.h>

texture<float4, 1, cudaReadModeElementType> grid;

int main()

	{

	float4 *hostPtr = 0;

	void * devPtr;

	cudaSetDeviceFlags( cudaDeviceMapHost );

	cudaSetDevice(0);

	cudaError_t error = cudaHostAlloc( (void**) &(hostPtr), sizeof(float4) * 30000, cudaHostAllocMapped | cudaHostAllocPortable );

	if (error != cudaSuccess)

		printf("error allocating\n");

	error = cudaHostGetDevicePointer(&devPtr, hostPtr, 0);

	if (error != cudaSuccess)

		printf("error mapping\n");

	error = cudaBindTexture(0, grid, devPtr, (sizeof(float4)*(30000)));

	if (error != cudaSuccess)

		printf ("error binding\n");

	}

When I run it on my CUDA 2.2 beta machine on a GTX 280, I get “error binding” as the output.

I think everything you are doing is correct… at least according to the manual.

Maybe Tim will comment on whether or not host mapped memory can be bound to a texture or not. It isn’t mentioned specifically in the programming guide.

tmurray · April 9, 2009, 8:05pm

It indeed cannot! I should probably make sure the programming guide contains that limitation…

st5486 · April 9, 2009, 10:28pm

That’s a bit poo :(. Glad I made this discovery anyhow. Can I use something similar to a texture that can used mapped memory or is it just not possible? My requirements are that the memory needs to be dynamic and large, which texture memory provides but it is currently not able to use mapped memory?

Any ideas? Can I use a cuda array or something? Do you think mapped memory will help much in my case?

I’m creating a raytracer with triangles and a grid structure stored in textures which get updated every frame. Any help with the above questions would be greatly appreciated.

MisterAnderson42 · April 10, 2009, 12:35pm

Well, you could just try reading the device pointer directly. Since this is mapped memory from the host, the normal coalescing rules don’t apply in quite the same way. Tim did mention in a previous post that you still want to have threads in a warp accessing nearby values in the array to get the most out of each PCI-e burst, but that is something you would have to do to get good performance from the textures anyways.

I don’t know whether mapped memory makes the most sense for your application. As I see it, there is only one “big win” situation for mapped memory is in a massive compute bound problem that only needs to slowly pull in 100’s of MiB to many GiBs of data. Thus you can run the kernel, reading mapped memory as the kernel executes instead of needing all that wasted “start-up” time to copy the data over. The other interesting application I see for it is to allow kernels with small outputs (like a sum reduction) to write their own results into host memory, thus potentially removing the latency of a small 4-byte cudaMemcpy.

st5486 · April 10, 2009, 4:28pm

Well, you could just try reading the device pointer directly. Since this is mapped memory from the host, the normal coalescing rules don’t apply in quite the same way. Tim did mention in a previous post that you still want to have threads in a warp accessing nearby values in the array to get the most out of each PCI-e burst, but that is something you would have to do to get good performance from the textures anyways.

I don’t know whether mapped memory makes the most sense for your application. As I see it, there is only one “big win” situation for mapped memory is in a massive compute bound problem that only needs to slowly pull in 100’s of MiB to many GiBs of data. Thus you can run the kernel, reading mapped memory as the kernel executes instead of needing all that wasted “start-up” time to copy the data over. The other interesting application I see for it is to allow kernels with small outputs (like a sum reduction) to write their own results into host memory, thus potentially removing the latency of a small 4-byte cudaMemcpy.

Interesting stuff. I have another problem though - My kernel is getting very long and it is taking about 20 mins to build now. Have you got any tips to make it build faster? god knows what nvcc is doing to make it take that long

Topic		Replies	Views
Texturing from Zero-Copy Memory CUDA Programming and Performance	3	4203	January 21, 2010
Pinned memory with textures CUDA Programming and Performance	4	2673	August 4, 2009
Problems with cudaHostAlloc and cudaMemcpyAsync CUDA Programming and Performance	5	4494	February 8, 2010
Mapped memory across multiple GPUs CUDA Programming and Performance	3	8737	October 28, 2010
Another Device Memory Question CUDA Programming and Performance	7	2305	February 9, 2010
error in using cuda mapped memory a test program for mapped memory CUDA Programming and Performance	3	2064	April 20, 2010
CUDA device memory access? CUDA Programming and Performance	11	15665	August 5, 2011
cudaIpcGetMemHandle with mapped/pinned memory CUDA Programming and Performance	8	4410	July 10, 2018
Invalid Device Pointer CUDA Programming and Performance	9	24476	January 15, 2009
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	14	6246	January 22, 2025

Pinned memory error invalid device pointer

Related topics