C++ DLL - Invalid Device Pointer


I’m new to CUDA programming and have some difficulties and do not know where they exactly come from.

I try to extend an image processing application with some CUDA accelerated algorithms. Each “filter” is

implemented as a derived class and compiled to a DLL.

When you select a filter for execution it will be created once and afterwards for each new processing

element (image) the ExecuteFilter() method will be called.

My idea was to do cudaMalloc once, when the class is created and cudaFree once when the class is

destroyed. Unfortunatly it doesn’t work for me. I get an “invalid device pointer” error, when cudaMemcpy

is called.

It only works when I call cudaMalloc/Free inside the ExecuteFilter() method, which is not very efficient,

since this method is called very often.

I’ve already looked for solutions in this forum, but I didn’t exactly know the problem. In other threads

it was even suggested to do the cudaMalloc/Free inside the constructor and destructor of a class.

Can someone tell me how I might solve this problem?

Thanks in advance,


I put some of the code online in order to better illustrate the problem.

SobelFilterCUDA::SobelFilterCUDA(STRING name){




	sobel_output = new TIplImage();

 	getOutputData(0).setData(sobel_output,"sobel filtered image");

	m_ClassName = name; 

	CUDA_SAFE_CALL(cudaMalloc((void**) &d_data, 8294400));






void SobelFilterCUDA::ExecuteFilter() {

	TIplImage* input = dynamic_cast<TIplImage*>((getInputData(0).getData()));


	char* data = input->image()->imageData;

	int mem_size = input->image()->imageSize;

	CUDA_SAFE_CALL(cudaMemcpy(d_data, data, mem_size, cudaMemcpyHostToDevice));	

	CUDA_SAFE_CALL(cudaMemcpy(data, d_data, mem_size, cudaMemcpyDeviceToHost));



I am not sure if this is the problem…This is JFYI… Look for this scenario in your setup…

cudaMalloc() pointers are VALID only for that THREAD of execution…

If u r trying to use it in different threads of execution - it wont work.

Thank you for your reply Sarnath. You are right.

I already assumed that it might be a thread problem.

Now I talked to the main developer and he told me how the application is working.

The filter class is created by the main thread, but the ExecuteFilter() method is called

by another thread.

There was however, a possibility to run the filter in the main thread. That isn’t very

clever, since the gui is affected by the filter, but for now it is ok for a temporary solution.

You should look @ Mr.Anderson’s multi-GPU idea/code/concept…

You should be able to google it out… Its there in this forum (use site:http://forums.nvidia.com + your search keywords).

His idea was to dedicate a thread for doing operations on a GPU.

Different threads interested in the same GPU would just BIND their functions and pass it on to this thread… That master thread would execute and pass back the results… Requires Boost library…

We have a library built without using Boost. I am hoping to publish it in NV forums within a month or so.

Also with the new CUDA 2.2 ( i think), devices can be reserved for shared/exclusive access among threads using “cudaSetDeviceFlags” or some API like that… There r some NV utils that set this up for u as well… Just be aware (some1 could have set it for exclusive access and you may have to undo to get your program working)… A very very remote possibility. But just be aware of it.

Hello again,

is there a possibility of shared access among threads using “cudaSetDeviceFlags” or by some other way?

I unfortunately can’t change the way of execution of the application for my cuda filter, so the idea of Mr.Anderson’s can’t be applied to my problem :-(.

Thanks in advance!!!