cudaMemcpy error invalid device pointer

I encountered a strange problem. I am compiling my cuda application as a static library so that other application can use it. So I created several functions to allocate and access the device memory. Basically, my code look like this:

float *result_dev;

extern “C” {
void GPUinit()
{

cudaError_t s=cudaMalloc((void**)&result_dev, size);
cudaMemset((void*) result_dev, 16, size);
float *hostbuffer=(float *) malloc(size);
s=cudaMemcpy(hostbuffer, result_dev, size, cudaMemcpyDeviceToHost); // runs successfully
// save hostbuffer to file

}

readResult()
{
float *hostbuffer=(float *) malloc(size);
s=cudaMemcpy(hostbuffer, result_dev, size, cudaMemcpyDeviceToHost); // cudaMemcpy error invalid device pointer
if (s != cudaSuccess)
{
printf(“cudaMemcpy error %s\n”, cudaGetErrorString(s));
}
//save hostbuffer to file
}

}

When I save the hostbuffer from the GPUinit(), I can see the result correctly. However, when I call readResult(), the cudaMemcpy returns error code that translate to invalid device pointer.
I am wondering if anyone else get into this issue. Whether it is the fact that I compile it to static library and link against my application, or it is related to the CUDA release (I believe I am using 2.0).

Update, I wrote a simple test program and test library. Seems it works. So I still need to figure out what is wrong.
BTW, I am wondering if it relates to whether I have control of the video card. I remember remote desktop doesn’t work. This application I am building is a server job invoked by some job scheduler. I am wondering if that is the reason.

Ok, I found out why. My application use threads and the initialization function is run in another thread.

From another thread of discussion http://forums.nvidia.com/lofiversion/index.php?t55838.html

You cannot use Ad, Bd and Cd in a thread that did not CudaMalloc them, each thread has its own context and cannot exchange device-pointers with other threads.

This is exactly my issue.