Mapped Memory on a Multi-Threaded Single GPU application

I have issues when using mapped memory within a multithreaded application with CUDA 4.0.

A kernel which previously worked fine in a single threaded application is now failing when it writes to mapped memory. The kernel returns cudaErrorInvalidValue and further debugging with Parallel Nsight & cuda-memcheck shows “error = access violation on store” when writing in an array of booleans and an array of pointers.

My mapped memory arrays are allocated in the main host thread but are passed to the kernel by different worker threads. I suspect this may be the source of the problem. I call cudaSetDeviceFlags(cudaDeviceMapHost); on the main thread also.

Can someone explain whether it is ok to allocate mapped memory in one thread and use it on the other? On which thread should cudaSetDeviceFlags be called?

Again I’m 99% sure this isn’t a problem with the kernel as it’s extensively tested to work in a single threaded app.

Thank you.

Update: I let each thread do its own allocations but the problem persists. It’s behaviour is random.

The arrays I’m trying to write to are allocated as members of a struct which is passed by value.

MTMPL struct MMarker{

public:

	Type *h_values, *d_values;

	bool *h_found, *d_found;

	MMarker()

	{	

		cudaHostAlloc((void**) &(h_values), MAX_KEYS*sizeof(Type), cudaHostAllocMapped);

		cudaHostGetDevicePointer((void**)&(d_values), (void*)(h_values), 0);

		cudaHostAlloc((void**) &(h_found), MAX_KEYS*sizeof(bool), cudaHostAllocMapped);

		cudaHostGetDevicePointer((void**)&(d_found), (void*)(h_found), 0);

	}

	~MMarker()

	{

		cudaFreeHost((void*) h_values);

		cudaFreeHost((void*) h_found);

	}

};

MTMPLR __global__

void multi_contains_kernel(Key* keys, int k, MParams<Key,Type>* prms, MMarker<Key,Type> marker)

{

//unrelated code...

	int key_index=tid/warpSize;	//index belonging to thread

	

	bool lFound;

	if ((tid&(warpSize-1))==0 && key_index < k)

	{

		Key key=keys[key_index];		//key belonging to thread

		int sMemKeyIndex=key_index%maxKeysPerBlock;

		lFound=find<Key,Type,reversed>(key, &preds[sMemKeyIndex*MAX_HEIGHT], &succs[sMemKeyIndex*MAX_HEIGHT], prms);

		marker.d_found[key_index]=lFound; //THIS CAN FAIL

		if (lFound)

		{

			marker.d_values[key_index]=succs[sMemKeyIndex*MAX_HEIGHT]->data; //THIS CAN ALSO FAIL

		}

	}

}

I check after allocations for any errors but there are none.