Concurrent kernel execution Only working with mapped memory

Hello,

I have a structure which is managed by the GPU and and each object of the structure carries out its operations in its own stream, which is a member variable. The goal is to allow operations performed on different objects to execute concurrently because they are logically independent.

I have two versions of the insert() method:

The first takes (page-locked) input and passes it to the kernel by issuing a memcpyasync(). Calls to this method from different objects are not able to execute concurrently and I have no idea why.

__host__ void insert(Key *key, Type *value, size_t k) //key, value are allocated using pinned memory

	{

		//Dynamically allocated shared memory size in bytes (per Block)

		size_t smem_size=2*MAX_KEYS*MAX_HEIGHT*sizeof(Node<Key,Type>*);

		cudaError_t error;

		error=cudaMemcpyAsync((void*)dev_keys, (void*) key, k*sizeof(Key),cudaMemcpyHostToDevice, stream);

		error=cudaMemcpyAsync((void*)dev_values, (void*) value, k*sizeof(Type),cudaMemcpyHostToDevice, stream);

		//Launch Kernel

		multi_insert_kernel<<<blocksPerGrid,threadsPerBlock,smem_size,stream>>>(dev_keys, dev_values, k, prms);

	}

The second version takes a pointer to mapped memory. It executes concurrently on the device.

__host__ void insertM(Key *key, Type *value, size_t k) //key, value are mapped to device

	{

		//Dynamically allocated shared memory size in bytes (per Block)

		size_t smem_size=2*MAX_KEYS*MAX_HEIGHT*sizeof(Node<Key,Type>*);

		//Launch Kernel

		multi_insert_kernel<<<blocksPerGrid,threadsPerBlock,smem_size,stream>>>(key, value, k, prms);

	}

Calls are issued like this:

for(unsigned long i=0; i<=(samples-increment);i+=increment)

	{

		gpulist1.insert(a+i,a+i,increment);

		gpulist2.insert(a+i,a+i,increment);

		gpulist3.insert(a+i,a+i,increment);

		gpulist4.insert(a+i,a+i,increment);

	}

	

for(unsigned long i=0; i<=(samples-increment);i+=increment)

	{

		gpulist1.insertM(d+i,d+i,increment);

		gpulist2.insertM(d+i,d+i,increment);

		gpulist3.insertM(d+i,d+i,increment);

		gpulist4.insertM(d+i,d+i,increment);

	}

The device is a GTX 560ti (compute 2.1). I am able to tell whether the kernel is being executed concurrently by timing the execution times.

Does anyone have any suggestions?

Thank you.

This is an unrelated question: Is cudaStreamCreate() thread safe?

all CUDA API functions should be thread safe as of 4.0.

the problem with concurrency with cudaMemcpyAsync is an artifact of how concurrency works on Fermi. the only guaranteed way to get concurrency is to do a breadth-first approach and launch one piece of work per stream before launching any second piece of work into any stream.

Do you mean I cannot issue more than one command in a given stream before moving on to the next one? I’m not sure I understand.

What puzzles me is that the pattern that I used for the above example is the same as in section 3.2.5.5.1 in the programming guide.

Ideally I’d like the function call pattern to be random (data driven) and I’d like kernels to overlap where possible.

So isn’t it enough to avoid any implicit synchronization as described in the programming guide?

all CUDA API functions should be thread safe as of 4.0.

Then I like the direction the API is moving External Image

Thank you.

I think that currently CUDA does not reorder (even async) memcpys and kernel calls, which would explain what Tim said. So yes, issuing two commands to the same stream would block any further concurrency.

And I think you have indeed pointed out a weakness in the Programming Guide. As far as I remember, the origins of section 3.2.5.5.1 predate Fermi, at which time the pattern used in that example achieved the maximum concurrency possible. However, section 3.2.5.5.5 discusses the efficiency of different patterns on pre-Fermi and Fermi hardware in detail.

I will end up using mapped memory then because it proved to be faster due to the overlapping behavior.
Even with a single object using a simple stream it was still faster due to overlapping of memory transfers and kernel execution (since I only read the keys once).
Btw, as an exercise, I don’t see any way I can rewrite my methods to have them overlap, do you?
Thanks to you both.

You would need to somehow batch several calls to insert() so that you can reorder their cudaMemcpyAsync() calls and kernel invocations manually. Does not seem to be a simple exercise.
Using mapped memory indeed seems like the better option.