Copying Memory from Pinned Memory to Device

I am trying to copy memory from Pinned Memory to Device memory. The pinned memory is being modified by the CPU. In the device code, if I dont declare the Pinned memory as volatile, changes after memcopy are not picked up. If I declare the Pinned Memory as Volatile, memcopy won’t work. Is there a way I can force the copy from Pinned Memory to Device Memory (or shared memory) to not use cache?

Don’t really know what that means.

If you’re running into a compiler error, you can recast the pointer as needed.

If it is a runtime issue, you would need to provide an example.

This example shows roughly what is need to communicate data between a running kernel and host code:

note that this can be extraordinarily difficult if your GPU is running in WDDM mode.

I am running into a compile error. If I recast the pointer, memcopy is unaware that the CPU might have changed the contents.

This example works for 1 int, volatile int * progress. What if progress was an array? One option is to modify the contents of progress in a loop. But this is slow; each iteration of loop takes ~1us, which is the round trip latency of the PCIE bridge and hence expected.

Is there a way to modify the contents of an array in pinned memory so that it is visible to the CPU in ~1us?

Sorry, most of your statements don’t make any sense to me.

If you want to provide a short complete code, I’ll take a look as time permits. Otherwise, perhaps someone else will be able to help.

void addPersist(int n,   volatile rid * rid_h, volatile float * host_p,  volatile uint64_t * host_w, volatile uint64_t * host_r,  volatile uint64_t* kill, float * res) {
	float y = 0.0f;

	//only used by main thread
	static uint64_t readIndex;
	static rid rid_d;

	static volatile int doWork;
	static volatile float z;

	doWork = 0;

	while(doWork >= 0) {

		if(threadIdx.x ==0 ) {
			//memcpy(&rid_d, rid_h, sizeof(rid));
			rid_d.ri = rid_h->ri;
			rid_d.d = rid_h->d;
// here is where i am copying rid_h (in pinned memory) to rid_d
			if( rid_d.ri > readIndex)
// if rid_h->ri is incremented by host, read rid_d->d and do work.
				if( rid_d.ri > readIndex) {
					z = rid_d.d;
					doWork = 1;




		if( doWork== 1) {
// I the work done here is dumb...
			y = y  + tanh(z);


		if(threadIdx.x == 0) {
			if( doWork) {

				*host_r = *host_r + 1;
				doWork = 0;
// if fill is set to 1 by host, stop doing work
			        if (*kill == 1)
			        doWork = -1;


	int i = blockIdx.x*blockDim.x + threadIdx.x;
	res[i] = y;

The code above is an example of a persistent kernel. Once initialized, it busy polls rid_h (modified by the host) for work to do.
It is killed if *kill is set to 1 by the host.

Line 19 won’t work since rid_h has to be declared volatile for the device thread to be aware that the host may modify its contents.

Instead, I have to copy the struct rid_h to rid_d in two steps since struct rid has 2 variables. These 2 lines take 1us each. Is there a way to copy the contents of rid_h to rid_d faster?
Note: If struct rid had 10 variables, it would take 10us to copy rid_h to rid_d.

The reason why I’m doing this is to get around the overhead of launching a kernel.

Just to be clear, I asked for a complete code. If that is not clear, I’ll be happy to further explain what it is I was asking for. Again; do as you wish. Perhaps someone else will jump in based on what you have posted.