I am trying to copy memory from Pinned Memory to Device memory. The pinned memory is being modified by the CPU. In the device code, if I dont declare the Pinned memory as volatile, changes after memcopy are not picked up. If I declare the Pinned Memory as Volatile, memcopy won’t work. Is there a way I can force the copy from Pinned Memory to Device Memory (or shared memory) to not use cache?
Don’t really know what that means.
If you’re running into a compiler error, you can recast the pointer as needed.
If it is a runtime issue, you would need to provide an example.
This example shows roughly what is need to communicate data between a running kernel and host code:
note that this can be extraordinarily difficult if your GPU is running in WDDM mode.
I am running into a compile error. If I recast the pointer, memcopy is unaware that the CPU might have changed the contents.
This example works for 1 int, volatile int * progress. What if progress was an array? One option is to modify the contents of progress in a loop. But this is slow; each iteration of loop takes ~1us, which is the round trip latency of the PCIE bridge and hence expected.
Is there a way to modify the contents of an array in pinned memory so that it is visible to the CPU in ~1us?
Sorry, most of your statements don’t make any sense to me.
If you want to provide a short complete code, I’ll take a look as time permits. Otherwise, perhaps someone else will be able to help.
__global__
void addPersist(int n, volatile rid * rid_h, volatile float * host_p, volatile uint64_t * host_w, volatile uint64_t * host_r, volatile uint64_t* kill, float * res) {
float y = 0.0f;
//only used by main thread
static uint64_t readIndex;
static rid rid_d;
static volatile int doWork;
static volatile float z;
doWork = 0;
__syncthreads();
while(doWork >= 0) {
if(threadIdx.x ==0 ) {
//memcpy(&rid_d, rid_h, sizeof(rid));
rid_d.ri = rid_h->ri;
rid_d.d = rid_h->d;
// here is where i am copying rid_h (in pinned memory) to rid_d
if( rid_d.ri > readIndex)
// if rid_h->ri is incremented by host, read rid_d->d and do work.
if( rid_d.ri > readIndex) {
z = rid_d.d;
doWork = 1;
}
}
//__threadfence();
__syncthreads();
if( doWork== 1) {
// I the work done here is dumb...
y = y + tanh(z);
}
__syncthreads();
if(threadIdx.x == 0) {
if( doWork) {
*host_r = *host_r + 1;
doWork = 0;
}
else
{
// if fill is set to 1 by host, stop doing work
if (*kill == 1)
doWork = -1;
}
}
__syncthreads();
}
int i = blockIdx.x*blockDim.x + threadIdx.x;
res[i] = y;
}
The code above is an example of a persistent kernel. Once initialized, it busy polls rid_h (modified by the host) for work to do.
It is killed if *kill is set to 1 by the host.
Line 19 won’t work since rid_h has to be declared volatile for the device thread to be aware that the host may modify its contents.
Instead, I have to copy the struct rid_h to rid_d in two steps since struct rid has 2 variables. These 2 lines take 1us each. Is there a way to copy the contents of rid_h to rid_d faster?
Note: If struct rid had 10 variables, it would take 10us to copy rid_h to rid_d.
The reason why I’m doing this is to get around the overhead of launching a kernel.
Just to be clear, I asked for a complete code. If that is not clear, I’ll be happy to further explain what it is I was asking for. Again; do as you wish. Perhaps someone else will jump in based on what you have posted.