hi,
I found that Memcpy from cuda buffer to screen buffer takes too much time. How can I avoid the memcpy step or make memcpy work at 100% bandwidth of DRAM?
unsigned char screen_buf = (unsigned char )mmap(NULL, screenlen, PROT_WRITE | PROT_READ, MAP_SHARED, fd,0);
cudaMallocManaged (&cuda_buf, size, cudaMemAttachGlobal);
…
for (i=0; i<height; i++) {
memcpy(screen_buf+iwidth4 ,cuda_buf+iwidth4,width*4);
}