How can I write cuda data directly into framebuffer? Memcpy takes too much time

I found that Memcpy from cuda buffer to screen buffer takes too much time. How can I avoid the memcpy step or make memcpy work at 100% bandwidth of DRAM?

unsigned char screen_buf = (unsigned char )mmap(NULL, screenlen, PROT_WRITE | PROT_READ, MAP_SHARED, fd,0);
cudaMallocManaged (&cuda_buf, size, cudaMemAttachGlobal);

for (i=0; i<height; i++) {
4 ,cuda_buf+i

Check sections 5.13ff of the CUDA Runtime API.

Use the CUDA/OpenGL interop.

You will still need to do a final device side write into the “interop buffer” that is mapped to an opengl texture (depending on the method).

Mapping the texture to a surface and writing to it directly from your CUDA kernel is likely one of the more efficient methods AFAIK.