Is it possible for host code to write its results directly into cuda-malloc’ed GPU-memory?
At the moment I’m writing it into cpu memory and doing a Memcopy afterwards…
I wanted to try if it’s any faster to directly write into videomemory instead of copying a matrix with too much information already…
unsigned char* m_videoYGPU;
CUDA_SAFE_CALL( cudaMalloc( (void**) &m_videoYGPU, mem_size));
for ( int i = 0; i < g_height; i++){
m_videoYGPU[g_width*i]=m_videoY[g_width*i];
}
As you c, i only need to certain positions in the array and not the whole array itself…
The code presented above gives a Access violation writing location …
Then cudaMemcpy just the values you need to in a “packed” array that doesn’t have unneeded elements and run a quick little kernel to unpack it. If your unpacking kernel has all reads/writes coalesced (read from a tex1Dfetch if you can’t coalesce them), then you should see 70GiB/s performance in the unpacking.
Thank you very much for your replies, have to redesign some stuff so some extra code is executed at gpu side, which would make this memcopies obsolete.