I’m wondering how would cudaMemcpy3D behave if it is asked to copy a volume of size W1xH1xD1 from host (source volume size WxHxD, W>W1, H>H1, D>D1) to device (destination volume size W1xH1xD1) starting at a source offset of (Wo, Ho, Do), such that part of the volume to be copied lies outside the source.
Will cudaMemcpy detect it and copy only the relevant portion?
Will the destination part of volume outside the copied portion be set to zero or kept unchanged?
or is it a responsibility of the developer to take care that the subvolume to be copied lies completely inside the volume?
cudaMemcpy3D does not warn or complain in such a case, but the values inside the source volume are correctly copied.
Outside volume values are not set to anything, but perhaps these are taken from the memory location falling outside the bounds, which could be anything.
The question now is that, is it safe to do such copies (in terms of system stability, etc.)?
I fall into the trap of such copies since I need to manage copies from a large volume with a subvolume size that needs to be changed later (to save memory requirement of the application).
It’s the programmers responsibility to avoid out-of-bounds accessesk, just like it is in C. The only time you get bounds checking in CUDA is when using texture fetch functions.
Still dealing about cudaMemcpy3D : I wish to copy memory from device to host; the device memeory is allocated with Malloc3d so it might be padded for alignement, but the host memory is allocated as 1D data since for further processing simplicity.
The result fails as soon as device memory is padded… Is there a solution ?
Can you paste your copying code? The padding must be done by cuda, so that shouldn’t much of a big deal. Just make sure that you create a cudaPitchedPointer from the host memory pointer like this:
I checked the memory copy and it actually seems to work properly. The problem comes from the addressing of the elements in the allocated memory (see post : http://forums.nvidia.com/index.php?showtopic=87178&hl=). I still didn’t find an issue, and using standard 1D allocation doubles the computation time…
Here is the code though for the memcpy, which works ok: (still nice to find code samples…)