Using memcpyDtoH for copying data from cuda to rendering buffer

I have downloaded PhysX5 repo (git@github.com:NVIDIA-Omniverse/PhysX.git) and ran the snippet “SnippetPBF” in order to get a notion of how to render particles position calculated by PhysX5 (with cuda) without retrieving the particles position from the gpu to the cpu first and encountered the following code (I have deleted unnecessary lines):

PxParticleAndDiffuseBuffer* userBuffer = getParticleBuffer();
PxVec4* positions = userBuffer->getPositionInvMasses();
PxCudaContextManager* cudaContextManager = scene->getCudaContextManager();
cudaContextManager->acquireContext();
PxCudaContext* cudaContext = cudaContextManager->getCudaContext();
cudaContext->memcpyDtoH(sPosBuffer.map(), CUdeviceptr(positions), sizeof(PxVec4) * numParticles);
cudaContextManager->releaseContext();

If we take a look at:
void* SharedGLBuffer::map()
{
if (devicePointer)
return devicePointer;

#if USE_CUDA_INTEROP
size_t numBytes;
cudaContextManager->acquireContext();
devicePointer = mapCudaGraphicsResource(reinterpret_cast<CUgraphicsResource*>(&vbo_res), numBytes, 0);
cudaContextManager->releaseContext();
#else
glBindBuffer(GL_ARRAY_BUFFER, vbo);
devicePointer = glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
glBindBuffer(GL_ARRAY_BUFFER, 0);
endif
return devicePointer;
}

and then at:

void* mapCudaGraphicsResource(CUgraphicsResource* vbo_resource, size_t& numBytes, CUstream stream = 0)
{
CUresult result0 = cuGraphicsMapResources(1, vbo_resource, stream);
PX_UNUSED(result0);
void* dptr;
CUresult result1 = cuGraphicsResourceGetMappedPointer((CUdeviceptr*)&dptr, &numBytes,
*vbo_resource);
PX_UNUSED(result1);
return dptr;
}

It seems that sPosBuffer.map() returns a pointer to a device address and not a cpu address. If so why is the function used is memcpyDtoH and not memcpyDtoD and why does it work**?**

The general suggestion for physX questions is to post them as issues or dicussions on github. You might want to try that if you don’t get an answer here.

Thank you for the clarification, I have now uploaded the question to the PhysX github.

Have you tried whether both options work?

Cuda can determine the location - host or device - from the address and may (I would not count for it) gracefully fall back to the correct copy operation in that case?

If that is confirmed, then it is a bug in the code.

I have replaced cudaContext->memcpyDtoH(sPosBuffer.map(), CUdeviceptr(positions), sizeof(PxVec4) * numParticles);

with:

cudaContext->memcpyDtoD(CUdeviceptr(sPosBuffer.map()), CUdeviceptr(positions), sizeof(PxVec4) * numParticles);

I can confirm that they both work.

I am very new to Cuda but I assume that somewhere in memcpyDtoH implementation there is a call to:
__host__​cudaError_t cudaMemcpy ( void* dst, const void* src, size_t count, cudaMemcpyKind kind ) and that the cudaMemcpyDeviceToHost used as the kind? Does the address resolution you have mentioned holds for this function also (even if the wrong kind is used)?

See:

cudaMemcpyDefault = 4

Direction of the transfer is inferred from the pointer values. Requires unified virtual addressing

So if there is unified virtual addressing (which there usually is in current architectures), it can be deduced from the pointer.

One could argue that the copy function should return an error if the location does not match the parameter.

I don’t know the details of this, but I also assumed such a thing, and I believe it is here

	mLastResult = cuMemcpyDtoH(dstHost, srcDevice, ByteCount);

It’s using the driver API and AFAIK that API call has no notion of a “default” transfer direction

I haven’t studied it carefully/beyond that.