I rejigged my code to reduce the impact of this workaround and as a result smacked into another problem, which I am now working around again.
What I wanted to do was:
void CalculateSomethingSpectacular( unsigned int outputvboid )
{
void * p = MapVBOToCUDA( outputvboid );
CalculateKernel( p );
UnmapVBOFromCUDA( outputvboid );
}
However, if I did this, future cudaMallocs etc would return cudaError 10201 …
So instead, I need to do the compute into a temporary buffer and then copy it in
void CalculateSomethingSpectacularButSlower( unsigned int outputvboid, void *pTempDeviceBuffer )
{
CalculateKernel( pTempDeviceBuffer );
void * p = MapVBOToCUDA( outputvboid );
cudaMemcpy( p, pTempDeviceBuffer, sz, devicetodevice );
UnmapVBOFromCUDA( outputvboid );
}
OK, this works, but uses extra memory, requires temporary buffer management, and has an extra memcpy. Yeah it is blindingly fast since it is on the card but it is still unnecessary.
(!) As a side note, I imagine that cudaGLMapBufferObject actually copies the VBO contents to CUDA memory as part of the mapping. It would be nice if the function took an optional parameter “in_bPreserveOriginalData” that could be set to “false” when we don’t really care about the original contents. A post-blur effect would be able to set it to true, so the copy is done, and a kernel that purely generates data, overwriting everything, would just set it to false.
Similarly, the unmap should have such a parameter, so that two things can be done: 1) the caller can signal that although the data was mapped, it never changed and hence does not need to be copied back, and 2) if something failed, there is no point copying bogus data back.
Or is this latter optimization done automatically using some lower level mechanism we don’t see?
Thanks for any comments in advance,
Stewie