I want to replace the ‘cudaMallocPitch’ function with an custom routine (tailored for images), which employ internally a caching device allocator (like the one from Cub library, cub::CachingDeviceAllocator). This is, because the cuda memory management functions are not the fastest …
In order to do that, I have to know how ‘cudaMallocPitch’ internally calculates the base adress and the ‘pitch’, in bytes, between two consectuive image rows. Unfortunately, there is no official information on this, and no property of the ‘cudaDeviceProp’ object which i could query. So how to find out?
(1) ad base adress alignment:
Base adress must be a multiple of ‘cudaDeviceProp::textureAlignment’, otherwise on cannot bind a texture to it. Any other requirements ?
(2) ad pitch alignment:
I know that the pitch must be a multiple of ‘cudaDeviceProp::texturePitchAlignment’, otherwise one cannot bind a texture (or texture object) to it. According to http://stackoverflow.com/questions/12550927/pitch-alignment-for-2d-textures , the alignment seems to be 512 bytes currently. And (of course), it must be a muliple of the size (in bytes) of one pixel. Any other requirements ?
(3) ad implications
What if I choose the alignment of base adress to be the same as ‘cudaDeviceProp::textureAlignment’ and the alignment of pitch to be the same as ‘cudaDeviceProp::texturePitchAlignment’ , will I get into troubles (crashes and/or major performance degradations of kernels working with such images) ?
Internal details of memory allocators are not usually made publicly available, precisely because the intention is that they be treated as black boxes whose implementation details are subject to change. Documenting the internals (allowing them to be exploited by programmers) would defeat the purpose of the abstraction.
cudaMallocPitch() is no different in this regard, so I wouldn’t expect its implementation details to ever be documented. What you can do is reverse engineer the current behavior, which very likely differs by GPU architecture and presumably much less likely differs by CUDA version. It seems you are already successfully working along those lines, but of course the risk is that current behavior may change in the future. You may want to consider extracting relevant parameters dynamically at runtime, by performing “model allocations” with cudaMallocPitch(), and then doing all other allocations of the same size using your own allocator, with the same parameters as were chosen by cudaMallocPitch().
It is common that generic memory allocators aren’t very fast, entire companies justify their existence by offering faster replacements (I recall at least two in the CPU space). The first two rules of high-performance code (regardless of platform) in this regard are (1) create new allocations as infrequently as possible, (2) re-use existing allocations if at all possible. Have all attempts in this regard be exhausted in your application, making you progress to rule (3) build your own custom sub-allocator?
Well, creating new allocation as infrequently as possible, and re-using existting allocation would mean that we would have to re-organize and modify our existing software-framework significantly. The easiest thing would be to replace the underyling memory allocator for images, as it could be done ‘under the hood’, without significant changes in other places. Therefore, actually the only practical option for me is to replace the memory allocator.
Regarding the performance implications of my envisioned strategy (3) (using the alignment requirements specified for textures), I feel that any performance degradation will likely not by really significant for recent-generation (e.g. CC >= 3.0) GPUs because the coalasecing rules are more relaxed on recent GPUs. Also the posting http://stackoverflow.com/questions/14715343/is-cudamallocpitch-really-more-efficient-when-using-two-dimensional-arrays points in this direction.
Kind of funny how refactoring is usually considered a good strategy – for other people’s code :-) But seriously, if performance matters to this code, being mindful of memory allocations would be a good long-term strategy, independent of any GPU acceleration.
If you find that specific kinds of memory allocations are particularly slow, you might want to consider filing a bug with NVIDIA.
As njuffa said already, if you’re looking for a specification of these items, I don’t think you’ll find it. You can arrive at experimental answers of course. Such answers may change depending on various factors. Those factors and specifications are not spelled out AFAIK, so the only option you have to determine what pitch will be returned for a specific allocation by cudaMallocPitch is to run the function call in the desired setting.
It’s understood that you have a use case for which this is non-optimal. If you desire changes in the functionality provided by CUDA or the CUDA documentation, you can file a bug at developer.nvidia.com - this is effectively an “RFE” or request for enhancement.