Hi all,
I am currently writing a class for calculating the occupancy of my kernels depending on the used resources. I am using CUDA 6.5 on a GTX 750 (CC=5.0)
In my test cases I am evaluating if the results are the same as the occupancy values calculated with the “CUDA_Occupancy_Calculator.xls” spread sheet.
In test cases without a limitation by the shared memory the results are as expected, but as soon as the shared memory usage per block is greater than 32KB the test fails.
First I thought it has something to do with the cudaDeviceSetCacheConfig(…) (for the global cache configuartion, there is also a equivalent for configuring the cache for a specific kernel function: cudaFuncSetCacheConfig(…) ).
TEST(KernelResource, Occupancy_varThreads){
KernelResource resc = KernelResource();
KernelResource::SetCacheConfig(resc, cudaOccCacheConfig::CACHE_PREFER_SHARED);
resc.GridDim = dim3(100, 1, 1);
resc.BlockDim = dim3(128, 1, 1);
resc.FuncAttr.maxThreadsPerBlock = resc.BlockDim.x + resc.BlockDim.y * resc.BlockDim.z;
resc.FuncAttr.numRegs = 63;
resc.FuncAttr.sharedSizeBytes = 1024*42;
// values taken from: "CUDA_Occupancy_Calculator.xls"
float occ= KernelResource::GetOccupancy(resc);
EXPECT_EQ( (float)4/64*1.0f, occ ); // <--- FAILS as actual is 0.0 !!!
...
I found a similar question, but without solution for that problem:
https://devtalk.nvidia.com/default/topic/487733/?comment=3498261#reply
My assumption is that the internal cudaDeviceProperty structure is not updated, when the user sets the cache configuration. The thing is that the function cudaOccMaxActiveBlocksPerMultiprocessor(…) directly reads from the cudaDeviceProp.sharedMemPerBlock and sharedMemPerMultiprocessor.
float KernelResource::GetOccupancy(KernelResource const& resc) {
CUdev::CUDADevice& cuDev = CUdev::CUDADevice::GetInstance();
cudaDeviceProp props = cuDev.GetDeviceProp();
cudaOccDeviceProp occProp = props;
uint32_t blockSize = resc.BlockDim.x * resc.BlockDim.y * resc.BlockDim.z;
cudaOccResult res;
cudaOccError occErr;
occErr = cudaOccMaxActiveBlocksPerMultiprocessor(
&res, // out
&occProp, // in
&resc.FuncAttr, // in
&resc.DevState, // in
blockSize, // in
resc.FuncAttr.sharedSizeBytes); // in
if(occErr != CUDA_OCC_SUCCESS) return 0.0;
//
// http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/
//
float occupancy = (res.activeBlocksPerMultiprocessor * blockSize / props.warpSize) /
(float)(props.maxThreadsPerMultiProcessor /
props.warpSize);
return occupancy;
}
Hope someone has a workaround for that and if it is a bug, then I appreciate a fix ;)
KR,
Roland