[ON HOLD] Issue with cuda_occupancy and cudaDeviceSetCacheConfig(...)

Hi all,

I am currently writing a class for calculating the occupancy of my kernels depending on the used resources. I am using CUDA 6.5 on a GTX 750 (CC=5.0)

In my test cases I am evaluating if the results are the same as the occupancy values calculated with the “CUDA_Occupancy_Calculator.xls” spread sheet.
In test cases without a limitation by the shared memory the results are as expected, but as soon as the shared memory usage per block is greater than 32KB the test fails.

First I thought it has something to do with the cudaDeviceSetCacheConfig(…) (for the global cache configuartion, there is also a equivalent for configuring the cache for a specific kernel function: cudaFuncSetCacheConfig(…) ).

TEST(KernelResource, Occupancy_varThreads){  
  KernelResource resc = KernelResource();
  KernelResource::SetCacheConfig(resc, cudaOccCacheConfig::CACHE_PREFER_SHARED);

  resc.GridDim = dim3(100, 1, 1);
  resc.BlockDim = dim3(128, 1, 1);
  resc.FuncAttr.maxThreadsPerBlock = resc.BlockDim.x + resc.BlockDim.y * resc.BlockDim.z;
  resc.FuncAttr.numRegs = 63;
  resc.FuncAttr.sharedSizeBytes = 1024*42;

  // values taken from: "CUDA_Occupancy_Calculator.xls"
  float occ= KernelResource::GetOccupancy(resc);
  EXPECT_EQ( (float)4/64*1.0f, occ ); // <--- FAILS as actual is 0.0 !!! 

  ...

I found a similar question, but without solution for that problem:
https://devtalk.nvidia.com/default/topic/487733/?comment=3498261#reply


My assumption is that the internal cudaDeviceProperty structure is not updated, when the user sets the cache configuration. The thing is that the function cudaOccMaxActiveBlocksPerMultiprocessor(…) directly reads from the cudaDeviceProp.sharedMemPerBlock and sharedMemPerMultiprocessor.

float KernelResource::GetOccupancy(KernelResource const& resc) {
    CUdev::CUDADevice& cuDev = CUdev::CUDADevice::GetInstance();
    cudaDeviceProp props = cuDev.GetDeviceProp();
    cudaOccDeviceProp occProp = props;

    uint32_t blockSize = resc.BlockDim.x * resc.BlockDim.y * resc.BlockDim.z;
    cudaOccResult res;
    cudaOccError occErr;
    occErr = cudaOccMaxActiveBlocksPerMultiprocessor(
                &res,                             // out
                &occProp,                         // in
                &resc.FuncAttr,                   // in
                &resc.DevState,                   // in
                blockSize,                        // in
                resc.FuncAttr.sharedSizeBytes);   // in

    if(occErr != CUDA_OCC_SUCCESS) return 0.0;
    // 
    // http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/
    // 
    float occupancy = (res.activeBlocksPerMultiprocessor * blockSize / props.warpSize) / 
      (float)(props.maxThreadsPerMultiProcessor / 
      props.warpSize);

    return occupancy;
  }

Hope someone has a workaround for that and if it is a bug, then I appreciate a fix ;)

KR,
Roland

Hi,

here is the test output for 42KB shared memory:

[ RUN      ] KernelResource.Occupancy_shm42KB
..\..\MetricMonitor\test\test_KernelResource.cpp(64): error: Value of: occ
  Actual: 0
Expected: (float)4/64*1.0f
Which is: 0.0625
<---------- Threads/Block = 128

..\..\MetricMonitor\test\test_KernelResource.cpp(69): error: Value of: occ
  Actual: 0
Expected: (float)8/64*1.0f
Which is: 0.125
<---------- Threads/Block = 256

..\..\MetricMonitor\test\test_KernelResource.cpp(75): error: Value of: occ
  Actual: 0
Expected: (float)13/64*1.0f
Which is: 0.203125
<---------- Threads/Block = 400

[  FAILED  ] KernelResource.Occupancy_shm42KB (5 ms)

and for 22KB shared memory (actual is the half of expected^^):

[ RUN      ] KernelResource.Occupancy_shm22KB
..\..\MetricMonitor\test\test_KernelResource.cpp(92): error: Value of: occ
  Actual: 0.0625
Expected: (float)8/64*1.0f
Which is: 0.125
<---------- Threads/Block = 128

..\..\MetricMonitor\test\test_KernelResource.cpp(97): error: Value of: occ
  Actual: 0.125
Expected: (float)16/64*1.0f
Which is: 0.25
<---------- Threads/Block = 256

..\..\MetricMonitor\test\test_KernelResource.cpp(103): error: Value of: occ
  Actual: 0.1875
Expected: (float)26/64*1.0f
Which is: 0.40625
<---------- Threads/Block = 400

I can’t figure out what you are doing.

  1. You haven’t provided a complete code that someone else could test.

  2. Also, I don’t recognize this:

occErr = cudaOccMaxActiveBlocksPerMultiprocessor(

There is no documented function in the cuda runtime API with that name, as far as I know:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OCCUPANCY.html#group__CUDART__OCCUPANCY

There is a function called cudaOccupancyMaxActiveBlocksPerMultiprocessor, but it has a different name from what you are using and a different number of function parameters specified. (Yes, there is a cudaOccMax… in cuda_occupancy.h. Why are you using that? It’s behavior is unspecified and it may change from cuda version to cuda version, since it’s not part of the documented API.)

  1. The question that you linked “without solution” seems to be adequately explained. I’m not sure what you think the issue is there.

  2. I took the code demonstrating the occupancy API from the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#occupancy-calculator

and extended the provided kernel to use dynamically allocated shared memory of adjustable size. With this modification, the sample code seemed to give sensible output (occupancy output varied as I varied the shared mem size), and did not give “zero” output when I specified a shared memory size of 42KB per block.

If you decide to provide a complete code, please don’t provide a dozen functions embedded in a class. Develop your understanding, or demonstrate the issue, with an approach approximately as simple as what is shown in the programming guide, which I have already linked.

Note that the shared memory passed to the cuda occupancy API is for dynamically allocated shared memory only (since that is not discoverable by the compiler). If your kernel uses statically allocated shared memory, you do not pass the size of that shared memory in this parameter field to cudaOccupancyMaxActiveBlocksPerMultiprocessor.

Thx txbob for your fast reply.

Sorry for my imprecise description. The purpose of the whole thing is to have a possibility to get some metrics of the current kernel configuration. Finally there is a Vistor (I am using the Visitor design pattern) that visits all kernel functions and collects the metrics such as, host/device usage, min/max/avg excecution time, grid/block dimensions, dynamic shared_mem usage, texture usage, etc. The thing is that I am dealing with highly configurable algorithms and I want to show the user the influence of changing parameters (It is similar to what the profiler does, but the metrics are only collected on a user-demand during the runtime -> can be turned on/off)

Sorry for that, but I am not allowed to do so.

Thank you for that hint :) and you are right, I am currently using the “cuda_occupancy.h” header. I thought it was the proper way to use this one. (as I already mentioned the article in CUDA Pro Tip http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/ uses the cuda_occupany.h module)

Now I will test the occupancy calculator from the cuda-runtime API and hopefully the results are correct then.

Long terms short, the thread says that the cudaDeviceSetCacheConfig(…) has no impact on the cudaDeviceProperties, but I cannot use the nvvp profiler to get the current ratio between L1-Cache and Shared_mem to feed my occupancy-calculator. So the nvvp profiler is not handy and not a solution for me, as you can image.

Sounds promising to me :D and next time I try harded to give a better insight in my code, maybe it would be best to extract the most important things in a single source file, which can then be uploaded…

Thank you so much for your effort!!!
Roland

Hi again,

I have now adapted my code with the new functions and the results are now correct :) , but there are some things that are not so confortable:

  1. The function pointer to the kernel is required.
  2. many of my kernel functions are templates, which cause an compile error with msvc - v100 when the specialized function is casted to a (void*).
  3. I have to write a getter methode for each kernel function to get the device pointer from the *.cu-file into a *.cpp-file, for example into the test-sources.
  4. I have tried to set the cudaDeviceSetCacheConfig(...) to all settings, but it has no influence on the occupancy...

(1) Why is the pointer needed? For estimating the used registers per thread or for the static shared memory amount? Okay I would have estimated the usage of registers per thread on my own (with the ptxas info).

(2 + 3) my quick and dirty workaround:

__host__ void const* GPUcode::census_kernel_GetFpt() {
  // return (void*) censusTransform_new<unsigned short> ;  <--- fails
  typedef void (*my_ftp1)(unsigned int *,size_t,unsigned short*,
                          size_t,int,int,int,int,int,int,int,int) ;
  my_ftp1 ftp= censusTransform_new<unsigned short>;
  return (void*) ftp ;
}

But this only valid for one of many possible specializations^^

(4) I can live with that limitation (it has definitely something to do with the fact that the cudaDeviceProperties are not updated when the cache is configured…) , but it may cause strange errors, when the kernel-call fails but the occupancy seems to be okay.

All in all I am not quite satisfied with that solution, especially with the needed function pointer.

KR,
Roland

I wasn’t asking to see your code, and in general when people make this request on an internet help forum, that’s not what they are asking for. They are asking for something like this:

http://stackoverflow.com/help/mcve

or this:

http://sscce.org/

It’s such a common request (and so frequently misunderstood or objected to) that folks have written complete web pages to describe it. It’s unlikely that you’re “not allowed” to provide it. But it does take effort to create one. I’ve already indicated how to do so, though. Start with the published example in the programming guide, and modify it until it demonstrates the issue you are concerned about.

You don’t need an occupancy calculation to determine how to set cache config so that the kernel call won’t fail. And the occupancy API was not designed to serve this purpose.

And yes, the cache config doesn’t affect the properties returned by cudaDeviceProperties. You don’t need the profiler to determine the current setting of the cache config, you can get the current setting of the cache config using the appropriate runtime API function, cudaDeviceGetCacheConfig:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1gd9bf5eae6d464de05aa3840df9f5deeb

You can also set cache config individually for kernels, which should obviate the need for any of this, since you can take into account a specific kernel’s usage of shared memory when you make the setting.

And yes, the function pointer is needed at least so that the number of registers per thread can be factored into the calculation.

I had not previously recognized the comments in the pro tip you linked:

“The CUDA Toolkit version 6.5 also provides a self-documenting, standalone occupancy calculator and launch configurator implementation in <CUDA_Toolkit_Path>/include/cuda_occupancy.h for any use cases that cannot depend on the CUDA software stack.”

It does seem like this might be a better fit for your use case, and my previous comments about why are you using it may be off-base. If you want to provide a complete sample, as I have already indicated how, I will take another look as time permits.

Hi txbob,

thank you for the links and for helping me. And I know that you were not interessted in my code, the thing was that I didn’t mind to write a small sample code, but now I know how to do better in future posts.

I think we talk at cross-purposes. And yes, I now the API function. The thing is that I want be able to get the proper occupancy at runtime printed as ouput for the console. It should serve as online profiler which gives a result when the user desire it and I hoped that the CUDA runtime API would support me with an easier solution. And I am not willed to write switch-statements or other constructs to to get the proper amount of the configured shared memory size (I am sure that this logic already exists somewhere… ), because it would mean to maintain this part of the code everytime the device changes and so on.

For now I put this part of my project on hold, as it has a low priority. But when I have the finial solution, I’ll post it as an sample.

Thank you very much.
KR.

Hi,

I encountered the exactly same issue. As long as the shared memory usage exceeds half of the amount available per block, the computed occupancy differs from the excel sheet. I tested on a Kepler architecture with 48KB Shared memory per block.

I understood from previous reply that the standard API functions do not have this issue. However, if the resource usage is the only input I have, the standalone functions in cuda_occupancy.h are still the best option.

Is the cuda_occupancy.h going to be updated or already became obsolete ? Did anybody keep looking into this issue ?

Any help is appreciated. Thanks in advance.
Bo