some problem related with RT_CALLABLE_PROGRAM???

hey, i found some strange behavior related with RT_CALLABLE_PROGRAM.
if i call markDirty() for an instance while using RT_CALLABLE_PROGRAM, the next launch call becomes drastically slower than not call markDirty().

i am using optix 3.0 SDK, and you can easily make this situation.

i use whirligig example source. i just added below code to pinhole_camera.cu.

rtCallableProgram(float3, check, (float3));

RT_CALLABLE_PROGRAM float3 check_problem(float3 r)
{
   return make_float3(0.0f, 1.0f, 0.0f);
}

and i added below code to load above program to whirligig.cpp, and then this example’s FPS will fall down as almost tenth.

Program check_program = m_context->createProgramFromPTXFile(ptxpath("whirligig", "pinhole_camera.cu"), "check_problem");
m_context["check"]->set(check_program);

anyone know how to resolve this problem??

I assume you used markDirty on an acceleration structure. In such cases, the next launch call would rebuild the acceleration structure first before the actual ray tracing begin, which results in the FPS dropping down. I don’t think this has anything to do with RT_CALLABLE_PROGRAM.

Look at Page 27 of the OptiX 3.0 Programming guide. The “refine” or “refit” property might be what you need if you don’t want a full rebuild of the acceleration structure.

I know what do you mean, so i called it as “strange behavior”. I also agree with you that rebuilding the acceleration structure has nothing to do with RT_CALLABLE_PROGRAM.

but it makes an great impact on FPS. the FPS of original whirligig on my test environment is about 50, but when i added just only the above code, the FPS is about 4~5 terribly.

and I already tested the above code with “refine” and “refit” property enabled, and other acceleration structure types. it improved just a little.

additionally, if i don’t markDirty() when i use RT_CALLABLE_PROGRAM, it works normally.

is there anything i lost???

Hi haruband,

This may look like a bug, I’ve managed to replicate your problem. I believe this has something to do with the optimization routine within OptiX during compilation time. The callable function might have been removed when it is not used, leading to some kind of unknown complications. If you have called the
RT_CALLABLE_PROGRAM function somewhere within any RT_PROGRAM, the FPS will not be affected.

as you said, after i called the RT_CALLABLE_PROGRAM function within RT_PROGRAM, the FPS does not be affected.

and i hope to be fixed that problem as soon as possible.

thank you very much!!!

Dear,
I faced another problem related with RT_CALLABLE_PROGRAM.
I had tried to repeat the haruband’s experiment with callable programs using the Whitted sample.

  1. I inserted below code to pinhole_camera.cu, just before the line “RT_PROGRAM void pinhole_camera()”:
rtCallableProgram(float3, check, (float3));

RT_CALLABLE_PROGRAM float3 check_problem(float3 r)
{
   return make_float3(0.0f, 1.0f, 0.0f);
}
  1. And put (after line # 106) into whited.cpp the analogous lines:
Program check_program = m_context->createProgramFromPTXFile(ptx_path, "check_problem");
  m_context["check"]->set(check_program);
  1. Compile and run. Message:
    OptiX Error: Parse error (Details: Function “_rtProgramCreateFromPTXFile” caught
    exception: D:\CUDA_OPTIX\OptiX SDK 3.5.1\SDK\build\lib\ptx/whitted_generated_pi
    nhole_camera.cu.ptx: error: Cannot find function “check_problem” in PTX [4850050
    ], [4850050])
  2. Could somebody help me?
  3. My envi: desktop, Win 8.1, VS 2010, CUDA 5.5, OptiX 3.5.1 GeForce GTX 560ti
  4. After setup of OptiX 3.5.1 all EXEs from SDK-precompiled-samples work. I recompiled and built all samples from SDK: all work.

Thanks,
Victor

Hi Victor, it’s likely that you’re compiling your code targeting sm_1.x devices.

In that case, and with the function you’re inserting, there’s a small caveat in optix_device.h:

/* This is used to declare programs that can be attached to variables and called from
 * within other RT_PROGRAMS.
 *
 * There are some limitations with PTX that is targetted at sm_1x devices.
 *
 * 1. Functions declared with RT_CALLABLE_PROGRAM will not be emitted in the PTX unless
 *    another function calls it.  This can be fixed by declaring a __global__ helper
 *    function that calls the desired function.
 *
 *    RT_CALLABLE_PROGRAM
 *    float3 simple_shade(float multiplier,  float3 input_color)
 *    {
 *      return multiplier * input_color;
 *    }
 *
 *    #if __CUDA_ARCH__ < 200
 *    __global__ void stub() {
 *       (void) simple_shade( 0, make_float3(0,0,0) );
 *    }
 *    #endif
 *
 * 2. You can't pass pointers to functions or use integers for pointers.  In the first
 *    case CUDA will force the inline of the proxy function removing the call altogether,
 *    and in the case of passing pointers as integers, CUDA will assume that any pointer
 *    that was cast from an integer will point to global memory and could cause errors
 *    when loading from that pointer.  If you need to pass pointers, you should target
 *    sm_20.
 */

To summarize: your function is being elided from the generated ptx if you’re compiling with sm_1.x.

Since your GeForce GTX 560 Ti has compute capabilities 2.1 (https://developer.nvidia.com/cuda-gpus), there are two ways to solve this:

  • Add an intermediate target (e.g. -arch=compute_20) to your NVCC compilation command: look into the wherever_you_cmake’d_the_OptiX_SDK\whitted\CMakeFiles\whitted.dir\whitted_generated_pinhole_camera.cu.ptx.cmake program that calls NVCC at the line
set(CUDA_NVCC_FLAGS --use_fast_math;--compiler-options;/D_USE_MATH_DEFINES ;; ) # list

and add that option

set(CUDA_NVCC_FLAGS -arch=compute_20;--use_fast_math;--compiler-options;/D_USE_MATH_DEFINES ;; ) # list
  • Add a helper method as described above to deal with sm_1.x issues
rtCallableProgram(float3, check, (float3));

RT_CALLABLE_PROGRAM float3 check_problem(float3 r)
{
   return make_float3(0.0f, 1.0f, 0.0f);
}

#if __CUDA_ARCH__
     __global__ void stub() {
       (void) check_problem( make_float3(0,0,0) );
     }
#endif

Hope it helps. Let me know if something else comes up.

Dear marknv,
Great! Works!

Obviously, while reading the Programming guide I saw that capability 1.x should be processed in particular manner. I did not pay attention as I had another GPU.

O-o-o, after next update of a text I run a build of the whitted and VS2010 asked to reload projects of all SDK samples. I looked at the same place: there is the old version of the line. So, each reload means the return to the initial state. It means I need each time to examine
whitted\CMakeFiles\whitted.dir\whitted_generated_pinhole_camera.cu.ptx.cmake

I have found that the original File CMakeLists.txt (in SDK dir) sets only “set(flag “–use_fast_math”)”. It would be nice if CMakeLists.txt sets "-arch=compute_20;” also. In such a case I need to use CMake only once for each project.

Probably global NVCC setting exists: it is a solution for me also.
Could you help me?

Next. I would be happy if the next versions of OptiX setup would take into account higher capabilities in that or another way, e.g., several files CMakeLists.txt.

Thank you,
Victor

Dear marknv,
Sorry for troubles. I tried to understand CMakeLists.txt and it seems I have solved my problem.
I updated line 168, i.e. from

if(USING_WINDOWS_CL)
  list(APPEND CUDA_NVCC_FLAGS --compiler-options /D_USE_MATH_DEFINES)
endif()

to

if(USING_WINDOWS_CL)
  list(APPEND CUDA_NVCC_FLAGS -arch=compute_20 --compiler-options /D_USE_MATH_DEFINES)
endif()

Nevertheless, I count that OptiX setup should not ignore higher capabilities. For example, advises in CMakeLists.txt or INSTALL-WIN.txt
Best wishes,
Victor

Glad to read that, good job!

I agree with your concerns and, by the way, sm_10 devices are being deprecated by CUDA6 so hopefully we’ll be able to solve this gracefully in a few time.