optiXTutorial 11 - remove (free)GLUT

Hi Detlef,

thank you for your answer. I could solve the last issue.

Could you just help me with another issue. I’m trying to store
vectors in float3 format in an OptiX INPUT Buffer. I’d like to save about 200 float3
values. Indeed, it doesn’t work due to an Exception Error from MS Visual Basic 2015.

The strange behaviour is that I can store up to about 105 vectors without running into an
exeption and I can even read those rtBuffer values which are, indeed, correct, as well.

  1. Is there a limit of how many items one can store in an rtBuffer<float3, 1>?
  2. If yes how can I extend this limit besides defining another rtBuffer?
  3. Or is there another way to tell OptiX I’d like to save a certain amount of items
    in the Buffer?
  4. Is there a possibility to analyze how much Storage in the GPU is left in order to be
    able to say if this could be an issue, as well?

Kind Regards,
Robert

1.) In principle consider the limit to be the amount of memory on your board.
Mind that many other tasks on your OS use graphics memory as well and I doubt you can pinpoint a VRAM memory limit to 105 of 200 float3 vectors. There is more likely something else wrong in your code.

You want to write the data on the host into the input buffer, right?
Because writing input buffers on device side is not allowed.
How you write data into an output or input_output buffer on device side is also your responsibility.
If you just provided the code you think is not working correctly I might be able to tell…

2.) Could it be that you created the buffer in OptiX with the incorrect size?

2.) + 3.) You can call setSize() on a buffer to change its size if you need to adjust it dynamically.
With respect to performance, don’t overdo it, there shouldn’t be a need to adjust buffers dynamically often.

4.) One tool to analyze the GPU and VRAM usage is available for workstation class boards:
[url]https://developer.nvidia.com/nvidia-system-management-interface[/url]
That allows to dump various information about the board and also has command line options to do that in regular intervals.
I only use Quadros and don’t know if that is installed on GeForce consumer boards.
Under Windows 10 I’m not aware of other tools which work reliably.

Hi Detlef,

thank you very much for your answers.

A general question popped up which I like to solve.

  1. Is it possible like in the SDK Example ‘optiXPrimitiveIndexOffsets’
    to draw a line which exactly one Ray covers. So I would like to visualize
    the the Ray (e.g. as a blue line) which hits the first obstacle then the second and so on.

As a line is not a 3D Object is it possible to do this with the sutil::displayBufferGlut();,
as well?

Or do you have another idea?

Thank you very much in advance.

With kind regards
Robert

Yes, you could visualize rays with OpenGL line primitives for example.
If you want to do that on top of the ray traced image, you would need to match the camera projection exactly. That obviously won’t really work for the primary rays while your eye-position is in the camera origin.

Check this earlier response to that question: [url]https://devtalk.nvidia.com/default/topic/956807/?comment=4949165[/url]
Note that the “collision” example mentioned in there is from OptiX 3.9.1. It’s not present in the OptiX 4.x SDK at this time.

Hi Detlef,

thank you!

I encountered another issue and Im looking for a solution.

I’m trying to create an instance of the Quaternion class in the pinhole.cu file.
Remark:
I changed the original code a little but on my host code *.cpp the changed code
does exactly what it is intended to do.

Quaternion *ptr_Example = new Quaternion();

When executing the context->launch([…]) 2D function I get the following error:
OptiX Error: […] Undefined Symbol: malloc Cannot compile function due to unresolved symbols.

Im sure that this line of code is the problem because in case of

Quaternion ptr_Example;

I don’t get any error.
Further, I included all required C and C++ libraries for malloc and new/delete symbols.
Likewise, I searched in the internet and I think it shall be possible to use the
new/delete operator in device kernels, shouldn’t it?
Therefore, I put host device in front of each Quaternion method for the constructors, as well.

Now I dont know how to get a solution for it. Or is it like CUDA is able to handle these operators
but OptiX isn’t? Can I use own defined classes in device code, anyway?

Or is the problem that the NVidia Geforce GTX 745 is not in the list:

Does this GPU not contain the required Compute Capability 2.0?

And if it is like that is there any way to workaround that besides buying a
different GPU?

With kind regards,
Robert

The new operator is not supported in CUDA. Please read over the CUDA docs on basic memory management and usage.

“Likewise, I searched in the internet and I think it shall be possible to use the new/delete operator in device kernels, shouldn’t it?”
You should really read the whole OptiX programming guide. That you cannot use dynamic memory allocations in device code inside OptiX is mentioned in the caveats chapter.

Hi Detlef,

thank you for your information!

Can you help me please with following issue:

I defined 100.000 Rays stored in a std::vector< float3 >. Afterwards I created a loop
that looks like the following, in principle

for( uint64 i = 0; i < 20; i++ )
{
 rtBuffer <- Store the i-th 5000 Rays

 context->validate();
 context->launch([...]);

 std::vector< float3 > <- Copy Data From Device Buffer
 
}

My Problem ist, that this loop runs till 35.000 Rays. Afterwards the display driver
and therefore the execution of my program crashes.

Then I used the tool nvidia-smi tool as you mentioned to preclude the possibility of
a memory buffer overflow. However, after the first loop about 1,4 Gbyte out of 4 GByte
memory are used. This value remained the following loops. So this can not be the problem.

Do you have any idea what could lead to this reproduceable behaviour?
Btw. additionally if I like to store more than 10.000 Rays in a rtBuffer the OptiX launch
lead to an immeadiate crash of the display driver. I adapted the StackSize by calling the OptiX
API. This has no effect upon this issue.

UPDATE: I get an error Message of OptiX

“Unknown error (Details: Function “_rtContextLaunch2D” caught exception: Encountered a
CUDA error: driver().cuMemcpyDtoHAsync( dstHost, srcDevice, byteCount, hStream.get() ) returned (999) Unknown)”

With kind regards
Robert

Sorry, not enough information. It’s not possible to tell what goes wrong from six lines of pseudo code.

Performance tip: There shouldn’t be a need to call context->validate() at that place. It’s ok for debugging reasons, but shouldn’t appear inside the hot loop of your rendering in release mode.

Hi Detlef,

I resolved the issue. It has to do with the Timeout Interval which is given by
the Operating System Microsoft Windows.

Now Im facing the following issue.

I have lets say 20.000 Rays. If I use the builder “NoAccel” the following loop
inside my Ray_Generation_Program pinhole_camera.cu is working correctly.
Likewise, the results I get out of my calculations make totally sence.

const uint64_t Upper_Bound_For_Loop = 20000;

for( uint64_t loop_i = 0; loop_i < Upper_Bound_For_Loop; loop_i++ )
{

//Defining Ray Payload and much more

optix::makeRay ray([...]);

rtTrace([0..2]);

// Defining new Ray variables

}

If I use any other builder like “Bvh” the for loop stops looping after a few (<10)
loops eventhough the limit ‘Upper_Bound_For_Loop’ has not been reached yet.
This lead to the problem that I get correct results for the first lets say 10 Rays,
but hence I dont get any results for all following Rays.

Thank you very much in advance.

  1. Question

Can you tell me if it is possible to store One Million Rays in a rtBuffer or is it way too much
a GPU can handle to store or process

Kind Regards
Robert

1.) You have a loop inside the ray generation program(?) which issues 20,000 rtTrace calls?
What is your launch dimension and size?

I wouldn’t be surprised if that runs into a timeout.
Please read this threads and all OptiX forum links in it for how to partition the workload for some usual examples.
[url]https://devtalk.nvidia.com/default/topic/1004932/optix/timeout-with-50-lights-on-the-scene/[/url]

I’d say you should re-architect your data flow to be able to launch multiple times (maybe even 20,000 times) and do the necessary subset of work and accumulate the results properly, of whatever you’re programming.

2.) Yes, of course it’s possible to store millions of rays inside a buffer. Nothing special about that when rendering FullHD for example. You just need to consider what you’re actually doing with them, since you already mentioned you’re running into timeouts, that indicates you workload is simply too high for your system setup.

Hi Detlef,

thank you for your quick respone again.

First of all my launch dimension is 2D, as it was in the first place in the
primitiveIndexOffset project. I didnt change this unless this does not have any effect
to what I would like to do.

The currently set this size to width = 1; length = 1;

Please note: I dont use OptiX for calculating anything computer graphic related.
Especially I dont use OptiX for renderring pictures or similar.

A: That is the reason why I think it does not matter which dimension and size Im using.
Am I right regarding this assumption?

I mean you are right that this runs into a timeout. But, indeed, I already deactivted this TDR
of Microsoft Windows. So this is no issue for me any more. So from this point of view I really can
process 20.000 Rays per launch. However, I do have a slightly different problem.

To your answer 2.) - I supposed this answer, however, if Im trying to store more than 10.000 (In my current case) in an rtBuffer and if I’m trying to launch OptiX shows an error without calculating anything.

  1. According to your answer I guess you would recommend, as described in those links you sent me, to just use lets say 500 Rays per launch call, right?

  2. So if you would like to render I 1920 x 1080 HD picture which results in 2073600 Pixels can I ask you how many Rays you are using for this? I guess this should be less than 500 Rays in that case, because I recognized that a launch call takes more time to process data than storing more Rays on the GPU and have a loop over each Ray like I described in my last post.

  3. One thing I dont understand by 100% is the following. I guess you need a lot of more Rays for
    Render that HD picture. By now I need about 3 seconds per 10.000 Rays that I can store in my Ray Generation Program per launch call. I loop over an array which contains a package of 10.000 Rays,
    therefore my for loop in the Ray Generation Program. Could you please tell me how to change that logic to be able to process like a few million Rays per second like I saw on a Power Point Presentation of a GPU Tech Conference.

  4. I guess I cannot call rtTrace function one million times simultaneously, can I?

Kind Regards
Robert

A) Wrong. You misunderstood parallel computing with OptiX completely.
If you use a launch size of 1*1 you effectively don’t do any parallel work on the GPU!
The launch size determines how many threads are executed on the chip. A current GPU has over 3000 cores running in parallel. You used 1.
You should use a launch size well over 64K cells to saturate a modern GPU.

  1. Let’s say you use a launch size of 1000*1000, then a single rtTrace call inside your ray generation program would already be executed 1 millon times, each one handled by one of thousands of threads running in parallel, and you could write that 1 million results into different buffer cells indexed by the variable with the rtLaunchIndex semantic.
    That’s what I mean with no problems to trace millions of rays.
    Maybe that explains why it is confusing to see a loop of length 20,000 over an rtTrace inside the ray generation program. With a launch size of 1 million that would have been 20 billon rays in a single launch. Timeouts to be expected.

Again it would have been less confusing if you posted actual working code instead of pseudo code.

With that complete lack of parallel computing explained now, you need to think about how you can calculate your desired results with thousands of threads running in parallel of which all do the same thing, though with the possibility of different start values per launch index.

It would have been easier to come up with a recommendation for a working algorithm if you explained what calculation you really need to solve in the end.

Hi Detlef,

ok yes you’re totally right I didn’t knew that. Now everything is working as it should!

Currently I’m facing an issue which should be easy to resolve, as well.
First of all I added some code which is based on my code and of which problem is arising from.

To put one thing in front is that I cant free memory which is allocated by the
map() - API, which leads to a rising amount of RAM Memory being reserved until no memory
is available any more on my system.

void foo1
(
Context context,
std::vector<float> *pt_OUT_Target_floats,
std::string        *OptiX_rtBuffer_Identifier
)
{
    Buffer     Local_Buffer_Floats = context[*OptiX_rtBuffer_Identifier]->getBuffer();
    float      *DATA_floats        = reinterpret_cast<float*>( Local_Buffer_Floats->map() ); <- This line lead to an amount of 15.625 KByte to be allocated in the Heap Storage according to the size I defined beforehand

    float      *ITERATOR_POINTER   = DATA_floats;

    { //%/----- PSEUDO CODE----------------------------------------------------------------
      //%|
      //%| Read each element in the valid interval range of 'ITERATOR_POINTER' and write it
      //%| to an std::vector<float>
      
      pt_OUT_Target_floats <- ALL_ELEMENTS_OF(ITERATOR_POINTER);

    }

    Local_Buffer_Floats->unmap();

}

int main()
{

// 1. Create a Context
// 2. Create an Output Buffer with Size 1 000 000 * 1 * sizeof(float);

std::vector<float> Local_DATA_Container_float;

for( uint64_t i = 0; i <= 5; i++ )
{
    foo1( context, &Local_DATA_Container_float, (&(std::string)("DATA_of_OUT_rtBuffer")) );
}

Local_DATA_Container_float.clear();

}

In function foo1 I’m mapping the rtBuffer with a local pointer variable. So far so good.
Now, if I’m looping over foo1 15625 KBytes of memory (RAM Memory) is going to be
allocated IN EACH LOOP. Using the unmap() API has no effect to this. In particular the 15625 KBytes
are not going to be released again. As you told me free/delete MUST not been used. Anyway
I tried it and of course it didn’t have any effect. The std::vector container is cleared after
each call of foo1 and does not have negative effects regarding the Heap memory. Now I’m wondering
what I could do.

Do you have any Idea?

Thank you very much in advance.

Kind Regards
Robert

You misunderstood again what people have been explaining to you.
Dynamic memory allocations are only not supported inside OptiX device GPU code, that is, inside the CUDA code you use to program your OptiX shaders.
What you do on the host CPU is totally your choice and of course you can allocate and free memory there as needed.

Sorry, I have no idea where your potential memory leak would come from when all you provide is incomplete pseudo code. Once and for all, stop doing that!

If I understand your pseudo code correctly, you want to read a million float values from an output buffer.
Maybe better try something simple like this (not compiled myself):

void readFloats(Context& context,
                std::string const& bufferName,
                std::vector<float>& destination,
                const size_t sizeInFloats,
)
{
  destination.resize(sizeInFloats); // Realloc to the size of floats to be read.
  
  optix::Buffer const& buffer = context[bufferName]->getBuffer(); // Get the buffer with bufferName.
  
  const float* source = reinterpret_cast<const float*>(buffer->map()); // Map the buffer with bufferName for reading.
  memcpy(destination.data(), source, destination.size() * sizeof(floats)); // Copy sizeOfFloats floats into the destination vector.
  buffer->unmap()
}

int main()
{
  // 1. Create a Context
  // 2. Create an Output Buffer with Size 1 000 000 * 1 * sizeof(float);

  size_t sizeInFloats = bufferWidth * bufferHeight; // The buffer size in floats
  ...
  
  std::vector<float> destination;
  
  // Resize the destination to the sizeInFloats and read the data from the buffer with the name "DATA_of_OUT_rtBuffer".
  readFloats(context, std::string("DATA_of_OUT_rtBuffer"), destination, sizeInFloats);
}

Hi Detlef,

thank you for your explanations. Actually I totally got your point regarding the
dynamic storage allocation. Granted my explanation was a little bit confusing in that context.
I changed the code to a similar one you proposed and it worked.

And by the way I didn’t know I should not use pseudo code here. My intention was just to reduce the amount of code for the admins here in order to really get an example as little as possible.
So sorry for that!

Can I ask you for another appreciation, please?

By now I only used very simple sceneries with about 50 triangles.
Now, when I used a free car model which had about 100.00 verteces I’m facing a problem.
On the basis of my code OptiX needs about 2 hours! before it starts to calculate anything. During
these 2 hours the gpu cores have an occupancy rate of about 25%. My assumption is that the function
memcpy could cause the problem, because when I used NSight earlier with another scenery this was
actually the CUDA function that took most of the processing time.

  1. Is this 2h value realistic for 100.000 verteces? => I guess this value is more than unrealistic isnt it.

After the two hours the nvGPUUtilization Tool shows a 100% occupancy rate which remains in this state.
I’m getting results but one result per 16 hours more or less.

  1. Is it possible for you to say anything about it without me posting any code at this place?

Kind Regards
Robert

Without knowing what exactly you’re doing and what the results are and how you calculate them, it’s hard to answer any questions about possible inefficiencies which could result in long startup times.

First, I’m assuming you’re still on the same system configuration:
Win-7 Enterprise, NVIDIA GeForce GTX 745, Driver 376.33, OptiX SDK 4.0.2, CUDA v8.0.
That is a really low-end GPU configuration for Professional Rendering and Graphics related tasks discussed on this forum.
Do you have access to any more high-end system to run your application on for comparisons?

1.)
Two hours startup times for a 100,000(?) primitives (triangles?) scene is completely off the chart.
Depending on the scene structure, that should generate the acceleration structure in under a second, even on that small GM107. I have a 2GB mobile version of that GPU in my laptop and that can handle scenes with millions of triangles without problems.

How does the scene structure look like?
How many OptiX nodes are in that? (Geometry, GeometryInstance, Material, GeometryGroup, Acceleration, Transform, Group)

If that happens during acceleration build and you used the Trbvh builder and that used too much memory and fell back to software builds, maybe look at the chunk_size option (see OptiX Programming Guide, Table 4 Acceleration Structure Properties).

Or it could be that your OptiX programs in the *.cu files are complex and let the PTX assembler and optimizer run into disk paging on a system with little system RAM. You would notice that when looking at the Windows TaskManager performance graphs.
How much RAM is installed in your host system?
How many lines of PTX code do your OptiX programs have in the scene?

Also getting one result in 16 hours implies either that your workload is outrageously high, or your GPU is massively under-powered for the task at hand, or something is not right.

2.)
Well, you could start with an OptiX API Capture trace of your application run (search for “OAC” on this forum for explanations how to enable that) and look at the resulting trace.oac text file which tracks all OptiX API calls.
If that looks correct to you and didn’t throw any errors, you might send an archive of that whole “oac_” folder (must be <10 MB) to OptiX-Help(at)nvidia.com and we could take a look.
Note that *.zip extensions won’t make it through our e-mail servers. Rename the extension to *.zi_ instead.
If it’s bigger than 10 MB, I can set up a temporary FTP account to upload bigger files.

PS: “And by the way I didn’t know I should not use pseudo code here. My intention was just to reduce the amount of code for the admins here in order to really get an example as little as possible.”
You need to ask yourself what is required for other developers on a forum to understand your programming problem. Pseudo code only works when explaining code structure or general algorithms, not when there are real bugs inside your actual code.