Launch dimensions in LaunchContextnD and optixLaunch

Hi,
I am trying to write a ray tracer using Optix.
I don’t understand the launch dimensions and the inputs of width and height in the function LaunchContext1/2/3D in V6.5. I believe in V7.2 it is replaced by optixLaunch which also has 3 inputs for width, height and depth.
The specification of width in 1D and width and height in 2D confuses me.
What do these values of width and height specify?
I saw someone interprets these values as the dimension of the virtual screen grid for launching rays.
More confusing is the launch dimensions 1D/2D? What is a 1D context launch? or 1D optixLaunch?

Thanks a lot!

1 Like

Hi @JW_raytracing, welcome!

In any launch, the total number of threads that will be spawned is the product of all the launch dimensions, meaning nThreads = (width * height * depth). The designations of 1D/2D/3D launch is purely a convenience that allows you to more naturally map your threads to whatever data structure you need to compute. In ray tracing, it’s very natural to use a 2D launch in order to render an image. But sometimes a launch is doing computation over a volume, and it is convenient to think of it as a 3D launch and have one thread of computation per voxel. Or you might use the GPU to compute something over a linear structure that does not factor naturally into width * height or any 2D or 3D rectangular grid. One example for a 1D launch that I use is for curves: suppose you want to store a parameter value for every segment in a set of connected segments that form hair strands. The number of segments is determined solely by your input geometry, and is arbitrary. In this case, it makes the most sense to think about this as a 1D launch, and I will put the number of segments into optixLaunch or into rtContextLaunch1D as the width parameter, which will mean that I get exactly 1 thread per curve segment.

Maybe it will help to think of all launches as 1D launches. The only two things that really matter from the perspective of launch size & shape are how many threads you want to spawn, and how does each thread convert it’s thread id into whatever index you need in order to handle I/O on your input & output buffers. The 2D and 3D launch types are just convenient ways to handle your indexing, because OptiX will automatically give you a “launch index” with dimensionality that matches your launch.

To summarize:

  • For a 1D launch, you have width threads total, and you get a 1D launch index x in your raygen program, which goes from 0 to width-1.
  • For a 2D launch, you have (width*height) threads total, and you get a 2D launch index (x,y), where the x range is [0,width-1] and the y range is [0,height-1].
  • For a 3D launch, you have (width*height*depth) threads total, and you get a 3D launch index (x,y,z), where x is in [0,width-1], y is in [0,height-1], z is in [0,depth-1].

Note (x,y,z) are not the names of the launch index, the name of your launch index is your mapped rtLaunchIndex variable in OptiX <= 6.5, and the value returned by optixGetLaunchIndex()in OptiX >= 7.0.

Also be aware that OptiX maps 2D threads in a way that keeps 2D tiles of threads together, assuming you’re tracing rays through pixels. This usually gives you better performance than if you were to use a 1D launch and map your threads across scan-lines.


David.

2 Likes

Hi David,

Many thanks for your answer.
Your answer is very clear. It is the number of threads that I was looking at, yet I confused with the number of rays launched. Now I see that many programs use one thread for each one pixel in the rendering image as in the starting example given in this page https://developer.nvidia.com/blog/how-to-get-started-with-optix-7/
Such a way of choosing the threads maps the number of threads with the number of pixels and also often with the number of the rays launched.

Then I have further questions about the number of threads.
I am new to GPU programming, so many concepts are new ideas to me.
As I understand, here the threads are the same concept as in the CPU/multi-threads sense, while GPUs normally have a large number of parallel threads.

How do I determine the amount of threads that I need?
Do more threads help accelerate the ray tracer?
Is there a cost for assigning many threads?
I see the determining of the number of threads is a design by the programmer.

As I understand, to set the number of threads is independent from the rest of the ray tracer. For example, I can set the number of threads to an arbitrary interger in the optixLaunch and rtContextLaunch1D, say, either 1 or 10000. The rest of the ray tracer will still work either way, even executed on a single thread, though it will have a poor efficiency on a GPU. But the ray tracer should work even on a single thread, if it is properly setup.

1 Like

As I understand, here the threads are the same concept as in the CPU/multi-threads sense, while GPUs normally have a large number of parallel threads.

They are similar, just be aware of the differences too. With CPU threads, it’s common to start as many “worker” threads as you have processors, and allow them to pull work jobs from a priority queue. They’ll do a job, and when they finish they’ll start another job. Starting 16 threads on a 16 core CPU is a good model. Starting a million CPU threads is not a good idea, that can slow the CPU down and consume too much memory.

With GPU threads, you shouldn’t do that. CPU threads have a (relatively) larger cost to create and destroy the thread itself, where GPU threads have almost no cost. With GPU threads, it’s more common to designate the number of threads as the number of separate things you need to compute and save to memory, so it’s common to start kernels that have a million or even a billion threads.

How do I determine the amount of threads that I need?

Normally, it’s the number of individual things you need to compute, e.g., the size of your output array. In a ray tracer, the most common setup is to compute a single color value per pixel, so you start (width*height) threads where width and height are your image resolution.

With CPU threads and a work queue, it’s very common to have different sized chunks of work for each thread. With GPU threads, the single most important performance factor is making sure all threads do the same amount of work, as much as possible. The way a GPU works is that threads are grouped together and execute together - a GPU is (mostly) a SIMD machine, so you should imagine that it’s executing the same instruction for 32 threads at a time, because that’s what is usually happening. If any one of your threads does something different, or computes something unique, all the other 31 threads usually have to wait around for the unique/slow one to finish. You’ll read about thread “divergence” - this is what that term is talking about, threads taking different paths in the code causing delays because they have to wait for each other.

Do more threads help accelerate the ray tracer?

Yes, up to a point. Today’s GPUs have thousands of thread cores. If your per-thread workload is normal and small-ish, you need hundreds of thousands of threads to saturate the GPU and get the maximum throughput performance. Once the GPU is saturated, then more threads stops helping, but it never hurts unless you have other factors like too much memory usage. If you have threads that do a lot of compute, then a billion of them won’t be any faster per-thread than a million of them. Bottom line is you can’t have too many threads, but you can have too few.

Is there a cost for assigning many threads?

Basically no. The main cost is whatever memory you use and/or computation you do per-thread. You might be able to speed things up by combining threads together that share memory lookups, or share computation, as long as your threads are still saturating the GPU and doing similar sized amounts of work.

But the ray tracer should work even on a single thread, if it is properly setup.

While this is true, a single threaded GPU program is usually far worse performance than a single threaded CPU program. It will go much much slower. In order to use a GPU effectively, you should be using a lot of small bite-sized threads that each do as much uniformly similar work as possible.


David.

1 Like

David,
Many thanks to your meticulous and clear answer.
Your answer gives me a much better understanding on the concept of multi-threads/GPU.