[RayTracing] Smaller resolution - smaller rays per sec (performance)

Hello, everyone! I have a problem with performance.

Kernel launch cfg:
blocks = dim3(VIEWPORT_W / 16, VIEWPORT_H / 16);
threads = dim3(16, 16);

When VIEWPORT_W and VIEWPORT_H is decreasing, average performance is significantly decreasing too.

Performance measurment in 10^6 rays per second.
Here is my tests:
768x768 - 4.6 mrays/sec
512x512 - 3.0 mrays/sec
256x256 - 1.1 mrays/sec
128x128 - 0.6 mrays/sec

Environment - CUDA 4.0, sm_10, GTS 450 (192 cores).

I admit that some decresing may be due to launch overhead and caching, but it is too much.
Same code working on CPU and it is always perform 0.02 mrays/sec, independently of resolution.

Any suggestions?
(thank you and sorry for english)

If I am understanding you correctly, you are then only launching 64 blocks of threads for the 128x128 pixel configuration. That may end up not being enough. However, due to the relatively small size of your GPU, that would be surprising.

Is there an upper bound on the image size that gets you a constant speed in mrays/sec? You should at some point hit the saturation point and not see any gain passed a number of threads. If you don’t, I guess I would investigate the measurement methodology, or the results.

Hope this helps!

Thank you, Ailleur!

Yes, you are correct. But as I understand it from cuda programming guide, the number of blocks is determining from size of data to compute and this number does not affect the performance (if the number of blocks > 2-3 x number of GPU multiprocessors).

GTS 450 contains 3 multiprocessors, and there must be no matter 64 or 128 or 1024 blocks count. Much more important the number of threads, but this value is always 256. So I can’t understand the reason for this behavior.

I’ve tested it on GT730M (384 cores) with high resolutions, and obtain following results


Looks like the problem is not in the measurment. Also because the same methodology used for CPU mode, and the results is constant mray/sec.

Also, I have tested the launch overhead (simply reduce amount of calculations in each thread, but keep the same launch configuration) - overhead is insignificant, exactly as it described in programming guide.

So I have no idea how to fix it, or at least to explain this behavior.

Thanks and sorry again for english.

I would have some questions:

  • How many rays do You have per pixel, and is it constant every frame?
  • How do You count them? Is it only primary (camera) rays or are also the secondary/shadow/reflective/refractive rays included into the count?


Hello, cmaster.matso!

I have only one ray per pixel, and there is no secondary rays. Each thread is processing only one pixel of the screen (which means a tracing only one ray for this pixel).

So the rays count is simply equal to the screen resolution (resolution 512x512 = 262144 rays). It is a constant value.

I’m working on a volume rendering module, if it is can be important information.

Here is some examples of my rendering results:

(By the way, it is an exclusive shots of a rendering module from our workstation, which is not published yet).

This rendering is performed in high quality mode and taked about 200ms for 912x848 (about 3.9 mrays/sec) on my GTS450.

Rendering 256x256 taked 110ms (0.6 mrays/sec). But expected something about 20ms (with constant 3.9 mray/sec).
How to explain this behavior?

And do You get the same FPS for each resolution, say 30?


Sorry I see the render times - got confused a bit…

Some Ideas:
When you reduce the resolution, do you also reduce the field of view accordingly? If you don’t the following problems may cause your results:
-Smaller resolutions will result in more chaotic memory accesses and thus reducing the hitrate of all caches (texture-cache/ L2-cache). Since volume rendering is a streaming application, you will have to stream almost all of your volume textures once every frame through your SMs. Therefore your maximum frame-rate will be TextureSize/DRAM-Bandwidth. This might reduce the ray-throughput at lower resolutions and will become even worse if your working set won’t fit in the L2-cache anymore. You might try to reduce volume-size (5x5x5 Dummytexture works fine for this purpose) or to profile DRAM-Bandwidth with Visual-Profiler to check this out.

-You probably avoid shading of empty space with a simple if case; furthermore you will probably have some ifs for diffuse and specular shading. This only helps to save unnecessairy computations, if no thread of a warp wants to do it. By lowering the resolution a warp will cover a larger area in the volume and thus this optimization won’t be working as well as before. This will also reduce your ray-throughput. The same goes for early ray termination if there’s any.

No, fps is’nt same. As I said before, resolution 912x848 takes 200ms, 256x256 takes 110ms. FPS is almost equal to 1000 / timing, so I get 5 and 9 frames per second for this cases. Rays count was reduced 12 times, but it provides only double fps.

FOV is a constant.
I had a thought about less locality of memory access. Can you explain why should almost whole volume texture be streamed, when we render in low resolution? I think we will read smaller part of the volume, just with bigger step, and here is only one problem - worse caching. Am I right?
Also, I want to remind you that the CPU mode works well (without this strange performance reducing

Second idea sounds very very sensibly. So I have disabled (commented) every branch (if-case) to check this out, but results are stayed the same. It was really surprising, I thought we almost found the explanation.

And what about uncoalesed memory access and bank conflicts? Do You use shared memory? Such matters can have significant influance on performance.

I’m using only texture memory for data. Some constant values, like camera position and direction are placed in constant memory.

I’m not quite sure whether I understand this question correctly. So please reask this question if my answer sounds strange.

Texture Memory on the GPU is organized for local straming accesses, probably in a morton-order. Thus if theres a single channeled byte texture every texture-block of 8x4x4 voxel corresponds to a single cache-line. Therefore, if the distance between two rays is less than 4 voxel, all of the texture’s cache-lines will be needed to render the whole texture. Hence the optimal amount (perfect L2-Caching) of required memory-bandwidth is constant and doesn’t depend on the resolution. That’s why I’d try to reduce the texture size.

Your CPU also has a vastly different hardware compared to your GPU. Your CPU has probably 1-8 MByte L3-Cache, 16-30 Gbyte/s memory bandwidth and a low computional throughput (0.02 mrays/sec) whereas your GPU probably has 256 Kbyte L2-Cache, 30 Gbyte/s memory bandwidth and a high computational throughput (4.6 mrays/sec). Furthermore the GPU hardware multithreading needs many active Threads (up to 2048 per SM) or rays in this case, which increases the GPUs working set and worsens the L2-Cache hit rate. On the CPU depending on your (compiler’s) vectorization you’ve just got one or a few rays per core.
Thus a GPU is more susceptible to memory bandwidth limitation or a bad cache hit rate. That’s why you probably just notice a performance penalty on your GPU.

Excuse me for my english. You have understood my question correctly.

Ok, I see. Actually, I want to raytrace 96x96 image as a null-phase rendering, and the distance between two rays will slightly greater than 4 voxels or equal. This can explain everything, but I still hope to increase the performance with low resolution. Maybe I should try some gpu performance profiler?

Also, I checked the occupancy with CUDA Occupancy Calculator. It shows that occupancy is only 33%, and limited by register per sm. So I’ve tried to reduce registers by setting up -maxrregcount 8 to achieve 100% occupancy, but this resulted in using of the slow local memory, and does not change performance/resolution behavior.

I just tried to render a smaller volume (50x128x128) and performance reducing became significantly less! It was a headshot about memory throughput, Fiepchen, thank you very much.

Now I’m thinking about some kind of mipmap, but not sure (because of using transfer functions for real-time convertation of raw signal data into a color).

Thank you again! I will report here about further optimizations related to my low res rendering phase.

Hello, everybody!

I just completed a mipmapping and finally enabled my optimization. And now I’m getting just amazing results!

Here is an example image rendered with my volume rendering module:

Performance in high quality mode with my notebook’s GT730M:
Bruteforce ray marching = 103ms, 8.7 mrays/sec
Optimized (using mipmaps) = 33ms, 27.3 mrays/sec

CPU mode in high quality with my notebook’s Core-i3 3120M:
Bruteforce = 4714ms, 0,191 mrays/sec
Optimized = 1237ms, 0,728 mrays/sec

The only pay for this optimization is rare artifacts, manifesting in the disappearance of narrow objects like smallest vessels. This happens really rare and I don’t care about this.

Image quality already was sufficient, and now the performance became totally acceptable for real-time rendering with low-cost hardware.

Can somebody suggest, what performance will be with GTX580, 680, 770, 780 and Titan?

Thanks everybody, especially Fiepchen, for bright and clever ideas! Discussion very much helped me.

May the Force be with you.
Alexander Korolev.