3D Launch using Opitx to obtain 3D-complex data

I am trying to make 3D launch in Optix to achieve a radar simulator.
The Launch Dimensions are [512, 512, 512], which does not exceed 2^30.
In each ray_gen program, I loop 64*64 times in each pixel to make dense sampling.
However, the program in Z dimension seems not parallel, and each optixLaunch is so slow.

My Settings

image

My optixLaunch

image

My ray_gen program

the size of SARCSARdistBuffer is sizeof(float)x512x512x512;

Hope to hear from someone.

Ok, you’re saying you shoot 512 x 512 x 512 x 64 x 64 = 2^39 = 549,755,813,888 rays per launch.

That’s about 550 GRays rays per launch, so even when assuming you have the highest-end RTX board, and let’s say that handles around 10 GRays/second in your case (which it won’t, even if the limit is actually higher), that would still take 55 seconds for one launch without doing anything else.

Now, what is your actual system configuration and how long does it really take?

Please always provide the following system configuration information when asking about OptiX issues:
OS version, installed GPU(s), VRAM amount, display driver version, OptiX (major.minor.micro) version, CUDA toolkit version (major.minor) used to generate the input PTX, host compiler version.

From the cropped code screenhots (note that the forum supports code blocks) it’s not apparent how your index_real and index_imag advance or if they are the same.

If the index_real and index_imag are constant inside the loop, it would be faster if you would not accumulate the result into the two output buffers inside that loop but accumulate the results into a local variable to keep them in registers and only write it once at the end.

Also note that it’s less efficient to write two individual floats to separate buffers because these would lie in different memory cache lines. The GPU microcode supports vectorized load and store instructions for 2- and 4-component data types, not 3-component which are handled as three individual scalars. Means it would be faster if you handled your complex numbers as float2 vectors and store them into one output buffer if possible.

The 3D launch will be scheduled as 2D slices.
See this discussion about the order and potential access hazards: https://forums.developer.nvidia.com/t/optixlaunch-configuration-revisited/198275

Thanks. You have explained all my problems. It is now clear to me how to fix it. Thanks for your reply.