[bugreport] Writing to CUDA SurfaceObjects produces no result

Hi,

My program has the exact same set-up as this (using “bindless” surface objects introduced with Kepler)

https://stackoverflow.com/questions/58303950/writing-to-cuda-surface-from-optix-kernel

I do understand that if I dispatch 1280x720 optix raygen, it wont actually dispatch a 1280x720 kernel but some group of persistent workgroups that will iterate through ray-packets, and hence more than one physical dispatch thread may end up writing to the Surface.

However even if my memory is 100% not coherent I’d still expect the writes to go through, is there some sort of a memory barrier I should issue like in OpenGL ?

I’ve tried hand-massaging the PTX before it goes into optix trying out different cache op modes for the sust op, such as .wb, .wt and .cg.

I also even put a stream synchronise and an OpenGL memory barrier


		cuda::CCUDAHandler::cuda.pcuStreamSynchronize(stream);
		video::COpenGLExtensionHandler::extGlMemoryBarrier(GL_ALL_BARRIER_BITS);

Still no avail

It seems that OptiX is excising the sust instruction from the PTX during its compilation

original CUDA C

original PTX (%r3 and %r4 contain the launch_index.xy)

BB0_3:
    ld.const.u64 	%rd8, [params];
    shl.b32     %r8, %r3, 4;
    mov.u32     %r9, 1065353216;
    mov.u32     %r10, 1132396544;
    sust.b.2d.cg.v4.b32.trap     [%rd8, {%r8, %r4}], {%r10, %r10, %r10, %r9};
    ret;

disassembly (no sust instruction)

0x0000029dab64ba90  [296] shl.b32 	%r28, %r17, 4; 
0x0000029dab64ba90               IMAD.SHL.U32 R10, R16, 0x10, RZ  
0x0000029dab64baa0  [307] st.param.b8	[param0+3], %rs16; 
0x0000029dab64baa0               PRMT R4, R13, 0x654, R0  
0x0000029dab64bab0  [328] call.uni  
0x0000029dab64bab0               MOV R20, 0x0  
0x0000029dab64bac0               MOV R21, 0x0  
0x0000029dab64bad0               CALL.ABS.NOINC 0x0  

Hi devsh,

Which OptiX version and driver did you try this with? I haven’t actually tried writing to a surface in an OptiX program, it is possible there’s a bug. Do you have a complete & minimal reproducer you could share with us?

FWIW, if you launch a 1280x720 optix raygen, we choose the block size, but other than it does equate to a kernel who’s dimension is 1280 * 720 threads. You can verify this in Nsight Compute, for example. The main thing you need to be aware of and careful with is that OptiX programs are not CUDA, even though we’re trying to make it as close as possible. OptiX shaders cannot use shared memory, synchronizations, barriers, or other SM-thread-specific programming constructs in device code.


David.

I’m using OptiX 7.0.0, latest one.

I could try and put some minimal and complete reproducer, but my stuff is always NVRTC JIT compiled, and using OpenGL interop where OpenGL owns the textures and buffers. But i dont think it would be as nice as the code from here to debug for you
https://stackoverflow.com/questions/58303950/writing-to-cuda-surface-from-optix-kernel

I’d much rather you patch the Hello World SDK sample with this guy’s changes (far better repro sample)
https://stackoverflow.com/questions/58303950/writing-to-cuda-surface-from-optix-kernel

Actually you use persistent threads (common raytracing trick and HPC GPGPU), I can see that you’re launching 288 blocks of 64 invocations on my RTX 2070, this actually turns out to be 8 invocations per “CUDA core” (I have 2304 of those, 36 SMs and Turing can do 64 SIMD in 2 warps of 32)

So I’d presume there’s some cooperative CUDA going on or a shared global atomic counter and a circular buffer work-list ;)

Yeah, noted, already knew that… but image storage from a kernel is not any of the above, right?

I asked around and discovered that we have an open bug report on surface writes, it’s indeed not working correctly in OptiX. I’ll follow up here when it’s fixed. Thanks for the report.


David.

Hi, I met the same issue during investigation of a related issue :surf2Dread in OptiX kernel.

I created a minimal reproducer for this issue.


I created two equivalent kernels, the one is written as OptiX’s raygen, the other is a normal CUDA kernel.
The reproducer creates an array (512x256, float4) and fills all the pixels by red.
The both kernels read the array via surface object and add blue gradient over the red image. If the kernels work as expected, the resulted image should be red to purple gradient.

The reproducer is set to use the OptiX kernel by default. The result is completely red image in my environment. On the other hand, CUDA kernel (can be enabled by commenting out USE_OPTIX_KERNEL in test_shared.h) produces the gradient.
For validating purpose, I put a macro to switch the surface object to a plain buffer by commenting out USE_SURFACE_OBJECT. In this case both kernels produce the gradient.

Thanks,

Environment:
Windows 10, 1909
NVIDIA Driver: 445.87
CUDA 10.1.243
OptiX 7.0, installed at the default location.
Visual Studio Community 2019, 16.5.4
RTX 2070

Thanks. I downloaded the reproducer project and filed another bug for investigation.

@devsh Just in case you hadn’t seen the message in the thread linked in comment 7, the R450 display drivers supporting OptiX 7.1.0 fixed the surface access for that case.
Please try if you’re getting the expected results with drivers from that branch. Thanks.