My OpenGL GPGPU app uses a geometry shader with transform feedback (rasterization turned off, rendering from one VBO into another). Therefore each pass I can reduce/expand the data in the array. Plus, I also make use of the ability to read data from adjacent vertices.
I’m very new to CUDA. I want to like it (driver-wise with multiple GPUs in Windows it looks very attractive). The programming model seems a little rigid though - ie. you set up your kernel function, and then you have to specify the size of the thread block (?). Is there a way of using CUDA more like rendering from one array to another, with the ability to reduce/expand the data ?
If not, maybe it’s something that could be incorporated in the future : a shader-like functionality, where you specify an input buffer, an output buffer, a ‘shader’ function and two structs to say how the input/output arrays are arranged.
I’ve never written a geometry shader so I can’t compare CUDA to that. But the programming model in CUDA is anything but restrictive. Each thread can write multiple values or don’t need to write a value so you can reduce/expand any data. Reading adjacent values could be done with the texture cache or more efficiently through the use of shared memory. The block size can be varied from kernel call to kernel call to adapt to your current data size.
I’m not sure what you have envisioned here, but this shader type functionality is pretty simple with the thread blocks.
The calculation of idx is how the input/ouptut arrays are arranged and the input/output arrays are specified as arguments. How can it get simpler than that?
I can see CUDA can easily do data expansion/reduction in the sense that it has freedom to write any amount of data to the various GPU memory spaces.
My point is, if you put 4 vertices (ABCD) into a geometry shader, and the shader outputs variable numbers of vertices for each input vertex (eg. A outputs 2, B outputs 3, etc …), then with transform feedback, the output buffer object holds : AABBBCCCCCDD. The driver/hardware has handled assembling the output into one sequential array. This doesn’t seem possible in CUDA because each thread doesn’t know how many ‘vertices’ the other threads have outputted (regardless of which memory space the threads are writing to).
That’s why it might be an interesting extension for CUDA, if the hardware has this capability and CUDA is not making use of it.
Ah, I see the distinction now. Unfortunately, there are a few things (like rasterization accumulators and apparently this aspect of geometry shaders) that CUDA can’t access.
You can perform this type of operation efficiently in CUDA using a scan primitive. It’s a two pass process: calculate the number of outputs for each thread and then perform a scan. In the 2nd pass, use the results of the scan as the index where each thread starts writing.
Take a look at the marching cubes sample in the SDK. This uses the scan primitive from CUDPP to achieve geometry-shader like functionality - variable-sized output per thread.