Hello!
I am implementing a custom rendering pipline using cuda. Let’s say that we have an array AR of independent N=65536 elements and would like to process them (to subdivide until some criteria). Original implementation works fine:
<b> First case </b>
cudaMalloc(AR, N)
and orther cudaMallocs for temp buffers
// ... some OpenGL map/unmap pixel-buffer
cudaThreadSynchronize
timer start
int QueueSize = N; // working set size
for(level = 0; level < 25; level++) // levels of data scanning
{
process_data(AR, QueueSize); // this func computes storage requirements (dataCount)
cudppScan(plan, dataOffsets, dataCount, QueueSize + 1); // QueueSize + 1 vary between 10K and 100K
scatter_data(AR) // create compact representation according to dataOffsets & dataCount
cudaMemcpy(&QueueSize, dataOffsets + QueueSize, DeviceToHost); // update working set size
}
cudaThreadSynchronize
timer end, print timings
// other stages
// ... some OpenGL unmap/map pixel-buffer
This version works fine, let’s say that ~7ms on GTX 280, CUDA 2.2 and corresponding driver, Win XP SP3 or Vista, 32-bit, Core2Duo 3Ghz.
You may consider this algo as some graphics application with adaptive subdivisions. All data buffers are stored on single device and CPU is only responsible to high-level logic and kernel invocation, but not data-transfer in critical program loops.
Still having the size N<=65K of our array AR this may increase the storage of orther data under some user configurations: AR_dependent of size M = N * K, (K = 32, 256… etc.). The elements of AR_dependent depend on AR’s elements, but they are also independent among each other. We would like to store the entire data on the GPU. Actually, the data for AR and AR_dependent is generated and modified on GPU without touching CPU.
Sometimes parameter K may force to pre-allocate so much data for AR_dependent that is not available on GPU. That’s why I decided to split array AR into several chuncks (but still big) in order to sequentially process these chunks on GPU. But such simple tiling may help to generate and keep all the data AR, and AR_dependent on GPU without costly CPU-GPU transfers. Look at the following test:
<b> Second case </b>
int numStages = 4;
cudaMalloc(AR, N / numStages)
and orther cudaMallocs for temp buffers
// ... some OpenGL map/unmap pixel-buffer
cudaThreadSynchronize
timer start
for(stage = 0; stage < numStages; stage++)
{
int QueueSize = N / numStages; // working set size <b>now it is smaller</b>
for(level = 0; level < 25; level++) // levels of data scanning
{
// AR = AR(shift = base + stage * N / numStages), actually AR elements a generated on GPU depending on this "shift".
process_data(AR, QueueSize); // this func computes storage requirements (dataCount)
cudppScan(plan, dataOffsets, dataCount, QueueSize + 1); // QueueSize + 1 vary between 10K and 100K
scatter_data(AR) // create compact representation according to dataOffsets & dataCount
cudaMemcpy(&QueueSize, dataOffsets + QueueSize, DeviceToHost); // update working set size
}
}
cudaThreadSynchronize
timer end, print timings
// other stages
// ... some OpenGL unmap/map pixel-buffer
Frustrating thing is that the second case is 3x slower (22ms) than the first case. What happens? Why? The ammount of data processed is the same in both cases, intermediate results are correct. And the sum of results of the second case is equal to the corresponding data from the first case. I tested it on Vista and thought that some WDDM driver model sucks and it would be fine on XP. But no! Relative behavior is the same. As I described, the amount of data proccessed is the same in both cases and the chunks that are sent sequantially to GPU in the second case still have reasonable amount of work to occupy the device.
I don’t want to believe that 25 kernel iterations that process N elements are better than 100 kernels that process N / 4 elements… due to kernel invocation overhead. But it shouldn’t be 15ms! Maybe some issue is within cudppScan, as I never involved in details of this library (lots of core). proccess_data and scatter_data are my kernels, and they are very simple.
Please, help me! I need to increase the number of kernel invocations even further for orther staff… I would be very gratefull for help and suggestions as I have some deadlines…