Long custom pipeline bottlleneck array tiling, scan, compact...

Hello!

I am implementing a custom rendering pipline using cuda. Let’s say that we have an array AR of independent N=65536 elements and would like to process them (to subdivide until some criteria). Original implementation works fine:

<b> First case </b>

cudaMalloc(AR, N)

and orther cudaMallocs for temp buffers

// ... some OpenGL map/unmap pixel-buffer

cudaThreadSynchronize

timer start

int QueueSize = N; // working set size

for(level = 0; level < 25; level++) // levels of data scanning 

{

	process_data(AR, QueueSize);  // this func computes storage requirements (dataCount)

	cudppScan(plan, dataOffsets, dataCount,  QueueSize + 1); // QueueSize + 1 vary between 10K and 100K

	scatter_data(AR)			 // create compact representation according to dataOffsets & dataCount

	cudaMemcpy(&QueueSize, dataOffsets + QueueSize, DeviceToHost); // update working set size

}

cudaThreadSynchronize

timer end, print timings

// other stages

// ... some OpenGL unmap/map pixel-buffer

This version works fine, let’s say that ~7ms on GTX 280, CUDA 2.2 and corresponding driver, Win XP SP3 or Vista, 32-bit, Core2Duo 3Ghz.

You may consider this algo as some graphics application with adaptive subdivisions. All data buffers are stored on single device and CPU is only responsible to high-level logic and kernel invocation, but not data-transfer in critical program loops.

Still having the size N<=65K of our array AR this may increase the storage of orther data under some user configurations: AR_dependent of size M = N * K, (K = 32, 256… etc.). The elements of AR_dependent depend on AR’s elements, but they are also independent among each other. We would like to store the entire data on the GPU. Actually, the data for AR and AR_dependent is generated and modified on GPU without touching CPU.

Sometimes parameter K may force to pre-allocate so much data for AR_dependent that is not available on GPU. That’s why I decided to split array AR into several chuncks (but still big) in order to sequentially process these chunks on GPU. But such simple tiling may help to generate and keep all the data AR, and AR_dependent on GPU without costly CPU-GPU transfers. Look at the following test:

<b> Second case </b>

int numStages = 4;

cudaMalloc(AR, N / numStages)

and orther cudaMallocs for temp buffers

// ... some OpenGL map/unmap pixel-buffer

cudaThreadSynchronize

timer start

for(stage = 0; stage < numStages; stage++)

{

	int QueueSize = N / numStages; // working set size <b>now it is smaller</b>

	for(level = 0; level < 25; level++) // levels of data scanning 

	{

		 // AR = AR(shift = base + stage * N / numStages), actually AR elements a generated on GPU depending on this "shift".

		 process_data(AR, QueueSize);  // this func computes storage requirements (dataCount)

		 cudppScan(plan, dataOffsets, dataCount,  QueueSize + 1); // QueueSize + 1 vary between 10K and 100K

		 scatter_data(AR)			 // create compact representation according to dataOffsets & dataCount

		 cudaMemcpy(&QueueSize, dataOffsets + QueueSize, DeviceToHost); // update working set size

	}

}

cudaThreadSynchronize

timer end, print timings

// other stages

// ... some OpenGL unmap/map pixel-buffer

Frustrating thing is that the second case is 3x slower (22ms) than the first case. What happens? Why? The ammount of data processed is the same in both cases, intermediate results are correct. And the sum of results of the second case is equal to the corresponding data from the first case. I tested it on Vista and thought that some WDDM driver model sucks and it would be fine on XP. But no! Relative behavior is the same. As I described, the amount of data proccessed is the same in both cases and the chunks that are sent sequantially to GPU in the second case still have reasonable amount of work to occupy the device.

I don’t want to believe that 25 kernel iterations that process N elements are better than 100 kernels that process N / 4 elements… due to kernel invocation overhead. But it shouldn’t be 15ms! Maybe some issue is within cudppScan, as I never involved in details of this library (lots of core). proccess_data and scatter_data are my kernels, and they are very simple.

Please, help me! I need to increase the number of kernel invocations even further for orther staff… I would be very gratefull for help and suggestions as I have some deadlines…

It is probably the additional memcpy() calls which is hurting the overall performance, rather than the kernel invocations.

I have tested without memcpus in inner loop by making data static and precomputing QueueSize for all levels and stages. Overall, performance for both cases is improved by 1.5x, but relative difference between them stays the same: 2ms and 4ms. I observe that by decresasing the number of inner levels to be processed the relative difference between 2 approaches is reduced (because the number of kernel invocations and cudppScans is reduced). You can see attachments. I can also say that when overall the size of data (array AR) is increased, let’s say by a factor of 10 then both approaches become slower by a factor of 10. But relative difference between them is the same: 20ms and 40ms.

However, among the frames the timings for this stage passes are constant, as we have 1000s of kernels in many frames.

What else can influence on such behavior? Any suggestions?