need suggestion for a 4D data computation project

I’m doing 4D data computing and any suggestions for the optimal parallelism framework using CUDA will be highly appreciated. The first 3 dimension represents the x,y,z coordinate, the 4th is the time (about 1000 timepoints). The data size is about 900Mb. For each spatial location, I will need to extract the surrounding neighbors (around 40-90) and extract the time series of all the neighbors for pattern extraction. Each location can be processed independently, so does each time point. I’m now using a parent kernel for each spatial location and then a child kernel for the time domain for each neighborhood but the speed gain is not as fast as I expected. One reason might be the frequent memory access (the 900Mb is loaded into the unified memory. The code I’m using now split the 3D space into a series of small blocks in a for loop and run the kernel one by one to avoid overwhelming the GPU. My few questions are: will the pinned memory be a better choice? Will creating multiple streams help improving the speed by avoiding the synchronization called after each loop? Any suggestions for the framework?

Thanks

The GPU does not get “overwhelmed” when a kernel with many blocks is launched. Quite to the contrary, this is the ideal scenario.
So do not split your kernel launches.

If you haven’t yet, arrange your grid layout to optimize memory access patterns.

My concern is that the >400000 threads will each invoke 800 sub-threads (in the child kernel), would that be too many for GPU?

By “child kernel”, are you referring to the use of dynamic parallelism?

Yes. Each parent kernel invokes a separate child kernel. Because there are about 800 independent process, about 800 sub-threads would be invoked if resource is available.

I had understood your opening post to mean you are using dynamic parallelism to avoid launching too many blocks which would “overwhelm” the GPU.
Now that we have established that the “overwhelming” scenario does not exist, I expected you to remove dynamic parallelism from the code.
Where is my misunderstanding?

I was saying that the code (based on dynamic parallelism) is still not as fast as I expected and my suspicion was that once the millions of parent threads launched there might not be enough resource for the subsequent 800 xN child threads. Also the parent thread needs all child thread to be finished before moving on.

What is the reason for using dynamic parallelism at all?

Just a way to check whether the code can be further accelerated. It did but not significantly. That is why I’m wondering whether my overall implementation framework is sub-optimal or not. I haven’t used more than 1 stream though (which may help the “overwhelming” issue?).

One of the basic facts about launching kernels from the device is that the launch overhead is pretty much the same as launching a kernel from the host.

Given that, the only reason to use dynamic parallelism I perceive is that it provides flexibility (different responses at different points in the data space). If the same response is applied to all data points (e.g. following a particular stencil pattern), dynamic parallelism isn’t needed and is likely to harm performance due to launch overhead.

What does the CUDA profiler indicate as the most prominent bottleneck in your code? I am guessing it would be memory access.