need suggestion for a 4D data computation project

redhatw · December 17, 2018, 2:31am

I’m doing 4D data computing and any suggestions for the optimal parallelism framework using CUDA will be highly appreciated. The first 3 dimension represents the x,y,z coordinate, the 4th is the time (about 1000 timepoints). The data size is about 900Mb. For each spatial location, I will need to extract the surrounding neighbors (around 40-90) and extract the time series of all the neighbors for pattern extraction. Each location can be processed independently, so does each time point. I’m now using a parent kernel for each spatial location and then a child kernel for the time domain for each neighborhood but the speed gain is not as fast as I expected. One reason might be the frequent memory access (the 900Mb is loaded into the unified memory. The code I’m using now split the 3D space into a series of small blocks in a for loop and run the kernel one by one to avoid overwhelming the GPU. My few questions are: will the pinned memory be a better choice? Will creating multiple streams help improving the speed by avoiding the synchronization called after each loop? Any suggestions for the framework?

Thanks

tera · December 17, 2018, 8:16am

The GPU does not get “overwhelmed” when a kernel with many blocks is launched. Quite to the contrary, this is the ideal scenario.
So do not split your kernel launches.

If you haven’t yet, arrange your grid layout to optimize memory access patterns.

redhatw · December 17, 2018, 6:04pm

My concern is that the >400000 threads will each invoke 800 sub-threads (in the child kernel), would that be too many for GPU?

njuffa · December 17, 2018, 6:12pm

By “child kernel”, are you referring to the use of dynamic parallelism?

redhatw · December 17, 2018, 6:58pm

Yes. Each parent kernel invokes a separate child kernel. Because there are about 800 independent process, about 800 sub-threads would be invoked if resource is available.

tera · December 17, 2018, 7:17pm

I had understood your opening post to mean you are using dynamic parallelism to avoid launching too many blocks which would “overwhelm” the GPU.
Now that we have established that the “overwhelming” scenario does not exist, I expected you to remove dynamic parallelism from the code.
Where is my misunderstanding?

redhatw · December 17, 2018, 7:35pm

I was saying that the code (based on dynamic parallelism) is still not as fast as I expected and my suspicion was that once the millions of parent threads launched there might not be enough resource for the subsequent 800 xN child threads. Also the parent thread needs all child thread to be finished before moving on.

tera · December 17, 2018, 7:37pm

What is the reason for using dynamic parallelism at all?

redhatw · December 17, 2018, 7:46pm

Just a way to check whether the code can be further accelerated. It did but not significantly. That is why I’m wondering whether my overall implementation framework is sub-optimal or not. I haven’t used more than 1 stream though (which may help the “overwhelming” issue?).

njuffa · December 17, 2018, 7:49pm

One of the basic facts about launching kernels from the device is that the launch overhead is pretty much the same as launching a kernel from the host.

Given that, the only reason to use dynamic parallelism I perceive is that it provides flexibility (different responses at different points in the data space). If the same response is applied to all data points (e.g. following a particular stencil pattern), dynamic parallelism isn’t needed and is likely to harm performance due to launch overhead.

What does the CUDA profiler indicate as the most prominent bottleneck in your code? I am guessing it would be memory access.

Topic		Replies	Views
A question on nested parallelism CUDA Programming and Performance	5	1396	April 11, 2019
a question about low performance on dynamic parallelism with tremendous data CUDA Programming and Performance	2	1189	May 27, 2013
dynamic parallelism CUDA Programming and Performance	3	1106	December 30, 2012
Is dynamic parallelism suitable for this application? CUDA Programming and Performance	3	1201	August 20, 2013
Is this strategy not suitable for dynamic parallelism ? CUDA Programming and Performance	0	489	January 9, 2014
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	685	June 3, 2024
Dynamic Parallelism Memory Consistency across Thread Blocks? CUDA Programming and Performance	11	1760	February 5, 2015
Cuda Dynamic Parallelism Performance CUDA Programming and Performance	3	1952	July 14, 2016
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4525	October 24, 2008
Does saturating a stream hide kernel launch latency? CUDA Programming and Performance	23	2556	October 28, 2014

need suggestion for a 4D data computation project

Related topics