Decomposition of algorithm into CUDA blocks Looking to accelerate a custom mathematical formula

Caanon · September 12, 2011, 7:57am

Hello everyone

Disclaimer: I’m just getting started with CUDA, so if there’s a more appropriate forum for this question please feel free to move it.

I’ve got an algorithm I’ve designed and implemented in Matlab that does a type of windowed cross covariance measurement. Profiling the code has led me to optimize the loops/matrix multiplications, apply parallel computing toolboxes, use built-in multithreaded functions like bsxfun, etc. But even still, it takes ~60 hours per dataset to run and it needs to be run on hundreds of datasets. The algorithm itself is extremely parallelizable, so I figured what better place to start than on a GPU. I tried using conventionally built libraries like Jacket (CUDA-accelerated Matlab library), but they’re optimized for computing very large datasets a few tiems, not a small sample computed over and over again. So, I decided to dive in and try my hand at implementing the algorithm in CUDA.

I’ve wrapped my head around the concept of threads and blocks, looked into warps and their implementation in the thread scheduler, discovered the advantages of shared memory and built my first few kernels successfully. But I’m having a hard time picking out how to block my data appropriately to take advantage of the hardware while keeping in mind its limitations. The datasets I’m working with are variable length and definitely not power of 2s and most likely not easily divisble by the warp count of 48.

For instance: In one dataset, I take two signals of length 389475 and do a short-time cross covariance measure, resulting in a matrix that’s 401x389475. Neither of these dimensions fit nicely into any easily divisible power of 2. I’m working with a 2.x processor, so I can create 389475 blocks of length 401, but that’s not evenly divisible by the warp count, and if I happen to need longer cross-covariance measures (i.e. 501xN or 801xN), then it won’t fit into a single block. I realize the answer will probably vary significantly from algorithm to algorithm, but in this example would it generally be more efficient to have blocks of size 96 threads (leaving the 5th block with only 17/98 threads active) or have blocks of size 256, with one block full and one using only 145 of 256 threads?

In short, I’m not sure how to approach blocking datasets of non-multiples of threadcount, SM processor count, max thread limit, etc. Any pointers, guidance, or white papers that talk about decomposing strange dataset sizes?

tera · September 12, 2011, 10:05am

Indeed blocksize affects performance in many non-obvious ways. The common approach is to parametrize the kernel and benchmark different sizes.

melonakos · September 17, 2011, 3:49pm

Caanon, Jacket’s GFOR loop is exactly designed to handle batches of many small problems. Have you looked at that? Email me and we can get you on the right path.

-John

Topic		Replies	Views
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7080	January 30, 2008
Designing a CUDA algo question Sort of a newbie question.... CUDA Programming and Performance	2	2397	December 9, 2011
thread / block allocation in function of data size CUDA Programming and Performance	5	4355	November 9, 2009
Scheduling thread blocks adaptively How to distribute thread blocks to arbitrary dimensions CUDA Programming and Performance	1	2576	March 25, 2010
parallelizing CUDA Programming and Performance	5	5399	March 13, 2007
Block size and grid size CUDA Programming and Performance	5	8480	April 27, 2009
using vectors in GPU kernel CUDA Programming and Performance	3	2579	March 24, 2007
Newbie help on thread blocks CUDA Programming and Performance	22	10881	December 24, 2008
Block size's effect on program performance, why does my program run faster at seemingly random sizes? CUDA Programming and Performance	5	4138	January 2, 2017
Grids and Threads question CUDA Programming and Performance	2	4475	August 7, 2007

Decomposition of algorithm into CUDA blocks Looking to accelerate a custom mathematical formula

Related topics