Decomposition of algorithm into CUDA blocks Looking to accelerate a custom mathematical formula

Hello everyone

Disclaimer: I’m just getting started with CUDA, so if there’s a more appropriate forum for this question please feel free to move it.

I’ve got an algorithm I’ve designed and implemented in Matlab that does a type of windowed cross covariance measurement. Profiling the code has led me to optimize the loops/matrix multiplications, apply parallel computing toolboxes, use built-in multithreaded functions like bsxfun, etc. But even still, it takes ~60 hours per dataset to run and it needs to be run on hundreds of datasets. The algorithm itself is extremely parallelizable, so I figured what better place to start than on a GPU. I tried using conventionally built libraries like Jacket (CUDA-accelerated Matlab library), but they’re optimized for computing very large datasets a few tiems, not a small sample computed over and over again. So, I decided to dive in and try my hand at implementing the algorithm in CUDA.

I’ve wrapped my head around the concept of threads and blocks, looked into warps and their implementation in the thread scheduler, discovered the advantages of shared memory and built my first few kernels successfully. But I’m having a hard time picking out how to block my data appropriately to take advantage of the hardware while keeping in mind its limitations. The datasets I’m working with are variable length and definitely not power of 2s and most likely not easily divisble by the warp count of 48.

For instance: In one dataset, I take two signals of length 389475 and do a short-time cross covariance measure, resulting in a matrix that’s 401x389475. Neither of these dimensions fit nicely into any easily divisible power of 2. I’m working with a 2.x processor, so I can create 389475 blocks of length 401, but that’s not evenly divisible by the warp count, and if I happen to need longer cross-covariance measures (i.e. 501xN or 801xN), then it won’t fit into a single block. I realize the answer will probably vary significantly from algorithm to algorithm, but in this example would it generally be more efficient to have blocks of size 96 threads (leaving the 5th block with only 17/98 threads active) or have blocks of size 256, with one block full and one using only 145 of 256 threads?

In short, I’m not sure how to approach blocking datasets of non-multiples of threadcount, SM processor count, max thread limit, etc. Any pointers, guidance, or white papers that talk about decomposing strange dataset sizes?

Indeed blocksize affects performance in many non-obvious ways. The common approach is to parametrize the kernel and benchmark different sizes.

Caanon, Jacket’s GFOR loop is exactly designed to handle batches of many small problems. Have you looked at that? Email me and we can get you on the right path.