For my first real CUDA attempt, I would like to convert some Mathematica code using ListConvolve to CUDA.
[*]I have a 2D 8x256 kernel and would like to convolve it with a 9000x256 ‘movie’.
[*]The result of the convolution is a real vector of length 9000-8+1=8993, so no overhangs in the convolution.
[*]The movie will be fixed throughout but there will be batches of 50 kernels that will need to be convolved with the movie on each iteration, which are steps in a quasi-newton optimization with the kernel elements as parameters.
[*]The objective being optimized is a function of the sum of the 8993 elements in each convolution, so I have to be careful of accumulated errors.
Is this a problem that is a good candidate for CUDA?
I am adapting the sample convolutionFFT2D code to do this, though I don’t understand parts of the code (such as binding arrays to texture references) and am not sure if it’s all necessary.
I was hoping for some pointers and advice on implementing this. So far, I’ve tried some of the cuFFT functions from Mathematica but am seeing no performance improvements (possibly due to inept compile settings).
There are times it seems it’s not such a good idea to use FFTs in doing convolutions, such as when the kernel is just one element. What other convolution algorithms are used depending on argument sizes?
I adapted the simpleCUFFT (a 1D convolution) sample to Mathematica’s mathlink protocol. Even excluding the overhead of this protocol, the CUDA convolution is at least several times slower than the builtin Mathematica ListConvolve function on a MacBook Pro with a 9600M GT with 512mb memory.
I’ve attached the sample (mac os build and device is hard coded to 0 in main()) if anyone wants to try it.
Not sure what’s wrong. I get similary poor perfomance for a 1d FFT.
Does anyone know any resources on the web related to Mathematica and CUDA? Unfortunately, a google search only gives as results the very misleading recent news release by NVidia that Mathematica 7 supports CUDA natively.
Absolutely this can be done in CUDA, though the exact best strategy will require some experiments.
In particular, you may want to change your “view” of the problem as a 2D convolution… though this may depend on the convolution kernel itself.
But notice, horizontally, the 256 wide convolution is really a dot product since you don’t care about all the columns, just the center one.
If the convolution is seperable, you’d probably use two passes and create a 9000 long 1D array of the dot product results, then run a 1D convolution on that 1D array.
But now I’m also confused by your final output. You say “The objective being optimized is a function of the sum of the 8993 elements in each convolution” so it’s really just some function of the sum of the convolved array? But wouldn’t that just be the product of the trace of the data times the trace of the kernel? There’s no convolution needed at all. There’s boundary effects to worry about though I guess.
I probably don’t understand your problem enough.
Yes you are right, I am not really doing a 2D convolution and it doesn’t make sense to go the FFT route. I should be performing dot products as you say. I wonder if the reason Mathematica’s ListConvolve function is about 10 times faster than Matlab’s conv2 and scipy’s numpy.convolution for this case is that Mathematica recognizes the problem shouldn’t use FFT’s whereas the others don’t check the arguments as carefully.
I don’t know enough about CUDA to implement the dot product method, so any direction is appreciated. I guess I should look at the BLAS examples.
The convolution is not separable, but I could force that to be the case in the optimization at the cost that it is no longer convex, but thanks for pointing out how to implement it. I’ll think about it.
Regarding your point on product of traces, what I said was incorrect, the objective is not exactly the sum of the convolution but a sum over a function of the elements of the convolution. I was just wondering if single precision error becomes a big problem without compensated summation.
Ok, just quick brainstorming, but this probably requires a little more analysis.
I’d set up each of your convolution kernels as its own block. 50 blocks is a low though, but for a first pass design it’s a good place to start.
Now you have one block with a constant convolution kernel. Use 256 threads. (again this can be done with variations, but 256 is nice…).
Load the kernel’s 8*256 tap values… but load them so each thread holds those values in 8 registers.
The strategy is to load in the next “row” of your “movie” and save it into shared memory.
At any time, shared memory holds 8*256=2048 values. (Unfortunately this is just a tiny bit too big to run two blocks at once, but this may not be a
big deal.)
Your block’s main convolve loop will do:
Load the next row of movie data into shared memory
Each thread does 8 multiplies of it’s “column” of data with its stored convolution weights. These are summed and written to shared memory.
We now have 256 values in shared memory which have to be summed. We use reduction to sum the 256 values. This is an OUTPUT VALUE!
go back to step 1 until the whole movie has been computed.
There’s LOTS of small details, ranging from doing prefetching of the next movie row while you’re doing the computes, to doing the reduction per-warp with only a single syncthreads. If your final output compute itself is expensive, you may even make an "extra’ warp that is used JUST for doing reduction and your final output compute while the other warps are busy with the load/convolve.
You’ll probably be pretty close to a balance between bandwidth, compute, and instruction bottlenecks… you’re doing 8 multiplies and sums for each value loaded… it should be pretty efficient! The low block count will likely be a limit since ideally you want more like 300 blocks, not 50. You could optimize by breaking your 9000 long movie into two movies of 4508 (note the small overlap), or even 5-10 short movies, and combining the results, though that’s extra complexity.
This isn’t a minor amount of coding, though it’s pretty straightforward.
It will take a while for me to implement your idea. I have been looking up many of the terms you use in your suggestion, registers and shared memory vs the device memory and warps and syncthreads. I get the basic idea of how to split each kernel into blocks and the convolution into threads. But I need a better understanding of CUDA. The examples are excellent and the documentation is pretty clear if you read it several times. I think I will go through more of the CUDA samples before I do anything.