Optimization of kernel for batch convolution of many small matrices

First of all, please note: I am not asking for bug fixes.
I am new to CUDA programming (not a very good coder as it is), and I only wrote this code because I’m in desperate need of a fast code to convolve many small matrices with a few convolution masks.

All I ask for is suggestions on what changes I can make to my code to make it even faster it’s a matter of approach - I assume my code is full of problems, which I can solve on my own. I am here to ask for help from those who have experience and know how to solve a problem in a manner suitable for the GPU.
If you have a different, faster approach to suggest, please do.

//matrix to process
    inline int mat2proc()
        return blockIdx.y*threadDim.x*threadDim.y + threadIdx.y*threadDim.x + threadIdx.x;

    inline int place(k,s,i) //k is an index in the big matrix; s is the size of the small matrix, i is an index in the small matrix
    { //this calculates the index in the result matrix where we should add the product
     //of an element in line/row k of the big matrix and line/row i of small matrix (size s)
        return i-s+k+1;

__global__ void batchConv (float *res, const float *mats, const float *masks, int s1,
    int s2, int matCount, int maskCount)
    //res = result matrix. mats= 3D array with matCount matrices of size s1xs1
    //masks = convolution masks masks. there are maskCount of them, all of size s2xs2
    extern __shared__ float mask [];
    int ResSize=s1+s2-1; //size of convolution result
    int mNum = blockIdx.x; //the number of x blocks is the number of convolution masks.
    float *tmpRes;

    if (threadIdx.x < s2)
        if (threadIdx.y < s2)
            mask[threadIdx.x*s2 +threadIdx.y] = masks[mNum*s2*s2 +threadIdx.x*s2 +threadIdx.y]; //copy the masks to shared memory
//all threads share the same mask

    if (blockIdx.y*threadDim.x*threadDim.y + threadIdx.y*threadDim.x + threadIdx.x < matCount)
            for (k=0; k<s1; i++) //iterate over lines
                for (l=0; l<s1; i++) //iterate over columns
                    float tmpVal=mats[s1*s1*mat2proc() +s1*k + l];
                    for (i=0; i<s2; i++) //iterate over lines
                        for (j=0; j<s2; i++) //iterate over columns
                            tmpRes[ResSize*place(k,s2,i) +place(l,s2,j)]+=tmpVal*mask[i*s2 +j];


    for (i=0; i<ResSize; i++)
        for (j=0; j<ResSize; j++)
            res[ResSize*ResSize*mat2proc() +ResSize*i + j]=tmpRes[ResSize*i +j];


Does this code work? I think I spotted cudaMalloc() and cudaMemset() calls inside the kernel, which is probably not what you want (if it even works at all).

If you have only a few matrix sizes to support, consider templating the kernel, making the matrix dimensions template parameters. The introduction of compile-time constants in this fashion will allow increased optimization by the compiler.

Use ‘const’ and ‘restrict’ on the pointer arguments to the kernel, as appropriate. See also Best Practices Guide.

The code most likely does not work, but neither does my compiler, So I can’t fix the problems I don’t know about for now…
I was advised to use cudaMemset() in the kernel… oh well, guess i’ll get rid of that

Thank you for your suggestions. I will learn more about templating and apply your suggestions tomorrow.

Meanwhile, any more suggestions about the algorithm I used for convolution?
I wrote it to require reading from the global memory only once per element of the matrix with which we convolve, to cut down on memory calls. I haven’t been able to find any implementations/suggestions for batch convolution of small matrices. only FFT-based methods for big matrices…

Okay. another question: how much of this code do you deem useable and efficient enough? Ignore the memory allocation mistake. I will learn how to fix it…

As njuffa said, making s1 and s2 constants will give a big speedup. It can unroll all the loops (use “#pragma unroll” to help), pre-calculate most/all of the array offsets, and use more registers (if the matrix is small enough, it can put the entire thing into registers which is VERY fast). In its current state, it is probably using half the processing resources on calculating array offsets.

It’s very helpful to inspect the PTX file to see if it’s actually unrolling (the flag, I believe, is --keep). Optimally, it will mostly be FMA instructions between registers (float multiply-add). You may need #pragma unroll in front of every for loop if you want to get everything into registers. Example: a 5x5 matrix can be stored in 25 registers (out of 63 on most chips). If you want a 7x7 matrix, though, that’s getting a bit large since it needs scratch space. So it’s really important to know the matrix size to know the best approach.

The shared memory approach looks fine and is the most widely supported.

Global reads might be slowing things down since they don’t look coalesced. This should be helpful: http://mc.stanford.edu/cgi-bin/images/0/0a/M02_4.pdf.

Thank you. the matrix sizes are known in compile-time, though I plan on using the code for MATLAB (I’m not sure how exactly yet). I guess i’ll pre-compile it for the small matrices (constant s2) and create a file for s2= 2,…,9. As for the larger matrix… that’s already too much.

To coalesce the memory, I have to make sure all the 1st elements of the matrix i’m reading are stored together, then all the 2nd elements, etc…, so that if threads i through i+31 take care of the 1st element of each matrix, they can read all of the 1st elements at once. right?

As for njuffa’s saying that i shouldn’t use udaMalloc() and cudaMemset() inside the kernel, should I do something like this instead?

__global__ void mallocTest()
    __shared__ int* data;
    if (threadIdx.x == 0) {
        size_t size = blockDim.x * 64;
        data = (int*)malloc(size);
        memset(data, 0, size);

(Copied from nVidia guide)
I am finding out that I know less about programming than I thought…

Anyway, Do you think the way I do the convolution itself is efficient?
With what little knowledge I have, I figured that since I have one matrix which needs to be used by all threads within a block, and many other matrices which are each only used once, I should copy the first matrix into the shared memory, and reduce the number of times I access the other matrices as much as possible.

would you say the approach itself is an efficient one?