Performance of memory bound calculations

Hey there,
I have some user-defined arbitrary scalar valued functions (about 10000 of them) with vector inputs (the vector inputs are about 10000 elements in size) and I want to calculate all of them while keeping up the good performance. The vector inputs are the same for all functions thus saving it in global memory would be my performance idea. I tried it already but the performance is so bad that I’m better doing it on the CPU :(
Any ideas how to manage such a problem on the GPU?
thanks!

That is a rather unusual sounding problem. Do I understand it that you actually have 10000 different device functions or kernels to run on the same set of inputs?

Yes that’s correct. Actually, I have something like that:

__device__ double myFunc(int function_index)

{       

    switch(function_index)

    {

        case 1:

            return PNT[0]*PNT[0]*PNT[1]*PNT[1];

        case 2:

            return PNT[1]*PNT[1];

        case 3:

            return PNT[2]*PNT[2]*PNT[1]*PNT[1];

    }

}

in this case I would have three different functions and I pass an index to ‘myFunc()’ about whatever function I want to evaluate. It could also be like that:

__device__ double myFunc(int idxFunc)

{	

	// example functions: f[i] = x[i]^2

	return PNT[idxFunc]*PNT[idxFunc];

	// */

}

So when doing

myFunc(2)

it actually returns the second coordinate of the array PNT squared. The body of ‘myFunc’ is completely user-defined thus I don’t know what the user implements there in advance and I just need to evaluate every of those functions (could be 10000 different ones) for the same input ‘vector’ (here managed as ‘array’) PNT.

thanks!

That is going to be a very poor fit for the CUDA programming model and GPU architecture generally.

Functions modelled after the first function you posted will produce a lot of warp level divergence if there is a lot of “scatter” in the input indices (as well as the fact that there are lots of conditionals and apparently few FLOPs or IOPs in the function itself), and with completely unpredictable memory access patterns thrown in as well, the code will likely be very slow. Also, the input sizes you are working with are pretty trivial - 10,000 scalar function evaluations offers little total FLOPs for the GPU to work on, and doing it 10,000 times where each function is different offers almost no parallelism that the GPU can easily leverage.

Are you sure there isn’t another way you could implement this?

Yeah those are the problems I already thought about :( But I just don’t know any other possibility of letting the user define thousands different functions (‘mathematical functions like f=x^2*y^2’) and numerically calculate the derivative of those functions on a point with thousands of coordinates (therefor I absolutely need to evaluate each single function on this point via (f(PNT+h)-f(PNT))/h :( Do you have any idea about it?

The obvious answer is don’t use the GPU, or perhaps use both the GPU and CPU simultaneously and split the work over both. I am a little puzzled when you say

What does “a point with thousands of coordinates” mean in this context? Do you mean that the function is evaluated in some sort of vector space? And the derivative calculation is yielding a Jacobian matrix for the function about a given vector input?

Yes you are right, I want to calculate the jacobian matrix for a certain number of functions, which I implemented like that:

__device__ double myFunc(int function_index)

{       

    switch(function_index)

    {

        case 1:

            return PNT[0]*PNT[0]*PNT[1]*PNT[1];

        case 2:

            return PNT[1]*PNT[1];

        case 3:

            return PNT[2]*PNT[2]*PNT[1]*PNT[1];

    }

}

representing a vector valued function split into several scalar valued functions. And yes, the ‘point’ is representing a ‘Vector’ with many many coordinates (like in maths).

Thanks so far!