Is it possible to estimate the performance ? 8500GT (current) -> 9800xxx = ?

My current graphics card is for experiments only: it is 8500 GT (16 processors, 128 bit memory bus, moreover - it is inserted into the second PCI-e slot which is slower e t c).

I’ve implemented the kernel for my needs that actually works (and I hope that it works not too slow - it does not diverge, it uses coalesced memory access, it works with shared memory e t c). As my kernel uses a lot of shared memory per thread (for stacks), I can run only about 32-42 threads per block. No matter how many test data sets I provide, the pure time of kernel run (without malloc/memcopy and other stuff like that) is always bigger than the time of the same calculations on CPU (including CPU malloc and other service stuff).

When workload is not huge (about 100000 test cases) the GPU kernel is about 2.5 times slower than CPU (Athlon X2 4800+, calculations done in single thread for correct timing). When it becomes really huge (about 10 millions test cases) the CPU is about 20% faster than GPU, but it is still faster. However, typical workload for my tasks is about 10000-100000 testcases.

What I’m trying to figure out is the estimation of my GPU kernel productivity on modern nvidia solutions. Specs of 9800-series cards are known, but I’m not sure if the speed of kernel run comparing to 8500 will grow linearly or may be better than linearly or how … The main problem (I gues) is in the number of threads per block that is low in my case, but it is limited by the amount of shared memory.

I have to decide if GPU’s may speed my tasks up … may be somebody faced the same choice some time, any opinion would be greatly appreciated.

Thanks in advance.

First, there is a huge performance difference between the 8500 gt to the 9800 gtx, i can’t give you numbers but it should be a factor of over x2, still it depends on your implementation.

But … why do u need so much shared mem ? maybe you can do it in a few loops, load a chunk into shared work with it and then load the next chunk, if you have enough threads and blocks then while one is loading the other will do its calculations. occupancy isn’t the most deciding factor in performance. but in your case it is so low it might be.

Cheers

Romant

I think its most likely your cuda implementation that is faulty here OR your application does NOT lend itself to data-parallel processing

My task is simple - I have a number of expressions in reverse polish notation (postfix notation), each must be evaluated on a set of test cases. For example:

  1. Expression: X X + X X / * (this means (X + X) * (X / X)).

  2. Test cases: 10, 20, 30.

  3. Expression must be parsed, three results (according to the number of test cases) are expected.

My kernel receives an expression and a set of test cases, so it evaluates the expression for each test case in parallel. The number of test cases is about 1000 … 50000.

Expression can be very complicated and deep - that’s why I need a stack with big max depth (<= 90), the size of stack and the size of shared memory limit the number of threads per block.

I don’t know if that’s bad algorithm or bad task for GPU …

So the bulk of your time is spent doing calculations and read/writing to shared memory? In that case, the performance increase probably scales like (clock rate * # of stream processors).

How long does each kernel call take? The other possible bottleneck is the overhead of starting the kernel if each call is very short. That’s unlikely to get faster with a newer card.

Hmm, replied in the other thread as well, but I have an idea:

Can you not according to the size of the expression change your grid & block dimension & shared memory size? I guess you know the size of the expression on CPU also.

__global__ void my_kernel(expression, testcases)

{

extern __shared__ stack;

.....

};

my_kernel<<<gridSize, blockSize, shared_mem_size>>>(expression, testcases);

Otherwise maybe you can use a template kernel function like in the reduction example;

Seibert:
Memory allocation/read/write/deallocation requires equal time for each call. I’m going to minimize the number of host<->device transfers, that’s why I told about the pure time of kernel call. Your performance forecast sounds reasonable - that’s the formula I’ve been thinking … Nvidia is going to announce GT200-based chips soon, may be there will be a miracle and shared memory will be increased, not just the number of multiprocessors.

E.D. Riedijk:
Great idea! That’s exactly what has come to my mind today. Not all expressions require deepest stacks, I can vary the amount of shared memory for each kernel call.

A couple of words about the implementation, your comments guys are extremely valuable. In order to parse the simplest expression (say, X X *) I have to store two arguments for multiplication operation somewhere. Really, I don’t see how to implement it without a stack (or stack-like) memory structure at all. Also, may be there are some wise algorithms that parse the prefix or postfix form of expressions without stacks and in a parallel manner ? I did not find something like that yet.

Thank you very much for your help.

Assuming the linear scaling, jumping up to a 9800 GTX, for example, would give you a (1.68 GHz * 128 SP)/(0.9 GHz * 16 SP) = 15x improvement in speed over the 8500 GT. At that point, it sounds like you get a net factor of 6 improvement over your CPU implementation. Not as great as some algorithms, but not bad. :)

Heh …

My current CPU (Athlon X2 4800+) is 2.5 times faster then GPU kernel when using the only core. If to use both cores then it will be 5 times faster. If to migrate to topmost of core2duos - it will be 10 times faster. If to migrate to core2quad - it will be 20 times faster.

GT280 is 21.6 times faster than 8500GT (according to the number of processors and frequency). Well … core2quad seems to be much more flexible solution :-)

However, I’m not dropping the idea yet, something tells me that the kernel is, say, “not optimal”.

Also keep the speed of memory in the equation. Your current card has much lower bandwidth than 8800GTX. Accessing mem with 70 GB/s has quite some impact on overall performance.

What I don’t understand is: why do you need shared mem per thread? I guess each thread has to perform the same expression or not?
Could you post some sourcecode to comment on?

I’ve attached the code. It is not too short but it is minimal possible code to explain the details. There is one big comment inside that explains how expression presented and how it is parsed.

The main concept is simple: each thread parses the expression for a particular test case.

In general, I’m ready to rework everything from the scratch if current approach is not CUDA-friendly.

Code.cpp (4.2 KB)

Hi, well here is my 2 cents : pre processing. Cuda likes things to stream along not pop. since you run each expression allot of times, spending some time preprocessing would seem to be worth while. I would get it so that my kernel receives as input one large array of the small arrays of variables. the rest of the data i would try and put in constant memory, if its too big then you can pass it as arrays as well. one array is the array of operations, one is of the constant values.

kernel(...)

{

// load the variables array into shared mem in a coalsed read

//load const values array into same array

while opNum < numOfOps

{

   float res = s_data[0]; // load to res initial value

   uint2 op = g_op[i];

   switch

     case op[i].x = MUL

         res *= s_data[op[i].y];

         break;

     case op[i].x = ADD

         res += s_data[op[i].y];

         break;

   ... for all other oprations

}

//write back the res

 g_res[tid] = res;

}

thing is you need to make sure all the arrays are in the right order before you launch this kernel. I think your problem should run amazingly fast on cuda as you have allot of math and each evaluation is independent. The tricky bits are when you need to pass data from thread to thread and from block to block. Hope this makes sense.

good luck

Eri

To Eri Rubin:
Not sure I understand … preporcessing is always good, I agree, but I don’t understand how can I pre-process postfix expression in order to make the ‘*=’ and ‘+=’ possible.

Say, I have the expression: X X + X X + * (in normal notation it is (X+X)*(X+X)).

How can I transform it to make += possible ?
I go through the expression and see X, then one more X, then +. Also, before multiplication I have to calculate the second (X+X) part of an expression … and keep the result somewhere.

Are you performing these expressions on arrays or on single values (was not clear at first glance and am short on time ;))?

If you are working on arrays:

__global__ void kernel()

{

__shared__ expression_information;

testcase = blockIdx.x;

if (threadIdx.x == 0)

  decode_expression_information();

__syncthreads();

idx = threadIdx.x;

while (idx < array_size) {

perform expression for element idx;

idx += blockDim.x;

}

}

launch it with:

kernel<<<num_expressions, max_number_of_threads_possible>>>()

But it is very possible that I did not understand anything from what you are trying to do ;)

I understand your proposal - I don’t understand what can be done in “decode_expression_information()” :-) That’s the problem.

And once again for clearness: there is one expression and many test cases. Each kernel should parse the ONLY expression for MANY test cases …