Is GPU worth it? GPU currently too slow.

Hi,

I’m trying to implement the fitness function for a genetic programming system to run on the GPU because it’s the most computationally expensive part of the system and I’ll need to calculate about 1000 at a time and each one is independent. So, I recently got it working and have tried it out. For one test I tried, it takes 17 seconds for just one fitness function evaluation on the GPU, while when I run it in emulation mode, it takes about 0.04 seconds… so I’d be needing to run about 400 fitness evaluations at the same time to even break even with the GPU, meaning I’d need another video card. Plus, if I ran the program on the CPU, I could speed it up based on inefficiencies I have to take into account on the GPU (such as no recursion or function pointers).

Am I perhaps using memory inefficiently on the card? Here’s what I’m doing:

Most of the calculations involve data from a really big table of floating point numbers, which I pass in from MATLAB and move to the card like this:

[codebox] CUDA_SAFE_CALL(cudaMalloc((void**)&data_gpu, sizeof(double)numSymbolsnumDates*numIndicators));

checkCudaError("h1");

double *data_double = (double*)mxMalloc(sizeof(double)*numSymbols*numDates*numIndi

cators);

for (i = 0; i < numSymbols*numDates*numIndicators; i++) data_double[i] = NA; //initialize the data

for (i = 0; i < numRows; i++) {

	symbol = data[i];

	date = data[i+numRows];

	for (j = 0; j < numIndicators; j++) {

		data_double[symbol*(numDates*numIndicators) + date*(numIndicators) + j] = (double)data[i+numRows+numRows*(j+1)];

	}

}

CUDA_SAFE_CALL(cudaMemcpy(data_gpu, data_double, sizeof(double)*numSymbols*numDates*numIndicators, cudaMemcpyHostToDevice));

mxFree(data_double);

checkCudaError("h2");[/codebox]

And then, in the kernel, I have a loop that does all the calculations using this table, so the bottleneck is most likely due to memory accesses. Is there a better way to allocate the memory? Would using textures or shared memory or something else speed things up?

-Ryan

I’m interested in genetic algorithms/programming on the GPU as I worked with both. I am one of the (few) people on this forum who thinks GPUs are overrrated, but your latencies seem way too large (the 17sec is preposterous – even a cheap GPU should be faster than a typical CPU in anything highly parallel). What GPU are you using? Their performance depreciates really fast once you missed the sweet spot in terms of memory access coalescence, or if you have bank conflicts in shared memory accesses. To verify where you stand, use cuda Visual profiler to get stats on your kernel. Also, use cuda Occupancy calculator to see what your bottlenecks are (registers, warps, or shared memory).

Without seeing your kenrels, here’re some things that come to mind:

  • older GPUs are not that good with double precision floats, and even in newer ones, single precision gives you more throughput. Can you switch to single prec? Besides, it is harder to get coalesced access with a double prec array.

  • if your table acts as a LUT, then you definitely have uncoalesced access. LUTs are best input in the GPU as textures because they are cashed and you can interpolate through entries for free. However, there’s a limited texture memory available per multiprocessor.

  • How did you get your timings? Use the visual profiler to get timings on the kernels because a kernel is executed asynchronously and CPU timing methods are inaccurate.

Thanks for the reply. I’m using the GTX 280 card, so it’s not an old card.

I haven’t used the Visual profiler before because the CUDA code is for MATLAB and so I’m not sure how to run the program through the profiler instead of through MATLAB. Is there a way to do this?

Also, what is a LUT?

I got my timings using the tic/toc functions of MATLAB…

I don’t know, but if there isn’t, NVidia should make it a priority to develop one.

A Look-Up-Table, they are used a lot of times to pre-approximate complicated functions.

You’d do someting like this:

/* initialize LUT offline by calculating discrete values for complicated_function(); */

/* in your real time code, you’d have */

int value = some_number_from_somewhere;

answer = LUT[value];

It’d be faster in real time than

answer = complicated_function(value);

GPUs like sequential memory accesses where adjacent threads read adjacent memory locations. Since the LUT access is unpredictable, your coalescence will break;

It does act as a LUT, and so I changed it to a texture and also changed it to using single-precision and there was no speedup at all :no: . So that’s unfortunate, and I’m not sure what else to do. Maybe the issue is that, in this kernel function, I am doing a whole lot (hundreds) of accesses into the table and really not that complex of computation, so even though I’m making it parallel, is it possible that the program is just too I/O bound to be efficient on the GPU as compared to the CPU (maybe because the accesses still aren’t coalesced enough)? Is there a way to improve this any more? (BTW, this might not be the same test that I was doing before, but I tested it with just a single individual, and it took 67 seconds on the GPU and 0.27 seconds on the CPU, so

I would have to run at least 250 individuals in parallel to make it worth it at all…)

Thanks for your help.

After looking at these results, we’re actually considering just going to pure MATLAB and parallel CPUs, which is likely to be 1) much easier and 2) as far as this application has gone so far, much faster. I’m really not sure why I’m getting such bad timing, so if anyone has any suggestions for making more efficient use of memory, definitely let me know.

Thanks!

I don’t think there is ever any reason to use 1 D textures on GTX2xx cards. And even the usefulness of 2D/3D textures should be limited.

The memory bandwidth should be at least 10x that of the CPU, so unless almost all your access are uncoalesced it should still be faster. Obviously we can’t really help without knowing your kernel code.

“Individual”? Do you mean to say you are running only 1 thread?? 250 threads, besides not being divisible by 32 would be a joke. Unless you are doing some ridiculously compute-heavy processing I don’t think there is a point of even trying the GPU when you can not run at least 1000 - 2000 threads.

And sorry if I understood you wrong, but if not one advice: the GPU is very much special-purpose hardware, you will not get good performance with most stuff without knowing the architecture really well, and possibly changing your algorithm completely to match that architecture. In particularly that means make sure you have read and understood the programming guide.

Yes, I was testing it with a single individual thread. I did this to see how fast it would be, although for real computation I would be running many more threads than this. The point of this was that, if it takes 67 seconds for one individual, then even if I stepped it up to 1000 or 2000, it’s not going to get any faster, right?

Most likely not, but given the maximum possible parallelism of the GTX280 I can’t see how that is any useful information. It is possible that you could solve about 10000-20000 instances in the same 67 seconds.

So all you know now is that the GPU could be between 300 times slower and 100 times faster. Unfortunately benchmarking with artificial data usually is just a waste of time.

And that would be wonderful if that happened, but how could it? The card definitely doesn’t have 10,000-20,000 processors on it and wouldn’t it need to run that many threads at once for this to happen? So if each thread takes 67 seconds to run, in order to get 10,000 threads running at once I’d need a lot more cards and with the same money I could just get a few quad-core computers and run it parallel on the CPUs…

I’m really wondering if I’m just doing something wrong, because right now it seems like, unless I have several good cards, I couldn’t even begin to see any advantages for the GPU, but then why not just buy more CPUs. If it would help, in two days when I a chance I could post some of the code I’m working on…

Reimar is right, your experiment is not useful for a couple of reasons: There is overhead associated to a kernel call that would be shared by however many threads you are running in it; Also, a GT200 has 30 multiprocessors if I rememer correctly, each of which can run a warp full of threads at a time. There may also be some overhead at the interface between Matlab and cuda or the cuda drivers (I"m really clueless as to how it works from Matlab) that interferes with your timing. Something is strange about those 67 seconds vs 0.27 seconds.

I don’t mean to derail this thread but I’m curious about this comment. I haven’t yet used a GTX2xx myself. I presume you are talking about the relaxed coallescing rules but how exactly are they implemented? Is the texture cache effectively being used as a staging area for coallescing memory accesses?

Edit: Having read the appropriate section in the new programming guide it obviously doesn’t use the texture cache and coallescing is still only possible within half-warps. So there should still be advantages to using 1D, 2D and 3D textures. Suppose that you have a lookup table that fits entirely within the texture cache and you access this in a random (or more likely data-dependent) way. Suppose also that for each texture unit (i.e. pair or triplet of multiprocessors) you make many more accesses than there are elements in the lookup table. With textures the global memory bandwidth used by each texture unit will be simply the number of elements in the lookup table times the element size. Without textures the global memory bandwidth used could be as high as 32 times the number of accesses times the element size. The slight complication is that the texture unit has relatively low throughput compared to global memory bandwidth but if you are also making lots of coallesced reads and/or writes then using the texture cache this way can really be helpful. Correct me if I’m wrong.

Assuming liv remembers right, your cards has 30 processors, each of which can have 512 or 768 threads “active” on it, giving 15360-23040.

Now, only about 1/16th to 1/32th of those are actually running at one moment of time, but that does not matter since they can also run when some other

thread is waiting for the up to 400 cycles latency of the global memory without increasing total runtime.

We can not know if that will indeed be the case, nor do I know for sure how many processors your card has, nor do I know if your kernel is simple enough that 512

threads fit into one processor etc. pp.

But I can tell that your current testing methodology is close to pointless.

Personally for my problem (solving large linear equations, which mostly depends on memory bandwidth, someone else had been working on it a few months already)

I went from a factor 3 slower over same speed (eliminating card <-> host copies), over two times faster (avoid uncoalesced access, shared memory as cache -

these make little difference on GTX2xx hardware), over 10 times faster (a different algorithm), over 13 times faster (newer hardware, both CPU and GPU faster but the

GPU much more so), over 17 times faster (using native instead of emulated double precision), which stayed 17 times faster when comparing the MPI variant running on

4 cores to running on 2 GPUs (and it was unfair to the GPU, since with all the optimizations the problem was now so small that the 20us kernel call overhead is about 20%

of the total runtime of the GPU algorithm).

Also, about using quad-cores, have you tested that? For my problem using two cores was 25 % faster than using one, using four cores was 12 % faster than using one,

probably due to cache conflicts, so you will need a proper benchmark anyway, otherwise your cluster solution might work just as badly as your GPU solution.

A multi-socket board may ease that problem, but is much more expensive, and a cluster with many PCs has its own bunch of problems.

But if you are not willing to learn the GPU architecture in some details and think long and hard about both your code and algorithms you will have more success with

using more CPUs - but for most problems you will not get a speedup of 10 without doing a lot of hard thinking and learning the details with that either, even if you had thousands of CPUs available.

I can confirm this. Benchmarks of HOOMD, which performs many random memory reads, on a GTX 280 show a 3x speedup using 1D textures vs straight global memory reads.

Kenry, you didn’t give the exact details of your genetic algorithm, but I imagine you are trying to optimize some complicated non-linear function. If the GPU genetic algorithm doesn’t work out, perhaps it would be faster to do some sort of hybrid approach by doing an initial search via an iterative matrix method on the GPU (which should parallelize very nicely), then once you’ve reached a certain tolerance, load your current best solution from the GPU into your CPU-based genetic algorithm to fine-tune your results. Depending on how large your simulation is (and if you are able to use the matrix method as well) this might give you a nice speedup in your research.

Thanks for the replies. It will be a couple days before I have the time to really read and think about them…

Kenry,

I’m currently doing some work in Particle Swarm Optimization (Genetic Algorithm derivative) and I found it best to change how the algorithm maps to the architecture. Your kernels should be light, shared memory utilized, memory accesses coalesced including broadcasting, and branching avoided or minimized (for loops and if statements). You will benefit greatly by rather than calculating an individual’s fitness on a single thread, distribute the evaluation across the entire block. Thus the grid itself will make up the population giving you a higher ceiling limit to the number of individuals in the population. You will not benefit from evaluating all individuals in a single block since it will only be executed on a single multiprocessor - counter intuitive if you have a card like a GTX280. Instead of synchronizing everything, try an asynchronous approach which may reduce synchronization overheads and allow a higher throughput.