Newbie: loops performance issue

Hi All,

I am currently writing my first CUDA program and I have made a working program but the identical code completes faster on the CPU than it does on the GPU.

The project I have been asked to do is to compare two 2D arrays of (short 2-byte) integers.
Array A is effectively a dictionary of words to find, and B is huge and full of randomly generated integers.
In both cases the integers are numerical values of ASCII characters.
Array A measures 4x30 and B measures 100000x1200.

The code uses the CPU to compare the first int of each row in A to all ints in all rows of B.
If a match is found it then checks the next int in A with the next int in B until EITHER the expected length of A is met in B OR there is a “break” in the int chain in which case that row of B is skipped.

The two 2D arrays are cudaMemcpy’ed to the GPU and the same code is ran except the calculated thread index is used to define which row of B is scanned in any given thread.

Since I haven’t got the hang of working with 2D arrays on the GPU I am simply using B[threadIndex*1200] instead of B[row][col].

All is working fine and the expected result is copied back to the CPU but the only problem is that 5 iterations take 8secs on the GPU whereas it takes 6secs on the CPU.

Any ideas why this might be?

Posting the kernel is always a good thing, altough a few things that come up from your signature:

  • If I recall correctly Visual studio express will only give you emulated mode - i.e. the code doesnt run on the GPU. You should validate that.

  • Your 9600 GPU is quite old.

I think that can be safely ruled out. If the code was running in emulation, it wouldn’t be 25% slower, it would be 25,000 times slower…

I think the key pieces of information here are 9600GT and 2-byte integers. I don’t believe that compute 1.1 hardware can coalesce 2 byte loads and stores, which probably means the code is running at a very small fraction of peak global memory bandwidth. My suggestion is to do some profiling. Wordy descriptions sans code or any serious details of how the code words are generally not the way to get help with problems around here.

Seems indeed logical but I recall seeing something about this in this newsgroups…

2.0 docs states its not supported :) but 2.3 does:…2.3_Windows.pdf

“Microsoft Visual Studio 2005 or 2008, or the corresponding versions of Microsoft Visual C++ Express”

Which version of CUDA are you using?


the slowness might be due to excessive global memory access. try to copy the arrays to shared memory. you might also want to reconsider the algorithm.

for example - maybe it would be more effective if each thread checked the word at a different offset in the dictionary line instead of a thread per line. this way each thread can copy it’s first dictionary pixel to shared memory - then use syncthreads() - then check the word when the line (or line segment if your blocks are smaller than 1200 threads) is entirely in shared memory.

hope this helps,