I am currently writing my first CUDA program and I have made a working program but the identical code completes faster on the CPU than it does on the GPU.
The project I have been asked to do is to compare two 2D arrays of (short 2-byte) integers.
Array A is effectively a dictionary of words to find, and B is huge and full of randomly generated integers.
In both cases the integers are numerical values of ASCII characters.
Array A measures 4x30 and B measures 100000x1200.
The code uses the CPU to compare the first int of each row in A to all ints in all rows of B.
If a match is found it then checks the next int in A with the next int in B until EITHER the expected length of A is met in B OR there is a “break” in the int chain in which case that row of B is skipped.
The two 2D arrays are cudaMemcpy’ed to the GPU and the same code is ran except the calculated thread index is used to define which row of B is scanned in any given thread.
Since I haven’t got the hang of working with 2D arrays on the GPU I am simply using B[threadIndex*1200] instead of B[row][col].
All is working fine and the expected result is copied back to the CPU but the only problem is that 5 iterations take 8secs on the GPU whereas it takes 6secs on the CPU.
I think that can be safely ruled out. If the code was running in emulation, it wouldn’t be 25% slower, it would be 25,000 times slower…
I think the key pieces of information here are 9600GT and 2-byte integers. I don’t believe that compute 1.1 hardware can coalesce 2 byte loads and stores, which probably means the code is running at a very small fraction of peak global memory bandwidth. My suggestion is to do some profiling. Wordy descriptions sans code or any serious details of how the code words are generally not the way to get help with problems around here.
the slowness might be due to excessive global memory access. try to copy the arrays to shared memory. you might also want to reconsider the algorithm.
for example - maybe it would be more effective if each thread checked the word at a different offset in the dictionary line instead of a thread per line. this way each thread can copy it’s first dictionary pixel to shared memory - then use syncthreads() - then check the word when the line (or line segment if your blocks are smaller than 1200 threads) is entirely in shared memory.