I am currently writing my first CUDA program and I have made a working program but the identical code completes faster on the CPU than it does on the GPU.
The project I have been asked to do is to compare two 2D arrays of (short 2-byte) integers.
Array A is effectively a dictionary of words to find, and B is huge and full of randomly generated integers.
In both cases the integers are numerical values of ASCII characters.
Array A measures 4x30 and B measures 100000x1200.
The code uses the CPU to compare the first int of each row in A to all ints in all rows of B.
If a match is found it then checks the next int in A with the next int in B until EITHER the expected length of A is met in B OR there is a “break” in the int chain in which case that row of B is skipped.
The two 2D arrays are cudaMemcpy’ed to the GPU and the same code is ran except the calculated thread index is used to define which row of B is scanned in any given thread.
Since I haven’t got the hang of working with 2D arrays on the GPU I am simply using B[threadIndex*1200] instead of B[row][col].
All is working fine and the expected result is copied back to the CPU but the only problem is that 5 iterations take 8secs on the GPU whereas it takes 6secs on the CPU.
Any ideas why this might be?