Your code cannot possibly be very fast compared to CPU. If your professor is too stubborn to accept not every problem can be magically sped up 100x with CUDA, show him some calculations.
Your kernel basically does one global memory read, very little local arithmetic and shared memory shuffling and one global memory write. It is completely bandwidth bound. For a kernel to be compute bound you’d need something in the order of 50 arithmetic operations per memory access if not more. When a kernel is compute bound, the only number that counts is GB/s. Forget about FLOPS, they get important only as your kernel becomes compute bound.
Your GPU has a peak memory bandwidth of around 8 GB/s. My desktop PC’s CPU with DDR2 RAM does around 5 GB/s - you can assume 5-10 GB/s as your “standard” bandwidth you get on a non-server CPU (more with i7 and such).
Plus you need to copy the data over to the GPU and back through PCIe. PCIe 2.0 bandwidth with pageable memory is, on my machine, 1.5 GB/s. Around 3.5 GB/s with pinned memory, let’s take that more optimistic variant.
Now, assume you do computations on an array of 256k floats, total 1MB of data.
On the cpu, the time to read 1 MB and then store 1 MB would be:
1 MB / (5 GB/s) + 1 MB / (5 GB/s) = 0.4 ms
The time of the shuffling in L1 cache and whatever collateral arithmetic ops you perform is negligible.
On the GPU the total time is a PCI transfer, ram read, ram write, PCI transfer:
1 MB / (3.5 GB/s) + 1 MB / (8 GB/s) + 1 MB / (8 GB/s) + 1 MB / (3.5 GB/s) = 0.82 ms
Now you say you have a cluster? So let’s add the time to send and receive 1MB through the network - say it’s 100Mb Fast Ethernet , effectively ~10MB/s
gpuTime + 1 MB / (10 MB/s) + 1 MB / (10 MB/s) = 0.82 ms + 200 ms = 200 ms+
Amdahl’s law at its finest
See the problem here? It just doesn’t make sense to send two vectors to a GPU just do to a glorified copy, much less over network, it will never be as fast as the CPU can get, even if it’s an embarrassingly parallel problem and you’re exploiting every last bit of parallelism there is.
Matrix multiplication is different because if the copy through PCIe is of size N, you do N^2 ram accesses, so for big enough N it becomes profitable even for slow PCI and only a slight edge in ram bandwidth (assuming there’s enough ram on board). Plus there are some arithmetic operations that can, in some cases and with proper implementation, make this actually achieve nice FLOPS.
So for matrices, the rough time calculation would go like:
N MB / (3.5 GB/s) + N^2 MB / (8 GB/s) + N MB / (3.5 GB/s)
It’s a quadratic equation, the multiplier for N^2 (ram bandwidth) becomes much more important than multipliers for the linear components (pcie bandwidth) for large N.