First of all, I want to thank you MisterAnderson42 for helping me on this problem. I’m new to CUDA and these forums have been very helpful.
I’m getting back to this problem after a break for a couple of weeks. My GPU extended simulator initially seemed to have a 3x - 5x speedup over the CPU simulation. And from the GPU perspective, this really isn’t something to get excited about. However, I figured it could be sped up more with a more efficient algorithm on the Kernel, learning how to use the shared memory, etc.
But, now that I come back to this two weeks later and look at my code, I am shocked that I made the same silly mistake, on the CPU version of the simulator, of trying to “square” values in my distance formula by typing ^2 in my code. HAH! That’s what happens when you first type psuedocode and then quickly throw it all together.
So anyway, upon correcting that and actually performing proper “squares”, I come to find that my GPU simulator actually runs SLOWER that its CPU counterpart…SLOWER. Not a ton slower…but nevertheless, slower.
Could you please take a look at my Kernel and tell me if you see a reason why this multithreaded GPU version would be slower than the CPU version:
//Here's the Execution Configuration:
dim3 dimBlock(blockSize/16, blockSize/16);
dim3 dimGrid(numObjects/dimBlock.x + (numObjects%dimBlock.x == 0 ? 0:1), numObjects/dimBlock.y + (numObjects%dimBlock.y == 0 ? 0:1));
computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);
//And here's the Kernel:
__global__ void computeDistanceArray (int *x_d, int *y_d, int *distance_d, int numObjects) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
int idx = i*numObjects + j;
int xDiff = x_d[i] - x_d[j];
int yDiff = y_d[i] - y_d[j];
if (i < numObjects && j < numObjects)
distance_d[idx] = (int)sqrt((double)(xDiff*xDiff) + (yDiff*yDiff));
}
I may have just answered my own question…to some extent. On the CPU simulator, I have a function that queries all objects. Basically it’s an O(N^2) funciton that has two for loops that go through all the objects and calculates the distance between each and every object. IMMEDIATELY upon calculating the distance, if the distance is less than the query range, object B is added to the neighbor list of object A.
Now, for the GPU based simulator, I calculate the distances on the GPU…so we avoid here the two for loops and instead, ideally, are using a bit of the parallel processing capability of the GPU. But then, AFTER all the distances are calculated, I return back to the CPU based program and have two for loops that run through this newly created distance array. Just like on the CPU simulator, if the distance is less than the query range, object B is added to the neighbor list of object A.
So, in answering my question, perhaps the GPU simulator is not faster because EITHER way i’m running a O(N^2) for loops. On the CPU Simulator, this for loop (1) calculates the distance, (2) compares it to the query radius, and (3) then adds a neighbor if necessary. On the GPU simulator, the for loops do the second two steps, since the Kernel does the distance calculation.
So if that makes sense as to why it is slower, then I’m really hurting myself here. Basically, why use the GPU if I’m going to come back and run an O(N^2) for loops to test something anyway? Is that right?
So, what I then did, just for testing purposes, was to just IGNORE the for loops on the GPU simulator. I just commented them out. Upon getting the distance array back from the GPU, I do nothing with it at all. So of course, the simulator is useless…but the point is to see if it is much faster.
Without having to run through those for loops, the GPU was faster…but not by much…maybe a whopping 25% or so.
So my first question is still valid. And that is, is there something I’m doing wrong with my Kernel that is not taking proper advantage of inherent parallelism of the GPU?
If you have the time, I would really appreciate if you could help.
Thanks again.
…
EDITing the post now. It’s about 20 minutes later, and perhaps I’m still answering my own question.
Right now, on my little example, the CPU simulator runs at lets say around 40 seconds. And the GPU version maybe does 30 seconds. However, the number of objects is only 10,000. As a result, the CPU is actually able to run through this pretty quickly (it takes about 40 seconds for 20 iterations of moving the objects and then calculating the distances, etc.).
Since the time here (30-40 seconds) is so little, could the reason I don’t see speedup be because I’m not having the chance to really see the GPU in action. That’s not realy clear. What I’m saying is that there is COST associated with using the GPU, and that cost is the tranferring of this distance array back and forth between the host and the device on each iteration. IDEALLY, this cost is offset by a MASSIVE speedup of computation on the GPU. BUT in my case, since the number of mobile objects is small (only 10,000), the GPU really isn’t having to work that hard, and as a result, the COST of communication isn’t really getting ofset. I think that’s a little bit more clear now.
So if this makes sense, I would LOVE to simply increase the mobile objects to 100,000, which takes the CPU WAAAAAAAAAY too long for even one iteration…I won’t wait it takes so long. That would be a PERFECT number of objects to test on the GPU. However, I can’t create a distance array that large on the device. The device has a space limitation:
int *distanceArray;
int size_d = numObjects*numObjects*sizeof(int);
// This allocation is possible on host even with numObjects = 100,000
distanceArray = (int*)malloc(size_d);
int *distanceArray_onDevice;
// But this allocation fails on the device
cudaMalloc((void**)&distanceArray_onDevice, size_d);
So if you agree I could indeed be on to something here, meaning with regards to that my GPU code may not be that slow after all…I’m just not giving it a chance to be tested properly, then could you please advise how one goes about having that many nodes and getting their distances on the GPU.
Thanks.