Okay, so now let’s see if I can make sense of my original question…transforming those for loops, which were used to calculate distances between objects, into CUDA code to be used on the GPU. Hopefully you all can give me some pointers here…
For the sake of easy “coding” on this thread, let’s just say i don’t use a struct, and that i just have seperate arrays for each of my important things.
So, previously, i had an array of structs, where each struct had an ID, an x coordinate, a y coordinate, and other things.
Now, i will have several arrays: an array of IDs, a 2d array of x and y coordinates, and other arrays for stuff. Or i could even have a seperate x-coord array and a y-coordinate array.
So now, in the main program, I will send a pointer of the x and y coordinate arrays (or the one 2d array) to the kernel. So if there are 1000 objects, there will an array of length 1000 for both the x and y coordinates. I will then need to use these values to calculate the distances between each and every object (saved as an int for space purposes).
A very humble, and perhaps naive, way would be to create a new “distance” array on both the host and device.
This distance array would be a 2d array of size N*N, where N is the # of Objects. The indeces of the array would simply refer to a specific object. So for example, the value stored at distance(212,487) would refer to the distance between object #212 and object # 487.
This distance array is what i would like the GPU to calculate for me. On the host, this would boil down to to two for loops, and would be a running time of O(n^2). Considering that I keep reading that the GPU is best used for most types of huge for loops that can be parallelized, I’m guessing this would be considerably sped up on the GPU.?.
So, now the question is, how do I make this happen.
Here’s a bit of code that I have NOT actually done anything with. I literally just typed this up in notepad for the purpose of pasting here to see if i even remotely get this.
My confusion is with the execution configuration and the kernel.
int numObjects = 100000;
int main() {
// calculate "size" to be used for allocating memory of x & y coord. arrays
int size = numObjects*sizeof(int);
// Also calculate "size" of the distance array
int size_d = numObjects*numObjects*sizeof(int);
// make arrays for x and y coordinates, as well as the "to be computed"
// distance array, and allocate necessary space on host
int *xArray, *yArray, *distanceArray;
xArray = malloc(size);
yArray = malloc(size);
distanceArray = malloc(size_d);
// distance array is a 2d array as described above
// now assume those x & y arrays have values in them
// and we want to calculate the "distance" array mentioned previously
// So we need to allocate space on the Device for the x & y arrays
// as well as the distance array we will calculate.
// So I start by making pointers to these arrays on the Device:
int *x_onDevice, *y_onDevice, *distanceArray_onDevice;
// the following just stores a result if i want to check what happened
cudaError_t result;
// And now we allocate memory on the Device for each of these arrays:
result = cudaMalloc((void**)&x_onDevice, size);
result = cudaMalloc((void**)&y_onDevice, size);
result = cudaMalloc((void**)&distanceArray_onDevice, size_d);
// Now I have to copy the x & y arrays to the device:
result = cudaMemcpy(x_onDevice, xArray, size, cudaMemcpyHostToDevice);
result = cudaMemcpy(y_onDevice, yArray, size, cudaMemcpyHostToDevice);
//*********************************************************
// Now I need to make the execution configuration which I am confused with
// I'm not sure as to how I choose the values for dimBlock and dimGrid
//*********************************************************
dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid( numMobileObjects/dimBlock.x, numMobileObjects/dimBlock.y );
computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);
// Now I simply bring the result back to the Host
result = cudaMemcpy(distanceArray, distanceArray_onDevice, size_d, cudaMemcpyDeviceToHost);
// Do other computation here as needed 4 program
// Now I can free all allocated space on Host and on Device:
free(xArray); free(xArray); free(distanceArray);
cudaFree(x_onDevice); cudaFree(y_onDevice); cudaFree(distanceArray_onDevice);
}
And here is the actual kernel:
__global__ computeDistanceArray (int *x, int *y, int *d, int numMobileObjects ) {
// I'm stuck here
// I need to somehow use the x and y coordinates, pair them up, and then calculate the distances
// between each and every object.
}
So the two main things, and yes they are big things, are:
(1) my confusion on what to specify for the execution configuration
(2) how i calculate the distances array on the GPU
For (1), I’m guessing I should make my blocksize equal to 16. Thus, this would result in a dimBlock of 256…which seems to be a popular number ppl reference on these forums.
But for dimGrid, I really don’t get what I’m trying to accomplish there.
And then comes (2), which is the biggee. How to make this happen in the first place.
Whew.
Whoever, can help, thank you in advance.