# modifying simulator to work with CUDA.?.

Hi, I’m wondering if the following concept would be workable on a GPU with CUDA.

As of now, I have a simulator (in C) that does the following. I basically have a number of objects that constantly move around at random. At each cycle, the simulator calculates the neighboring objects (within a certain distance) to any given object and stores those neighboring objects in an array specific to that object.

If that doesn’t make sense, here’s the code:

``````void queryAllMOs(MobileObject MOs[]) {

int i, j;

double distance;

for (i = 0; i < numMobileObjects; ++i) {

for (j = 0; j < numMobileObjects; ++j) {

if (i != j) {

distance = (double) sqrt((double)((MOs[j].x - MOs[i].x)^2) + ((MOs[j].y - MOs[i].y)^2));

// So here we are caluclating the distance between object "i" and ALL other "j" objects,

// by computing the distance formula based on their x and y coordinates

//Here is where I now add this j object to the neighbor array of object i.

}

}

}

}

}
``````

Is this something I can do in CUDA? All of these distance calculations can happen in parallel; so that part makes sense.

But for me, i’m just flat out stumped on how to make it happen. I’ve been trying to look at example of matrices, as this should ideally (?) shed some light on how to work with two dimensions (two coordinates).?.

And also, all the basic examples I’ve seen on Dr. Dobbs and other tutorial websites always use a basic array on the host, whether it be a 1d, 2d, or m-d array.

In this example, on the host, i have an array of structs. Each struct has the x and y coordinate as well as other information for each object.

Is this workable?

Any ideas here? I’m really at a loss.

Is there any sample project that i can somewhat map to this problem?

It may be a bit overkill for you, but I’ve worked extensively with lists of neighbors on the GPU. All the details are in the paper here

If all you really need is an example code of how to efficiently calculate a neighbor list in O(N^2) time on a GPU, HOOMD’s code has one of these too:
http://trac2.assembla.com/hoomd/browser/tr…orListNsqGPU.cu

As for decoupling your data arrays from your structures, well, that is just something you have to do with CUDA. On GPUs, structures of arrays are much more efficient than arrays of structures because of the memory access patterns.

As of now, on my simulator, i have an array of structures. How do you suggest i work this so it is an easy transition to the GPU.

Is it perhaps easiest to just take out the information i need from the structures (coordinates) and make a seperate coordinate array. So i can have my array of structures. And then i can have my 2d array of coordinates. Thus, i can pass my array of coordinates to the kernel and process that way.

Is that what you were suggesting?

I didn’t understand the “structures of arrays” part.

Thanks.

It is standard language, but can be confusing when you first see it.

Array of structures:

``````struct my_object

{

float mass;

float x,y,z;   // position

float vx,vy,vz;  // velocity

float ax,zy,ax;  //acceleration

}

my_object data_array[100];

}
``````

Structure of arrays:

``````struct my_object

{

float *mass;

float4 *pos;   // position

float4 *vel;  // velocity

float4 *accel;  //acceleration

}

// each pointer is allocated to be of length N
``````

Let me start by expressing my thanks for spending the time to help. It is greatly appreciated.

To be honest, even after you posted this, i was confused and began just googling “structure of arrays”. But then it hit me (I think). Instead of an array of structures, you are just making seperate arrays for EACH “thing” (seperate array for mass, seperate array for pos, for vel, and a seperate array for accel).

Then, just to package it all together, you are putting all these arrays in a struct. This last part really isn’t necessary right? It’s just suggested to keep the feel of everything being “together”.?.

And to confirm my humble understanding of this, the entire purpose of writing it as a structure of arrays is because we need to be able to pass the entire array to the GPU and have the GPU work on the entire array…and this is because the GPU can’t decouple the information we need from an array of structs…or if it can, it can’t do it nearly as quicly.

Do I sort of have this.?.

And lastly, what did you mean, “each pointer is allocated to be of length N”? Was N referring to the 100 in the example on array of structures?

Yes, I believe that’s what he meant.

Okay, so now let’s see if I can make sense of my original question…transforming those for loops, which were used to calculate distances between objects, into CUDA code to be used on the GPU. Hopefully you all can give me some pointers here…

For the sake of easy “coding” on this thread, let’s just say i don’t use a struct, and that i just have seperate arrays for each of my important things.

So, previously, i had an array of structs, where each struct had an ID, an x coordinate, a y coordinate, and other things.

Now, i will have several arrays: an array of IDs, a 2d array of x and y coordinates, and other arrays for stuff. Or i could even have a seperate x-coord array and a y-coordinate array.

So now, in the main program, I will send a pointer of the x and y coordinate arrays (or the one 2d array) to the kernel. So if there are 1000 objects, there will an array of length 1000 for both the x and y coordinates. I will then need to use these values to calculate the distances between each and every object (saved as an int for space purposes).

A very humble, and perhaps naive, way would be to create a new “distance” array on both the host and device.

This distance array would be a 2d array of size N*N, where N is the # of Objects. The indeces of the array would simply refer to a specific object. So for example, the value stored at distance(212,487) would refer to the distance between object #212 and object # 487.

This distance array is what i would like the GPU to calculate for me. On the host, this would boil down to to two for loops, and would be a running time of O(n^2). Considering that I keep reading that the GPU is best used for most types of huge for loops that can be parallelized, I’m guessing this would be considerably sped up on the GPU.?.

So, now the question is, how do I make this happen.

Here’s a bit of code that I have NOT actually done anything with. I literally just typed this up in notepad for the purpose of pasting here to see if i even remotely get this.

My confusion is with the execution configuration and the kernel.

``````int numObjects = 100000;

int main() {

// calculate "size" to be used for allocating memory of x & y coord. arrays

int size = numObjects*sizeof(int);

// Also calculate "size" of the distance array

int size_d = numObjects*numObjects*sizeof(int);

// make arrays for x and y coordinates, as well as the "to be computed"

// distance array, and allocate necessary space on host

int *xArray, *yArray, *distanceArray;

xArray = malloc(size);

yArray = malloc(size);

distanceArray = malloc(size_d);

// distance array is a 2d array as described above

// now assume those x & y arrays have values in them

// and we want to calculate the "distance" array mentioned previously

// So we need to allocate space on the Device for the x & y arrays

// as well as the distance array we will calculate.

// So I start by making pointers to these arrays on the Device:

int *x_onDevice, *y_onDevice, *distanceArray_onDevice;

// the following just stores a result if i want to check what happened

cudaError_t result;

// And now we allocate memory on the Device for each of these arrays:

result = cudaMalloc((void**)&x_onDevice, size);

result = cudaMalloc((void**)&y_onDevice, size);

result = cudaMalloc((void**)&distanceArray_onDevice, size_d);

// Now I have to copy the x & y arrays to the device:

result = cudaMemcpy(x_onDevice, xArray, size, cudaMemcpyHostToDevice);

result = cudaMemcpy(y_onDevice, yArray, size, cudaMemcpyHostToDevice);

//*********************************************************

// Now I need to make the execution configuration which I am confused with

// I'm not sure as to how I choose the values for dimBlock and dimGrid

//*********************************************************

dim3 dimBlock( blocksize, blocksize );

dim3 dimGrid( numMobileObjects/dimBlock.x, numMobileObjects/dimBlock.y );

computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);

// Now I simply bring the result back to the Host

result = cudaMemcpy(distanceArray, distanceArray_onDevice, size_d, cudaMemcpyDeviceToHost);

// Do other computation here as needed 4 program

// Now I can free all allocated space on Host and on Device:

free(xArray); free(xArray); free(distanceArray);

cudaFree(x_onDevice); cudaFree(y_onDevice); cudaFree(distanceArray_onDevice);

}
``````

And here is the actual kernel:

``````__global__ computeDistanceArray (int *x, int *y, int *d, int numMobileObjects ) {

// I'm stuck here

// I need to somehow use the x and y coordinates, pair them up, and then calculate the distances

// between each and every object.

}
``````

So the two main things, and yes they are big things, are:

(1) my confusion on what to specify for the execution configuration

(2) how i calculate the distances array on the GPU

For (1), I’m guessing I should make my blocksize equal to 16. Thus, this would result in a dimBlock of 256…which seems to be a popular number ppl reference on these forums.

But for dimGrid, I really don’t get what I’m trying to accomplish there.

And then comes (2), which is the biggee. How to make this happen in the first place.

Whew.

Whoever, can help, thank you in advance.

And just to help with the translation from C to CUDA, this is how I would calculate the distance array on the host:

``````int i, j, index;

for (int i=0; i < numObjects; ++i) {

for (int j=0; j < numObjects; ++j) {

// Just calculates the distance between objects "i" and "j"

// by referencing the indeces of the xArray and yArray and using the distance formula

distance[i][j] = (int) sqrt(((xArray[i] - xArray[j])^2) + ((yArray[i] - yArray[j])^2));

// Now, couldn't I calculate the index of the storage mapping function directly:

// index = i + j*numObjects;

// distance[index] = (int) sqrt(((xArray[i] - xArray[j])^2) + ((yArray[i] - yArray[j])^2));

// ??? But does this help me somehow in the translation

}

}
``````

On many of the 2 for-loop examples I’ve seen, they compute the storage mapping function and use that to reference the index of the arrays. I’m guessing this is to somewhat clarify the translation from C to CUDA.?.

I’m referring to the example shown on page 7 here:

http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf

My confusion is how to make this happen with my specific array to be calculated (distance), and then what to put in the execution configuration and how to use those values in the kernel.

Thanks.

Okay, I’m trying to make sense of things as I wait for your response…so this may be waaaaay off. How does this look:

Here is the Execution Configuration:

``````dim3 dimBlock(blockSize);

dim3 dimGrid(numObjects/blockSize + (numObjects%blockSize == 0 ? 0:1);

computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);
``````

Considering that my xArray and yArray are both 1d arrays, I really don’t need to specify more than one dimention for the dimBlock or dimGrid right? Is there any advantage of doing so?

So here, blockSize is 256, and then based on the example above with numObjects = 100,000, dimGrid would get the value of 391. So there would be 391 blocks, each with 256 threads…so a few more threads than needed, and we take care of this in the Kernel…

Here is the Kernel:

``````__global__ computeDistanceArray (int *x, int *y, int *d, int numObjects) {

int i = blockIdx.x*blockDim.x + threadIdx.x;

int j = blockIdx.x*blockDim.x + threadIdx.x;

int idx = i + j*numObjects;

if (i < numObjects && j < numObjects)

distance[idx] = (int) sqrt(((xArray[i] - xArray[j])^2) + ((yArray[i] - yArray[j])^2));

}
``````

To be honest, I’m doubting myself on this one, as it just seems funky. I tried to write what made sense.

Of course, I need to get the id of the threads, and that is done with the int i variable. However, as you see, i simply repeated this again with the int j variable. Why? My logic was that when i compute the distance array, since the x and y coordinates pair up, meaning that xArray[412] must be paired with yArray[412], because of this, i needed to somehow show this in the equation.

But does this work? I’m not too confident.

And I’m not sure about the distance array. Can I use this instead:

``````distance[i][j] = (int) sqrt(((xArray[i] - xArray[j])^2) + ((yArray[i] - yArray[j])^2));
``````

Thanks for the help!

Okay, I’ve now tried to implement this, and I got all excited thinking it was working. But something is definitely wrong. Here’s the gist of my code:

``````// ******* KERNEL *******//

__global__ computeDistanceArray (int *x, int *y, int *d, int numObjects) {

int i = blockIdx.x*blockDim.x + threadIdx.x;

int j = blockIdx.x*blockDim.x + threadIdx.x;

int idx = i + j*numObjects;

if (i < numObjects && j < numObjects)

distance[idx] = (int) sqrt(((xArray[i] - xArray[j])^2) + ((yArray[i] - yArray[j])^2));

}

int blockSize = 256;

int numObjects = 100000;

int main() {

// calculate "size" to be used for allocating memory of x & y coord. arrays

int size = numObjects*sizeof(int);

// Also calculate "size" of the distance array

int size_d = numObjects*numObjects*sizeof(int);

// make arrays for x and y coordinates, as well as the "to be computed"

// distance array, and allocate necessary space on host

int *xArray, *yArray, *distanceArray;

xArray = malloc(size);

yArray = malloc(size);

distanceArray = malloc(size_d);

// distance array is a 2d array as described above

// now assume those x & y arrays have values in them

// and we want to calculate the "distance" array mentioned previously

// So we need to allocate space on the Device for the x & y arrays

// as well as the distance array we will calculate.

// So I start by making pointers to these arrays on the Device:

int *x_onDevice, *y_onDevice, *distanceArray_onDevice;

// the following just stores a result if i want to check what happened

cudaError_t result;

// And now we allocate memory on the Device for each of these arrays:

result = cudaMalloc((void**)&x_onDevice, size);

result = cudaMalloc((void**)&y_onDevice, size);

result = cudaMalloc((void**)&distanceArray_onDevice, size_d);

// Now I have to copy the x & y arrays to the device:

result = cudaMemcpy(x_onDevice, xArray, size, cudaMemcpyHostToDevice);

result = cudaMemcpy(y_onDevice, yArray, size, cudaMemcpyHostToDevice);

//*********************************************************

// Now I need to make the execution configuration which I am confused with

// I'm not sure as to how I choose the values for dimBlock and dimGrid

//*********************************************************

dim3 dimBlock(blockSize);

dim3 dimGrid(numObjects/blockSize + (numObjects%blockSize == 0 ? 0:1);

computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);

// Now I simply bring the result back to the Host

result = cudaMemcpy(distanceArray, distanceArray_onDevice, size_d, cudaMemcpyDeviceToHost);

// Do other computation here as needed 4 program

// Now I can free all allocated space on Host and on Device:

free(xArray); free(xArray); free(distanceArray);

cudaFree(x_onDevice); cudaFree(y_onDevice); cudaFree(distanceArray_onDevice);

}
``````

That is NOT exactly copied from my program, so expect there are perhaps some misspellings, etc.

But it is the meaning i’m looking for. Something I am doing, probably with the execution configuration or with the Kernel itself (or both) is way wrong.

Any help would be appreciated.

Thanks.

Ok, well, it is not all bad. I added many switch statements to the code, allowing me to check the value of “result” after each cuda reference.

Everything returned cudaSuccess EXCEPT when i tried to cudaMalloc the space for the distanceArray_onDevice. It’s the third one shown below.

``````result = cudaMalloc((void**)&x_onDevice, size);

result = cudaMalloc((void**)&y_onDevice, size);

result = cudaMalloc((void**)&distanceArray_onDevice, size_d);
``````

I don’t even know what error that returned because it wasn’t one of the two errors provided in the reference manual. So my switch statement just printed “other”.

I’m guessing this was just because it was a HUUUUUGE amount of space? I had the program set to 10,000 objects. size_d was equal to (10,000)*(10,000)*sizeof(int). Bottom line, a HUGE number. I then reduced the # of objects to 1,000 and it did give me a cudaSuccess on that particular malloc.

It’s still NOT WORKING…so i’m still back to looking at the execution configuration and the kernel. However, that initial problem seems to have been that i was asking to allocate too much space.?. Does that sound right?

Well, just b4 I hed off to sleep, let me say what is happening in case anyone reads this.

All of the cuda calls result in “cudaSuccess”. So those are working.

However, the distance_d array that is computed on the device is resulting in an array of all ZEROES.

Once the resulting array is copied back, which does copy successfully, for testing purposes, I printed out several array indeces, and all of the distance values are 0. So something is wrong with my coding of the kernel.

Ideas.?.

I would help more, but I’m at a conference all week (and had to prepare for the last couple days). Maybe someone else can step in and help…?

You are on the right track, though.

To avoid so much memory usage, you should consider not storing all the distance values. Instead, just run one thread per object in your simulator and have it build the neighbor list for that object. Did you look at the information I linked you to before? It documents a sliding window technique to maximize use of shared memory for these types of problems. The Nbody sample in the SDK uses the same technique.

Thanks. Hopefully someone else can help out a bit.

For now, i’m not worried about this most effective or efficient solution. I knew that storing all the distance values would be a problem and a bottleneck, and I will work on that later.

But for now, i’m just wanting to first start by at least having this trivial method work…heck, even if it is slower! I just want it to calculate the distances and know that the kernel is doing it properly. As mentioned, it is currently returning all zeroes for the distance values. So I am stumped here. But yes, once I get that fixed, then I can working on more effective techniques.

So can anyone take a look and explain what I’m doing wrong with my kernel or perhaps the execution configuration (or both)? Why is it returning all zeroes?

Thanks.

Can anyone take a look at my Kernel and shed some light as to what is wrong. Once I get the distance array back on the host side, I print out the values of a few random indeces, and they are all ZERO. Each and every index has the value 0 stored in it. Why? What am I doing wrong with the Kernel?

Regarding the coding of the Kernel, I explained my perhaps crazy thought process previously:

So, bottom line, it is not working. It is getting all zeroes. But why?

Looks like you’re missing a parenthesis here? Not sure how that compiled like that…

``````dim3 dimGrid(numObjects/blockSize + (numObjects%blockSize == 0 ? 0:1);
``````

I don’t really think that’s you problem though. Perhaps I will take a look tomorrow if no one else does before then.

Wow. GREAT eye.

Um, but as mentioned, I originally typed this code here on the forum. Afterwards, I then took it into my simulator, fixed any mistakes, typos, etc. (including the one you mentioned), and then tried to run it. The compiler did indeed yell at me regarding that missing parenthesis; I corrected it yesterday. But either way, good catch, as I’m not sure if too many ppl would have noticed.

If you do have the time tomorrow, please do see if you can assist with why the distance array is filling with all zeroes.

``````// ******* KERNEL *******//

__global__ computeDistanceArray (int *x, int *y, int *d, int numObjects) {

int i = blockIdx.x*blockDim.x + threadIdx.x;

int j = blockIdx.x*blockDim.x + threadIdx.x;

int idx = i + j*numObjects;

if (i < numObjects && j < numObjects)

d[idx] = (int) sqrt(((x[i] - x[j])^2) + ((y[i] - y[j])^2));

}
``````

Hope thats a typo, but you are using the same calculation for i and j.

So what you are basicly doing is something like sqrt((a-a)^2 + (b-B)^2) what hopefully always will be zero.

Second problem is, that you are launching about numObjects threads on the GPU which calculates one distance each and expect numObjects * numObjects results.

So you either have to implement a for-loop in your kernel or launch about numObjects^2 threads. Maybe someone can tell which would be more effective in terms of memory access, but i guess the version with more threads will be the better one.