Scalable models for simulations Allow a user to input the number of species to simulate

Hello, I will start this post as I always do. I am a noob at CUDA and relearning C++, so I may explain things in strange ways, but I hope you can get the idea I’m trying to get across. You have been warned :-)

I am working on a model similar to a common predator-prey model for plants. There are simple differential equations for each plant species, water, and nutrient that relate to one another and enforce competition and coexistence. So far I have been hardcoding the equations according to how many species I want to include, but I would like to avoid this and allow a user to input the number of species and have the equations and arrays take this into account “automatically”.

The problems I’m running into include allocating/populating arrays of proper size and efficiently modifying the dynamics. I have a parameter, say ‘g’ for growth coefficient, that each species will have. I pass into the kernel an unsigned int nSpecies and am trying to do something like

float * g[nSpecies];

g[0] = 123;

g[1] = 456;

...

but it complains that nSpecies is not constant. I then tried

float * g = new float [nSpecies];

g[0] = 123;

g[1] = 456;

...

which complained that new is a host function that cannot be called from a device kernel. I also tried defining a variable thinking it may be considered constant in the compiler’s eyes,

#define ArraySize nSpecies

__device__ FUNCTION(inputs)

{

     float * g[ArraySize];

     g[0] = 123;

     g[1] = 456;

     ...

}

but it complained that ArraySize is not constant either. So I settled for allocating with a number larger than I anticipate actually using, such as 10 species, then just populating the first nSpecies elements of the arrays. This is ok, but definitely not the best way to do this.

My next hurdle was adjusting the dynamics. Each species as well as nutrient and water have their own WxH grid and the equations operate on the cells of these grids. For the species I stack the grids on top of each other so I can pass in a single variable that contains all of the information, then I loop the equations such as

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

for(unsigned int sp = 0; sp < nSpecies; sp++)

{

     Species[sp*h + y*w + x] += [Species-specific equation];

     Water[y*w + x] += dt * [Species-dependent parts of the equation];

     Nutrient[y*w + x] += dt * [Species-dependent parts of the equation];

}

Water[y*w + x] += dt * [Species-independent parts of the equation];

Nutrient[y*w + x] += dt * [Species-independent parts of the equation];

this seems to work (I haven’t tested it too much) but the program takes a huge hit to its performance. For a single species (nSpecies = 1) the code used to run at ~70 fps, but with the loop it is running at ~45 fps, even though it only loops once. (EDIT: this decrease is the end result of several instances of this loop implementation in different functions. Each one is about a 4-10 fps decrease on its own.)

Does anybody know if it is the for-loop that costs so much overhead? If not, what else could it be? Is allocating arrays the way I do inefficient? Is there a better way to accomplish what I need?

For extra credit, you can read my other posts:

http://forums.nvidia.com/index.php?showtopic=202551

http://forums.nvidia.com/index.php?showtopic=202554

The goal is to eventually allocate all variables in a single place to help the user quickly and accurately adjust the model. The end goal is to allocate default values then read a text file that will update these variables at runtime to allow the user to iterate through several model setups without remaking the source each time. I see this as an extension of the first part, but perhaps it is completely independent.

Thank you for reading my post. I appreciate any and all help!

~Josh

Hello, I will start this post as I always do. I am a noob at CUDA and relearning C++, so I may explain things in strange ways, but I hope you can get the idea I’m trying to get across. You have been warned :-)

I am working on a model similar to a common predator-prey model for plants. There are simple differential equations for each plant species, water, and nutrient that relate to one another and enforce competition and coexistence. So far I have been hardcoding the equations according to how many species I want to include, but I would like to avoid this and allow a user to input the number of species and have the equations and arrays take this into account “automatically”.

The problems I’m running into include allocating/populating arrays of proper size and efficiently modifying the dynamics. I have a parameter, say ‘g’ for growth coefficient, that each species will have. I pass into the kernel an unsigned int nSpecies and am trying to do something like

float * g[nSpecies];

g[0] = 123;

g[1] = 456;

...

but it complains that nSpecies is not constant. I then tried

float * g = new float [nSpecies];

g[0] = 123;

g[1] = 456;

...

which complained that new is a host function that cannot be called from a device kernel. I also tried defining a variable thinking it may be considered constant in the compiler’s eyes,

#define ArraySize nSpecies

__device__ FUNCTION(inputs)

{

     float * g[ArraySize];

     g[0] = 123;

     g[1] = 456;

     ...

}

but it complained that ArraySize is not constant either. So I settled for allocating with a number larger than I anticipate actually using, such as 10 species, then just populating the first nSpecies elements of the arrays. This is ok, but definitely not the best way to do this.

My next hurdle was adjusting the dynamics. Each species as well as nutrient and water have their own WxH grid and the equations operate on the cells of these grids. For the species I stack the grids on top of each other so I can pass in a single variable that contains all of the information, then I loop the equations such as

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

for(unsigned int sp = 0; sp < nSpecies; sp++)

{

     Species[sp*h + y*w + x] += [Species-specific equation];

     Water[y*w + x] += dt * [Species-dependent parts of the equation];

     Nutrient[y*w + x] += dt * [Species-dependent parts of the equation];

}

Water[y*w + x] += dt * [Species-independent parts of the equation];

Nutrient[y*w + x] += dt * [Species-independent parts of the equation];

this seems to work (I haven’t tested it too much) but the program takes a huge hit to its performance. For a single species (nSpecies = 1) the code used to run at ~70 fps, but with the loop it is running at ~45 fps, even though it only loops once. (EDIT: this decrease is the end result of several instances of this loop implementation in different functions. Each one is about a 4-10 fps decrease on its own.)

Does anybody know if it is the for-loop that costs so much overhead? If not, what else could it be? Is allocating arrays the way I do inefficient? Is there a better way to accomplish what I need?

For extra credit, you can read my other posts:

http://forums.nvidia.com/index.php?showtopic=202551

http://forums.nvidia.com/index.php?showtopic=202554

The goal is to eventually allocate all variables in a single place to help the user quickly and accurately adjust the model. The end goal is to allocate default values then read a text file that will update these variables at runtime to allow the user to iterate through several model setups without remaking the source each time. I see this as an extension of the first part, but perhaps it is completely independent.

Thank you for reading my post. I appreciate any and all help!

~Josh

I’m not too familiar with the new cuda syntax that allows dynamic allocation (malloc) from within a kernel and I’m sure someone else here can explain it much better. What you can definitely do though is allocate your data on the host and pass the pointer to your kernel.
Typically what you do is you compute the size needed with the information you get at runtime and you declare a device pointer, which you initialize using cudaMalloc and the size you just computed.You then pass that pointer to your kernel.

The main difference with your solution is that the information will no longer be stored in quickly accessible registers but instead in the slow global memory.
I don’t think it is a good idea to try and solve your second problem now because it might be linked with the first one. It could be (please correct me if I’m wrong) that declaring big in register arrays reduces the number of concurrent threads thus making your whole computation slow.

I hope this makes sense,
regards,

Guillaume

I’m not too familiar with the new cuda syntax that allows dynamic allocation (malloc) from within a kernel and I’m sure someone else here can explain it much better. What you can definitely do though is allocate your data on the host and pass the pointer to your kernel.
Typically what you do is you compute the size needed with the information you get at runtime and you declare a device pointer, which you initialize using cudaMalloc and the size you just computed.You then pass that pointer to your kernel.

The main difference with your solution is that the information will no longer be stored in quickly accessible registers but instead in the slow global memory.
I don’t think it is a good idea to try and solve your second problem now because it might be linked with the first one. It could be (please correct me if I’m wrong) that declaring big in register arrays reduces the number of concurrent threads thus making your whole computation slow.

I hope this makes sense,
regards,

Guillaume

Thank you Guillaume,

The goal of this project is to make the simulation as fast as possible to allow, for example, high resolution weather forecasting for large areas in a short amount of time (hours instead of days). User convenience seems to be a competing force against computational speed.

I tried using cudaMalloc before but I could never make it work, the data types were the most common error message printed. I gave up on that because it was eating too much time just to debug something I didn’t fully understand. I may go back to it at some point to benchmark the performance.

As I understand it, big in register arrays does not reduce the number of concurrent threads, instead it increases the number of serial threads. I think it has the same effect, so maybe we are just talking syntax here.

For now I’m trying this approach:

__device__ FUNCTION(inputs, unsigned int nSpecies)

{

     if (nSpecies == 1)

     {

          float g = 123;

//do stuff with g

     }

else if (nSpecies == 2)

     {

          float g[2];

          g[0] = 123;

          g[1] = 456;

//do stuff with g[0]

          //do stuff with g[1]

     }

...

else if (nSpecies == N)

     {

          float g[N];

          g[0] = 123;

          g[1] = 456;

          ...

          g[N-1] = 789;

//do stuff with g[0]

          //do stuff with g[1]

          ...

          //do stuff with g[N-1]

     }

}

and the performance is going back up to what it was before I tried to generalize for arbitrary number of species. I just personally don’t like this coding style.

Thank you Guillaume,

The goal of this project is to make the simulation as fast as possible to allow, for example, high resolution weather forecasting for large areas in a short amount of time (hours instead of days). User convenience seems to be a competing force against computational speed.

I tried using cudaMalloc before but I could never make it work, the data types were the most common error message printed. I gave up on that because it was eating too much time just to debug something I didn’t fully understand. I may go back to it at some point to benchmark the performance.

As I understand it, big in register arrays does not reduce the number of concurrent threads, instead it increases the number of serial threads. I think it has the same effect, so maybe we are just talking syntax here.

For now I’m trying this approach:

__device__ FUNCTION(inputs, unsigned int nSpecies)

{

     if (nSpecies == 1)

     {

          float g = 123;

//do stuff with g

     }

else if (nSpecies == 2)

     {

          float g[2];

          g[0] = 123;

          g[1] = 456;

//do stuff with g[0]

          //do stuff with g[1]

     }

...

else if (nSpecies == N)

     {

          float g[N];

          g[0] = 123;

          g[1] = 456;

          ...

          g[N-1] = 789;

//do stuff with g[0]

          //do stuff with g[1]

          ...

          //do stuff with g[N-1]

     }

}

and the performance is going back up to what it was before I tried to generalize for arbitrary number of species. I just personally don’t like this coding style.

Some variables seem to be missing above like what is dt ? but ok…

To start with the index calculation seem to be a bit fuzzy;

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

Second these index calculations also seem a bit fuzzy:

Species[sph + yw + x]

 Water[y*w + x] += 

 Nutrient[y*w + x] += 

Third you could also try to seperate these 3 arrays each into it’s own for loop, perhaps that would help, not sure about that.

When calculating indexes I would recommend (though I am cuda noobie) to try and calculate the calculations into a linear variable, so a single variable.

So this code should look like:

Species[ vSomeIndex ] =

Water[ vSomeOtherIndex ] =

I am not going to try and make sense of your index calculations because I have to start coding myself/work on my own code.

But I suspect your code does something strange with the indexing/accessing of the memory. Perhaps the memory access is bad in combination with the threads. Perhaps each thread is doing wild random access instead of a more lineair access.

So my advice would be the following:

  1. First try to explain to yourself and us what your index calculations represent, what they are doing and what it means. All these little letters like w and h and sp what do they mean ?

  2. Then try to explain how you assured that memory access is linear, try to explain how each thread is accessing the memory.

  3. Perhaps also try to explain what your memory access strategy was in combination with the threading.

  4. Perhaps also explain what your “work distribution” strategy was… how should the memory be distributed for blocks/grids/multi processors etc.

These questions are probably hard for anybody but that’s probably what parallel computing is all about.

These questions need to be answered, a parallel strategy must be developed.

If you can explain your strategy then we can then analyze that strategy to see if it can be improved.

If you fail at explaining it then that’s not a shame or anything since many cuda noobs including me… but then at least this could be an indication of an unthought/unconcious strategy which might be the problem.

At least try to explain what the code/indexing is doing this might give you more insight, please do share, so then we can learn from it as well and perhaps offer different indexing strategies for you ;)

Some variables seem to be missing above like what is dt ? but ok…

To start with the index calculation seem to be a bit fuzzy;

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

Second these index calculations also seem a bit fuzzy:

Species[sph + yw + x]

 Water[y*w + x] += 

 Nutrient[y*w + x] += 

Third you could also try to seperate these 3 arrays each into it’s own for loop, perhaps that would help, not sure about that.

When calculating indexes I would recommend (though I am cuda noobie) to try and calculate the calculations into a linear variable, so a single variable.

So this code should look like:

Species[ vSomeIndex ] =

Water[ vSomeOtherIndex ] =

I am not going to try and make sense of your index calculations because I have to start coding myself/work on my own code.

But I suspect your code does something strange with the indexing/accessing of the memory. Perhaps the memory access is bad in combination with the threads. Perhaps each thread is doing wild random access instead of a more lineair access.

So my advice would be the following:

  1. First try to explain to yourself and us what your index calculations represent, what they are doing and what it means. All these little letters like w and h and sp what do they mean ?

  2. Then try to explain how you assured that memory access is linear, try to explain how each thread is accessing the memory.

  3. Perhaps also try to explain what your memory access strategy was in combination with the threading.

  4. Perhaps also explain what your “work distribution” strategy was… how should the memory be distributed for blocks/grids/multi processors etc.

These questions are probably hard for anybody but that’s probably what parallel computing is all about.

These questions need to be answered, a parallel strategy must be developed.

If you can explain your strategy then we can then analyze that strategy to see if it can be improved.

If you fail at explaining it then that’s not a shame or anything since many cuda noobs including me… but then at least this could be an indication of an unthought/unconcious strategy which might be the problem.

At least try to explain what the code/indexing is doing this might give you more insight, please do share, so then we can learn from it as well and perhaps offer different indexing strategies for you ;)

You are correct, I completely overlooked the fact that I’m using variables anybody else would have no clue what they were. Let me explain:

The w and h correspond to the width and height of the grid (number of cells in each direction). Each cell is mapped to a pixel in an openGL display window that shows how the biomass (plants) grow and spread over time. I have a fixed grid size for all variables, and for multiple species of the same type (biomass or nutrient) I am stacking the grids on top of each other. This was intended to allow me to pass in 1 grid or 10 grids wrapped up in a single variable, then I pull them apart inside the device function. Because the grid size is a fixed w x h, I don’t know if there is a better way to autonomously parse the calculations into individual threads, so I parse it based on a single species, then the variable sp is how I loop through the grids. The 2D grids are stored in a 1D array, so each consecutive grid is indexed by an offset of “h” from the previous, hence the sp*h where sp increases by 1 each iteration. It is supposed to mean “species” so you are calculating the next time step of “species 0”, then “species 1”, etc.

The equations are first-order linear approximations of time derivatives, thus the general form of the equation is Var(t + dt) = Var(t) + dt * (dVar / dt) where dt is some delta time step. The equations have some species-dependent parts that need to be modified for a given number of species, and some parts that are independent of the species. I tried to only loop through the parts that are species-dependent, then add the independent parts afterwards.

So to summarize, the variables for biomass, water and nutrient represent a 2D grid of size w x h and are stored in a 1D array of size (w * h) x 1. When I introduce more species of biomass (more plant species), the 2D grid is then of size w x (h * nSpecies), but the threading strategy is based on w x h grids, so I loop through using an offset sp * h where sp increments each iteration.

The problem is that even with 1 species, the for-loop took my performance down by about 30%.

Thank you for pointing this out. I spend so much time staring at the code that it all seems obvious to me, but I need to remember when I first saw the code and it wasn’t so obvious. Live and learn, right? :-)

You are correct, I completely overlooked the fact that I’m using variables anybody else would have no clue what they were. Let me explain:

The w and h correspond to the width and height of the grid (number of cells in each direction). Each cell is mapped to a pixel in an openGL display window that shows how the biomass (plants) grow and spread over time. I have a fixed grid size for all variables, and for multiple species of the same type (biomass or nutrient) I am stacking the grids on top of each other. This was intended to allow me to pass in 1 grid or 10 grids wrapped up in a single variable, then I pull them apart inside the device function. Because the grid size is a fixed w x h, I don’t know if there is a better way to autonomously parse the calculations into individual threads, so I parse it based on a single species, then the variable sp is how I loop through the grids. The 2D grids are stored in a 1D array, so each consecutive grid is indexed by an offset of “h” from the previous, hence the sp*h where sp increases by 1 each iteration. It is supposed to mean “species” so you are calculating the next time step of “species 0”, then “species 1”, etc.

The equations are first-order linear approximations of time derivatives, thus the general form of the equation is Var(t + dt) = Var(t) + dt * (dVar / dt) where dt is some delta time step. The equations have some species-dependent parts that need to be modified for a given number of species, and some parts that are independent of the species. I tried to only loop through the parts that are species-dependent, then add the independent parts afterwards.

So to summarize, the variables for biomass, water and nutrient represent a 2D grid of size w x h and are stored in a 1D array of size (w * h) x 1. When I introduce more species of biomass (more plant species), the 2D grid is then of size w x (h * nSpecies), but the threading strategy is based on w x h grids, so I loop through using an offset sp * h where sp increments each iteration.

The problem is that even with 1 species, the for-loop took my performance down by about 30%.

Thank you for pointing this out. I spend so much time staring at the code that it all seems obvious to me, but I need to remember when I first saw the code and it wasn’t so obvious. Live and learn, right? :-)

Ok, so if I understand correctly each species has it’s own 2D array.

So each specifies is build up on top of each other. So the entire species is a 3D array.

In that case I think your indexing calculation is wrong.

To calculate an index/pointer to the next species requires the following formula:

BaseOfSpecie = Width * Height * SpecieNumber;

Your index calculation was Height * SpecieNumber only which is probably wrong, you forget to multiple against the width as well.

After the base of the specie is calculated then the threads can do their work on each instance of the specie as follows:

MemoryCell[ BaseOfSpecie + Width * Y + X] = …;

So for example this line of yours:

Species[sph + yw + x] += [Species-specific equation];

Should probably be this:

Species[whsp + w*y + x] += [Species-specific equation];

If c/c++ had support for 3d arrays then this would almost be the same as:
(in reality/implemention the code below would be a pointer array to a pointer array to an array of species data, while the array above is a single 1d array)
(the adventage of a 1d array is it can be treated as a 3d array and element locations can be calculated immediatly, thus 1d arrays are faster as long as calculations can be done fast, while the pointer to array of pointer to array of elements is slower, because multiple pointers need to be retrieved from memory)

Species[sp][y]

3th dimension, 2th dimension, 1th dimension.

By putting the 1th dimension last, going through the array like for sp for y for x would lead to best memory traversal performance it would be sequential… ofcourse here we dealing with parallel access but still best to keep that order I think, until further notice ;) :)

Ok, so if I understand correctly each species has it’s own 2D array.

So each specifies is build up on top of each other. So the entire species is a 3D array.

In that case I think your indexing calculation is wrong.

To calculate an index/pointer to the next species requires the following formula:

BaseOfSpecie = Width * Height * SpecieNumber;

Your index calculation was Height * SpecieNumber only which is probably wrong, you forget to multiple against the width as well.

After the base of the specie is calculated then the threads can do their work on each instance of the specie as follows:

MemoryCell[ BaseOfSpecie + Width * Y + X] = …;

So for example this line of yours:

Species[sph + yw + x] += [Species-specific equation];

Should probably be this:

Species[whsp + w*y + x] += [Species-specific equation];

If c/c++ had support for 3d arrays then this would almost be the same as:
(in reality/implemention the code below would be a pointer array to a pointer array to an array of species data, while the array above is a single 1d array)
(the adventage of a 1d array is it can be treated as a 3d array and element locations can be calculated immediatly, thus 1d arrays are faster as long as calculations can be done fast, while the pointer to array of pointer to array of elements is slower, because multiple pointers need to be retrieved from memory)

Species[sp][y]

3th dimension, 2th dimension, 1th dimension.

By putting the 1th dimension last, going through the array like for sp for y for x would lead to best memory traversal performance it would be sequential… ofcourse here we dealing with parallel access but still best to keep that order I think, until further notice ;) :)

Hi,
What I meant by big in register arrays may reduce the amount of concurrent threads is the following :
big in register arrays mean more register used per thread, which means less “concurrent” threads ie threads will be serialized. So I guess we agreed all along :)

Concerning your first question again, correct me if I’m wrong, you have a variable number of species and a set of constants (such as growth rate etc.) for each of these species. For this, a good idea is to take advantage of the constant memory instead of having each thread initialize its own private (in register) variable. If you’re not already using all the constant memory, you can define the maximum number of species that your program will handle as MAX_SPECIES and the number of constants for each of them as CONSTANT_NB. Then you declare a constant array of size MAX_SPECIES*CONSTANT_NB, initialize it on the host, send it to the device and access it from within your kernel as if it were a normal global variable.

Hi,
What I meant by big in register arrays may reduce the amount of concurrent threads is the following :
big in register arrays mean more register used per thread, which means less “concurrent” threads ie threads will be serialized. So I guess we agreed all along :)

Concerning your first question again, correct me if I’m wrong, you have a variable number of species and a set of constants (such as growth rate etc.) for each of these species. For this, a good idea is to take advantage of the constant memory instead of having each thread initialize its own private (in register) variable. If you’re not already using all the constant memory, you can define the maximum number of species that your program will handle as MAX_SPECIES and the number of constants for each of them as CONSTANT_NB. Then you declare a constant array of size MAX_SPECIES*CONSTANT_NB, initialize it on the host, send it to the device and access it from within your kernel as if it were a normal global variable.

You caught my indexing error and when I fixed that things seemed to work as expected. I am still trying to improve on my methods, but you helped a lot to get me on the right path!

I’m trying to do like what you said, represent a 3D array as a 1D for speed, the weakness I have is that I don’t know any of the dimensions beforehand, and the equations change depending on the third dimension (number of species). I’m trying to find a general way to include all of forms of the equation without hardcoding if-else type switch statement.

Thank you again,

~Josh

P.S. I am having a new problem that I mentioned in a reply to Dude1205, but I’m not sure if you’d see it or not. I created a new topic here: http://forums.nvidia.com/index.php?showtopic=203467. If you could take a look and let me know what you think I’d really appreciate it, but I know you are busy so no hard feelings if you don’t have time :-)

I’m doing something similar to this now. Instead of declaring an array of size MAX_SPECIES * CONSTANT_NB, I’m declaring CONSTANT_NB arrays of size MAX_SPECIES just for readability for people I pass the code to. I agree that a single array containing everything would be better.

I am now having a new problem, that being that using cudamalloc is corrupting my graphics. I created a new topic here: http://forums.nvidia.com/index.php?showtopic=203467

If you could take a look and provide more of your helpful suggestions, I’d be much obliged :-)

I’d really like to help, but I have no experience with openGL and have only worked with graphics cards that were not used for display. It’s probably best if someone else can help you.

Good luck with that !

Regards,

Guillaume