Grain Growth Simulation unexpected behaviour

Irbis · June 4, 2011, 3:28pm

I made a simple grain growth simulation using JCuda/JOGL interoperability. The program works, in the cpu mode everything is ok, but when I run cuda growth of grains (shape) isn’t correct. An idea of grain growth is based on a cellular automaton. The grain growth simulation is using in metallurgy but rules in this version are extreme simple. I use vbo to store positions of the points(first half of the vbo) and colors of the points(second half of the vbo). Positions are initialized once and then I use for a simulation only second half of the vbo and auxiliary objects (buffer, array) to store colors. Each position and color has component x,y,z and w. Each position has w = 1.0f all the time, but when a color have component w = 0.0f, there is no grain. I simplified rules - each grain has the same color, so I don’t have to do some computation.

At the begining the program randomize some initial grains which will grow. Grain mesh dimension = 1000 x 1000.

In the grain growth simulation cells can only born. The main rule is: when a cell is empty (there’s no color: if(data.get((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetW) == 0) it can get color which is most frequent among 8 surrounding neighbours. Each initial grain has the same color so I don’t have to check which color is most ferquent in the tab, so I can just take the first color. I use if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999)) to avoid index out of bound exception.

The question is: where did I make a mistake when I was creating a kernel based on the cpu version ?

Cuda settings:

cuFuncSetBlockShape(function, 10, 10, 1);

cuLaunchGrid(function, 100, 100);

The cpu version works correct. Grains in some phase of growth: screen

private void cpu(GL gl)

    {

        gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, vbo);

        byteBuffer = gl.glMapBuffer(GL3.GL_ARRAY_BUFFER, GL3.GL_READ_WRITE);

if (byteBuffer == null)

        {

            throw new RuntimeException("Unable to map buffer");

        }

data = byteBuffer.asFloatBuffer();

for(int i = 0; i < 1000; i++)

        {

            for(int j = 0; j < 1000; j++)

            {

n = 0;

if((i > 0) && (j > 0) && (i < 999) && (j < 999))

                {

                    if(data.get((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetW) == 0)

                    {

if(data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get((((j  + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get(((j + (i - 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get((((j + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get((((j + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get((((j + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j + 1) + i * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get((((j  + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get(((j + (i + 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get((((j + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get((((j + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get((((j + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j - 1) + i * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW));

}

}

if(neighbourColors[0].getW() != 0)  //here should be some function which returns the most ferquent color, but I simplified this.

{

                       colorsBuffer.put(((j + i * grainMeshWidth) * 4), neighbourColors[0].getX());

                       colorsBuffer.put((((j + i * grainMeshWidth) * 4) + offsetY), neighbourColors[0].getY());

                       colorsBuffer.put((((j + i * grainMeshWidth) * 4) + offsetZ), neighbourColors[0].getZ());

                       colorsBuffer.put((((j + i * grainMeshWidth) * 4) + offsetW), neighbourColors[0].getW());

for(int n = 0; n < 8; n++)

                       {

                           neighbourColors[n].reset();

                       }

                   }

                }

            }

        }

for(int i = 0; i < 1000; i++)

        {

            for(int j = 0; j < 1000; j++)

            {

                if((i > 0) && (j > 0) && (i < 999) && (j < 999))

                {

                    if(colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetW) != 0)

                    {

                        data.put((((j + i * grainMeshWidth) * 4) + offsetColor), colorsBuffer.get((j + i * grainMeshWidth) * 4));

                        data.put(((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetY), colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetY));

                        data.put(((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetZ), colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetZ));

                        data.put(((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetW), colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetW));

                    }

}

}

        }

gl.glUnmapBuffer(GL3.GL_ARRAY_BUFFER);

        gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, 0);

    }

Cuda version in some phase of growth (a shape of grains should be identical as previous): screen

extern "C"

__global__ void graingrowth(float4* vbo, float4* colors, int width) 

{ 

	

	int tx = blockIdx.x * blockDim.x + threadIdx.x; 

	int ty = blockIdx.y * blockDim.y + threadIdx.y; 

	

	float4 tab[8];

	for(int i = 0; i < 8; i++)

	{

		tab[i] = make_float4(0.0f, 0.0f, 0.0f, 0.0f);

	}

	int x = 0;

	if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999))

	{

	     

             if(vbo[(ty + tx * width) + 1000000].w == 0.0f)

	     {

			

		  if(vbo[((ty - 1) + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + (tx - 1) * width) + 1000000];

	

		       x++;

	          }

		  

	          if(vbo[(ty + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[(ty + (tx - 1) * width) + 1000000]; 

			

		       x++;	

	          

		  }

		  

	          if(vbo[((ty + 1) + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + (tx - 1) * width) + 1000000];

		       x++;

	          }

		  if(vbo[((ty + 1) + tx * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + tx * width) + 1000000];

	               x++;

	          }

		  if(vbo[((ty + 1) + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + (tx + 1) * width) + 1000000];

		       x++;

	          }

		  if(vbo[(ty + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[(ty  + (tx + 1) * width) + 1000000];

		       x++;

	          }

		  if(vbo[((ty - 1) + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + (tx + 1) * width) + 1000000];

		       x++;

	          }

	          if(vbo[((ty - 1) + tx * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + tx * width) + 1000000];

	          }

	      }

	      if(tab[0].w != 0)

	      {

	           colors[(ty + tx * width)] = tab[0];

		   

	      }

	   

	}

	

	if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999))

	{

	     

             if(colors[(ty + tx * width)].w != 0.0f)

	     {

	          vbo[(ty + tx * width) + 1000000] = colors[(ty + tx * width)];

	     }

	}

		

}

Simon_Green · June 6, 2011, 3:18pm

It looks like maybe you’re writing to the same array you’re reading from? This will work in the serial version, but in parallel you might get race conditions. Try double buffering the array, so you write to a different array that you read from, and then swap.

Irbis · June 8, 2011, 2:42pm

Yes.

In the kernel expression like this:

tab[x] = make_float4((vbo[((ty - 1) + (tx - 1) * width) + 1000000].x), 

                     (vbo[((ty - 1) + (tx - 1) * width) + 1000000].y),

                     (vbo[((ty - 1) + (tx - 1) * width) + 1000000].z),

		     (vbo[((ty - 1) + (tx - 1) * width) + 1000000].w));

can be written as:

tab[x] = vbo[((ty - 1) + (tx - 1) * width) + 1000000];

so I rewrote the kernel to make the code more compact. How about using __syncthread()?

The kernel with __syncthread() also doesn’t work properly:

extern "C"

__global__ void graingrowth(float4* vbo, float4* colors, int width) 

{ 

	int tx = blockIdx.x * blockDim.x + threadIdx.x; 

	int ty = blockIdx.y * blockDim.y + threadIdx.y; 

	

	float4 tab[8];

	for(int i = 0; i < 8; i++)

	{

		tab[i] = make_float4(0.0f, 0.0f, 0.0f, 0.0f);

	}

	int x = 0;

	if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999))

	{

	     

             if(vbo[(ty + tx * width) + 1000000].w == 0.0f)

	     {

			

		  if(vbo[((ty - 1) + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + (tx - 1) * width) + 1000000];

                       x++;

	          }

		  

	          if(vbo[(ty + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[(ty + (tx - 1) * width) + 1000000]; 

                       x++;	

	          

		  }

		  

	          if(vbo[((ty + 1) + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + (tx - 1) * width) + 1000000];

                       x++;

	          }

		  if(vbo[((ty + 1) + tx * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + tx * width) + 1000000];

                       x++;

	          }

		  if(vbo[((ty + 1) + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + (tx + 1) * width) + 1000000];

                       x++;

	          }

		  if(vbo[(ty + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[(ty  + (tx + 1) * width) + 1000000];

                       x++;

	          }

		  if(vbo[((ty - 1) + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + (tx + 1) * width) + 1000000];

                       x++;

	          }

	          if(vbo[((ty - 1) + tx * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + tx * width) + 1000000];

	          }

	      }

	      if(tab[0].w != 0)

	      {

	           colors[(ty + tx * width)] = tab[0];

		   

	      }

	}

	__syncthreads();

	if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999))

	{

	     

             if(colors[(ty + tx * width)].w != 0.0f)

	     {

	          vbo[(ty + tx * width) + 1000000] = colors[(ty + tx * width)];

	     }

	}

}

Irbis · June 11, 2011, 9:52pm

Anyone ?

cbuchner1 · June 11, 2011, 10:49pm

Hmm, I misread the thread’s title as “groin growth simulation”…

tera · June 11, 2011, 11:46pm

Have you tried double buffering, as was suggested to you?

Irbis · June 13, 2011, 3:38pm

Ok, now my program with double buffering works properly, but I can’t understand what was wrong before when I was trying to use __syncthreads(). Could you give me some more precise explanation ?

Irbis · June 13, 2011, 3:38pm

Ok, now my program with double buffering works properly, but I can’t understand what was wrong before when I was trying to use __syncthreads(). Could you give me some more precise explanation ?

tera · June 13, 2011, 4:00pm

__syncthreads() only synchronizes the threads of a block, not all threads of the grid. There is no inter-block synchronization in CUDA since usually not all blocks run concurrently.

tera · June 13, 2011, 4:00pm

__syncthreads() only synchronizes the threads of a block, not all threads of the grid. There is no inter-block synchronization in CUDA since usually not all blocks run concurrently.

Irbis · June 13, 2011, 7:53pm

The double buffering seems to be the only way to resolve problems like this. So without the double buffering there was a race condition ? How to describe this ? Some threads did they work faster than other and modified the vbo. Finally “slower” threads were reading an incorrect data… ?

Irbis · June 13, 2011, 7:53pm

The double buffering seems to be the only way to resolve problems like this. So without the double buffering there was a race condition ? How to describe this ? Some threads did they work faster than other and modified the vbo. Finally “slower” threads were reading an incorrect data… ?

tera · June 14, 2011, 11:39am

Blocks don’t necessarily run all in parallel. If you invoke a kernel with more blocks than your hardware is capable of executing in parallel (which is the common use case), then the remaining blocks will be scheduled to run after the first wave of blocks finishes. In general, the order in which blocks execute is just undefined.

So in your case, some of the blocks will see the original values in the array, some will see the modified content. Separating input and output solves this issue.

tera · June 14, 2011, 11:39am

Blocks don’t necessarily run all in parallel. If you invoke a kernel with more blocks than your hardware is capable of executing in parallel (which is the common use case), then the remaining blocks will be scheduled to run after the first wave of blocks finishes. In general, the order in which blocks execute is just undefined.

So in your case, some of the blocks will see the original values in the array, some will see the modified content. Separating input and output solves this issue.