Grain Growth Simulation unexpected behaviour

I made a simple grain growth simulation using JCuda/JOGL interoperability. The program works, in the cpu mode everything is ok, but when I run cuda growth of grains (shape) isn’t correct. An idea of grain growth is based on a cellular automaton. The grain growth simulation is using in metallurgy but rules in this version are extreme simple. I use vbo to store positions of the points(first half of the vbo) and colors of the points(second half of the vbo). Positions are initialized once and then I use for a simulation only second half of the vbo and auxiliary objects (buffer, array) to store colors. Each position and color has component x,y,z and w. Each position has w = 1.0f all the time, but when a color have component w = 0.0f, there is no grain. I simplified rules - each grain has the same color, so I don’t have to do some computation.

At the begining the program randomize some initial grains which will grow. Grain mesh dimension = 1000 x 1000.

In the grain growth simulation cells can only born. The main rule is: when a cell is empty (there’s no color: if(data.get((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetW) == 0) it can get color which is most frequent among 8 surrounding neighbours. Each initial grain has the same color so I don’t have to check which color is most ferquent in the tab, so I can just take the first color. I use if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999)) to avoid index out of bound exception.

The question is: where did I make a mistake when I was creating a kernel based on the cpu version ?

Cuda settings:

cuFuncSetBlockShape(function, 10, 10, 1);

cuLaunchGrid(function, 100, 100);

The cpu version works correct. Grains in some phase of growth: screen

private void cpu(GL gl)

    {

        gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, vbo);

        byteBuffer = gl.glMapBuffer(GL3.GL_ARRAY_BUFFER, GL3.GL_READ_WRITE);

if (byteBuffer == null)

        {

            throw new RuntimeException("Unable to map buffer");

        }

data = byteBuffer.asFloatBuffer();

for(int i = 0; i < 1000; i++)

        {

            for(int j = 0; j < 1000; j++)

            {

n = 0;

if((i > 0) && (j > 0) && (i < 999) && (j < 999))

                {

                    if(data.get((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetW) == 0)

                    {

if(data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get((((j  + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get(((j + (i - 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get((((j + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get((((j + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get((((j + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j + 1) + i * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get((((j  + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get(((j + (i + 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get((((j + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get((((j + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get((((j + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));

                            n++;

                        }

if(data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)

                        {

neighbourColors[n] = new Float4(data.get((((j - 1) + i * grainMeshWidth) * 4) + offsetColor), 

                                                            data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetY),

                                                            data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetZ), 

                                                            data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW));

}

}

if(neighbourColors[0].getW() != 0)  //here should be some function which returns the most ferquent color, but I simplified this.

{

                       colorsBuffer.put(((j + i * grainMeshWidth) * 4), neighbourColors[0].getX());

                       colorsBuffer.put((((j + i * grainMeshWidth) * 4) + offsetY), neighbourColors[0].getY());

                       colorsBuffer.put((((j + i * grainMeshWidth) * 4) + offsetZ), neighbourColors[0].getZ());

                       colorsBuffer.put((((j + i * grainMeshWidth) * 4) + offsetW), neighbourColors[0].getW());

for(int n = 0; n < 8; n++)

                       {

                           neighbourColors[n].reset();

                       }

                   }

                }

            }

        }

for(int i = 0; i < 1000; i++)

        {

            for(int j = 0; j < 1000; j++)

            {

                if((i > 0) && (j > 0) && (i < 999) && (j < 999))

                {

                    if(colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetW) != 0)

                    {

                        data.put((((j + i * grainMeshWidth) * 4) + offsetColor), colorsBuffer.get((j + i * grainMeshWidth) * 4));

                        data.put(((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetY), colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetY));

                        data.put(((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetZ), colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetZ));

                        data.put(((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetW), colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetW));

                    }

}

}

        }

gl.glUnmapBuffer(GL3.GL_ARRAY_BUFFER);

        gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, 0);

    }

Cuda version in some phase of growth (a shape of grains should be identical as previous): screen

extern "C"

__global__ void graingrowth(float4* vbo, float4* colors, int width) 

{ 

	

	int tx = blockIdx.x * blockDim.x + threadIdx.x; 

	int ty = blockIdx.y * blockDim.y + threadIdx.y; 

	

	float4 tab[8];

	for(int i = 0; i < 8; i++)

	{

		tab[i] = make_float4(0.0f, 0.0f, 0.0f, 0.0f);

	}

	int x = 0;

	if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999))

	{

	     

             if(vbo[(ty + tx * width) + 1000000].w == 0.0f)

	     {

			

		  if(vbo[((ty - 1) + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + (tx - 1) * width) + 1000000];

	

		       x++;

	          }

		  

	          if(vbo[(ty + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[(ty + (tx - 1) * width) + 1000000]; 

			

		       x++;	

	          

		  }

		  

	          if(vbo[((ty + 1) + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + (tx - 1) * width) + 1000000];

		       x++;

	          }

		  if(vbo[((ty + 1) + tx * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + tx * width) + 1000000];

	               x++;

	          }

		  if(vbo[((ty + 1) + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + (tx + 1) * width) + 1000000];

		       x++;

	          }

		  if(vbo[(ty + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[(ty  + (tx + 1) * width) + 1000000];

		       x++;

	          }

		  if(vbo[((ty - 1) + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + (tx + 1) * width) + 1000000];

		       x++;

	          }

	          if(vbo[((ty - 1) + tx * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + tx * width) + 1000000];

	          }

	      }

	      if(tab[0].w != 0)

	      {

	           colors[(ty + tx * width)] = tab[0];

		   

	      }

	   

	}

	

	if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999))

	{

	     

             if(colors[(ty + tx * width)].w != 0.0f)

	     {

	          vbo[(ty + tx * width) + 1000000] = colors[(ty + tx * width)];

	     }

	}

		

}

It looks like maybe you’re writing to the same array you’re reading from? This will work in the serial version, but in parallel you might get race conditions. Try double buffering the array, so you write to a different array that you read from, and then swap.

Yes.

In the kernel expression like this:

tab[x] = make_float4((vbo[((ty - 1) + (tx - 1) * width) + 1000000].x), 

                     (vbo[((ty - 1) + (tx - 1) * width) + 1000000].y),

                     (vbo[((ty - 1) + (tx - 1) * width) + 1000000].z),

		     (vbo[((ty - 1) + (tx - 1) * width) + 1000000].w));

can be written as:

tab[x] = vbo[((ty - 1) + (tx - 1) * width) + 1000000];

so I rewrote the kernel to make the code more compact. How about using __syncthread()?

The kernel with __syncthread() also doesn’t work properly:

extern "C"

__global__ void graingrowth(float4* vbo, float4* colors, int width) 

{ 

	int tx = blockIdx.x * blockDim.x + threadIdx.x; 

	int ty = blockIdx.y * blockDim.y + threadIdx.y; 

	

	float4 tab[8];

	for(int i = 0; i < 8; i++)

	{

		tab[i] = make_float4(0.0f, 0.0f, 0.0f, 0.0f);

	}

	int x = 0;

	if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999))

	{

	     

             if(vbo[(ty + tx * width) + 1000000].w == 0.0f)

	     {

			

		  if(vbo[((ty - 1) + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + (tx - 1) * width) + 1000000];

                       x++;

	          }

		  

	          if(vbo[(ty + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[(ty + (tx - 1) * width) + 1000000]; 

                       x++;	

	          

		  }

		  

	          if(vbo[((ty + 1) + (tx - 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + (tx - 1) * width) + 1000000];

                       x++;

	          }

		  if(vbo[((ty + 1) + tx * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + tx * width) + 1000000];

                       x++;

	          }

		  if(vbo[((ty + 1) + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty + 1) + (tx + 1) * width) + 1000000];

                       x++;

	          }

		  if(vbo[(ty + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[(ty  + (tx + 1) * width) + 1000000];

                       x++;

	          }

		  if(vbo[((ty - 1) + (tx + 1) * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + (tx + 1) * width) + 1000000];

                       x++;

	          }

	          if(vbo[((ty - 1) + tx * width) + 1000000].w != 0.0f)

		  {

	               tab[x] = vbo[((ty - 1) + tx * width) + 1000000];

	          }

	      }

	      if(tab[0].w != 0)

	      {

	           colors[(ty + tx * width)] = tab[0];

		   

	      }

	}

	__syncthreads();

	if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999))

	{

	     

             if(colors[(ty + tx * width)].w != 0.0f)

	     {

	          vbo[(ty + tx * width) + 1000000] = colors[(ty + tx * width)];

	     }

	}

}

Anyone ?

Hmm, I misread the thread’s title as “groin growth simulation”…

Have you tried double buffering, as was suggested to you?

Ok, now my program with double buffering works properly, but I can’t understand what was wrong before when I was trying to use __syncthreads(). Could you give me some more precise explanation ?

Ok, now my program with double buffering works properly, but I can’t understand what was wrong before when I was trying to use __syncthreads(). Could you give me some more precise explanation ?

__syncthreads() only synchronizes the threads of a block, not all threads of the grid. There is no inter-block synchronization in CUDA since usually not all blocks run concurrently.

__syncthreads() only synchronizes the threads of a block, not all threads of the grid. There is no inter-block synchronization in CUDA since usually not all blocks run concurrently.

The double buffering seems to be the only way to resolve problems like this. So without the double buffering there was a race condition ? How to describe this ? Some threads did they work faster than other and modified the vbo. Finally “slower” threads were reading an incorrect data… ?

The double buffering seems to be the only way to resolve problems like this. So without the double buffering there was a race condition ? How to describe this ? Some threads did they work faster than other and modified the vbo. Finally “slower” threads were reading an incorrect data… ?

Blocks don’t necessarily run all in parallel. If you invoke a kernel with more blocks than your hardware is capable of executing in parallel (which is the common use case), then the remaining blocks will be scheduled to run after the first wave of blocks finishes. In general, the order in which blocks execute is just undefined.

So in your case, some of the blocks will see the original values in the array, some will see the modified content. Separating input and output solves this issue.

Blocks don’t necessarily run all in parallel. If you invoke a kernel with more blocks than your hardware is capable of executing in parallel (which is the common use case), then the remaining blocks will be scheduled to run after the first wave of blocks finishes. In general, the order in which blocks execute is just undefined.

So in your case, some of the blocks will see the original values in the array, some will see the modified content. Separating input and output solves this issue.