Block execution

Hi All,

I am doing some matrix calculations which require me to move column by column. The only way I could think of to do this might not be ideal, but it seems to work (mostly). I have the following Kernel, but the output is inconsistent. When I print the result vector at the end, I often come up with different output results, despite the fact that the matrix and vector are static and never change. Is this possibly due to block execution? I’ve tried hard to force cuda to execute the way I need it to, but I’m at a loss. FYI - the code is not the most efficient.

__global__ void Matrix(complex* a, complex* b, complex* c){

	// set initial i value for navigating through matrix

	int i = blockIdx.x*blockDim.x+threadIdx.x;

	

	// setup column counter to move through matrix from left to right

	int colCtr = 0;

	

	while(colCtr < N)

	{

		if(blockIdx.x < colCtr){

			// do something above the diagonal

			

		}

		if(blockIdx.x == colCtr){

			// diagonal

			c[colCtr] = a[i] * b[colCtr];

		}

		else if(blockIdx.x > colCtr){

			// below diagonal

			c[blockIdx.x] = b[blockIdx.x] + a[i] * b[colCtr];

		}

		// increment I to next column

		i += blockDim.x * N;

		// increment column counter

		colCtr++;

		// copy new results back to original vector

		b[blockIdx.x] = c[blockIdx.x];

	}

}

Where “a” is an NxN matrix and “b” is an N sized vector. “c” is the N sized output vector.

complex is a structure that I put together with two doubles - x & y. I also wrote device code to interpret complex addition, subtraction, multiplication and division. I am calling the kernel like this:

Matrix<<<N,1>>>(d_a, d_b, d_c);

Where N is set to 6000.

Thoughts?

Hi All,

I am doing some matrix calculations which require me to move column by column. The only way I could think of to do this might not be ideal, but it seems to work (mostly). I have the following Kernel, but the output is inconsistent. When I print the result vector at the end, I often come up with different output results, despite the fact that the matrix and vector are static and never change. Is this possibly due to block execution? I’ve tried hard to force cuda to execute the way I need it to, but I’m at a loss. FYI - the code is not the most efficient.

__global__ void Matrix(complex* a, complex* b, complex* c){

	// set initial i value for navigating through matrix

	int i = blockIdx.x*blockDim.x+threadIdx.x;

	

	// setup column counter to move through matrix from left to right

	int colCtr = 0;

	

	while(colCtr < N)

	{

		if(blockIdx.x < colCtr){

			// do something above the diagonal

			

		}

		if(blockIdx.x == colCtr){

			// diagonal

			c[colCtr] = a[i] * b[colCtr];

		}

		else if(blockIdx.x > colCtr){

			// below diagonal

			c[blockIdx.x] = b[blockIdx.x] + a[i] * b[colCtr];

		}

		// increment I to next column

		i += blockDim.x * N;

		// increment column counter

		colCtr++;

		// copy new results back to original vector

		b[blockIdx.x] = c[blockIdx.x];

	}

}

Where “a” is an NxN matrix and “b” is an N sized vector. “c” is the N sized output vector.

complex is a structure that I put together with two doubles - x & y. I also wrote device code to interpret complex addition, subtraction, multiplication and division. I am calling the kernel like this:

Matrix<<<N,1>>>(d_a, d_b, d_c);

Where N is set to 6000.

Thoughts?

Why do you use blockIdx.x in do many places? This looks like you have a full block working on each matrix element, resulting in a race condition when all these threads write their data to the same place.

Why do you use blockIdx.x in do many places? This looks like you have a full block working on each matrix element, resulting in a race condition when all these threads write their data to the same place.

Thanks for the reply - The reason for so many blocks is that if I setup a block with multiple threads, I can’t seem to get the code to go through the matrix column by column. Instead, it will execute all of the threads within the block resulting in say the first 512 columns executing simultaneously.

The matrix is fairly large - 6000x6000. The only way I’ve found to force the code to go column by column is to create nearly as many blocks as there are elements in a matrix column.

I did figure out the missing part of my code - I will try to post it later tonight. My problem was multiply by b[colCtr]. I changed that to b[blockIdx.x] and I started getting consistent output that also matches the CPU version of the algorithm. I will likely need to change that back though once I put real data into the matrix and vector.

Thanks for the reply - The reason for so many blocks is that if I setup a block with multiple threads, I can’t seem to get the code to go through the matrix column by column. Instead, it will execute all of the threads within the block resulting in say the first 512 columns executing simultaneously.

The matrix is fairly large - 6000x6000. The only way I’ve found to force the code to go column by column is to create nearly as many blocks as there are elements in a matrix column.

I did figure out the missing part of my code - I will try to post it later tonight. My problem was multiply by b[colCtr]. I changed that to b[blockIdx.x] and I started getting consistent output that also matches the CPU version of the algorithm. I will likely need to change that back though once I put real data into the matrix and vector.