need help urgently, inconsistent result? half the time the result is right and half the time wrong

xtiger1 · February 11, 2011, 10:08pm

Hi,

I have changed the sdk transpose kernel as follows to deal with non-square matrices:

extern "C"

__global__ void transposek(double* array, double* output, int size, int lda){

	__shared__ double tile[TILE_DIM][TILE_DIM+1];

	int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

	int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;  

	int index_in = xIndex + (yIndex)*lda;

	xIndex = blockIdx.y * TILE_DIM + threadIdx.x;

	yIndex = blockIdx.x * TILE_DIM + threadIdx.y;

	int index_out = xIndex + (yIndex)*size/lda;

	for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {

		xIndex = index_in+i*size/lda;

		if(xIndex < size)

			tile[threadIdx.y+i][threadIdx.x] = array[xIndex];

	}

	__syncthreads();

	for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {

		if(xIndex < size)

			output[index_out+i*lda] = tile[threadIdx.x][threadIdx.y+i];

	}

        __syncthreads(); //I tried with and without this line, same problem

}

And the test code is very simple, as follows:

#define TILE_DIM	32

#define BLOCK_ROWS	32

#define THREAD_BLOCK 512

#define REQ(size,tile) (size-1)/tile+1

void transpose(double* array, double* output, int size, int lda){

	dim3 grid(REQ(size/lda, TILE_DIM), REQ(lda, TILE_DIM)), threads(TILE_DIM,BLOCK_ROWS);

	transposek<<<grid, threads>>>(array, output, size, lda);

}

using namespace std;

void printArray(double* array, int size){

	for(int i = 0; i < size; i++) cout << array[i] << ", ";

	cout << endl;

}

void randomInit(double* array, int size){

	for(int i = 0; i< size; i++) array[i]=i+1;

}

void fixInit(double* array, int size, double value){

	for(int i = 0; i< size; i++) array[i]=value;

}

int main(int argc, char** argv){

	unsigned int uiWA = 4;

	unsigned int uiHA = 2;

	unsigned int size_A = uiWA * uiHA;

        unsigned int mem_size_A = sizeof(double) * size_A;

        double* h_A = (double*)malloc(mem_size_A);	randomInit(h_A, size_A);

    	double* d_A;

        cudaMalloc((void**) &d_A, mem_size_A);

    	double* d_AT;

	cudaMalloc((void**) &d_AT, mem_size_A);

// copy host memory to device

        cudaMemcpy(d_A, h_A, mem_size_A,cudaMemcpyHostToDevice);

	printArray(h_A, size_A);

	transpose(d_A, d_AT, size_A, uiHA);

	cudaMemcpy(h_A, d_AT, mem_size_A, cudaMemcpyDeviceToHost);

	printArray(h_A, size_A);

	using namespace std;{

		cout << REQ(size_A/uiHA, TILE_DIM) << "," << REQ(uiHA, TILE_DIM) << endl;

		cout << TILE_DIM << "," << BLOCK_ROWS << endl;

	}

}

So half the time the result is right, 1 3 5 7 2 4 6 8

But the other half of the time the result is 1 3 5 7 0 0 0 0 or 1 3 5 7 x x x x, where x is some junk value (like uninitialized var)

What is going on?

I have GTX460 and I compiled the code both with sm_20 and sm_21.

xtiger1 · February 11, 2011, 10:41pm

PLEASE? anyone?

xtiger1 · February 12, 2011, 1:06am

OKAY, i figured out there is a race condition.

KChou · February 23, 2011, 11:38pm

Did you ever solve this problem? If you did, can you post your working solution? Thanks

Gilles_C · February 24, 2011, 3:07pm

Hi,

I tried to figure out what you tried to do here, and it looks far from clear to me.

What looks about clear is that

output[index_out+i*lda]

will lead to go far beyond the end of output.

The only check you make before writing in output is about

xIndex < size

If I’m not mistaken, here,

xIndex

is basically

threadIdx.x

and can be anything between 0 and 31.

Then size is 8.

And then

index_out

is

threadIdx.x + 4 * threadIdx.y

, so can be anything between 0 and 4*32+7=135.

And finally,

index_out+i*lda

can be anything between 0 and 135+31*2=196.

Knowing that output is of size 8, you play way out of bounds.

Now, I might have missed something…

Topic		Replies	Views
Question about tranpose CUDA Programming and Performance	19	7347	June 11, 2008
Very simple kernel gives wrong results sometimes See code in thread CUDA Programming and Performance	3	1819	September 1, 2009
Kernel adapted from CUDA documentation giving incorrect result CUDA Programming and Performance	1	252	August 19, 2023
Matrix transpose problem (SDK example) for matrices that are not multiple of 16 CUDA Programming and Performance	0	5482	December 2, 2007
An Efficient Matrix Transpose in CUDA C/C++ Technical Blog	31	2711	October 30, 2020
Unexpected algorithm behaviour CUDA Programming and Performance	9	1566	March 20, 2015
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2744	September 18, 2009
Very simple CUDA program bad output CUDA Programming and Performance	3	775	July 3, 2017
In-place register-based matrix transpose with shuffle not working CUDA Programming and Performance	7	1158	October 12, 2021
Desperate help needed! Sum rows of matrix CUDA Programming and Performance	16	40626	April 24, 2011

need help urgently, inconsistent result? half the time the result is right and half the time wrong

Related topics