Debugging inside my kernel I'm sure its something small.

senorbum · June 18, 2008, 7:00pm

This is a bit messy, but I wanted to make sure it worked before I started using +=. Currently, tempSum = 0 after the loop is completed. However, I checked what cpuA[x1 + i*nColumns] yields, and it correctly yields a value (between 0 and 1 roughly).

const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
int x2, x1;
x = ix;
y = iy;
x1 = ix%nColumns;
x2 = iy%nColumns;
float temp, temp2;
float tempSum = 0;
float tempSum2 = 0;

nColumns is an int being passed in. Could this be a data type error ( i.e. float vs. int)? I’m quite confused.

for( i = 0; i < nColumns; i++){
temp = tempSum + cpuA[x1 + inColumns];
temp2 = tempSum2 + cpuA[x2 + inColumns];
tempSum = temp;
tempSum2 = temp2;
}

BarsMonster · June 18, 2008, 7:22pm

Could you provide more details about your problem? I’ve red your pose 3 times and still have no idea what you are asking :)

senorbum · June 19, 2008, 12:30pm

Basically, temp sum should be adding up the total of some elements in an array. It currently is just being set to 0 after the loop is finished, which isn’t correct. I have no idea why it would be doing this, as I’ve checked as to whether or not cpuA is giving the correct elements and it is. So basically, I’m wondering if it could be set to 0 because of a data type error, or some weird CUDA syntax that I’m missing. I have not had this problem before with similar operations.

In Summary:

tempSum should not be 0 after the loop is finished but it is.

E.D_Riedijk · June 19, 2008, 1:25pm

please post your full kernel code & the calling code. It is impossible to see what is going on without this. Apart from that, to sum all the elements in an array, you usually do a reduction (example in the SDK)

+= is working just okay, I would not be surprised if the compiler converted your code to +=

senorbum · June 19, 2008, 1:32pm

Hopefully this post will clarify things.

Here is some seperate code that is being called right before this other code. It is behaving very similarly.
global void normalizeMatrix(float* gpuA, float* meanVector, int nRows, int nColumns){
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
float temp = 0;
float temp2 = 0;
float final, final2;
temp = gpuA[ix+iynColumns];
final = temp - meanVector[threadIdx.x];
gpuA[ix+iynColumns] = final;
}

This is an example of what is happening with specific numbers.

gpuA::
.538969 .608563 .489000
.249687 .169281 .855000
.755844 .152844 .494687

meanVector::
.514833 .310229 .612896

gpuA after function is called

.024135 .298333 -.123896
.249687 .169281 .855000
.755844 .152844 .494687

So what should be happening here is that gpuA[x, y] = gpuA[x,y] - meanVector. It is doing this correctly for the first row, but that is it. I have bolded this row to show that. However, for the other rows, it is just spitting out the original result. This leads me to believe that there is something wrong with how my threadID’s are done. I just can’t figure out what it is.

Edit: here is how the kernel is being called:
int nRows = 3;
int nColumns = 3;
cudaMalloc((void **)&meanVectorGPU, nColumnssizeof(float));
cudaMalloc((void **)&gpuA, nRowsnColumns*sizeof(float));

…other non involved code here…

normalizeMatrix<<<nColumns, nColumns>>>(gpuA, meanVectorGPU, nRows, nColumns);

Edit: Optimization is occurring at a later date. I’m not worrying about that right now. Also, the matrix sizes will eventually be 100x100 up to 2000x2000 or so. Hence the use of cuda here.

senorbum · June 19, 2008, 1:48pm

As a side question, if you are calculating large matrices (over 768x768) how do you approach the matrix, as I believe the limit of threads per block is 768, correct? Say I have an 1152x1152 matrix. If I was to call a function that needs to perform a calculation for every function in the matrix, how would I access each unit in the matrix with my threadIdx’s being off of normal? Or is it possible to assign more threads to a block, but only the first 768 will run before the others can start?

Sorry for all the questions. Thank you a lot for your time.

Joe

senorbum · June 19, 2008, 2:37pm

The problem is an access problem of gpuA. Even if I explicitly state gpuA[4] = 3; gpuA[4] will not equal 3. I really am not sure of why it does this. Right now I am copying gpuA to device mem. Then I am calling it in a seperate kernel before this. I made sure the cudaThreadSynchronize(); before this so there shouldn’t be a problem in that respect. I am not quite sure why I am only able to access the first row.

E.D_Riedijk · June 19, 2008, 2:42pm

iy is undefined. And also, please post your calling code. Do you check for errors after you called the kernel? It can be that your kernel actually did not run at all, and when you don’t check for errors you will never know, but just read back values that were not updated.

senorbum · June 19, 2008, 2:49pm

Regarding iy being undefined, check my posts after and my edit if you did not. Even without using iy, I have this problem. I’ll check for errors in a few minutes and see if that’s the problem.

senorbum · June 19, 2008, 3:10pm

float *cpuA;
float *cpuB;
float *cpuC;
float *gpuA;
float *gpuB;
float *gpuC;
float *meanVectorGPU;
float *meanVectorCPU;
int i, j;
int nRows = 3;
int nColumns = 3;
dim3 threads2(nColumns);
dim3 grid2(nColumns);
dim3 threads(nRows,nColumns);
dim3 grid(nRows,nColumns);
unsigned int timer = 0;
unsigned int elapsed = 0;
//cublasStatus status;

CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));
CUT_DEVICE_INIT();

cpuA = (float*) malloc (nRows*nColumns*sizeof(float));
cpuB = (float*) malloc (nColumns*nColumns*sizeof(float));
cpuC = (float*) malloc (sizeof(float));
cudaMalloc((void **)&gpuA, nRows*nColumns*sizeof(float));
cudaMalloc((void **)&gpuB, nColumns*nColumns*sizeof(float));
cudaMalloc((void **)&gpuC, nColumns*nColumns*sizeof(float));
cudaMalloc((void **)&meanVectorGPU, nColumns*sizeof(float));
initMatrix(cpuA,cpuB,cpuC,nRows,nColumns);
	for(i = 0; i < nRows; i++){
	for(j = 0; j < nColumns; j++){
		printf("%f  ", cpuA[j+i*nColumns]);
		}
	printf("\n");
	}
printf("\n");
cudaFree(gpuC);
free(cpuC);
cudaMemcpy(gpuA, cpuA, nRows*nColumns*sizeof(float), cudaMemcpyHostToDevice);
meanVectorCPU = (float*) malloc (nColumns*sizeof(float));
memset(meanVectorCPU, 0, nColumns*sizeof(float));


extractMeanVector<<<nColumns, nColumns>>>(meanVectorGPU, gpuA, nRows, nColumns);
CUT_CHECK_ERROR("Kernel execution failed");
cudaThreadSynchronize();
cudaMemcpy(meanVectorCPU, meanVectorGPU, nColumns*sizeof(float), cudaMemcpyDeviceToHost);	
	printf("MeanVector \n \n");
	for(i = 0; i < nColumns; i++){
	printf(" %f ", meanVectorCPU[i]);
	}
printf("\n \nNormalized Matrix \n \n");
normalizeMatrix<<<1, 9>>>(gpuA, meanVectorGPU, nRows, nColumns);
cudaThreadSynchronize();
CUT_CHECK_ERROR("Kernel execution failed");
cudaMemcpy(cpuA, gpuA, nColumns*sizeof(float), cudaMemcpyDeviceToHost);
for(i = 0; i < nRows; i++){
	for(j = 0; j < nColumns; j++){
		printf("%f  ", cpuA[j+i*nColumns]);
		}
	printf("\n");
	}

That is the code up to the problem point calling my kernel.

global void normalizeMatrix(float* gpuA, float* meanVector, int nRows, int nColumns){
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
float temp = 0;
float temp2 = 0;
float final, final2;

gpuA[ix] = 2;

}

that is the modified normalizeMatrix, which still causes only the first row to be set to 0 and all other rows stay the same as they were before. There are no errors that appear up to this point, using CUT_CHECK_ERROR after my kernel executions. Currently I am using 2 kernels, but in the future I’ll hopefully be switching to 1 to minimize communication.

E.D_Riedijk · June 19, 2008, 3:33pm

You are copying not enough bytes back to CPU :
cudaMemcpy(cpuA, gpuA, nColumnssizeof(float), cudaMemcpyDeviceToHost);
needs to be
cudaMemcpy(cpuA, gpuA, nRowsnColumns*sizeof(float), cudaMemcpyDeviceToHost);

senorbum · June 19, 2008, 3:51pm

You sir, are awesome.

I knew it was something dumb like that. I just couldn’t find it. I wish I had somebody to pair program with, as this probably wouldn’t have been a problem.

Thanks a bunch for your time.

Joe

E.D_Riedijk · June 19, 2008, 6:03pm

No problem. I just found a somewhat similar bug in my code, that is why it stared right at me. For me the bug was like in the following simple code (for people to laugh & learn, I have been looking on & off for this bug for the last 3 days…):

__global__ void kernelcall(float *arr)

{

index = threadIdx.x;

while (index < num_elements)

arr[index] = 1;

}

void mexFunction(....)

{

<stuff that gets data from matlab, one of the variables is num_element>

__constant__ unsigned int num_elements;

cudaMemCpyToSymbol(num_elements, &num_element, 1*sizeof(unsigned int));

float *some_array_to_fill;

cudaMalloc( (void *) some_array_to_fill, num_elements * sizeof(float));

kernelcall<<<1,32>>>(some_array_to_fill);

}

I was getting unspecified launch failures on the kernel call. People that see the bug without looking further are heroes to me.

The bug turned out to be here: cudaMalloc( (void *) some_array_to_fill, num_elements * sizeof(float));

What was happening is that I was inserting the adress of num_elements into the cudaMalloc instead of the value as that was called num_element…

So the lesson is to stick to a naming scheme (there is a post about it, called CUDA programming style I believe). When I would have called my constant c_num_elements, this mistake would not have happend that easily.

Topic		Replies	Views
CUDA - calculation of a sum CUDA Programming and Performance	7	5438	April 30, 2010
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010
do not understand thread/block division CUDA Programming and Performance	10	2795	April 23, 2012
computing a sum leads to infinite values CUDA Programming and Performance	3	5375	September 16, 2008
Timing comparison(ms) in calculation of the sum of matrix rows CUDA Programming and Performance cuda , kernel	1	452	October 26, 2022
Reduction CUDA Programming and Performance	19	3455	May 16, 2012
A value of zero is either printed as 1 or 4 in a kernel for some specific case !! CUDA Programming and Performance	7	725	January 13, 2018
Why this kernel "needs" a __syncthreads() statement to work? CUDA Programming and Performance	3	835	May 7, 2017
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1270	November 8, 2023
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1792	January 14, 2009

Debugging inside my kernel I'm sure its something small.

Related topics