Debugging inside my kernel I'm sure its something small.

This is a bit messy, but I wanted to make sure it worked before I started using +=. Currently, tempSum = 0 after the loop is completed. However, I checked what cpuA[x1 + i*nColumns] yields, and it correctly yields a value (between 0 and 1 roughly).

const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
int x2, x1;
x = ix;
y = iy;
x1 = ix%nColumns;
x2 = iy%nColumns;
float temp, temp2;
float tempSum = 0;
float tempSum2 = 0;

nColumns is an int being passed in. Could this be a data type error ( i.e. float vs. int)? I’m quite confused.

for( i = 0; i < nColumns; i++){
temp = tempSum + cpuA[x1 + inColumns];
temp2 = tempSum2 + cpuA[x2 + i
tempSum = temp;
tempSum2 = temp2;

Could you provide more details about your problem? I’ve red your pose 3 times and still have no idea what you are asking :)

Basically, temp sum should be adding up the total of some elements in an array. It currently is just being set to 0 after the loop is finished, which isn’t correct. I have no idea why it would be doing this, as I’ve checked as to whether or not cpuA is giving the correct elements and it is. So basically, I’m wondering if it could be set to 0 because of a data type error, or some weird CUDA syntax that I’m missing. I have not had this problem before with similar operations.

In Summary:

tempSum should not be 0 after the loop is finished but it is.

please post your full kernel code & the calling code. It is impossible to see what is going on without this. Apart from that, to sum all the elements in an array, you usually do a reduction (example in the SDK)

+= is working just okay, I would not be surprised if the compiler converted your code to +=

Hopefully this post will clarify things.

Here is some seperate code that is being called right before this other code. It is behaving very similarly.
global void normalizeMatrix(float* gpuA, float* meanVector, int nRows, int nColumns){
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
float temp = 0;
float temp2 = 0;
float final, final2;
temp = gpuA[ix+iynColumns];
final = temp - meanVector[threadIdx.x];
nColumns] = final;

This is an example of what is happening with specific numbers.

.538969 .608563 .489000
.249687 .169281 .855000
.755844 .152844 .494687

.514833 .310229 .612896

gpuA after function is called

.024135 .298333 -.123896
.249687 .169281 .855000
.755844 .152844 .494687

So what should be happening here is that gpuA[x, y] = gpuA[x,y] - meanVector. It is doing this correctly for the first row, but that is it. I have bolded this row to show that. However, for the other rows, it is just spitting out the original result. This leads me to believe that there is something wrong with how my threadID’s are done. I just can’t figure out what it is.

Edit: here is how the kernel is being called:
int nRows = 3;
int nColumns = 3;
cudaMalloc((void **)&meanVectorGPU, nColumnssizeof(float));
cudaMalloc((void **)&gpuA, nRows

…other non involved code here…

normalizeMatrix<<<nColumns, nColumns>>>(gpuA, meanVectorGPU, nRows, nColumns);

Edit: Optimization is occurring at a later date. I’m not worrying about that right now. Also, the matrix sizes will eventually be 100x100 up to 2000x2000 or so. Hence the use of cuda here.

As a side question, if you are calculating large matrices (over 768x768) how do you approach the matrix, as I believe the limit of threads per block is 768, correct? Say I have an 1152x1152 matrix. If I was to call a function that needs to perform a calculation for every function in the matrix, how would I access each unit in the matrix with my threadIdx’s being off of normal? Or is it possible to assign more threads to a block, but only the first 768 will run before the others can start?

Sorry for all the questions. Thank you a lot for your time.


The problem is an access problem of gpuA. Even if I explicitly state gpuA[4] = 3; gpuA[4] will not equal 3. I really am not sure of why it does this. Right now I am copying gpuA to device mem. Then I am calling it in a seperate kernel before this. I made sure the cudaThreadSynchronize(); before this so there shouldn’t be a problem in that respect. I am not quite sure why I am only able to access the first row.

iy is undefined. And also, please post your calling code. Do you check for errors after you called the kernel? It can be that your kernel actually did not run at all, and when you don’t check for errors you will never know, but just read back values that were not updated.

Regarding iy being undefined, check my posts after and my edit if you did not. Even without using iy, I have this problem. I’ll check for errors in a few minutes and see if that’s the problem.

float *cpuA;
float *cpuB;
float *cpuC;
float *gpuA;
float *gpuB;
float *gpuC;
float *meanVectorGPU;
float *meanVectorCPU;
int i, j;
int nRows = 3;
int nColumns = 3;
dim3 threads2(nColumns);
dim3 grid2(nColumns);
dim3 threads(nRows,nColumns);
dim3 grid(nRows,nColumns);
unsigned int timer = 0;
unsigned int elapsed = 0;
//cublasStatus status;


cpuA = (float*) malloc (nRows*nColumns*sizeof(float));
cpuB = (float*) malloc (nColumns*nColumns*sizeof(float));
cpuC = (float*) malloc (sizeof(float));
cudaMalloc((void **)&gpuA, nRows*nColumns*sizeof(float));
cudaMalloc((void **)&gpuB, nColumns*nColumns*sizeof(float));
cudaMalloc((void **)&gpuC, nColumns*nColumns*sizeof(float));
cudaMalloc((void **)&meanVectorGPU, nColumns*sizeof(float));
	for(i = 0; i < nRows; i++){
	for(j = 0; j < nColumns; j++){
		printf("%f  ", cpuA[j+i*nColumns]);
cudaMemcpy(gpuA, cpuA, nRows*nColumns*sizeof(float), cudaMemcpyHostToDevice);
meanVectorCPU = (float*) malloc (nColumns*sizeof(float));
memset(meanVectorCPU, 0, nColumns*sizeof(float));

extractMeanVector<<<nColumns, nColumns>>>(meanVectorGPU, gpuA, nRows, nColumns);
CUT_CHECK_ERROR("Kernel execution failed");
cudaMemcpy(meanVectorCPU, meanVectorGPU, nColumns*sizeof(float), cudaMemcpyDeviceToHost);	
	printf("MeanVector \n \n");
	for(i = 0; i < nColumns; i++){
	printf(" %f ", meanVectorCPU[i]);
printf("\n \nNormalized Matrix \n \n");
normalizeMatrix<<<1, 9>>>(gpuA, meanVectorGPU, nRows, nColumns);
CUT_CHECK_ERROR("Kernel execution failed");
cudaMemcpy(cpuA, gpuA, nColumns*sizeof(float), cudaMemcpyDeviceToHost);
for(i = 0; i < nRows; i++){
	for(j = 0; j < nColumns; j++){
		printf("%f  ", cpuA[j+i*nColumns]);

That is the code up to the problem point calling my kernel.

global void normalizeMatrix(float* gpuA, float* meanVector, int nRows, int nColumns){
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
float temp = 0;
float temp2 = 0;
float final, final2;

gpuA[ix] = 2;


that is the modified normalizeMatrix, which still causes only the first row to be set to 0 and all other rows stay the same as they were before. There are no errors that appear up to this point, using CUT_CHECK_ERROR after my kernel executions. Currently I am using 2 kernels, but in the future I’ll hopefully be switching to 1 to minimize communication.

You are copying not enough bytes back to CPU :
cudaMemcpy(cpuA, gpuA, nColumnssizeof(float), cudaMemcpyDeviceToHost);
needs to be
cudaMemcpy(cpuA, gpuA, nRows
nColumns*sizeof(float), cudaMemcpyDeviceToHost);

You sir, are awesome.

I knew it was something dumb like that. I just couldn’t find it. I wish I had somebody to pair program with, as this probably wouldn’t have been a problem.

Thanks a bunch for your time.


No problem. I just found a somewhat similar bug in my code, that is why it stared right at me. For me the bug was like in the following simple code (for people to laugh & learn, I have been looking on & off for this bug for the last 3 days…):

__global__ void kernelcall(float *arr)


index = threadIdx.x;

while (index < num_elements)

arr[index] = 1;


void mexFunction(....)


<stuff that gets data from matlab, one of the variables is num_element>

__constant__ unsigned int num_elements;

cudaMemCpyToSymbol(num_elements, &num_element, 1*sizeof(unsigned int));

float *some_array_to_fill;

cudaMalloc( (void *) some_array_to_fill, num_elements * sizeof(float));



I was getting unspecified launch failures on the kernel call. People that see the bug without looking further are heroes to me.

The bug turned out to be here: cudaMalloc( (void *) some_array_to_fill, num_elements * sizeof(float));

What was happening is that I was inserting the adress of num_elements into the cudaMalloc instead of the value as that was called num_element…

So the lesson is to stick to a naming scheme (there is a post about it, called CUDA programming style I believe). When I would have called my constant c_num_elements, this mistake would not have happend that easily.