kernel launched but give wrong results strange problem small instance size works well, bigger someti

Hello everyone,

I started CUDA for few weeks and I’m trying to develop my own application for doing mathematics calculation. I have already a CPU version which works well and I try to implement a GPU version.

I have a specific kernel which do a calculation, and I need to call it n * (n - 1) /2 times where n is the size of my problem.

My GPU program works well when n <= 30, but when I increased the size of my problem (for example n = 35 or n = 50), it sometimes give me wrong results. I tried to debug and I noticed that sometimes arrays from global memory have inconsistent values. I also tried to catch errors with cudaThreadSynchronize() + cudaGetLastError() but nothing was detected. In fact, the kernel executed but give wrong results.

Here is the number of registers used by thread:

ptxas info : Compiling entry function ‘_Z12CalculationPKiS0_S0_iiPi’

ptxas info : Used 16 registers, 44+40 bytes smem, 4 bytes cmem[1]

I thought I used too much registers, but it is strange that it works well for n=30 and not for n=35.

Here is the code of my program:

// Allocate a device matrix of same size as A.

int* AllocateDeviceMatrix (const int* A, int n)


  int* Adevice;

  int size = n * n * sizeof(int);

  cudaMalloc((void**)&Adevice, size);

  return Adevice;


// Allocate a device vector of same size as V.

int* AllocateDeviceVector (const int* V, int n)


  int* Vdevice;

  int size = n * sizeof(int);

  cudaMalloc((void**)&Vdevice, size);

  return Vdevice;


// Copy a host vector to a device vector.

void CopyToDeviceMatrix (int* Adevice, const int* Ahost, int n)


  int size = n * n* sizeof(int);

  cudaMemcpy(Adevice, Ahost, size, 



// Copy a device matrix to a host matrix.

void CopyFromDeviceMatrix (int* Ahost, const int* Adevice, int n)


  int size = n * n * sizeof(int);

  cudaMemcpy(Ahost, Adevice, size, 



// Copy a host vector to a device vector.

void CopyToDeviceVector (int* Vdevice, const int* Vhost, int n)


  int size = n * sizeof(int);

  cudaMemcpy(Vdevice, Vhost, size, 




void ProcessingGPU(const int* A, const int* B, int n){

  int* S = (int*)malloc(sizeof(int)*n);


int best_value = pre_calculation(A, B, S, n); // CPU part

int* Ad = AllocateDeviceMatrix(A,n);

  CopyToDeviceMatrix(Ad, A, n);

int* Bd = AllocateDeviceMatrix(B,n);

  CopyToDeviceMatrix(Bd, B, n);

int* Sd = AllocateDeviceVector(S,n);

  CopyToDeviceVector(Sd, S, n);

int* new_values = (int*)malloc(sizeof(int)*n*(n-1)/2);

int* new_values_d = AllocateDeviceVector(new_values,n*(n-1)/2);

Calculation<<<ceil((n*(n-1)*1.0/2)/BLOCK_SIZE),BLOCK_SIZE>>>(Ad, Bd, Sd, n, best_value, new_values_d);


  CopyFromDeviceVector(new_values, new_values_d, n*(n-1)/2);

for (int i = 0; i < n*(n-1)/2; i++){

	if (new_values[i] < best_value)

	  best_value = new_values[i];	 


printf("best_value = %d\n",  best_value);









#define ID(i,j,ld) (((i)*(ld))+(j))

#define BLOCK_SIZE 256

__device__ int internal_difference(int n, const int* a, const int* b,

				 const int* p, int i, int j)


  int sum;  int k;

sum = (a[ID(i,i,n)]-a[ID(j,j,n)])*(b[ID(p[j],p[j],n)]-b[ID(p[i],p[i],n)]) +


for (k = 0; k < n; k++) 

	if (k!=i && k!=j)

	  sum = sum + (a[ID(k,i,n)]-a[ID(k,j,n)])*(b[ID(p[k],p[j],n)]-b[ID(p[k],p[i],n)]) + (a[ID(i,k,n)]-a[ID(j,k,n)])*(b[ID(p[j],p[k],n)]-b[ID(p[i],p[k],n)]);

  return sum;


__global__ void Calculation(const int* A, const int* B, const int* S, int n, const int value, int* new_values)


volatile int temp = blockIdx.x * blockDim.x + threadIdx.x;

  __shared__ int N;

N = n;

if (temp >= N*(N-1)/2)



	int first;

	int second;


	first = (N-1) - floor( (sqrtf( 8 * ((N*(N-1)/2) - temp - 1) + 1) -1 ) / 2 ) - 1;

	second = temp - first * (N-1)+ first * (first + 1)/2 + 1;

	new_values[temp] = value +  internal_difference(N,A,B,S,first,second);



Thanks for helping me. It is very annoying problem for me because speed-up improvements in GPU seems very significant.


I’m just a newbie, but I think your problem may be in the shared variable N.

Just remove it and use n instead.


Thanks for your reply, but in fact I’ve tried this morning to decrease the number of registers by the shared variable and the volatile variable. Originally my kernel uses 18 registers (without any shared variable or volatile variable), and by doing theses changes I could reduce to 16 registers.

But with or without theses changes, I have the same problem: it works for n=30 and sometimes doesn’t work for n=35.

Any other suggestion ?

I see a potencial problem here (but I might be wrong, since I’m a newbie):

I think all your threads will share a single variable N (common to all). Thus you have a bottleneck when you do N = n; This will slow your program a lot.

Thanks again, but I told you in my original program i did not used any shared variable. Using or not doesn’t affect performance.

The problem I encounter here is a computation limitation: I launch my kernel n * (n-1) / 2 times. For n <= 30 it works very well, for n >= 35 it sometimes give me wrong results.

Other suggestion ?

The shared variable is not necessary, but it won’t hurt anything.

Also, declaring temp to be volatile is not necessary, but it won’t hurt anything either.

My guess is the problem lies in

first = (N-1) - floor( (sqrtf( 8 * ((N*(N-1)/2) - temp - 1) + 1) -1 ) / 2 ) - 1;

because you are depending on the exactness of sqrtf. For example when n=35 and temp=33, you’re calculating sqrt(8*(595-33-1)+1) = sqrt(4489) = 67.00000, but if sqrtf were to return 66.99998, your function will calculate the wrong value.

That’s my guess. Everything else looks ok.

I found yesterday afternoon exactly the same problem with sqrtf !! And I have just read your post today ! I should have read your post first, it would have saved me time !

Yes indeed sqrtf doesn’t give the same precision as sqrt. I put under the sqrtf calculation inside a little + 0.1f and it works fine.

Thank you for everything.

Best regards.