Unusual beahvior reading array


I’m new in this forum. I’m developing a software for general porpuse integration (Vegas method).
I’m in trouble with reading the vector xi because sometimes (apparently in a random way) reads bad values

I past my code down there, I hope someone could help me.

__global__ void kernel(int* ndim, int *kg, double* dxg, int * ia, double * xi, double *regn, double *dx, double * di, int* mds, double * d, double* xjac, double* xnd, int* npg, curandState *  const rngState, double *ret_f, double * ret_f2){

	//Thread identity and offset
	int tid = threadIdx.x;
	long int offset = blockIdx.x*blockDim.x+tid;
	//Variables neeeded
	__shared__ double dev_f[THREADS];
	__shared__ double dev_f2[THREADS];
	double xn,xo,rc,wgt;
	double x[MXDIM];
	double f=0.,f2=0.,appo=0.;
	int a,b;
	curandState localState=rngState[offset];        
	while(offset < (*npg)){
		for (int j=0;j<(*ndim);j++) {
			xn=(kg[j]- curand_uniform_double(&localState))*(*dxg)+1.0;
			/*if(a>3*NDMX || a<0) {
			if (ia[j] > 1) { 
			} else {	
			wgt *= xo*(*xnd);
		appo=fxn3(x,*ndim)* wgt; //Function evaluation
		for (int j=0;j<(*ndim);j++) {
			atomicAdd(&di[(ia[j]-1)*MXDIM+j], appo);
	//Summing over each single block
	int i = blockDim.x/2;
			dev_f[tid] +=dev_f[tid+i];
			dev_f2[tid] +=dev_f2[tid+i];

    	i /=2;	
	if(tid  == 0){

a) check that you have allocated and initialized the array correctly
b) check that what is supposed to be in the array, is in the array
c) check that your index, is what your index is supposed to be

you can add a few lines of code to be able to check b) and c)
simply read the array out with no fancy indexing
then read out your indices, or at least add code to test that they are always within known limits
add breakpoints and use the debugger to check the values

i would surely check c), as you use a to index the mentioned array, and a is no small thing in calculation itself

First Thanks for the Answer.

a) The array is allocated and initialized in the right way, before passing it to the device is surely correct;
b) I made this check and here appears the problem: when reading some position of the array it gives back wrong data, but this happens randomly, and the values given back are wrong but of the same order of the right value (this say to me that i’m not reading out of the ranges)
c) the index it’s fancy because I’ve transferred a matrix in an array so the right way to acces the right position is that used.

You said “a is no small thing in calculation itself”. what do you refer to?

I’m still learning using the debugger so beg me if I not see problems at the first sight.

Thank you very much.

This conclusion sounds dangerous…

I think he wanted to ask you to do a simple indexing as a test. In your code the variable “a” is very complex…

little_jimmy proposed to make two tests:
to check b) check the values without this complex (fancy) indexing to ensure you have the correct values in the array.
to check c) check that your index is always in the correct range.

But do not try check this by accessing the array with your complex index. Otherwise, you won’t be able to conclude if the index is wrong or the data is wrong…

In your kernel the xi values don’t change, right? Then, you can check if the values before and after the kernel call are still the same to ensure that other writes were not out of bounds.

Ok, you clarify the problem.

I’ve made the check you suggested.

  • The values in the array without the complex index are right
  • The array before and after the kernel call is the same
  • The correct range of the index is checked in the construction of ia[j] , in line 29 , where the value is in the range 1 < ia[j] < NDMX. After the only thing that I add to this value (that is chosen random as you can see in line 28) is add j*MXDIM to select the range of the array of the dimension I’m working on (remember I’m working on Multi-dimensional integration so when I refer to dimension is one of the dimension of the integral).
  • a is correct

I’ve made other checks:

  • I put directly the expression of a in the array, xi[j*MXDIM+ia[j]-1], and I’ve notice that it fails (get the wrong value) more often
  • I’ve tried to make all operation in the most safety way i know, e.g. read value from xi and put in a double variable and after make the difference (this would let me know if the problem was the difference), but this doesn’t resolve the problem.

What others test can I do?

Unfortunately now I’m not succeeding in run it in cuda-gdb…

I do not understand how you define “wrong or correct value”. Do you mean the “wrong values” are not in the array at all?

xi is an array of size ndimNDMX? For each j = 0…ndim you want to choose randomly (with some weight?) an element from xi[jNDMX]…xi[(j+1)*NDMX-1]?

No. I try to explain quickly how it works.
The random number generated is used for two things: 1) to select the bin and 2) for evaluate the function in that bin. Statistically generating uniformly >100000 number I sampled all the posibles bins.

xi is an array of size MXDIM*NDMX and contains the bounds of the bin (NDMX) for each dimension (MXDIM), so I’ve to make a difference between two elements to get the width of a determined bin. Now I’ve fixed the width of each bin to control that all works, but sometimes this difference give me unexpected value (also negative). When I talk about wrong value I mean that the value is different from what I expect…

Another thing I do not understand is the use of the “ia” array. It is an array in global memory of size MXDIM (or ndim?). How do you ensure that there is no race condition on that array?

MXDIM = ndim in that case.
ia[j] its an array with three integer each one identify the bin for each dimension. As I explained before ia[j] is determined from xn that is a float between 1 and NDMX, ia[j] is the int cast…

Ok, i’Ve forgotten to check about race condition…

Shouldn’t you define ia in local memory? Or do other threads need to interact with ia?

Yes. In current version of the program i could define ia in the local memory and this solves the problem. In future I will have to get back ia from the kernel, but it’s not an actual problem at the moment.

Thank you very much.