"strange behavior" in a device function

I have a problem with lists on the device and I don’t understand my fault.
I give you a simplified code where you can see my problem.

In my global function, I create a list local_tableau_mots, and I define an integer k which depends on the ThreadId.
If I put k=ThreadId%623 then I have a very long time( 1 or 2seconds).
If I put k=0, or any fixed value, no problem (0.10s):

In the function “device void next_normal_MT” , I have a read of an array and a write of the value I just read at the same index: I if delete the write, the things are ok…

So something is really out … :( :( :(

Nothing appears on the Emudebug mode.

Unsually those problems come when you try to make a bad acces in memory.
Is there a problem to define list of 624 unsigned int in the device? I thought the register was something like 8000 words, and 16kB of shared memory.

This is my example, Thanks in advance.

global_ void GaussKernelMT(int nb_trajectoires, float nb_jour,int index_premier_k,float *d_tableau_sommes_partielles,int nb_strike,float *pd_strike,float forward, float vol, float volvol,unsigned int *d_tableau_mots,int * d_tableau_k)


float taux;
unsigned int local_tableau_mots[624];
int jj;
int k;

const int THREAD_N = blockDim.x * gridDim.x
const int      tid = blockDim.x * blockIdx.x + threadIdx.x; 


k=0;///////////////////////////1.7sec if this ligne is in comment, but      0.11sec if this ligne is inputed.//////////////////

for(int traj = tid; traj < nb_trajectoires; traj += THREAD_N)
            taux=taux_apres_trajectoire_gaussMT(nb_jour,forward,vol, volvol,local_tableau_mots,&k);// taux will not depend on k,see code

           remplit_tableau_somme_partielles(THREAD_N,tid, forward, taux, nb_strike,pd_strike,d_tableau_sommes_partielles);// does not depend on k.



device float taux_apres_trajectoire_gaussMT(float nb_jour,float forward,float vol,float volvol,unsigned int local_tableau_n_mots[624],int *pk)
float taux;
float vol_actuelle;
float valeur_alea1;
float valeur_alea2;


for (int j=0;j<nb_jour;j++)

//valeur_alea1 and valeur_alea2 is always at 0.5, taux will have the same value for every k.

         next_normal_M(local_tableau_n_mots,pk,&valeur_alea1, &valeur_alea2);

//valeur_alea1 and valeur_alea2 is always 0.5f, see code below

return taux;


device void next_normal_MT( unsigned int ptableau_n_mots[mt_nn],int *pk,float * palea1,float * palea2)

	int k;
	unsigned int prochain_mot;

	ptableau_n_mots[k]=prochain_mot;// if I delete this line the problem disappears!!!


In fact in EmuDebug I have a crash with the message:

Exception first chance à 0x7c812a5b in…exe : Exception Microsoft C++ : cudaError_enum à memory adress 0x021afe30…
thread ‘Thread Win32’ (0x58c) stopped with code 0 (0x0).
HEAP[sabr2.exe]: Invalid Address specified to RtlValidateHeap( 00370000, 0012E914 )


unsigned int local_tableau_mots[624];

… is this in device memory? All RAM used in a kernel must be device memory, for example shared memory or memory allocated with cudaMalloc(). I don’t see a cudaMalloc.

Yes it is in the device memory, like it is possible to declare locally simple floats in a kernel, isn’t it possible to declare by this simple way an array in a kernel ?

I would like the array to be on the register (but not sure the size=624 integer possible ?). If I put in my kernel:

shared unsigned int local_list[624];

it is better, no crash, but I have diffrent results at different launch: I need each and only one thread can work with “his” array in read and write. I suppose it is not the case by doing like that.

What can I do then ?


There is no problem with:

unsigned int local_list[624];

inside a global or device function. The compiler will give every thread its own array, but the array is (behind the scenes) stored in global memory. The access time will be slower, like reading from a normal global array.

It is very long (X20), is there anyway to go faster?

Is it possible to use the shared memory ?


You can use shared memory to hold your array, but by definition every thread will see the same array. If every thread needs its own array in shared memory, you can declare the shared array to be [# of threads] * and then have each thread calculate the offset they should use. With only 16 kB of shared memory, you will run out pretty quick if you have a reasonable number of threads. You may be stuck using local memory (the name for the global memory assigned to each thread).

ok , so I need to use global memory for that.