Fastest way for each thread wrks with 624 integers

I have 64*128=8192 threads.
For each thread,I need to input a list of 624 integers.
Each thread will have to work (read/write) on this list without interact with
the other lists of other threads.

In the global function I give function(…,int * d_my_list)
Where d_my_list was a Cudacopy HostToDevice of h_my_list
defined as:

CUDA_SAFE_CALL( cudaMalloc((void **)&d_my_list,8192624sizeof(int)) );
CUDA_SAFE_CALL( cudaMemcpy(d_my_list,h_my_list,8192624sizeof(int)) ,cudaMemcpyHostToDevice) );

8192624sizeof(int))=20Mo. (Each thread works only with the 624 integers,
and I want to use the fastest memory)

In the global function(…,int * d_my_list) I put:

unsigned int local_list[624];

const int THREAD_N = blockDim.x * gridDim.x;//128*64 blocs=8192
const tid = blockDim.x * blockIdx.x + threadIdx.x;


and I have crash with that.
If I put
shared unsigned int local_list[624] no crash, but the
local_list seems to be really “shared”: I need each thread works(read/write) independantly with others.
In EmuDebug, I see always the same adress of my local_list, and in Release I have some random result and so I think there are conflicts beetween the local_lists, that each thread modifies.

Can someone help me to solve my need. It will solve my previous Post too.

I don’t understand your question. You have 8192 threads total. And you have an array of 8192*624 uints?

What do you want? Do you want each thread have the same 624 uints or does every thread needs its own 624 uints. If you declare 624*8192 uints as shared memory you have a problem because that is too much the max is 16kb of shared mem.

But I don’t think that is your problem here. But maybe you can explain a little bit more?

No, each thread use only its own 624 uint; each thread will read/modify only its own uints. I have declared an array of 8192*624 uints just to extract for each thread the 624uints I need.

If I do

global void my_function(…,unsigned int *d_global_list)


    __shared__  unsigned int local_list[624];

int jj;

int k;

const int THREAD_N = blockDim.x * gridDim.x;

const  int      tid = blockDim.x * blockIdx.x + threadIdx.x; 






…and each thread will make calculation changing the values in local_list.

I have different result each time I launch so I think the local_list is really shared

by the other thread of the block.

I I do not put the shared I have crash, I think there is a problem of memory…


Try to put this after your kernel call.


	cudaError_t error = cudaGetLastError();

	if (error != cudaSuccess)

  printf("error :%s\n",cudaGetErrorString(error));

This will give you some more information about your error while running if it is your kernel.

If I do not put the shared unsigned int local_list then I have the


“the launch timed out and was terminated”

If I put the shared: results in 0.25sec, but always different:

I think that when I put the shared my local_list is really “shared” by all the thread. Can you confirm me that and what can I do?

If you say

shared myvar;

then myvar can be read and written from within all threads also take a loot at the programming guide section 2.3 Memory model this will tell you something about how it is made up.

the error you get will probably occur after about 5 seconds that will be the watchdog your kernel will take more than 5 seconds to compute.

I agree the error is the watchdog, but I still do not understand why it is impossible to declare an array in a kernel. Is your answer that it is just possible with shared or a cuMemAlloc() ?

In the cuMemAlloc, it will go in the global memory, so not a fast access.

Considering the size in only 624 int I would like better.

If in the kernel I declare “shared unsigned int local_list[624]”

then for each thread I need to have the creation of a different local_list. In EmuDebug I have only one adress for the local_list, and i thitnk there is not a different and independant creation at each thread of a local_list (because in Release I have unstable results) , which is what I need.

What can I do to have that then ?


Well I think declaring an array inside the kernel is impossible. That is some kind of dynamic allocation. If you want to use it like that you need to give them to the kernel as an argument. So I think better to use shared memory.

also take a look at this thread

Whether use shared memory?

yes, but I have 8192 threads and each thread needs an independant array of 624 int… (so 20Mo total).

So impossible to put that in shared memory.

Even if I consider that I have 64 blocks of 128 threads, it does by block:


So I think I can not do without the global memory, or to create the array with a cuMemAlloc in the kernel, but I think it is roughly the same.

Do you agree with me?


Yes I totally agree with you on that matter. I think you also have to use the local or global memory on you device.


Pardon me for asking, but 624 sounds like the size of the state vector for the Mersenne Twister random number generator. If that is indeed what you’re workin on, my suggestion is that MT19937 is not an appropriate random number generator to use within individual CUDA threads, the state vector is way too big… I’d suggest using one of the other xorshift type RNGs that have much smaller state vectors. If the fact that you’re storing 624 integers is just a random (pun intended) coincidence, then my only comment is that doing that sort of thing within individual CUDA threads is going to make your kernel memory bandwidth bound, and you’ll have a lot of trouble with coalescing etc unless you do things just right.


Yes you recognized me! In fact I made the same conclusion after the calculation of memory, the need of memory is too important to be used by GPU.

In fact there is the TT800 (2^800 peridocity) which is only with 25 uint but I think it is still a little too big, the Taus113 and Taus88 from L’ecuyer use only 4 and 3 uints. The Taus88 seems very nice has a periodicity of 2^88 so more than>10^22 and has no particular default as the statisticla tests.

With the TT800 I have a “bad” factor of time of 3 compared to a 3 uint generator if I declare the local_list in the kernel. I will look the factor if I use the shared memory to see if it is better. Howewer I will use definitely the Taus88.

Mayby the registers doesn’t like the arrays, and the shared memory is better to do that.

In the 3.2 Execution Kernel: we can read: “How many blocks each multiprocessor can process in one batch depends on how many registers and how much shared memory per block are required for a given kernel …” it seems that in reality the thread can be lauched without memory on the chip , using adress in the global memory: In that case what see at a certain level (the level were the global memory

is used) a strong decrease of performance. I am not sure of what I say but it seems consistent which what I see.

Yes, I think you’ll want to put your state vectors in shared memory if you can get them to fit. I don’t know what you’re planning to do with your per-thread RNGs, but you’ll be using a lot of your shared memory just for the RNG state vectors, so you won’t have much left for anything else. Hopefully you’re doing something that’s heavy on arithmetic, in which case you might get by ok with a fairly small sized thread block of say 64 threads or so.