Fastest way for each thread wrks with 624 integers

x248 · April 14, 2008, 11:24am

Hello,
I have 64*128=8192 threads.
For each thread,I need to input a list of 624 integers.
Each thread will have to work (read/write) on this list without interact with
the other lists of other threads.

In the global function I give function(…,int * d_my_list)
Where d_my_list was a Cudacopy HostToDevice of h_my_list
defined as:

CUDA_SAFE_CALL( cudaMalloc((void **)&d_my_list,8192624sizeof(int)) );
CUDA_SAFE_CALL( cudaMemcpy(d_my_list,h_my_list,8192624sizeof(int)) ,cudaMemcpyHostToDevice) );

8192624sizeof(int))=20Mo. (Each thread works only with the 624 integers,
and I want to use the fastest memory)

In the global function(…,int * d_my_list) I put:

{…
unsigned int local_list[624];

const int THREAD_N = blockDim.x * gridDim.x;//128*64 blocs=8192
const tid = blockDim.x * blockIdx.x + threadIdx.x;
for(jj=0;jj<624;jj++)//!!!
{

 local_list[jj]=d_my_list[624*tid+jj];
}

and I have crash with that.
If I put
shared unsigned int local_list[624] no crash, but the
local_list seems to be really “shared”: I need each thread works(read/write) independantly with others.
In EmuDebug, I see always the same adress of my local_list, and in Release I have some random result and so I think there are conflicts beetween the local_lists, that each thread modifies.

Can someone help me to solve my need. It will solve my previous Post too.
Thanks.

jordyvaneijk · April 14, 2008, 1:27pm

Hello,

I have 64*128=8192 threads.

For each thread,I need to input a list of 624 integers.

Each thread will have to work (read/write) on this list without interact with

the other lists of other threads.

In the global function I give function(…,int * d_my_list)

Where d_my_list was a Cudacopy HostToDevice of h_my_list

defined as:

CUDA_SAFE_CALL( cudaMalloc((void **)&d_my_list,8192624sizeof(int)) );

CUDA_SAFE_CALL( cudaMemcpy(d_my_list,h_my_list,8192624sizeof(int)) ,cudaMemcpyHostToDevice) );

8192624sizeof(int))=20Mo. (Each thread works only with the 624 integers,

and I want to use the fastest memory)

In the global function(…,int * d_my_list) I put:

{…

unsigned int local_list[624];

const int THREAD_N = blockDim.x * gridDim.x;//128*64 blocs=8192

const tid = blockDim.x * blockIdx.x + threadIdx.x;

for(jj=0;jj<624;jj++)//!!!
{
local_list[jj]=d_my_list[624*tid+jj];
}
and I have crash with that.

If I put

shared unsigned int local_list[624] no crash, but the

local_list seems to be really “shared”: I need each thread works(read/write) independantly with others.

In EmuDebug, I see always the same adress of my local_list, and in Release I have some random result and so I think there are conflicts beetween the local_lists, that each thread modifies.

Can someone help me to solve my need. It will solve my previous Post too.

Thanks.

[snapback]362017[/snapback]

I don’t understand your question. You have 8192 threads total. And you have an array of 8192*624 uints?

What do you want? Do you want each thread have the same 624 uints or does every thread needs its own 624 uints. If you declare 624*8192 uints as shared memory you have a problem because that is too much the max is 16kb of shared mem.

But I don’t think that is your problem here. But maybe you can explain a little bit more?

x248 · April 14, 2008, 1:59pm

No, each thread use only its own 624 uint; each thread will read/modify only its own uints. I have declared an array of 8192*624 uints just to extract for each thread the 624uints I need.

If I do

global void my_function(…,unsigned int *d_global_list)

{

    __shared__  unsigned int local_list[624];



int jj;

int k;





const int THREAD_N = blockDim.x * gridDim.x;

const  int      tid = blockDim.x * blockIdx.x + threadIdx.x; 



for(jj=0;jj<624;jj++)//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

{

 

 local_list[jj]=d_global_list[624*tid+jj];

}

…and each thread will make calculation changing the values in local_list.

I have different result each time I launch so I think the local_list is really shared

by the other thread of the block.

I I do not put the shared I have crash, I think there is a problem of memory…

External Media

jordyvaneijk · April 14, 2008, 2:03pm

No, each thread use only its own 624 uint; each thread will read/modify only its own uints. I have declared an array of 8192*624 uints just to extract for each thread the 624uints I need.

If I do

global void my_function(…,unsigned int *d_global_list)

{

shared unsigned int local_list[624];
int jj;

int k;
const int THREAD_N = blockDim.x * gridDim.x;

const int tid = blockDim.x * blockIdx.x + threadIdx.x;
for(jj=0;jj<624;jj++)//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

{
local_list[jj]=d_global_list[624*tid+jj];
}
…and each thread will make calculation changing the values in local_list.

I have different result each time I launch so I think the local_list is really shared

by the other thread of the block.

I I do not put the shared I have crash, I think there is a problem of memory…

External Media

[snapback]362086[/snapback]

Try to put this after your kernel call.

cudaThreadSynchronize();

	cudaError_t error = cudaGetLastError();

	if (error != cudaSuccess)

  printf("error :%s\n",cudaGetErrorString(error));

This will give you some more information about your error while running if it is your kernel.

x248 · April 14, 2008, 2:50pm

Try to put this after your kernel call.
cudaThreadSynchronize();

	cudaError_t error = cudaGetLastError();

	if (error != cudaSuccess)

 Â printf("error :%s\n",cudaGetErrorString(error));
This will give you some more information about your error while running if it is your kernel.

[snapback]362087[/snapback]

If I do not put the shared unsigned int local_list then I have the

message:

“the launch timed out and was terminated”

If I put the shared: results in 0.25sec, but always different:

I think that when I put the shared my local_list is really “shared” by all the thread. Can you confirm me that and what can I do?

jordyvaneijk · April 14, 2008, 3:03pm

If you say

shared myvar;

then myvar can be read and written from within all threads also take a loot at the programming guide section 2.3 Memory model this will tell you something about how it is made up.

the error you get will probably occur after about 5 seconds that will be the watchdog your kernel will take more than 5 seconds to compute.

x248 · April 14, 2008, 3:19pm

I agree the error is the watchdog, but I still do not understand why it is impossible to declare an array in a kernel. Is your answer that it is just possible with shared or a cuMemAlloc() ?

In the cuMemAlloc, it will go in the global memory, so not a fast access.

Considering the size in only 624 int I would like better.

If in the kernel I declare “shared unsigned int local_list[624]”

then for each thread I need to have the creation of a different local_list. In EmuDebug I have only one adress for the local_list, and i thitnk there is not a different and independant creation at each thread of a local_list (because in Release I have unstable results) , which is what I need.

What can I do to have that then ?

Thanks.

jordyvaneijk · April 14, 2008, 6:44pm

Well I think declaring an array inside the kernel is impossible. That is some kind of dynamic allocation. If you want to use it like that you need to give them to the kernel as an argument. So I think better to use shared memory.

also take a look at this thread

Whether use shared memory?

x248 · April 14, 2008, 7:30pm

yes, but I have 8192 threads and each thread needs an independant array of 624 int… (so 20Mo total).

So impossible to put that in shared memory.

Even if I consider that I have 64 blocks of 128 threads, it does by block:

1286244=…320ko>>16ko.

So I think I can not do without the global memory, or to create the array with a cuMemAlloc in the kernel, but I think it is roughly the same.

Do you agree with me?

Thanks.

jordyvaneijk · April 15, 2008, 7:26am

Yes I totally agree with you on that matter. I think you also have to use the local or global memory on you device.

tachyon_john · April 17, 2008, 6:07pm

Hi,

Pardon me for asking, but 624 sounds like the size of the state vector for the Mersenne Twister random number generator. If that is indeed what you’re workin on, my suggestion is that MT19937 is not an appropriate random number generator to use within individual CUDA threads, the state vector is way too big… I’d suggest using one of the other xorshift type RNGs that have much smaller state vectors. If the fact that you’re storing 624 integers is just a random (pun intended) coincidence, then my only comment is that doing that sort of thing within individual CUDA threads is going to make your kernel memory bandwidth bound, and you’ll have a lot of trouble with coalescing etc unless you do things just right.

John

Hello,

I have 64*128=8192 threads.

For each thread,I need to input a list of 624 integers.

Each thread will have to work (read/write) on this list without interact with

the other lists of other threads.

In the global function I give function(…,int * d_my_list)

Where d_my_list was a Cudacopy HostToDevice of h_my_list

defined as:

CUDA_SAFE_CALL( cudaMalloc((void **)&d_my_list,8192624sizeof(int)) );

CUDA_SAFE_CALL( cudaMemcpy(d_my_list,h_my_list,8192624sizeof(int)) ,cudaMemcpyHostToDevice) );

8192624sizeof(int))=20Mo. (Each thread works only with the 624 integers,

and I want to use the fastest memory)

In the global function(…,int * d_my_list) I put:

{…

unsigned int local_list[624];

const int THREAD_N = blockDim.x * gridDim.x;//128*64 blocs=8192

const tid = blockDim.x * blockIdx.x + threadIdx.x;

for(jj=0;jj<624;jj++)//!!!
{
local_list[jj]=d_my_list[624*tid+jj];
}
and I have crash with that.

If I put

shared unsigned int local_list[624] no crash, but the

local_list seems to be really “shared”: I need each thread works(read/write) independantly with others.

In EmuDebug, I see always the same adress of my local_list, and in Release I have some random result and so I think there are conflicts beetween the local_lists, that each thread modifies.

Can someone help me to solve my need. It will solve my previous Post too.

Thanks.

[snapback]362017[/snapback]

x248 · April 18, 2008, 12:21am

Yes you recognized me! In fact I made the same conclusion after the calculation of memory, the need of memory is too important to be used by GPU.

In fact there is the TT800 (2^800 peridocity) which is only with 25 uint but I think it is still a little too big, the Taus113 and Taus88 from L’ecuyer use only 4 and 3 uints. The Taus88 seems very nice has a periodicity of 2^88 so more than>10^22 and has no particular default as the statisticla tests.

With the TT800 I have a “bad” factor of time of 3 compared to a 3 uint generator if I declare the local_list in the kernel. I will look the factor if I use the shared memory to see if it is better. Howewer I will use definitely the Taus88.

Mayby the registers doesn’t like the arrays, and the shared memory is better to do that.

In the 3.2 Execution Kernel: we can read: “How many blocks each multiprocessor can process in one batch depends on how many registers and how much shared memory per block are required for a given kernel …” it seems that in reality the thread can be lauched without memory on the chip , using adress in the global memory: In that case what see at a certain level (the level were the global memory

is used) a strong decrease of performance. I am not sure of what I say but it seems consistent which what I see.

tachyon_john · April 19, 2008, 4:00am

Yes, I think you’ll want to put your state vectors in shared memory if you can get them to fit. I don’t know what you’re planning to do with your per-thread RNGs, but you’ll be using a lot of your shared memory just for the RNG state vectors, so you won’t have much left for anything else. Hopefully you’re doing something that’s heavy on arithmetic, in which case you might get by ok with a fairly small sized thread block of say 64 threads or so.

Cheers,

John

Yes you recognized me! In fact I made the same conclusion after the calculation of memory, the need of memory is too important to be used by GPU.

In fact there is the TT800 (2^800 peridocity) which is only with 25 uint but I think it is still a little too big, the Taus113 and Taus88 from L’ecuyer use only 4 and 3 uints. The Taus88 seems very nice has a periodicity of 2^88 so more than>10^22 and has no particular default as the statisticla tests.

With the TT800 I have a “bad” factor of time of 3 compared to a 3 uint generator if I declare the local_list in the kernel. I will look the factor if I use the shared memory to see if it is better. Howewer I will use definitely the Taus88.

Mayby the registers doesn’t like the arrays, and the shared memory is better to do that.

In the 3.2 Execution Kernel: we can read: “How many blocks each multiprocessor can process in one batch depends on how many registers and how much shared memory per block are required for a given kernel …” it seems that in reality the thread can be lauched without memory on the chip , using adress in the global memory: In that case what see at a certain level (the level were the global memory

is used) a strong decrease of performance. I am not sure of what I say but it seems consistent which what I see.

[snapback]363967[/snapback]

Topic		Replies	Views
Using Shared Memory in CUDA C/C++ Technical Blog	36	1954	October 8, 2020
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16291	January 30, 2011
How to use all 16KB shared memory CUDA Programming and Performance	39	19309	April 1, 2010
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3308	January 10, 2010
[SOLVED] Code his own shared memory with device memory! CUDA Programming and Performance	15	2557	October 7, 2015
Some confusion on using shared memory. CUDA Programming and Performance	26	9090	June 2, 2009
Efficient use of shared memory CUDA Programming and Performance	29	4329	December 2, 2019
why is shared memory example not faster CUDA Programming and Performance	7	1301	May 16, 2012
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7250	January 25, 2010
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29322	April 23, 2010

Fastest way for each thread wrks with 624 integers

Related topics