How to parallel cycle? (NEWBIE QUESTION) Large cycle to parallel threads

arcon · January 12, 2009, 7:11pm

Hello,

please, tell me,

i have a fucntion f(x), where X varies from 0x00000000 … 0xFFFFFFFF
the result f(x) compare with some constants and if equal, i have to write them down to file.

what is the best way to compute such task in cuda?
how i can pass the x to many threads?

tanx a lot.

Ocire · January 13, 2009, 12:19am

do you just have a single “x” you want to check?

if you have more than one: put all of them into an array (int x[…]), copy it to the gpu, let every thread read its “x” (meaning x[threadIdx.x+blockDim.x*blockIdx.x] or something like that) and compute the value, check against the constants (which you also copied to the device) and write back whether to save it or not and its value.

if it’s a single “x” and a function f() that’s taking quite a long time to compute, you will have to try to parallelize whatever f() does. (which in most cases will be a bit more tricky ;-))

arcon · January 13, 2009, 6:39am

tanx,

my X as values from 00000000 to 0xFFFFFFFF

my idea was to divide all range 0xFFFFFFFF to RANGE / NUM_THREADS and calc in each thread the chosen range for each thread

Sarnath · January 13, 2009, 7:11am

Your idea looks good! Thats the way to go.

i.e.

Make X as a function(threadIdx, blockIdx)

arcon · January 14, 2009, 8:19am

Such signgle-thread code hungs my machine… :-(
global void DoFunc(DWORD *res, DWORD test)
{
DWORD result;
DWORD x;

for (x=0x87000000;x<0x88000000;x++)
{
              result=FX(x)

      if (result==0x1129F56E) 
	{
		res[0]=j; break;
	}
}

}

arcon · January 14, 2009, 6:59pm

sorry, it hungs only host machine, computer freezes until computation is finished.
Thean i can access my machine again.
Any way to avoid this?

finally i set code:
num blocks = 16
num threads per block = 16

card is 9800GTX+

int idx = blockIdx.x * blockDim.x + threadIdx.x;

start = idx0x1000000;
end = idx0x1000000+0x1000000;
__syncthreads();
for (j=start;j<end;j++)
{
… computation cycle

}
__syncthreads();

Ocire · January 15, 2009, 2:16am

first, you are most likely doing too much work in one thread…

second, if it freezes, use a smaller range, optimizie your code and then go up with the range again ;-)

here an example, how i would do it:

__global__ kernel(int offset){

  int idx=threadIdx.x+blockIdx.x*blockDim.x+offset;

  //compute...

}

int main(...){

  //...

  kernel<<<4096,256>>>(0x87000000);

  //...

}

that’s 1M values you check, blockDim is just a guess, look what’s most efficient for you. you can also use 2 dimensional gridDims, if you want to have more threads.

measure the time it takes to do this, optimize (keywords: coalescing, shared memory, texture memory, constant memory).

once you’ve got an acceptable exec. time, crank the numbers up to spawn the range you want.

arcon · January 15, 2009, 6:15am

first, you are most likely doing too much work in one thread…

second, if it freezes, use a smaller range, optimizie your code and then go up with the range again ;-)

here an example, how i would do it:
__global__ kernel(int offset){

  int idx=threadIdx.x+blockIdx.x*blockDim.x+offset;

  //compute...

}

int main(...){

  //...

  kernel<<<4096,256>>>(0x87000000);

  //...

}
that’s 1M values you check, blockDim is just a guess, look what’s most efficient for you. you can also use 2 dimensional gridDims, if you want to have more threads.

measure the time it takes to do this, optimize (keywords: coalescing, shared memory, texture memory, constant memory).

once you’ve got an acceptable exec. time, crank the numbers up to spawn the range you want.

tanks, seems it freeses and didnt work, if execution time of single thread above ~5,5 seconds… with lover reanges which leads to lower time execution all ok,

so have to divive work on smaller items.