How to parallel cycle? (NEWBIE QUESTION) Large cycle to parallel threads

Hello,

please, tell me,

i have a fucntion f(x), where X varies from 0x00000000 … 0xFFFFFFFF
the result f(x) compare with some constants and if equal, i have to write them down to file.

what is the best way to compute such task in cuda?
how i can pass the x to many threads?

tanx a lot.

do you just have a single “x” you want to check?

if you have more than one: put all of them into an array (int x[…]), copy it to the gpu, let every thread read its “x” (meaning x[threadIdx.x+blockDim.x*blockIdx.x] or something like that) and compute the value, check against the constants (which you also copied to the device) and write back whether to save it or not and its value.

if it’s a single “x” and a function f() that’s taking quite a long time to compute, you will have to try to parallelize whatever f() does. (which in most cases will be a bit more tricky ;-))

tanx,

my X as values from 00000000 to 0xFFFFFFFF

my idea was to divide all range 0xFFFFFFFF to RANGE / NUM_THREADS and calc in each thread the chosen range for each thread

Your idea looks good! Thats the way to go.

i.e.

Make X as a function(threadIdx, blockIdx)

Such signgle-thread code hungs my machine… :-(
global void DoFunc(DWORD *res, DWORD test)
{
DWORD result;
DWORD x;

for (x=0x87000000;x<0x88000000;x++)
{
              result=FX(x)

      if (result==0x1129F56E) 
	{
		res[0]=j; break;
	}
}

}

sorry, it hungs only host machine, computer freezes until computation is finished.
Thean i can access my machine again.
Any way to avoid this?

finally i set code:
num blocks = 16
num threads per block = 16

card is 9800GTX+

int idx = blockIdx.x * blockDim.x + threadIdx.x; 	

start = idx0x1000000;
end = idx
0x1000000+0x1000000;
__syncthreads();
for (j=start;j<end;j++)
{
… computation cycle

}
__syncthreads();

first, you are most likely doing too much work in one thread…

second, if it freezes, use a smaller range, optimizie your code and then go up with the range again ;-)

here an example, how i would do it:

__global__ kernel(int offset){

  int idx=threadIdx.x+blockIdx.x*blockDim.x+offset;

  //compute...

}

int main(...){

  //...

  kernel<<<4096,256>>>(0x87000000);

  //...

}

that’s 1M values you check, blockDim is just a guess, look what’s most efficient for you. you can also use 2 dimensional gridDims, if you want to have more threads.

measure the time it takes to do this, optimize (keywords: coalescing, shared memory, texture memory, constant memory).

once you’ve got an acceptable exec. time, crank the numbers up to spawn the range you want.

tanks, seems it freeses and didnt work, if execution time of single thread above ~5,5 seconds… with lover reanges which leads to lower time execution all ok,

so have to divive work on smaller items.