find minimum num in array #2

Hey There.

so i tried to write another algorithm for my min num in array , but something is still not working,

instead of getting the first number of each block , which should contain the min num in each block after the WHILE loop kicks off.

im getting the my blockDim.x * blockIdx.x + threadIdx.x = 0 thread

but after my kernel’s call , i still have nothing.

__global__ void findMin(float4 *inPoints,float4 *outPoint,int n)

{

extern __shared__ float4 sdata[];

// load shared mem

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

float4 infinitePoint;

infinitePoint.w = 9999;

sdata[tid] = inPoints[i];

__syncthreads();

int nTotalThreads = blockDim.x;

while (nTotalThreads > 1)

{

  int halfPoint = (nTotalThreads >> 1);

  if (threadIdx.x < halfPoint)

  {

    float4 temp = sdata[threadIdx.x]; 

    if ( temp.w < sdata[threadIdx.x + halfPoint].w)

      sdata[threadIdx.x] = temp;

  }

  __syncthreads();

  nTotalThreads /=2;

}

if ( tid == 0)

*outPoint = sdata[0];

}

fixed a small bug , program still not working.

How do you think this code could possibly work? All blocks write to [font=“Courier New”]*outPoint[/font], so which one succeeds is entirely undefined.

Oh you are right , i changed it earlier so only the block that holds the blockIdx.x*blockDim.x + threadIdx.x = 0 would do *outPoint = sdata[0];

but it’s still not working and i dont understand how block X can access threads that belong to block Y.

how do i fix that annoying problem and make this program work… :(

edit : well i kinda reviewed that code several times , each block “sorts” its self.
and eventually the smallest number of each block will be in the first position of the shared memory of the current block,
but what remains now is a “list” of shared memory blocks , which each one holds in its first position the smallest number of the current block.
how do i find the smallest number among them?

thanks

It can’t. That’s one of the cornerstones of the CUDA programming model.

Run a second kernel with just one block that finds the minimum along those values. Or, if the number of blocks is fairly small, transfer the list to the CPU and find the minimum there.

i’m not quite sure i understood the second sentence. lets say that the original kernel does what it suppose to do , so if i have 512 blocks , then each block contains

a shared memory where the first address of that shared memory holds the smallest number,

from this point i should write everything back to a global memory which should be at size of 512 , when each cell represents another block?

Yes

Last question , once i copied back all my results to a 512 size arr , and run my kernel : foo <<<1,512>>> ,

is there going to damage my overall performance ? because if for example my Cuda Device has 8 SM , and each SM schedules 1 block ,

so running a kernel with 1 block will decrease performance ?

Thanks igal.

well i tried to do something else but it’s still not working,

this time i called a kernel with the following properties :

THREADS_IN_BLOCK = 512

numBlocks = MAXX * MAXY / THREADS_IN_BLOCK

MAXX = MAXY = 512

findMin <<<numBlocks,THREADS_IN_BLOCKS , NumBlocks>>> (inPoints , outPoints , MAXX * MAXY)

when inPoints is a [MAXX * MAXY = 512 * 512] array ,

outPoints is a [numBlocks = 512] Array ( should save the smallest number from each shared memory , block)

__global__ void findMin(float4 *inPoints,float4 *outPoints,int n)

{

extern __shared__ float4 sdata[];

// load shared mem

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

float4 infinitePoint;

infinitePoint.w = 9999;

sdata[tid] = inPoints[i];

__syncthreads();

int nTotalThreads = blockDim.x;

while (nTotalThreads > 1)

{

  int halfPoint = (nTotalThreads >> 1);

  if (threadIdx.x < halfPoint)

  {

    float4 temp = sdata[threadIdx.x]; 

    if ( temp.w < sdata[threadIdx.x + halfPoint].w)

      sdata[threadIdx.x] = temp;

  }

  __syncthreads();

  nTotalThreads /=2;

}

if ( tid == 0)

outPoints[blockIdx.x] = sdata[0];

}

what’s wrong with that code? :\

this is f%cking theme that is in my topic

if (threadIdx.x < halfPoint)

  {

    float4 temp = sdata[threadIdx.x + halfPoint]; 

    if (temp.w < sdata[threadIdx.x].w)

      sdata[threadIdx.x] = temp;

  }

still not working… :\

I really dont know what’s wrong with the code , i even tried to copy everything to an array which is located at the CPU memory and

double checked that it actually contains relevant values which had calculated from previous CUDA functions.

i made sure that all the relevant would have a number ranging from 0 to 5± and the irrelevant number would have a value of 9999.

so when the kernel that finds the minimum number executes , it will find only the smallest number among the relevant numbers.

The amount of shared memory in your launch config is 512 bytes. Inside the kernel you mapped that as an array of float4. Those 512 bytes therefore would result in an sdata array of size 32. You are using your thread index to access that array, so it seems to me you are overrunning the array bounds, since your thread index should go from 0 to 511, right? Check for an error code, maybe it’s reporting something.

krinitsa - I wrote a sample kernel that finds the maximum value in a large array along with its index. It uses the scan primitives I discuss in great depth on the rest of my site:
http://www.moderngpu.com/mailbag/maxindex.html

sean

Hey Bunny , i sent you a message via the mail , but i will re-ask it so everybody will be able to see it.

inside my kernel i declared on an external shared memory of the size 512 so each cell’s size is sizeof(float4)

so each time i do : sdata , i should get the current cell’s location , shouldn’t i?

Thanks