I wrote some kernel function for counting prime numbers here is the code:

```
__global__ void PrimesKernel(int first, int last,int* result, int* synch)
{
int id = blockDim.x*blockIdx.x + threadIdx.x;
int onoft = gridDim.x*blockDim.x;
int n = last-first;
int portion = n/onoft;
int rest = n % onoft;
int end;
bool prime = false;
int ij = 0,i = id*portion + (id < rest ? id : rest);
end = i + portion + (id < rest ? 1 : 0);
if (i % 2 == 0)
i++;
for ( ; i <=end ; i+=2 )
{
prime = true;
for ( int j = 3; j*j <= i && prime ; j+=2 )
{
if ( i % j == 0 )
prime = false;
}
if ( prime)
{
ij++;
atomicCAS(result+(*synch)++,0,i);
}
}
atomicAdd(result,ij);
}
```

synch variable is counter for threads to they know where in result vector to put the prime number.

when I launch it with parameters: first last number_of_blocks number_of_blocks

first - beginning of the area

last - end of the area

here is sequential version of the same:

```
void Primes(int first, int last,int* result)
{
bool prime = false;
int ij = 1,i = first;
if (first % 2 == 0)
i++;
for ( ; i <= last ; i+=2 )
{
prime = true;
for ( int j = 3; j*j <= i && prime ; j+=2 )
{
if ( i % j == 0 )
prime = false;
}
if ( prime)
result[ij++] = i;
}
result[0] = ij-1;
}
```

I wanted to write kernel function which just take care of part of the area. Hovewer my CUDA solution doesn’t work. It works for small number. When I increase first and last parameters to 1 and 100000 then it throws error 6 and sometimes error 30. When I want to calculate primes between 1 to 100 then it works fine.

Any ideas or suggestions ? I attach example of listings from console for different parameters, I increased the area of searching.

Liczba liczb pierwszych is the number of prime numbers,

nop - is size of array for primes which I pass to kernel to have it filled

=============================================================================================================

Im still struggling with the code and the problem, I thought that the reason of these errors is atomic operations which I use inside kernel function, but after I removed them not much changed. Namely, with code like this:

```
__global__ void PrimesKernel(int first, int last,int* result)
{
__shared__ int count;
count=0;
int id = blockDim.x*blockIdx.x + threadIdx.x;
int onoft = gridDim.x*blockDim.x;
int n = last-first;
int portion = n/onoft;
int rest = n % onoft;
int end;
bool prime = false;
int ij = 0,i = id*portion + (id < rest ? id : rest);
end = i + portion + (id < rest ? 1 : 0);
if (i % 2 == 0)
i++;
for ( ; i <=end ; i+=2 )
{
prime = true;
for ( int j = 3; j*j <= i && prime ; j+=2 )
{
if ( i % j == 0 )
prime = false;
}
if ( prime)
{
ij++;
//atomicCAS(result+count++,0,i);
}
}
//atomicAdd(result,ij);
}
```

program still crashes with error 6.

In windows logs I found information about the error, because at the same time my screen goes dark and goes back to the origin state.

“Windows Vista and Windows Server 2008 can detect when the graphics hardware or device driver take longer than expected to complete an operation. When this happens, Windows attempts to preempt the operation, and restore the display system to a usable state by resetting the graphics adapter. Typically, the only noticeable effect from this is a flicker of the display due to the reset and subsequent screen redraw. For more information, see “Timeout Detection and Recovery of GPUs through WDDM” at http://go.microsoft.com/fwlink/?linkid=77531 on the Microsoft Web site.”

So, why graphics driver take longer than exprected to complete operation. Can’t I do complicated loop inside kernel function ?