segmentation fault urgent...!

Hi,

can anybody help me solve the segmentation fault…

I am using NVIDIA’s Quadro FX5800.

I am trying to make some changes in SDK’S Binomial Options.

I need to increase the number of options. here number of options = number of blocks.

so I tried launching my kernel in this way:

[codebox]nt rootOptN = sqrt(optN);

dim3 theGrid(rootOptN, (optN+rootOptN-1)/rootOptN);

binomialOptionsKernel<<<theGrid, CACHE_SIZE>>>(optN);[/codebox]

In the above code I can have “theGrid” upto 161000. If I try increasing the value of “theGrid” ,I am getting "Segmentation fault’.

My kernel function is called in this way:

[codebox]static global void binomialOptionsKernel(const unsigned int optN){

__shared__ real callA[CACHE_SIZE];

__shared__ real callB[CACHE_SIZE];

const int BlockIndex = blockIdx.y * gridDim.x + blockIdx.x;

//const int ThreadIndex = BlockIndex * blockDim.x + threadIdx.x;

//Global memory frame for current option (thread block)

if (BlockIndex >= optN) {

 return;

   }

real *const d_Call = &d_CallBuffer[BlockIndex* (NUM_STEPS + 16)];

const int tid = threadIdx.x;

//const int       thid= blockIdx.x*blockDim.x+threadIdx.x;

// const int bid = gridDim.x*blockDim.x;

//const int bd = blockDim.x*Grid;

//const int bd = blockDim.xgridDim.xgridDim.y;

const real      S = d_OptionData[BlockIndex].S;

const real      X = d_OptionData[BlockIndex].X;

const real    vDt = d_OptionData[BlockIndex].vDt;

const real puByDf = d_OptionData[BlockIndex].puByDf;

const real pdByDf = d_OptionData[BlockIndex].pdByDf;

// for(int thid=blockIdx.x*blockDim.x+threadIdx.x; thid <=bd ; thid += bid){

//Compute values at expiry date

for(int i = tid; i <= NUM_STEPS; i += CACHE_SIZE)

    d_Call[i] = expiryCallValue(S, X, vDt, i);

//Walk down binomial tree

//So double-buffer and synchronize to avoid read-after-write hazards.

for(int i = NUM_STEPS; i > 0; i -= CACHE_DELTA)

    for(int c_base = 0; c_base < i; c_base += CACHE_STEP){

        //Start and end positions within shared memory cache

        int c_start = min(CACHE_SIZE - 1, i - c_base);

        int c_end   = c_start - CACHE_DELTA;

//Read data(with apron) to shared memory

        __syncthreads();

        if(tid <= c_start)

            callA[tid] = d_Call[c_base + tid];

//Calculations within shared memory

        for(int k = c_start - 1; k >= c_end;){

            //Compute discounted expected value

            __syncthreads();

            if(tid <= k)

                callB[tid] = puByDf * callA[tid + 1] + pdByDf * callA[tid];

            k--;

//Compute discounted expected value

            __syncthreads();

            if(tid <= k)

                callA[tid] = puByDf * callB[tid + 1] + pdByDf * callB[tid];

            k--;

        }

//Flush shared memory cache

        __syncthreads();

        if(tid <= c_end)

            d_Call[c_base + tid] = callA[tid];

}

//Write the value at the top of the tree to destination buffer

if(threadIdx.x == 0) d_CallValue[BlockIndex] = (float)callA[0];

}[/codebox]

Can anybody help me find a solution for this…

Thanks in advance.

A segfault indicates a memory protection violation in your host code. Nothing to do with the CUDA side of things. Most likely something like using a device pointer as a host pointer or using an uninitialized host pointer.

Most likely an out-of-bound memory access.

I am using a two dimensional grid with xdim=401 and ydim=402 and blocksize=256

do you think it is a out of bound memory access.

how do I get to know what is the memory requirement of my application and how much is the limit?

There must be some data-structure in your CPU part of the code – where in, 401*402 is NOT going down well.

Some array declaration with lesser number somewhere in the code – and you are trying to access it with 401*402 number of indices – and hence out of bound – and hence seg fault – Something on these lines

I am using these data structures

static device __TOptionData d_OptionData[MAX_OPTIONS];(does device refer to global memory here if so what is its limit)

static device float d_CallValue[MAX_OPTIONS];

static device real d_CallBuffer[MAX_OPTIONS * (NUM_STEPS + 16)];

where MAX_OPTIONS= Number of options.(greater than 161000)

      TOptionData is a structure

      typedef struct{

real S;

real X;

real vDt;

real puByDf;

real pdByDf;

} __TOptionData;

Do you think there is any array size problem?

d_callBuffer turns out to be 314MB even for NUM_STEPS of 512, single precision.

Check out.

How much memory u have in your Quadro? Check that out.

The exact numbers for the Quadro FX5800 :

Total amount of global memory: 4294246400 bytes (approx 4GB)

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes (64KB)

Total amount of shared memory per block: 16384 bytes (16KB)

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate (Speed at which each processor runs) : 1.30 GHz

I have kept my NUM_STEPS constant as 256 so that won’t exceed 4GB.

What else can be the problem.