Problem with Kernel Urgent.............


I am using NVIDIA’s Quadro FX 5800. I am trying to find the GPU execution time for one core and 240 cores for cuda SDK’s Monte Carlo program . The problem is , in Monte carlo the grid size= number of options. But i want to run this program with grid size=1 and block size=1 where one block can handle multiple options.

I am calling the kernel function like this

[codebox] MonteCarloOneBlockPerOption<<optionCount, THREAD_N>>>(




The kernel function is

[codebox]static global void MonteCarloOneBlockPerOption(

float *d_Samples,

int pathN


const int SUM_N = THREAD_N;

__shared__ real s_SumCall[SUM_N];

__shared__ real s_Sum2Call[SUM_N];

const int optionIndex = blockIdx.x;

const real        S = d_OptionData[optionIndex].S;

const real        X = d_OptionData[optionIndex].X;

const real    MuByT = d_OptionData[optionIndex].MuByT;

const real VBySqrtT = d_OptionData[optionIndex].VBySqrtT;

//Cycle through the entire samples array:

//derive end stock price for each path

//accumulate partial integrals into intermediate shared memory buffer

for(int iSum = threadIdx.x; iSum < SUM_N; iSum += blockDim.x){

    __TOptionValue sumCall = {0, 0};

    for(int i = iSum; i < pathN; i += SUM_N){

        real              r = d_Samples[i];

        real      callValue = endCallValue(S, X, r, MuByT, VBySqrtT);

        sumCall.Expected   += callValue;

        sumCall.Confidence += callValue * callValue;


    s_SumCall[iSum]  = sumCall.Expected;

    s_Sum2Call[iSum] = sumCall.Confidence;


//Reduce shared memory accumulators

//and write final result to global memory

sumReduce<real, SUM_N, THREAD_N>(s_SumCall, s_Sum2Call);

if(threadIdx.x == 0){

    __TOptionValue t = {s_SumCall[0], s_Sum2Call[0]};

    d_CallValue[optionIndex] = t;




can any one help me .

Thank you in advance.

That will yeild you the slowest GPU code ever.

Actually I’m an intern and learning CUDA . I know that it will be slow if i use a grid with one block and one thread but i want to find the GPU time for <<<1,1>>> and wat is the speed up when i use all 240 cores of the GPU. so can u tell me how can i do that.