Poor performance in Monte Carlo due to time-step loop

Hi Everyone,

I have a Monte Carlo code for option pricing which is similar to the example ‘MonteCarloMultiGpu’ in CUDA SDK 5.5. The difference is my code uses a time-step of 250 (total computations=pathsoptions250). As a result the performance is poor even on K40.

Could you please suggest how can I parallelize the time-step loop? I have provided the code for reference:

Setup_Kernel() //Setting up states (=paths) for Random Number generations
Random_Number_Kernel() //Generating 250*Paths Random Numbers and storing in global memory
Compute_kernel() //doing the computations for N paths and M options
{
for(int numSample=threadIdx.x; numSample < NUM_SAMPLES; numSample+=blockIdx.x)
{
getPath(path, numSample, random_Numbers, optionStructs[optionIndex]);
price[GLobal_ID] = path[250-1];
}
}

device void getPath(dataType* path, int numSample, dataType* random_Numbers, MCStruct optionStructs)
{
path[0] = process(optionStruct);

    for (size_t i=1; i<250; i++)
    {
            dataType t = i*dt;
            int index = (i-1)*250 + numSample;
            dataType randVal =(random_Numbers[index]);
            path[i] = process(t, path[i-1], randVal, optionStruct);
    }

}

Thanks in advance.