handling blocks and threads..... urgent..!

omshree.shetty · June 2, 2009, 6:27am

Hi,

I wanted to change the NVIDIA SDK’s BINOMIAL OPTIONS kernel so that each thread can handle one option.

earlier one block was calculating one option.

can anybody help me…

[codebox]#define TIME_STEPS 16

#define CACHE_DELTA (2 * TIME_STEPS)

#define CACHE_SIZE (256)

#define CACHE_STEP (CACHE_SIZE - CACHE_DELTA)

#if NUM_STEPS % CACHE_DELTA

#error Bad constants

#endif

//Preprocessed input option data

typedef struct{

real S;

real X;

real vDt;

real puByDf;

real pdByDf;

} __TOptionData;

static device __TOptionData d_OptionData[MAX_OPTIONS];

static device float d_CallValue[MAX_OPTIONS];

static device real d_CallBuffer[MAX_OPTIONS * (NUM_STEPS + 16)];

////////////////////////////////////////////////////////////////////////////////

// Overloaded shortcut functions for different precision modes

////////////////////////////////////////////////////////////////////////////////

#ifndef DOUBLE_PRECISION

device inline float expiryCallValue(float S, float X, float vDt, int i){

real d = S * expf(vDt * (2.0f * i - NUM_STEPS)) - X;

return (d > 0) ? d : 0;

}

#else

device inline double expiryCallValue(double S, double X, double vDt, int i){

double d = S * exp(vDt * (2.0 * i - NUM_STEPS)) - X;

return (d > 0) ? d : 0;

}

#endif

////////////////////////////////////////////////////////////////////////////////

// GPU kernel

////////////////////////////////////////////////////////////////////////////////

static global void binomialOptionsKernel(const unsigned int optN){

__shared__ real callA[CACHE_SIZE];

__shared__ real callB[CACHE_SIZE];

const int BlockIndex = blockIdx.y * gridDim.x + blockIdx.x;

//Global memory frame for current option (thread block)

if (BlockIndex >= optN) {

return;

   }

real *const d_Call = &d_CallBuffer[BlockIndex* (NUM_STEPS + 16)];

const int tid=threadIdx.x;

const real S = d_OptionData[BlockIndex].S;

const real      X = d_OptionData[BlockIndex].X;

const real    vDt = d_OptionData[BlockIndex].vDt;

const real puByDf = d_OptionData[BlockIndex].puByDf;

const real pdByDf = d_OptionData[BlockIndex].pdByDf;

//Compute values at expiry date

for(int i = tid; i <= NUM_STEPS; i += CACHE_SIZE)

    d_Call[i] = expiryCallValue(S, X, vDt, i);

//Walk down binomial tree

//So double-buffer and synchronize to avoid read-after-write hazards.

for(int i = NUM_STEPS; i > 0; i -= CACHE_DELTA)

    for(int c_base = 0; c_base < i; c_base += CACHE_STEP){

        //Start and end positions within shared memory cache

        int c_start = min(CACHE_SIZE - 1, i - c_base);

        int c_end   = c_start - CACHE_DELTA;

//Read data(with apron) to shared memory

        __syncthreads();

        if(tid <= c_start)

            callA[tid] = d_Call[c_base + tid];

//Calculations within shared memory

        for(int k = c_start - 1; k >= c_end;){

            //Compute discounted expected value

            __syncthreads();

            if(tid <= k)

                callB[tid] = puByDf * callA[tid + 1] + pdByDf * callA[tid];

            k--;

//Compute discounted expected value

            __syncthreads();

            if(tid <= k)

                callA[tid] = puByDf * callB[tid + 1] + pdByDf * callB[tid];

            k--;

        }

//Flush shared memory cache

        __syncthreads();

        if(tid <= c_end)

            d_Call[c_base + tid] = callA[tid];

}

//Write the value at the top of the tree to destination buffer

if(threadIdx.x == 0) d_CallValue[BlockIndex] = (float)callA[0];

}

//CudaVisualProfiler

////////////////////////////////////////////////////////////////////////////////

// Host-side interface to GPU binomialOptions

////////////////////////////////////////////////////////////////////////////////

static void binomialOptionsGPU(

float *callValue,

TOptionData  *optionData,

int optN

){

   printf("kernel execution..........");

   fflush(stdout);

static __TOptionData h_OptionData[MAX_OPTIONS];

for(int i = 0; i < optN; i++){

    const double      T = optionData[i].T;

    const double      R = optionData[i].R;

    const double      V = optionData[i].V;

const double dt = T / (double)NUM_STEPS;

    const double    vDt = V * sqrt(dt);

    const double    rDt = R * dt;

    //Per-step interest and discount factors

    const double     If = exp(rDt);

    const double     Df = exp(-rDt);

    //Values and pseudoprobabilities of upward and downward moves

    const double      u = exp(vDt);

    const double      d = exp(-vDt);

    const double     pu = (If - d) / (u - d);

    const double     pd = 1.0 - pu;

    const double puByDf = pu * Df;

    const double pdByDf = pd * Df;

h_OptionData[i].S = (real)optionData[i].S;

    h_OptionData[i].X      = (real)optionData[i].X;

    h_OptionData[i].vDt    = (real)vDt;

    h_OptionData[i].puByDf = (real)puByDf;

    h_OptionData[i].pdByDf = (real)pdByDf;

}

cutilSafeCall(cudaThreadSynchronize());

cutilSafeCall( cudaMemcpyToSymbol(d_OptionData, h_OptionData, optN * sizeof(__TOptionData)) );

int rootoptN = sqrt(optN);

dim3 theGrid(rootoptN, (optN+rootoptN-1)/rootoptN);

printf(“xdim=%d\n,ydim=%d\n”,rootoptN ,(optN+rootoptN-1)/rootoptN);

fflush(stdout);

binomialOptionsKernel<<<theGrid, CACHE_SIZE>>>(optN);

printf(“Done with kernel call\n”); fflush(stdout);

//binomialOptionsKernel<<<dimGrid, CACHE_SIZE>>>(optN);

//binomialOptionsKernel<<<optN, CACHE_SIZE>>>();

cutilCheckMsg(“binomialOptionsKernel() execution failed.\n”);

cutilSafeCall(cudaThreadSynchronize()); 

cutilSafeCall( cudaMemcpyFromSymbol(callValue, d_CallValue, optN * sizeof(float)) );

}[/codebox]

Sarnath · June 2, 2009, 6:39am

You will get miserably bad performance due to un-coalesced access.

But yeah, if u r smart enough, you can avoid it. Good Luck!

omshree.shetty · June 2, 2009, 6:51am

ya I know that. actually I wanted to compare the speed of one core compared to 240 cores… thats why i wanted a help to change my kernel. I wanted to vary the size of my grid(the number of blocks) as well as vary the number the threads… to check the performance. I din get to know how to change the number of threads which is CHCHE_SIZE in my code.can you tell how can i do it

Sarnath · June 2, 2009, 7:02am

First of all, Do you understand how the parallel algorithm works?

omshree.shetty · June 2, 2009, 7:31am

yes… its calculating one block per option.

Sarnath · June 2, 2009, 7:47am

Thats ok. But do you understand the parallel breakdown of the problem? How do they share the load among the threads of block, how the parallel algorithm works? - DO you understand all these

omshree.shetty · June 2, 2009, 8:40am

we need to load all leaf values into a high-speed shared memory buffer and perform calculations in shared memory. But since the size of shared memory

on the GPU is limited, we take as basis our intention to store as many data as possible in shared memory, but taking into account that working data sizes can be much

larger than the available shared memory, forcing us to spill to global memory at some steps of the computation i.e to process nodes of the tree in portions that fit into shared memory.

And then we are using Double buffering in the reduction primitive as follows

[codebox]//Calculations within shared memory

for(int k = c_start - 1; k >= c_end;){

//Compute discounted expected value

__syncthreads();

if(tid <= k)

callB[tid] = puByDf * callA[tid + 1] + pdByDf * callA[tid];

k--;

//Compute discounted expected value

__syncthreads();

if(tid <= k)

callA[tid] = puByDf * callB[tid + 1] + pdByDf * callB[tid];

k--;

}

[/codebox]

here threads are used for reducing values.

Sarnath · June 2, 2009, 8:50am

What about c_start, c_end logic… Thats crux of the parallel algorithm.

But I think your understanding of the cache is right.

So, when every thread proceses one option, every thread would need a shared memory space – So, given the limited shared memory space, u can only allocate very limited space for each thread.

Say there are 512 active threads in 1 active block, you can allocate 32 bytes of shared memory data to each thread max (assuming full 16K is available which obviously is not the case). And that would be 8 floats.
If you choose to use 256 threads, then you can get 16 floats and thats all.

But then, you wont need double buffering and the algorithm can proceed much more like the CPU version.

omshree.shetty · June 2, 2009, 9:46am

I have attached a figure in which,

A = c_base, B = min(c_base + CACHE_SIZE -1, i); i contains the highest index of the entire prices array for the current group of

CACHE_DELTA steps. The [A … B] range contains exactly CACHE_SIZE elements if index B = c_base + CACHE_SIZE â€“ 1 is â€œvalidâ€ (in other words not greater than i),

or fewer otherwise. On one hand, itâ€™s good to have a large CACHE_DELTA to perform memory spills as rarely as possible, but on the other hand, each reduction primitive loads

data with an apron of size CACHE_DELTA â€” to produce N output elements N + CACHE_DELTA elements are always loaded at each invocation of the reduction primitive.

Therefore, memory read overhead is inevitably increased.Having applied the primitive to the entire vector, we reduce it by CACHE_DELTA nodes,

stepping back CACHE_DELTA time steps,show in the second attachment (img2)
img2.doc (154 KB)
image.doc (154 KB)

Sarnath · June 2, 2009, 10:19am

Well, I understand that algorithm. I was just asking you if u have understood it correctly.
Because this comes first, before you even start tweaking things around.

I think I have given my 2 cents on what if each thread prices it separately.

omshree.shetty · June 2, 2009, 12:35pm

Thank you for your suggestion it was of great help…