Fiding an elegant solution for an On^4 problem

Hi guys. I’m trying to solve an On^4 problem (with 4 nested loops) using CUDA. As you can see below, I have an original version in C that runs in a reasonable time (50 seconds without threading or spliting the data structures), for data structures of 100 elements, but as soon as I try to use my CUDA version it is not so fast (in fact it is 50% faster than C code, but I need it to be faster), so I’m wondering if there are other solutions to try to solve my problem that I’m not aware of.

My server has two cards that support CUDA, but I’m using just the Tesla for now. Thanks for your time, any help will be highly appreciated.

Device 0: “Tesla C2050”

CUDA Driver Version: 3.20

CUDA Runtime Version: 3.20

CUDA Capability Major/Minor version number: 2.0

Total amount of global memory: 3220897792 bytes

Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Clock rate: 1.15 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Concurrent kernel execution: Yes

Device has ECC support enabled: No

Device is using TCC driver mode: No

Device 1: “Quadro FX 380”

CUDA Driver Version: 3.20

CUDA Runtime Version: 3.20

CUDA Capability Major/Minor version number: 1.1

Total amount of global memory: 267714560 bytes

Multiprocessors x Cores/MP = Cores: 2 (MP) x 8 (Cores/MP) = 16 (Cores)

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 1.10 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Concurrent kernel execution: No

Device has ECC support enabled: No

Device is using TCC driver mode: No

// host original version in C

static double CRH(double * fprS, double * fprJ, double * fprV,

		double * fprC, bool NO, double K, double T, unsigned int Q,

		double TP, unsigned short int SIZE_I, unsigned short int SIZE_J, 

		unsigned short int SIZE_K, unsigned short int SIZE_L)

{

	const double MV = PG_H(NO, fprS[0], K, fprV[0], T, fprJ[0], fprC[0]) * TP;

	double min = MV;

	double dif = 0.0;

	// loop (On^4) througth all possible combinations

	for (unsigned short int i = 1; i < SIZE_I; i++)

		for (unsigned short int j = 1; j < SIZE_J; j++)

			for (unsigned short int k = 1; k < SIZE_K; k++)

				for (unsigned short int l = 1; l < SIZE_L; l++)

				{

					dif =  (PG_H(NO, fprS[i], K, fprV[j], T, fprJ[k], fprC[l]) * TP) - MV;

					if (dif < min)

						min = dif;

				}

	return max(Q * min, 0.0);

}

// device version

static __device__ __global__ void CRD(double * R, double * fprS, double * fprJ, 

		double * fprV, double * fprC, bool NO, double K, double T, unsigned short int Q,

		double TP, unsigned short int SIZE_J, unsigned short int SIZE_K, unsigned short int SIZE_L)

{

	// threads id's (enough threads will be created to hold all elements of our data structure)

	const int TX = blockIdx.x * blockDim.x + threadIdx.x;

	__device__ __shared__ volatile double MV;

	__device__ __shared__ volatile double min;

	__device__ __shared__ volatile double dif;

	// initializing shared data

	MV = PG_D(NO, fprS[0], K, fprV[0], T, fprJ[0], fprC[0]) * TP;

	min = MV;

	dif = 0.0;

	// loop (Complexity On^3) througth all possible combinations

	// they start at index 1, because index 0 was calculated on

	// the previous instruction

	for (unsigned short int j = 1; j < SIZE_J; j++)

		for (unsigned short int k = 1; k < SIZE_K; k++)

			// instruct the compiler to unroll the inner loop completly

			#pragma unroll

			for (unsigned short int l = 1; l < SIZE_L; l++)

			{

				dif =  (PG_D(NO, fprS[TX], K, fprV[j], T, fprJ[k], fprC[l]) * TP) - MV;

					

				// sync all threads before next instruction,

				// preventing unnecessary overwritten values

				__syncthreads;

				// keep the smallest

				if (dif < min)

					min = dif;

			}

	// return the smallest value to host

	R[0] = max(Q * min, 0.0);

}
  1. Sounds like you’re trying to run this with 100 threads. You need much more. Try to do this with 10,000 threads.

  2. There’s no good reason to declare “MV” and “dif” as shared volatile. In fact, that may be counterproductive.

  3. The way you do it, there’s absolutely no guarantee that “min” will contain the correct value, because all your threads are trying to write to ‘min’ at the same time, even if you put __syncthreads in there.

Instead, create a global-memory array of SIZE_I * SIZE_J elements, rewrite the kernel so that each each thread calculates and writes one element. Then call __syncthreads and have the thread 0 in block 0 go through all elements and find the smallest. (There are much more efficient ways to calculate the minimum, but in your case having a single thread go through 10,000 values should suffice).

  1. I’m not sure exactly what you’re doing, but I have a strong suspicion that all your loops should iterate starting with index 0, not 1, despite the comment " because index 0 was calculated on the previous instruction".

  2. If rewriting everything the way I described does not result in satisfactory performance, I’d look at the implementation of PG_d.

Thank you so much hamster143 for your fast reply, I really appreciated it. I will work now in the modifications that you suggested and come back with news as soon as possible, but probably next week. Below my considerations about your questions.

1 - You are right. I was trying to use the wrong number of threads to do the job;

2 - I will implement your suggestion;

3 - I will implement your suggestion;

4 - This algorithm is part of a risk management of stock options, for stock exchanges;

5 - Find below the conde of PG_D.

#define Y 0.2316419

#define A1 0.31938153

#define A2 -0.356563782

#define A3 1.781477937

#define A4 -1.821255978

#define A5 1.330274429

// device function (cumulative distribution function (CDF)

static __device__ double CDF(double X)

{

	if (X >= 0.0)

		return 1.0 - ((1 / sqrtf(2.0 * M_PI)) * expf((-powf(X, 2.0)) / 2.0)) * ((A1 * (1.0 / (1.0 + (Y * X)))) + (A2 * powf((1.0 / (1.0 + (Y * X))), 2.0)) +

				(A3 * powf((1.0 / (1.0 + (Y * X))), 3.0)) + (A4 * powf((1.0 / (1.0 + (Y * X))), 4.0)) + (A5 * powf((1.0 / (1.0 + (Y * X))), 5.0)));

	else

		return 1.0 -(1 - ((1 / sqrtf(2.0 * M_PI)) * expf((-powf(-X, 2.0)) / 2.0)) * ((A1 * (1.0 / (1.0 + (Y * -X)))) + (A2 * powf((1.0 / (1.0 + (Y * -X))), 2.0)) +

				(A3 * powf((1.0 / (1.0 + (Y * -X))), 3.0)) + (A4 * powf((1.0 / (1.0 + (Y * -X))), 4.0)) + (A5 * powf((1.0 / (1.0 + (Y * -X))), 5.0))));

}

// device function Garman-Kohlhagen pricing model

static __device__ double PG_D(bool naturezaOpcao, double Sc, double K, double Vc, double T, double rc, double cc)

{

	// shared variables for fastest access speed

	__device__ __shared__ double d1;

	// inlined D1 Black-Scholes function

	d1 = (logf((Sc / K)) + ((rc - cc) + powf(Vc, 2.0) / 2.0) * T) / (Vc * sqrtf(T));

	if (!naturezaOpcao)

		return expf(-cc * T) * Sc * CDF(d1) - K * expf(-rc * T) * CDF(d1 - Vc * sqrtf(T));

	else

		return K * expf(-rc * T) * CDF(-(d1 - Vc * sqrtf(T))) - expf(-cc * T) * Sc * CDF(-d1);

}

One more time, thank you so much, I will come back next week with your suggestion implemented.

You have a mixture of single and double precision that causes a lot of conversions. Since you have a Tesla C2050, you don’t need to resort to single precision for speed, so you can decide either way: Either append [font=“Courier New”]f[/font] to all constants to turn them into single precision (and don’t forget M_PI on the way), or remove the trailing [font=“Comic Sans MS”]f[/font] from the floating point functions.

Also using [font=“Courier New”]pow()[/font]/[font=“Courier New”]powf()[/font] for small integer exponents is a lot slower that direct multiplication. If however you should stick with them, you can at least eliminate the expensive division [font=“Courier New”]pow(1.0/…, 2.0)[/font] by just using a negative exponent.

However, all of Hamster’s remarks seem valid and should be addressed first, before trying to optimize the code. Did you check the GPU code produces the same results as the CPU version?

Hi Tera, thank you also for your reply. The GPU results do not match CPU’s, but this is because I was using the wrong number of threads (I guess), leading to just a few elements of my arrays in the calculations. I hope after Hamster’s suggestions, the final result will be the same, or at least, a little different after the decimal point. Last but not least, I will also implement your considerations after Hamster’s remarks. Thank you Tera.

Part of my remark #3 was inaccurate. It would be correct to make the function CRD write into a global array, then copy the array onto the host, and use host code to find the minimum value.

To expand on my #4, did you notice that the space over which you iterate in the C version is [1 … SIZE_I-1]x[1 … SIZE_J-1]x[1 … SIZE_K-1]x[1 … SIZE_L-1], but in the CUDA version it is [0 … SIZE_I-1]x[1 … SIZE_J-1]x[1 … SIZE_K-1]x[1 … SIZE_L-1]; and is it your intention to skip the element PG_H(NO, fprS[0], K, fprV[0], T, fprJ[0], fprC[1])?

Declaring the variable d1 in PG_D as shared is completely wrong. You’re making all threads in each block store intermediate results of computations in the same memory element! Are you under the impression that simply “double d1;” would put d1 in global memory? Not at all. It would go into a register, and it would be as fast or faster than shared memory.

Both for accuracy and speed, polynomials should never be evaluated by computing powers explicitly, instead a Horner scheme should be used. The approximation to the CDF above looks like it is straight from Abramowitz & Stegun section 26.2.17. Note that this approximation has an absolute error limit of 7.5e-8, so delivers far from full double-precision accuracy. The code might want to look something like the following (note: I am on a machine without any compiler or interpreter so haven’t been able to test this; I hope I didn’t make any mistakes transcribing straight from A&S).

/* Approximate the CDF according to section 26.2.17 of Abramowitz & Stegun */

__device__ double CDF(double X)

{

   double y = 0.2316419;

   double rsqrt2pi = 0.3989422804;

   double x = abs(X);

   double t, p, z, cdf;

t = 1.0 / (y * x + 1.0);

   p =         1.330274429;

   p = p * t - 1.821255978;

   p = p * t + 1.781477937;

   p = p * t - 0.356563782;

   p = p * t + 0.319381530;

   p = p * t;

   z = rsqrt2pi * exp(-0.5 * x * x);

   cdf = (X >= 0) ? (1.0 - z * p) : (z * p);

   return cdf;

}

Given the low accuracy of the approximation one could also perform the internal computation in single precision. This would cause some additional loss of accuracy in the final result.

If I understand the article on Wolfram’s MathWorld correctly (http://mathworld.wolfram.com/NormalDistribution.html) a convenient way of computing the CDF for the normal distribution would we via the error function erf(), which is a C99 math function supported by CUDA.

__device__ double CDF(double X)

{  

   double rsqrt2 = 0.7071067811865475244008;

   return 0.5 + 0.5 * erf (X * rsqrt2);

}

Obviously there are accuracy and performance trade-offs between this implementation and one custom tailored to the requirements of any particular application.

Interesting and good to know. I ran some quick tests on my card. Versions from posts #3, #7 and #8 produce approximately identical results. Versions #3 and #7 are slightly more accurate (using Excel built-in function NORMDIST as a gold standard), but #8 should be accurate enough for most purposes (the error of #8 ranges from 10^-8 to 10^-7 for most values).

Version #3 compiles into 2480 instructions, version #7 compiles into 240 instructions, and version #8 compiles into 208 instructions.

I just get on my desk :), I will implement all your remarks and then, I will come back with news. One more time thank you for your commitment to help, I really appreciated it.

Thank you Njuffa for your remarks. I will take a look at them as soon as possible too. Thank you.

Thank you Njuffa for your remarks. I will take a look at them as soon as possible too. Thank you.

Thank you Hamster143. After I implement all remarks of this post :), I will return as soon as I can with the some results. Thanks.

I forgot to answer your question. Yes, I need to skip element 0 because it must be calculated just one time for all possibilities among the elements of my data structures fprS, fprV, fprJ, fprC. And that is why I wrote this post, once, I was tring to work with just one big matrix of all element possibilities, where each data structure is a column of this matrix. So with CUDA, I would just call my device function CRD just one time for each row of this big matrix. Now I changed my strategy, and I made separate arrays that in runtime will act like an abstract matrix, thus I do not need to create this big matrix.

This big matrix can be a real problem, because in the real program in production environment, each one of my data structures will have 100000 elements, and not just 100 (and this raise another question, where I will keep an ~12GB matrix?). But maybe, this problem will generate another post :). Thank you.

What you think of the solution below, a friend of mine just came with this block and thread parameters to address the elimination of all for loops in our device function. Pretty much the same of what Hamster143 said in his remark #1.

dim3 dimBlock(100, 100);

dim3 dimThreads(100, 100);

CRD<<<dimBlock, dimThreads>>>(…);

He claimed that I will be able to eliminate the all 3 remaing for loops of my program (because of CUDA grid, blocks and threads orchestration and organization), and call the PG_D function just one time for each block/thread in CRD device function. I think it will work, at least I hope so :)), I will test it together with all your solutions and come back in a couple of hours. Thanks.

...

int sizeC = 100 * 100 * 100 * 100;

C;

...

int i = blockIdx.y;

int j = blockIdx.x;

int k = threadIdx.x;

int l = threadIdx.y;

...

C[i + j + k + l] =  (PG_D(NO, fprS[i], K, fprV[j], T, fprJ[k], fprC[l]) * TP) - MV;

Yes. But then you’d need an array of 100x100x100x100 elements, or 800 megabytes. And you have to go through this entire array to find the minimum. If you go this way, you may need to figure out how to do that on the device (because transferring the entire 800 megabytes to the host, and then going through it with the CPU, may be costly.) The way I suggested above, you only transfer back 100x100 elements or 80 kb.

Are you saying that you will need to go over 100,000^4 possibilities?

Yes, the business guys are out of this world man :)). The good news is that what we are doing, is just a prove of concept (POC), and maybe the POC will serve to tell them that this is not possible to be done in a resonable and not so expensive way.

I will stick with your sugestions. I’m having some dificulties to setup the kernel invocation now (to meet this 10,000 threads setup), but I will find a way to get around the difficulties :)). Thanks Hamster143.

just turning TCC mode on should give you a nice boost in performance… (use nvidia-smi.exe)

Thanks Eldadk for the tip, I will take a look at it. Right now, we are trying to change some aspects of the original algorithm. When I finish the program, I will come back to tell what we did to get things to work.

Guys can you please help me understand, how can I start my kernel with 10.000 threads? I’m using the parameters below, but it is not working, the final result still diferent from CPU version, so I’m very confusing.

kernel<<<22, 512>>>

Thanks for your time.

Your execution configuration is correct. Also, if you tried to run with too many threads, the kernel just would not launch at all. If your kernel executes but produces different results, the problem lies elsewhere.

Thank you so much Tera, now I can focus on to try to find the problem elsewhere. Thanks.

Hopefully i have not misunderstood the problem, but the approach to remove all loops is the way to go.

One thread per dif computation.

The benefit is that the GPU will be fully utilized with parallel computation.

Looking at the host code implementation in the inner loop where dif is calculated, one can see that between each iteration there is no dependency.

In the final stage when all the dif elements are evaluated a new kernel should be launched on the GPU that deploys a parallel prefix scan algo using the min operator (just grab it from cudpp library and you are done).

When dealing with memory size that is not possible to get on the GPU one has to do the computation in “chunks”