Hi guys. I’m trying to solve an On^4 problem (with 4 nested loops) using CUDA. As you can see below, I have an original version in C that runs in a reasonable time (50 seconds without threading or spliting the data structures), for data structures of 100 elements, but as soon as I try to use my CUDA version it is not so fast (in fact it is 50% faster than C code, but I need it to be faster), so I’m wondering if there are other solutions to try to solve my problem that I’m not aware of.
My server has two cards that support CUDA, but I’m using just the Tesla for now. Thanks for your time, any help will be highly appreciated.
Device 0: “Tesla C2050”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 3220897792 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device 1: “Quadro FX 380”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 267714560 bytes
Multiprocessors x Cores/MP = Cores: 2 (MP) x 8 (Cores/MP) = 16 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.10 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
// host original version in C
static double CRH(double * fprS, double * fprJ, double * fprV,
double * fprC, bool NO, double K, double T, unsigned int Q,
double TP, unsigned short int SIZE_I, unsigned short int SIZE_J,
unsigned short int SIZE_K, unsigned short int SIZE_L)
{
const double MV = PG_H(NO, fprS[0], K, fprV[0], T, fprJ[0], fprC[0]) * TP;
double min = MV;
double dif = 0.0;
// loop (On^4) througth all possible combinations
for (unsigned short int i = 1; i < SIZE_I; i++)
for (unsigned short int j = 1; j < SIZE_J; j++)
for (unsigned short int k = 1; k < SIZE_K; k++)
for (unsigned short int l = 1; l < SIZE_L; l++)
{
dif = (PG_H(NO, fprS[i], K, fprV[j], T, fprJ[k], fprC[l]) * TP) - MV;
if (dif < min)
min = dif;
}
return max(Q * min, 0.0);
}
// device version
static __device__ __global__ void CRD(double * R, double * fprS, double * fprJ,
double * fprV, double * fprC, bool NO, double K, double T, unsigned short int Q,
double TP, unsigned short int SIZE_J, unsigned short int SIZE_K, unsigned short int SIZE_L)
{
// threads id's (enough threads will be created to hold all elements of our data structure)
const int TX = blockIdx.x * blockDim.x + threadIdx.x;
__device__ __shared__ volatile double MV;
__device__ __shared__ volatile double min;
__device__ __shared__ volatile double dif;
// initializing shared data
MV = PG_D(NO, fprS[0], K, fprV[0], T, fprJ[0], fprC[0]) * TP;
min = MV;
dif = 0.0;
// loop (Complexity On^3) througth all possible combinations
// they start at index 1, because index 0 was calculated on
// the previous instruction
for (unsigned short int j = 1; j < SIZE_J; j++)
for (unsigned short int k = 1; k < SIZE_K; k++)
// instruct the compiler to unroll the inner loop completly
#pragma unroll
for (unsigned short int l = 1; l < SIZE_L; l++)
{
dif = (PG_D(NO, fprS[TX], K, fprV[j], T, fprJ[k], fprC[l]) * TP) - MV;
// sync all threads before next instruction,
// preventing unnecessary overwritten values
__syncthreads;
// keep the smallest
if (dif < min)
min = dif;
}
// return the smallest value to host
R[0] = max(Q * min, 0.0);
}