parallelizing interconnected arrays

CUDman · December 25, 2017, 6:25am

Good day! I am new in CUDA, i have just read NVidia tutorial and want to parallelize the following C code.

#define NUM_OF_ACCOMS 3360
#define SIZE_RING 16
#define NUM_OF_BIGRAMMS 256

//...some code...
	for (i = 1; i <= SIZE_RING; i++) {
		for (j = 1; j <= SIZE_RING; j++) {
			if (j == i) continue;
			for (k = 1; k <= SIZE_RING; k++) {
				if (k == j || k == i) continue;
				accoms_theta[indOfAccoms][0] = i - 1; accoms_theta[indOfAccoms][1] = j - 1; accoms_theta[indOfAccoms][2] = k - 1;
				accoms_thetaFix[indOfAccoms][0] = i - 1; accoms_thetaFix[indOfAccoms][1] = j - 1; accoms_thetaFix[indOfAccoms][2] = k - 1;
				results[indOfAccoms][0] = results[indOfAccoms][1] = results[indOfAccoms][2] = 0;
				indOfAccoms++;
			}
		}
	}	

	for (i = 0; i < SIZE_RING; i++)
		for (j = 0; j < SIZE_RING; j++) {
			bigramms[indOfBigramms][0] = i; bigramms[indOfBigramms][1] = j;
			indOfBigramms++;
		}
        for (i = 0; i < NUM_OF_ACCOMS; i++) {
			thetaArr[0] = accoms_theta[i][0]; thetaArr[1] = accoms_theta[i][1]; thetaArr[2] = accoms_theta[i][2];
			d0 = thetaArr[2] - thetaArr[1]; d1 = thetaArr[2] - thetaArr[0];
			if (d0 < 0)
				d0 += SIZE_RING;
			if (d1 < 0)
				d1 += SIZE_RING;
			for (j = 0; j < NUM_OF_ACCOMS; j++) {
				theta_fixArr[0] = accoms_thetaFix[j][0]; theta_fixArr[1] = accoms_thetaFix[j][1]; theta_fixArr[2] = accoms_thetaFix[j][2];
				d0_fix = theta_fixArr[2] - theta_fixArr[1]; d1_fix = theta_fixArr[2] - theta_fixArr[0];
				count = 0;
				if (d0_fix < 0)
					d0_fix += SIZE_RING;
				if (d1_fix < 0)
					d1_fix += SIZE_RING;
				for (k = 0; k < NUM_OF_BIGRAMMS; k++) {
					diff0 = subst[(d0 + bigramms[k][0]) % SIZE_RING] - subst[bigramms[k][0]];
					diff1 = subst[(d1 + bigramms[k][1]) % SIZE_RING] - subst[bigramms[k][1]];

					if (diff0 < 0)
						diff0 += SIZE_RING;
					if (diff1 < 0)
						diff1 += SIZE_RING;
					if (diff0 == d0_fix && diff1 == d1_fix)
						count++;
				}
				if (max < count) {
					max = count;
					results[indResults][0] = max; results[indResults][1] = i; results[indResults][2] = j;
					count = 0;
					indResults++;
				}
			}
		}

As you can see, there two main cycles with i and j variables. I need foreach array from

accoms_theta

check the condition with each array from

accoms_thetaFix

. Well, you need for about 2^30 operations to check ALL arrays. Cause i am new in CUDA i need some help in parallelizing my algorithm.

Info about my device:

GeForce GT730M
Compute Capability 3.5
Global Memory 2 GB
Shared Memory Per Block 48 KB
Max Threads Per Block 1024
Number of multiprocessors 2
Max Threads Dim 1024 : 1024 : 64
Max Grid Dim 2*(10 ^ 9) : 65535 : 65535

Topic		Replies	Views
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	14108	September 5, 2008
Cycles CUDA Programming and Performance	13	6426	April 8, 2008
parallelizing CUDA Programming and Performance	5	5395	March 13, 2007
Effective Parallelisation of CUDA C code CUDA Programming and Performance	38	2339	December 27, 2021
Visiting every combination of three elements in CUDA? CUDA Programming and Performance	12	2773	February 26, 2014
Designing a CUDA algo question Sort of a newbie question.... CUDA Programming and Performance	2	2392	December 9, 2011
how to implement double for loops in CUDA CUDA Programming and Performance	23	16004	January 30, 2012
Confirm. This Parallel array Computing CUDA Programming and Performance	0	706	December 10, 2009
Nested loops in CUDA Legacy PGI Compilers	13	9892	July 12, 2019
A "simple" question CUDA Programming and Performance	2	1548	October 30, 2007

parallelizing interconnected arrays

Related topics