parallelizing interconnected arrays

Good day! I am new in CUDA, i have just read NVidia tutorial and want to parallelize the following C code.

#define NUM_OF_ACCOMS 3360
#define SIZE_RING 16
#define NUM_OF_BIGRAMMS 256

//...some code...
	for (i = 1; i <= SIZE_RING; i++) {
		for (j = 1; j <= SIZE_RING; j++) {
			if (j == i) continue;
			for (k = 1; k <= SIZE_RING; k++) {
				if (k == j || k == i) continue;
				accoms_theta[indOfAccoms][0] = i - 1; accoms_theta[indOfAccoms][1] = j - 1; accoms_theta[indOfAccoms][2] = k - 1;
				accoms_thetaFix[indOfAccoms][0] = i - 1; accoms_thetaFix[indOfAccoms][1] = j - 1; accoms_thetaFix[indOfAccoms][2] = k - 1;
				results[indOfAccoms][0] = results[indOfAccoms][1] = results[indOfAccoms][2] = 0;
				indOfAccoms++;
			}
		}
	}	

	for (i = 0; i < SIZE_RING; i++)
		for (j = 0; j < SIZE_RING; j++) {
			bigramms[indOfBigramms][0] = i; bigramms[indOfBigramms][1] = j;
			indOfBigramms++;
		}
        for (i = 0; i < NUM_OF_ACCOMS; i++) {
			thetaArr[0] = accoms_theta[i][0]; thetaArr[1] = accoms_theta[i][1]; thetaArr[2] = accoms_theta[i][2];
			d0 = thetaArr[2] - thetaArr[1]; d1 = thetaArr[2] - thetaArr[0];
			if (d0 < 0)
				d0 += SIZE_RING;
			if (d1 < 0)
				d1 += SIZE_RING;
			for (j = 0; j < NUM_OF_ACCOMS; j++) {
				theta_fixArr[0] = accoms_thetaFix[j][0]; theta_fixArr[1] = accoms_thetaFix[j][1]; theta_fixArr[2] = accoms_thetaFix[j][2];
				d0_fix = theta_fixArr[2] - theta_fixArr[1]; d1_fix = theta_fixArr[2] - theta_fixArr[0];
				count = 0;
				if (d0_fix < 0)
					d0_fix += SIZE_RING;
				if (d1_fix < 0)
					d1_fix += SIZE_RING;
				for (k = 0; k < NUM_OF_BIGRAMMS; k++) {
					diff0 = subst[(d0 + bigramms[k][0]) % SIZE_RING] - subst[bigramms[k][0]];
					diff1 = subst[(d1 + bigramms[k][1]) % SIZE_RING] - subst[bigramms[k][1]];

					if (diff0 < 0)
						diff0 += SIZE_RING;
					if (diff1 < 0)
						diff1 += SIZE_RING;
					if (diff0 == d0_fix && diff1 == d1_fix)
						count++;
				}
				if (max < count) {
					max = count;
					results[indResults][0] = max; results[indResults][1] = i; results[indResults][2] = j;
					count = 0;
					indResults++;
				}
			}
		}

As you can see, there two main cycles with i and j variables. I need foreach array from

accoms_theta

check the condition with each array from

accoms_thetaFix

. Well, you need for about 2^30 operations to check ALL arrays. Cause i am new in CUDA i need some help in parallelizing my algorithm.

Info about my device:

GeForce GT730M
Compute Capability 3.5
Global Memory 2 GB
Shared Memory Per Block 48 KB
Max Threads Per Block 1024
Number of multiprocessors 2
Max Threads Dim 1024 : 1024 : 64
Max Grid Dim 2*(10 ^ 9) : 65535 : 65535