Hello all,

I am doing just a freetime project using most accessible supercomputer:P (gpu), so please excuse my amateurism. Currently my kernel code is:

```
__global__ void kernelone(float *ind1, float *ind2, float *ind3, float* hi, float* low, float* close, int* entrycombinations1d, int* r, int start, int end) {
int idx = blockIdx.x*gridDim.y*gridDim.z*blockDim.x + blockIdx.y*gridDim.z*blockDim.x + blockIdx.z*blockDim.x + threadIdx.x;
int exitcondition1 = blockIdx.x;
int exitcondition2 = blockIdx.y;
int m = 0;
bool innerloopon= false;
int entrycondition1 = entrycombinations1d[(blockIdx.z*blockDim.x + threadIdx.x) * 3];
int entrycondition2 = entrycombinations1d[(blockIdx.z*blockDim.x + threadIdx.x) * 3 + 1];
int entrycondition3 = entrycombinations1d[(blockIdx.z*blockDim.x + threadIdx.x) * 3 + 2];
int exits1 = 0;
int exits2 = 0;
innerloopon= false;
bool condition1 = false;
bool condition2 = false;
bool condition3 = false;
for (int n = start; n < end; n++) { //OUTER LOOP
innerloopon = false;
condition1 = entrycondition1 + 10 > ind1[n] && ind1[n] > entrycondition1;
condition2 = entrycondition2 + 10 > ind2[n] && ind2[n] > entrycondition2;
condition3 = entrycondition3 + 10 > ind3[n] && ind3[n] > entrycondition3;
if (condition1 && condition2 && condition3 && innerloopon == false) {
innerloopon = true;
for (m = n + 1; m < end && innerloopon == true; m++) { //INNER LOOP
if (close[n] - low[m] > exitcondition1) {
innerloopon = false;
n = m;
exits1++;
}
else if (m == end - 1) {
n = m;
}
else if (hi[m] - close[n] > exitcondition2) {
innerloopon = false;
n = m;
exits2++;
}
}
}
int result = exits2 - exits1;
r[idx] = result;
}
```

I am running the code with:

```
dim3 numBlocks(16, 16, 4);
dim3 threadsPerBlock(1024, 1, 1);
traderkernelone <<<numBlocks, threadsPerBlock >>>
```

Inside of the krenel I have branching for loop for an array currently 1024*1024 large and I am launching this kernel also 1024*1024 times (as entry and exit conditions change and there are that many combinations)

I need this kernel to run as fast as possible. How can I optimize it? Can I introduce paralelism on the outer and inner loop itself? How? The main problem for me paralelising the loops is that n = m in the inner loop, which point is not to enter inner loop, if previous inner loop is not finished. Any idea how to paralelize this? Or any optimalization that make this kernel as fast as possible?

Minimal functional code is https://pastebin.com/kzfrYTtT (hopefully, cannot check, not at cuda capable pc).

Thank you