Trouble getting nested for loop to work as a kernel. Help!

7.5 toolkit Tesla k40c. I would show what I tried, but it would fill the website and none of it works.

for (int i = 0; i < MP; i++)
{
for (int j = 0; j < MH; j++)
{
VH[jMP + i] = 1.0 / (1.0 + exp(-VHSUM[jMP+i])); //sigmoid
if (VH[jMP + i] == 0.5) VH[jMP + i] = 0.0;
}

}

(Never mind the CUDA code, you haven’t even provided a complete C code.)

Presumably your VH array is a floating point type (float or double) and testing for equality of a floating point type is kind of a strange thing to do:

if (VH[j*MP + i] == 0.5)

However the structure of your code lends itself to trivial 2D parallelization:

$ cat t1032.cu
#include <stdio.h>
#include <math.h>
#include <stdlib.h>

#define MP 1024
#define MH 1024
#define BSZ 16
#define TOL 0.00001
typedef float mytype;

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


void cpu_func(mytype *VH, const mytype *VHSUM){

for (int i = 0; i < MP; i++)
{
  for (int j = 0; j < MH; j++)
  {
    int pos = j*MP+i;
    VH[pos] = 1.0 / (1.0 + exp(-VHSUM[pos])); //sigmoid
    if (VH[pos] == 0.5) VH[pos] = 0.0;
  }

}
}

__global__ void gpu_func(mytype * __restrict__ VH, const mytype * __restrict__ VHSUM){


  int j = threadIdx.y+blockDim.y*blockIdx.y;
  int i = threadIdx.x+blockDim.x*blockIdx.x;
  if ((j < MH) && (i < MP)){
    int pos = j*MP+i;
    VH[pos] = 1.0 / (1.0 + exp(-VHSUM[pos])); //sigmoid
    if (VH[pos] == 0.5) VH[pos] = 0.0;}
}


int main(){

  mytype *h_VH, *h_VHSUM, *gpu_VH, *d_VH, *d_VHSUM;
  size_t dsize = MP*MH*sizeof(mytype);
  h_VH    = (mytype *)malloc(dsize);
  gpu_VH  = (mytype *)malloc(dsize);
  h_VHSUM = (mytype *)malloc(dsize);
  printf("\n");
  memset(h_VH, 0, dsize);
  memset(gpu_VH, 0, dsize);
  cudaMalloc(&d_VH,    dsize);
  cudaMalloc(&d_VHSUM, dsize);
  cudaCheckErrors("cudaMalloc fail");
  for (size_t i = 0; i < MP*MH; i++) h_VHSUM[i] = rand()/(float)RAND_MAX;
  for (size_t i = 0; i < 10; i++) printf("%f ", h_VHSUM[i]);
  printf("\n");
  cudaMemset(d_VH, 0, dsize);
  cudaMemcpy(d_VHSUM, h_VHSUM, dsize, cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMemcpy1 fail");
  dim3 block(BSZ, BSZ);
  dim3 grid((MP+block.x-1)/block.x, (MH+block.y-1)/block.y);
  gpu_func<<<grid,block>>>(d_VH, d_VHSUM);
  cudaMemcpy(gpu_VH, d_VH, dsize, cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMemcpy2/kernel fail");
  cpu_func(h_VH, h_VHSUM);
  for (size_t i = 0; i < MP*MH; i++) if (fabsf(h_VH[i] - gpu_VH[i])>TOL) { printf("mismatch at %d, was %f, should be %f\n", i, h_VH[i], gpu_VH[i]); return 1;}
  for (size_t i = 0; i < 10; i++) printf("%f ", gpu_VH[i]);
  printf("\nSuccess!\n");
  return 0;
}
$ nvcc -o t1032 t1032.cu
$ ./t1032

0.840188 0.394383 0.783099 0.798440 0.911647 0.197551 0.335223 0.768230 0.277775 0.553970
0.698505 0.597337 0.686348 0.689641 0.713337 0.549228 0.583030 0.683138 0.569001 0.635056
Success!
$

Note that for such a trivial, independent function, the kernel code can look identical to the body of your original for-loop code.

First of all, I apologize for the incomplete code.

In my algorithm a sigmoid value for 0 = .5, but I need for it to be set back to 0 for my algorithm.

You’re freakin’ awesome, thanks!!!

Although it’s slower than the cpu, it works. Now I just have to go in and figure out why.

Thank you, again!!!

It may be slower than the CPU. I’m not sure your algorithm has enough compute intensity per byte of data loaded/stored to be compute bound (but I haven’t analyzed it, njuffa may chime in), so it may be memory bound. And if you have a memory bound code with no data re-use (true for your code), and that is the only function you are performing on the GPU, that is not a good fit for the GPU, basically because the cost of transferring data to/from the GPU outweighs any speed-up of running such a code on the GPU (due to the memory-bound nature).

You may have missed my point on the floating point equality. The results of floating point computations are rarely numerically precise. So testing for exact equality (as opposed to a toleranced range about the desired number) is rarely a sensible idea. But it may be a fit for your case, I haven’t observed what the output for an exactly zero input is, it may be exactly 0.5 such that it will pass the equality test.

Some small additional optimizations may be possible for the code I posted. The main one I can think of would be to create a loop in the kernel to increase work per thread (launch fewer threads, process multiple elements per thread). I wouldn’t expect more than a 20-25% speedup out of that, however, and it may be less, depending on GPU you are running on. @njuffa may have some ideas about how to speed up 1/(1+exp())

with such small kernel overall execution time may be dominated by gpu-cpu memory copying

exp() is about the fastest transcendental function there is, both in single precision and in double precision. The above code should be completely memory bandwidth limited, except possibly on Maxwell-family GPUs (with 1/32 DP speed) if ‘mytype’ is double precision. Certainly on a Tesla K40c it will be bound by memory throughput.

How about for three for loops?

/*
MI = 5 number of inputs
MH = 6 number of hidden nodes
MS = 12 number strings or solutions
MIB = MI + 1 = 6 inputs plus bias term
MHB = MH + 1 = 7 hidden nodes plus bias term
KWT = MIB * MH = 36 number input weights in a solution
T[KWT x MS] = T[36 x 12] population of 12 solutions
XI[MIB x MP] = XI [6 x 100] matrix of data 5+1 inputs and 100 observations
VH[MP x MHB] = VH[100 x 7] I only use 100 x 6. After this completes I put the bias in the last column.
tsubjct = one of the MS in another loop around these three. 0 - 11 or 12 solutions one at a time.

I didn’t know if you needed this information, but I thought I better include an example for you to go on.
*/

for (int i = 0; i < MP; i++)
{
for (int j = 0; j < MH; j++)
{
SUM = 0.0;
IN = jMIB;
for (int k = 0; k < MIB; k++)
{
SUM += T[tsubjct
KWT + (IN + k)] * XI[i*MIB + k];
}
VH[j*MP + i] = SUM;
}
}

What about it? The above code is completely limited by memory accesses. Optimize your access patterns, check whether you can make use of shared memory, use __ldg() where practical.

A simple approach would be to replace the two outer for-loops with the kernel/grid as in the first example, and retain the inner for-loop in the body of your kernel code.

Although I like to help people, I’m not a code-writing service. Hopefully you can learn enough from the example already provided to try your hand at the 3-loop case.

And if you do a little bit of searching you’ll find other examples, like here:

http://stackoverflow.com/questions/34529387/kernel-for-processing-a-4d-tensor-in-cuda/34539019#34539019

That’s all I needed, thanks!