Effective Parallelisation of CUDA C code

user19110 your post is completely off-topic in this thread. Please start your own thread for a completely new topic.

@Robert_Crovella Hello Sir, actually I tried printing values inside as well as outside kernel by writing simple program in cuda for addition of two elements. Below is the attached code.

%%cu

#include <cuda.h>

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <math.h>

#include <complex.h>

#include <cuComplex.h>

#include <time.h>

 

__global__ void fc(float fc_x11_1,float fc_x11_2,float *devdx){

  devdx[0] = fc_x11_1 + fc_x11_2;

  printf("devdx value after addition: %f\n", devdx[i]);

}

       

int main()

{

printf("hello main\n");

float *dx = (float *)malloc((1)*sizeof(float));

float *devdx;

cudaMalloc((void **)&devdx,(1)*sizeof(float));

  

float x11[147];

 int h;

 FILE *fp;

 fp = fopen("XYZ1.txt", "r");

 for(h=0;h<147;h++)

 {

  fscanf(fp, "%f", &x11[h]);

 }

 printf("two nu. %f \t %f",x11[1],x11[5]);

cudaMemcpy(devdx,dx, sizeof(float),cudaMemcpyHostToDevice);

fc<<<1,1>>>(x11[1], x11[5],devdx);

//printf("devdx value : %f\n",devdx);

cudaMemcpy(dx, devdx, sizeof(float),cudaMemcpyDeviceToHost);

printf("dx value : %f\n",dx);

}

But devdx value is not printing correctly. I am running this cuda code on google colab platform.

The code you have posted won’t compile. So it is evidently not the code you are running.

@Robert_Crovella Sir, It was not showing any compilation error on google colab, can you please tell me why it won’t compile…is there any problem with cuda syntax?

You can use a compiler to answer this question. After all, that is what I did. Take the code you have posted (only), and try to compile it.

Apart from the won’t-compile problem, one problem I see in your code is here:

That is host code. Understanding what is wrong there is simply a matter of C/C++ coding, not anything to do with CUDA. If you like, compare it with the way you wrote the printf statement in your kernel code, to see if any of the differences seem important to you. (Careful study of that will also expose the compile error issue.)

My goal here is not about teaching people C/C++ programming.

@Robert_Crovella Yes Sir I know I should not ask C/C++ queries, but actually I ran this code using this below statement also to print the value

printf("dx value : %f\n",*dx);

and this statement to print the address
printf("dx value : %f\n",dx);

Acually that was some issue due to google colab, now that code is running is perfectly fine.
nvcc compiler provided by google colab was not showing any error, that was the main issue.
Thank You

@Robert_Crovella @striker159 Hii, My previous code of 2 elements is running perfectly fine but when I am running the whole loop, then I am getting all the elements as 0 as the output.
Please can you tell me that what according to you might be wrong in my CUDA code for subtraction of all x11 values(2 values at a time) inside my code.

This was the code written by me :

%%cu
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <complex.h>
#include <cuComplex.h>
#include <time.h>
__global__ void fc(float *x11,float *dx){
int row = blockIdx.y*blockDim.y+threadIdx.y;
int col = blockIdx.x*blockDim.x+threadIdx.x;
if((row < 147) && (col < 147)){
dx[(row*147)+col] = x11[row] - x11[col];
}
}       
int main()
{
float *dx = (float*)malloc((147*147)*sizeof(float));
float *x11 = (float*)malloc(147*sizeof(float));  
float *devdx, *d_x11;

cudaMalloc(&devdx,(147*147)*sizeof(float));
cudaMalloc(&d_x11,(147)*sizeof(float)); 

 int h;
 FILE *fp;
 if ((fp = fopen("XYZ1.txt","r")) == NULL){
       printf("Error! opening file");
       exit(1);
   }

 for(h=0;h<147;h++){
  fscanf(fp, "%f", (x11+h));
 }
cudaMemcpy(d_x11, x11, (147*sizeof(float)), cudaMemcpyHostToDevice);
dim3 blocksize(147,147);
dim3 threadsize(1,1);
fc<<<blocksize,threadsize>>>(d_x11,devdx);
cudaMemcpy(dx, devdx,((147*147)*sizeof(float)),cudaMemcpyDeviceToHost);

int i,j;
for(i=0;i<147;i++){
for(j=0;j<147;j++){
printf("%f\t",dx[i*147+j]);
}
printf("\n");
}
return 0;
}

This was the following output which I was getting:

What happens if you follow the instructions I gave you here ? If no errors are reported then my best guess is your input file (which I don’t have) is all zeroes.

I’ve suggested to you twice now that you should do proper cuda error checking (demonstrate it in any code you post) and also run with one of the sanitizers and report the output of that tool when asking for help.

I probably won’t be able to help you further with questions where you don’t demonstrate those things.

If I run your code with synthetic data, the output looks sensible to me:

$ cat t1931.cu
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <complex.h>
#include <cuComplex.h>
#include <time.h>
__global__ void fc(float *x11,float *dx){
int row = blockIdx.y*blockDim.y+threadIdx.y;
int col = blockIdx.x*blockDim.x+threadIdx.x;
if((row < 147) && (col < 147)){
dx[(row*147)+col] = x11[row] - x11[col];
}
}
int main()
{
float *dx = (float*)malloc((147*147)*sizeof(float));
float *x11 = (float*)malloc(147*sizeof(float));
float *devdx, *d_x11;

cudaMalloc(&devdx,(147*147)*sizeof(float));
cudaMalloc(&d_x11,(147)*sizeof(float));

 for(int h=0;h<147;h++){
  x11[h] = h;
 }
cudaMemcpy(d_x11, x11, (147*sizeof(float)), cudaMemcpyHostToDevice);
dim3 blocksize(147,147);
dim3 threadsize(1,1);
fc<<<blocksize,threadsize>>>(d_x11,devdx);
cudaMemcpy(dx, devdx,((147*147)*sizeof(float)),cudaMemcpyDeviceToHost);

int i,j;
for(i=0;i<7;i++){
for(j=0;j<7;j++){
printf("%f\t",dx[i*147+j]);
}
printf("\n");
}
return 0;
}
$ nvcc -o t1931 t1931.cu
$ cuda-memcheck ./t1931
========= CUDA-MEMCHECK
0.000000        -1.000000       -2.000000       -3.000000       -4.000000       -5.000000       -6.000000
1.000000        0.000000        -1.000000       -2.000000       -3.000000       -4.000000       -5.000000
2.000000        1.000000        0.000000        -1.000000       -2.000000       -3.000000       -4.000000
3.000000        2.000000        1.000000        0.000000        -1.000000       -2.000000       -3.000000
4.000000        3.000000        2.000000        1.000000        0.000000        -1.000000       -2.000000
5.000000        4.000000        3.000000        2.000000        1.000000        0.000000        -1.000000
6.000000        5.000000        4.000000        3.000000        2.000000        1.000000        0.000000
========= ERROR SUMMARY: 0 errors
$

Don’t assume this means it should be OK in your setup. If you happen to get a colab instance with a K80 GPU in it, the problem may be that you are not compiling the code correctly. If you had done proper CUDA error checking, that error would be reported when you ran the code. But based on the code you have actually posted here, you wouldn’t get any indication of that sort of problem.