Problems with the summation of arrays There are no values ​​in the array

Hi all!

I am new in CUDA programming.

Wrote program to sum ​​two arrays in the third array.

For some reason, the target array C is always zero, even after adding … do not tell what I’m doing wrong?

The source code

#include <iostream>

#include <cuda_runtime.h>

__global__ void sum(float *A, float *B, float *C) 


    int n = blockDim.x * blockIdx.x + threadIdx.x;

    C[n] = A[n] + B[n];


void StartSum(float *A, float *B, float *C, int N)


    sum<<< N/64, 64 >>>(A, B, C) ;


The source code to initialize arrays and calling summation:

#include <windows.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <cuda_runtime_api.h>

#include <iostream>

#define N 5

void StartSum(float *A, float *B, float *C, int n);

int main() 


    float a[N] = {1,2,3,4,5}, b[N]={-2,-4,5,7,1}, c[N] = {0,0,0,0,0};

    cudaError_t err;

    float *dev_a , *dev_b , *dev_c ;


    cudaMalloc((void**)&dev_a , sizeof (float)*N);

    cudaMalloc((void**)&dev_b , sizeof (float)*N);

    cudaMalloc((void**)&dev_c , sizeof (float)*N);

    err = cudaMemcpy(dev_a, a, sizeof(float)*N, cudaMemcpyHostToDevice);

    err = cudaMemcpy(dev_b, b, sizeof(float)*N, cudaMemcpyHostToDevice);

    err = cudaMemcpy(dev_c, c, sizeof(float)*N, cudaMemcpyHostToDevice);

    StartSum(dev_a, dev_b, dev_c, N);

    err = cudaMemcpy(c, dev_c, sizeof(float), cudaMemcpyDeviceToHost);

    for (int i = 0; i<N; i++)

        std::cout<<c[i]<<" ";




In deriving the results always get zero … std::cout<<c[i]<<" ";

Thank you in advance for your help.


your problem (at least your first problem) is that your kernel is never called since 5/64=0…

Should you check the pre-launch error status, you would have get an “invalid configuration argument” error due to this. To convince yourself, just try this:

$ cat

#include <stdio.h>

#include <cuda.h>

__global__ void foo() {

	if (threadIdx.x==0)

		printf("in kernel, gridDim and blockDim are %d %d\n", gridDim.x, blockDim.x);


int main() {


	printf("%s\n", cudaGetErrorString(cudaGetLastError()));


	printf("%s\n", cudaGetErrorString(cudaGetLastError()));



	printf("%s\n", cudaGetErrorString(cudaGetLastError()));

	return 0;


$ nvcc -arch=sm_21 -o gridDim

$ ./gridDim 

invalid configuration argument

no error

in kernel, gridDim and blockDim are 1 1

invalid configuration argument

Then, I guess you’ll also have to add a test somewhere in your kernel to avoid accessing out of bound data (like an “if(n<N)” test).

Thank you very much, dear Gilles_C!
It worked! :biggrin:
I will continue to study this interesting subject.

Sorry for the stupid question … sum two arrays happened.

How to find the sum of the elements of one array?

I tried to do this:

__global__ void Summation(float *A, float *C) 


	int n = blockDim.x * blockIdx.x + threadIdx.x;

	*C += A[n];


but this option does not work …

I solved the problem this way:

for (int i=0; i<count; i++)

    *C += A[n];

Do you think this approach is correct, in terms of technology CUDA?

The problem it performs - the array is summed. But I doubt whether I came to this issue.

Thank you.

Last approach won’t help you much - sum of single array is performed by all cores not in parralel but serial way. Take a look at Parallel Reduction.