Problem with Vectors add Can't compute sum of two vectors

Hello. I work with CUDA not very long and have some problem.

I compute sum of two vectors in one block, but if I change elements count of vectors result becomes wrong.

My code:

[codebox]#include <stdio.h>

#include <memory.h>

#include <cuda_runtime_api.h>

#define CHECK_CUDA_ERROR(err){if (err != cudaSuccess) {printf(cudaGetErrorString(err));exit(-1);}}

#define MAX_X (100)

#define MAX_Y (100)

#define MAX_Z (10)

global void addVector(float* a, float* b, float* c)


int x = threadIdx.x;

int y = threadIdx.y;

int z = threadIdx.z;

int idx = y * MAX_X + x + z * MAX_X * MAX_Y + z;

c[idx] = a[idx] + b[idx] + 3.0f;



int main()


int byteSize = MAX_ELEMENTS * sizeof(float);

float* vec = new float[MAX_ELEMENTS];

float* devVec1;

float* devVec2;

float* devVec3;

CHECK_CUDA_ERROR(cudaMalloc((void**)&devVec1, byteSize))

CHECK_CUDA_ERROR(cudaMalloc((void**)&devVec2, byteSize))

CHECK_CUDA_ERROR(cudaMalloc((void**)&devVec3, byteSize))

cudaMemset(devVec1, 0, byteSize);

cudaMemset(devVec2, 0, byteSize);

cudaEvent_t syncEvent;


dim3 blocks = dim3();

dim3 threads = dim3(MAX_X, MAX_Y, MAX_Z);

addVector<<<blocks, threads>>>(devVec1, devVec2, devVec3);

cudaEventRecord(syncEvent, 0);


cudaMemcpy(vec, devVec3, byteSize, cudaMemcpyDeviceToHost);


printf("First element: %f\n", vec[0]);

printf("Second element: %f\n", vec[MAX_ELEMENTS - 1]);




delete[] vec;

return 0;



Block size on my GeForce 9600M GS is 51251264. How I can change MAX_X, MAX_Y and MAX_Z. For example:

if MAX_X = 100 MAX_Y = 100 and MAX_Z = 10 then all elements in result zero. Help if can.

P.S.: How I can use all threads in block for computing?

c[idx] = a[idx] + b[idx] + 3.0f;

I want know why when I change MAX_X and other constant, some components of vector is invalid (not 3).

Your block configuration is invalid.
The block size you are quoting 51251264 is the upper limit in each dimension, the number of threads should be <512.
So, you can have configurations like 512x1x1, a 1x512x1 or a 8x1x64, but a 100x100x10 is invalid

Thanks. I understand that I can use only 512 threads per block. I