I am having trouble with a kernel for the vector product of two vectors ( C[i] = A[i] * B[i]. where A, B and C are the vectors).

my kernel code is

__global__ void kernel (float* A, float* B, float* C){
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
C[tid] = A[tid] * B[tid];
__syncthreads();
}

i launch the kernel with n threads and size/n blocks , where size is size of the vectors and n is a multiple of 32.

This kernel fails to give the right answers. the values of C[i] are valid upto a certain number and all the remaining values are 0. Another fact is that the i value upto which the product is valid changes depending on the threads and size. (for size = 4096, and threads = 64 the products are valid upto i = 1023, ie the first 1024 entries)

I am using a 8500GT.

I am not having any trouble with any other program that uses cublas.

Can you please provide a ready to compile source file (with makefile) that reproduces the error? And if you compile it with emulation, does it give the correct result?

So, only the first 1/4 of your results are returned. Why? Because that’s all you copy back from the device. That’s OK, though, since you only copied the first 1/4 of your inputs over to the device in the first place. :)

I find it helpful to use variable names like “sizeInBytes” or “sizeInFloats”. Keeps Mars probes from crashing, too.

Btw. your syncthreads is completely pointless like that and unless the compiler removes it (unlikely) might well make you get only about half performance (since your kernel is mostly bound by memory speed it might not make much of a difference esp. on devices with slow memory though).