kernel problem

I am having trouble with a kernel for the vector product of two vectors ( C[i] = A[i] * B[i]. where A, B and C are the vectors).

my kernel code is

__global__ void kernel (float* A, float* B, float* C){


unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

C[tid] = A[tid] * B[tid];



i launch the kernel with n threads and size/n blocks , where size is size of the vectors and n is a multiple of 32.

This kernel fails to give the right answers. the values of C[i] are valid upto a certain number and all the remaining values are 0. Another fact is that the i value upto which the product is valid changes depending on the threads and size. (for size = 4096, and threads = 64 the products are valid upto i = 1023, ie the first 1024 entries)

I am using a 8500GT.

I am not having any trouble with any other program that uses cublas.

What is the problem ?

Any idea where i am going wrong

Can you please provide a ready to compile source file (with makefile) that reproduces the error? And if you compile it with emulation, does it give the correct result?

This is the entire code

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <math.h>

#include "cuda.h"

__global__ void kernel(float* A_d, float* B_d, float* C_d){

	int tid = blockIdx.x * blockDim.x + threadIdx.x;

	// C_d[tid] = A_d[tid] * B_d[tid]

	float a = A_d[tid];

	float b = B_d[tid];

	float product = a * b;

	C_d[tid] = product;



int main(int argc, char** argv) {


	int size = 2048;

	float* A = (float*) malloc(size * sizeof(float));	

	float* B = (float*) malloc(size * sizeof(float));	

	float* C = (float*) malloc(size * sizeof(float));	

	// Random initialization

	for(int i=0;i<size;i++){

 Â A[i] = rand() % 47;

 Â B[i] = rand() % 3451;



	printf("\n Vectors initialized \n");


	float *A_d, *B_d, *C_d;


	cudaMalloc((void**)&A_d, size * sizeof(float));

	cudaMalloc((void**)&B_d, size * sizeof(float));

	cudaMalloc((void**)&C_d, size * sizeof(float));


	cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);

	cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

	cudaMemcpy(C_d, C, size, cudaMemcpyHostToDevice);

	cudaMemset(C_d, 0, size);


	int threads = 64;

	int blocks = size / threads;

	printf(" Threads : %d, \t Blocks : %d\n", threads, blocks);


	dim3 dimBlock(threads, 1);	

	dim3 dimGrid(blocks, 1);

	kernel<<<dimGrid, dimBlock>>>(A_d, B_d, C_d);


	cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);	


	float* cpuResult = (float*) malloc(size * sizeof(float));

	float difference = 0.0;

	FILE *fp;

	fp = fopen("Result.txt", "w");


	for(int i=0;i<size;i++){


 Â cpuResult[i] = A[i] * B[i];

 Â difference += C[i] - cpuResult[i];

 Â fprintf(fp,"%d \t %f \t %f\n", i, C[i], cpuResult[i]);


	printf("Avg difference :\t %f\n", difference/size);	










	return 0;


There is no makefile for this code. I just use $nvcc and run the a.out file. Device emulation also gives me the same result (ie invalid)

EDIT : I am unable to attach files for some reason

So, only the first 1/4 of your results are returned. Why? Because that’s all you copy back from the device. That’s OK, though, since you only copied the first 1/4 of your inputs over to the device in the first place. :)

I find it helpful to use variable names like “sizeInBytes” or “sizeInFloats”. Keeps Mars probes from crashing, too.


Your cudaMemcpys should have the number of bytes not the number of elements.

For example:
cudaMemcpy(C, C_d, size*sizeof(float), cudaMemcpyDeviceToHost);

There is no need for the synchthread call in your kernel ( you are not using shared memory) or for the temporary variables

aaaaarrgggh …

I feel like crying… :( :( :(

thanks guys, you guys are awesome.

Btw. your syncthreads is completely pointless like that and unless the compiler removes it (unlikely) might well make you get only about half performance (since your kernel is mostly bound by memory speed it might not make much of a difference esp. on devices with slow memory though).