kernel problem

maringanti · August 14, 2008, 10:57am

I am having trouble with a kernel for the vector product of two vectors ( C[i] = A[i] * B[i]. where A, B and C are the vectors).

my kernel code is

__global__ void kernel (float* A, float* B, float* C){

        

unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

C[tid] = A[tid] * B[tid];

__syncthreads();

}

i launch the kernel with n threads and size/n blocks , where size is size of the vectors and n is a multiple of 32.

This kernel fails to give the right answers. the values of C[i] are valid upto a certain number and all the remaining values are 0. Another fact is that the i value upto which the product is valid changes depending on the threads and size. (for size = 4096, and threads = 64 the products are valid upto i = 1023, ie the first 1024 entries)

I am using a 8500GT.

I am not having any trouble with any other program that uses cublas.

What is the problem ?

Any idea where i am going wrong

Geli · August 14, 2008, 11:13am

Can you please provide a ready to compile source file (with makefile) that reproduces the error? And if you compile it with emulation, does it give the correct result?

maringanti · August 14, 2008, 7:52pm

This is the entire code

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <math.h>

#include "cuda.h"

__global__ void kernel(float* A_d, float* B_d, float* C_d){

	int tid = blockIdx.x * blockDim.x + threadIdx.x;

	// C_d[tid] = A_d[tid] * B_d[tid]

	float a = A_d[tid];

	float b = B_d[tid];

	float product = a * b;

	C_d[tid] = product;

	__syncthreads();

}

int main(int argc, char** argv) {

	

	int size = 2048;

	float* A = (float*) malloc(size * sizeof(float));	

	float* B = (float*) malloc(size * sizeof(float));	

	float* C = (float*) malloc(size * sizeof(float));	

	// Random initialization

	for(int i=0;i<size;i++){

 Â A[i] = rand() % 47;

 Â B[i] = rand() % 3451;

	}

	

	printf("\n Vectors initialized \n");

	

	float *A_d, *B_d, *C_d;

	

	cudaMalloc((void**)&A_d, size * sizeof(float));

	cudaMalloc((void**)&B_d, size * sizeof(float));

	cudaMalloc((void**)&C_d, size * sizeof(float));

	

	cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);

	cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

	cudaMemcpy(C_d, C, size, cudaMemcpyHostToDevice);

	cudaMemset(C_d, 0, size);

	

	int threads = 64;

	int blocks = size / threads;

	printf(" Threads : %d, \t Blocks : %d\n", threads, blocks);

	

	dim3 dimBlock(threads, 1);	

	dim3 dimGrid(blocks, 1);

	kernel<<<dimGrid, dimBlock>>>(A_d, B_d, C_d);

	

	cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);	

	

	float* cpuResult = (float*) malloc(size * sizeof(float));

	float difference = 0.0;

	FILE *fp;

	fp = fopen("Result.txt", "w");

	

	for(int i=0;i<size;i++){

 Â 

 Â cpuResult[i] = A[i] * B[i];

 Â difference += C[i] - cpuResult[i];

 Â fprintf(fp,"%d \t %f \t %f\n", i, C[i], cpuResult[i]);

	}

	printf("Avg difference :\t %f\n", difference/size);	

	cudaFree(A_d);	

	cudaFree(B_d);	

	cudaFree(C_d);	

 Â 

	free(A);

	free(B);

	free(C);

	free(cpuResult);

	fclose(fp);

	return 0;

}

There is no makefile for this code. I just use $nvcc vector.cu and run the a.out file. Device emulation also gives me the same result (ie invalid)

EDIT : I am unable to attach files for some reason

redpill · August 14, 2008, 8:54pm

int size = 2048;

	cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);

	cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

	cudaMemcpy(C_d, C, size, cudaMemcpyHostToDevice);

	cudaMemset(C_d, 0, size);

	

...

	

	cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);

[snapback]425834[/snapback]

So, only the first 1/4 of your results are returned. Why? Because that’s all you copy back from the device. That’s OK, though, since you only copied the first 1/4 of your inputs over to the device in the first place. :)

I find it helpful to use variable names like “sizeInBytes” or “sizeInFloats”. Keeps Mars probes from crashing, too.

–redpill

mfatica · August 14, 2008, 8:58pm

Your cudaMemcpys should have the number of bytes not the number of elements.

For example:
cudaMemcpy(C, C_d, size*sizeof(float), cudaMemcpyDeviceToHost);

There is no need for the synchthread call in your kernel ( you are not using shared memory) or for the temporary variables

maringanti · August 14, 2008, 8:59pm

aaaaarrgggh …

I feel like crying… :( :( :(

thanks guys, you guys are awesome.

Reimar · August 15, 2008, 6:30am

my kernel code is

__global__ void kernel (float* A, float* B, float* C){

 Â  Â  Â  Â 

unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

C[tid] = A[tid] * B[tid];

__syncthreads();

}

[snapback]425513[/snapback]

Btw. your syncthreads is completely pointless like that and unless the compiler removes it (unlikely) might well make you get only about half performance (since your kernel is mostly bound by memory speed it might not make much of a difference esp. on devices with slow memory though).

Topic		Replies	Views
Matrix by vector multiplication CUDA Programming and Performance	4	913	June 16, 2013
Vector Vector Multiplication Code : Error Basic Vector Vector Multiplication code CUDA Programming and Performance	5	7398	May 21, 2008
[newbie] float product crashes strange float product that refuses to be computed CUDA Programming and Performance	8	1085	June 23, 2010
Unexpected behavior on Dot Product Kernel CUDA Programming and Performance	8	9882	February 7, 2011
The kernel always returns values equal to zero CUDA Programming and Performance	10	8049	February 2, 2018
strange problem with kernel kernel does not write back to global memory CUDA Programming and Performance	2	1953	April 30, 2009
problem with dot product code CUDA Programming and Performance	11	6481	June 24, 2008
Problem with memory access or thread synchronization The code works well in emulation mode and give CUDA Programming and Performance	2	1089	November 13, 2009
Very strange behaviour. Maybe a bug...? Kernel fails to run strangely, but no errors are reported. CUDA Programming and Performance	5	1051	May 13, 2009
Getting started with CUDA ... cannot add simple vectors CUDA Programming and Performance	9	20945	January 31, 2011

kernel problem

Related topics