CUDA Noob here. Kernel does not act correctly

juyeonglee93 · June 6, 2022, 1:33pm

I started to learn CUDA today, but I got weird problem. The result of kernel those not changed.
My simple code calls simple ‘add’ kernel, but dev_c array is filled with 0 for all element.

several articles said that this is the matter of driver version, so I updated driver to the newest version.
If this problem was solved, I would not write this post. I am still having the problem.

My current version info:

Driver Version:512.96 Studio.
CUDA Version: Build cuda_11.6.r11.6/compiler.31057947_0

Here is my code.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <iostream>

#define N 1000

__global__ void add(int* a, int* b, int* c) {
	int tid = threadIdx.x;
	if (tid < N) {
		c[tid] = a[tid] + b[tid];
	}
}

int main(void) {
	int a[N];
	int b[N];
	int c[N];

	int* dev_a;
	int* dev_b;
	int* dev_c;

	for (int i = 0; i < N; ++i)
	{
		a[i] = -i;
		b[i] = i * i;
	}

	int memSize = N * sizeof(int);
	cudaMalloc((void**)&dev_a, memSize);
	cudaMalloc((void**)&dev_b, memSize);
	cudaMalloc((void**)&dev_c, memSize);

	cudaMemcpy(dev_a, a, memSize, cudaMemcpyKind::cudaMemcpyHostToDevice);
	cudaMemcpy(dev_b, b, memSize, cudaMemcpyKind::cudaMemcpyHostToDevice);

	add << <N, 1 >> > (dev_a, dev_b, dev_c);


	cudaDeviceSynchronize();
	cudaError_t error = cudaGetLastError();
	if (error != cudaSuccess)
	{
		fprintf(stderr, "ERROR: %s\n", cudaGetErrorString(error));
		exit(-1);
	}

	cudaMemcpy(c, dev_c, memSize, cudaMemcpyKind::cudaMemcpyDeviceToHost);

	for (int i = 0; i < N; ++i)
	{
		printf("%d + %d = %d\n", a[i], b[i], c[i]);
	}
	cudaFree(dev_a);
	cudaFree(dev_b);
	cudaFree(dev_c);
}

You might noticed the CUDA error checking codeblock is included. At first try, I got unsupported toolchain error. So I downgrade CUDA from 11.7 to 11.6. However I still got tons of zeros.(kernel does not worked.)

Robert_Crovella · June 6, 2022, 2:36pm

Your kernel is not designed correctly for your grid definition:

That launches a grid of N blocks, each of which has 1 thread.

This:

Means that for every block, the single thread in that block will pick up an index of 0.

So every block is working on index 0 in your array. The rest is untouched. One possible fix would be to create a globally-unique index using the canonical method:

int tid = blockIdx.x*blockDim.x+threadIdx.x;

juyeonglee93 · June 7, 2022, 1:06am

OMG. My book has CUDA code for old version. There could be significant change during the term. Thanks.

Robert_Crovella · June 7, 2022, 2:07am

I don’t know what book you are referring to, or what “old version” means exactly, but there has never been any change in CUDA that would affect the behavior of what we are discussing here. You would be able to run this experiment, using this code, with the two different definitions for tid, on the very first version of CUDA released around early 2007, and the latest one released today, and see the same contrast for the two different definitions.

Books could have mistakes or typos, of course. Just like anything I write could have a mistake or typo.

(It’s possible that if you went back to the first version of CUDA, you might have to replace cudaDeviceSynchronize() with cudaThreadSynchronize(), but that would have no material effect on code behavior or anything we are discussion here. And similarly I don’t know if device_launch_parameters.h was a valid CUDA header file in the first version of CUDA, but it serves no important purpose here. And you would probably have to replace cudaMemcpyKind::cudaMemcpyHostToDevice with just cudaMemcpyHostToDevice, etc. None of these things would be significant.)

juyeonglee93 · June 7, 2022, 2:51am

The book I was reading is Korean translated version of this book.

And this book uses CUDA 6 i think. (Some enums was changed to enum class just as you mentioned.)

There could be some mistake on the book. :(

Robert_Crovella · June 7, 2022, 3:27am

Is the exact code you have here found in that book? If so, where?

I was able to locate something that is very similar to the code you have posted on chapter 5, section 5.2.1, pages 61-63 of the English version of the book. However the kernel launch there does not match what you have (and N is different, too):

add<<<1,N>>>( dev_a, dev_b, dev_c );

as far as that goes, there is no problem that I see with what is in the book. Furthermore there are no published errata for this, that I can see.

It’s not obvious to me that there is any mistake of that kind in the book.

Topic		Replies	Views
Need Help. CUDA kernel fails randomly CUDA Programming and Performance cuda , kernel	3	505	July 27, 2022
First CUDA program -- Integrating CUDA with existing code base -- not working. CUDA Programming and Performance	5	1209	June 13, 2017
Kernel not doing anything CUDA Programming and Performance	8	4588	January 31, 2011
Losing CUDA calculatons CUDA Programming and Performance	5	2321	March 21, 2011
The kernel isn't working CUDA Programming and Performance	9	1050	January 19, 2011
Simple question on passing to the kernel CUDA Programming and Performance	15	3390	January 15, 2012
This code doesn't work maybe too much threads assigned? CUDA Programming and Performance	8	1083	February 2, 2014
Zero output in basic Vector Addition application in CUDA CUDA Programming and Performance	8	4840	January 18, 2011
[CUDA 4.0] : __CUDA_ARCH__ undefined in device code CUDA Programming and Performance	9	6738	July 14, 2011
Kernel Launch Failure Very simple kernel CUDA Programming and Performance	3	3882	September 14, 2011

CUDA Noob here. Kernel does not act correctly

Related topics