I started to learn CUDA today, but I got weird problem. The result of kernel those not changed.
My simple code calls simple ‘add’ kernel, but dev_c array is filled with 0 for all element.
several articles said that this is the matter of driver version, so I updated driver to the newest version.
If this problem was solved, I would not write this post. I am still having the problem.
My current version info:
Driver Version:512.96 Studio.
CUDA Version: Build cuda_11.6.r11.6/compiler.31057947_0
Here is my code.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <iostream>
#define N 1000
__global__ void add(int* a, int* b, int* c) {
int tid = threadIdx.x;
if (tid < N) {
c[tid] = a[tid] + b[tid];
}
}
int main(void) {
int a[N];
int b[N];
int c[N];
int* dev_a;
int* dev_b;
int* dev_c;
for (int i = 0; i < N; ++i)
{
a[i] = -i;
b[i] = i * i;
}
int memSize = N * sizeof(int);
cudaMalloc((void**)&dev_a, memSize);
cudaMalloc((void**)&dev_b, memSize);
cudaMalloc((void**)&dev_c, memSize);
cudaMemcpy(dev_a, a, memSize, cudaMemcpyKind::cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, memSize, cudaMemcpyKind::cudaMemcpyHostToDevice);
add << <N, 1 >> > (dev_a, dev_b, dev_c);
cudaDeviceSynchronize();
cudaError_t error = cudaGetLastError();
if (error != cudaSuccess)
{
fprintf(stderr, "ERROR: %s\n", cudaGetErrorString(error));
exit(-1);
}
cudaMemcpy(c, dev_c, memSize, cudaMemcpyKind::cudaMemcpyDeviceToHost);
for (int i = 0; i < N; ++i)
{
printf("%d + %d = %d\n", a[i], b[i], c[i]);
}
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
}
You might noticed the CUDA error checking codeblock is included. At first try, I got unsupported toolchain error. So I downgrade CUDA from 11.7 to 11.6. However I still got tons of zeros.(kernel does not worked.)
Your kernel is not designed correctly for your grid definition:
That launches a grid of N blocks, each of which has 1 thread.
This:
Means that for every block, the single thread in that block will pick up an index of 0.
So every block is working on index 0 in your array. The rest is untouched. One possible fix would be to create a globally-unique index using the canonical method:
I don’t know what book you are referring to, or what “old version” means exactly, but there has never been any change in CUDA that would affect the behavior of what we are discussing here. You would be able to run this experiment, using this code, with the two different definitions for tid, on the very first version of CUDA released around early 2007, and the latest one released today, and see the same contrast for the two different definitions.
Books could have mistakes or typos, of course. Just like anything I write could have a mistake or typo.
(It’s possible that if you went back to the first version of CUDA, you might have to replace cudaDeviceSynchronize() with cudaThreadSynchronize(), but that would have no material effect on code behavior or anything we are discussion here. And similarly I don’t know if device_launch_parameters.h was a valid CUDA header file in the first version of CUDA, but it serves no important purpose here. And you would probably have to replace cudaMemcpyKind::cudaMemcpyHostToDevice with just cudaMemcpyHostToDevice, etc. None of these things would be significant.)
Is the exact code you have here found in that book? If so, where?
I was able to locate something that is very similar to the code you have posted on chapter 5, section 5.2.1, pages 61-63 of the English version of the book. However the kernel launch there does not match what you have (and N is different, too):
add<<<1,N>>>( dev_a, dev_b, dev_c );
as far as that goes, there is no problem that I see with what is in the book. Furthermore there are no published errata for this, that I can see.
It’s not obvious to me that there is any mistake of that kind in the book.