strange behavior of data size in cudaMalloc or cudaMemcpy

pwang · February 9, 2009, 9:52pm

Hi all:

I am new to CUDA so maybe I am just making a trivial mistake. There is a strange behavior of cudaMalloc and/or cudaMemcpy. In the program attached below, I just allocated a large array on GPU global memory, set it to a fixed value using one thread per array element and then copy it back to CPU. The problem is that whenever the array size is larger than 128^3, e.g. 128^3+64 (to make the size multiple of CUDA_BLOCK_SIZE), then the elements after 128^3 will have strange values and the test will fail. In this test, the memory required is ~ 8 Mb and the grid size is 32769, so those seem not to be the limiting factor. Furthermore, there is no such problem if I run this code in deviceemu mode.

I tested this routine on both a Quadro 5600/Linux Red hat 5 and a GeForce 8600/MacBook Pro. They had the same results. Any thought is highly appreciated!

To compile the following program, type “nvcc -lcudart main.cu”.

main.cu:

#include <stdio.h>
#include <stdlib.h>

#define CUDA_BLOCK_SIZE 64

global void myKernel(float *ap, int size);

int main()
{
const int size = 128128128+64;

float ap;
if (cudaMalloc((void*)&ap, sizeof(float)*size) != cudaSuccess) {
printf(“cudaMalloc fail.\n”);
exit(0);
}

myKernel<<<size/CUDA_BLOCK_SIZE, CUDA_BLOCK_SIZE>>>(ap, size);

float bp = (float)malloc(sizeof(float)*size);
if (!bp) {
printf(“malloc fail.\n”);
exit(0);
}

if (cudaMemcpy(bp, ap, sizeof(float)*size, cudaMemcpyDeviceToHost) != cudaSuccess) {
printf(“cudaMemcpy failed.\n”);
exit(0);
}
for (int i = 0; i < size; i++)
if (fabs(bp[i] - 0.1) > 1e-5) printf(“fail at %d, “, i);
printf(”\n”);

free(bp);
cudaFree(ap);
}

global void myKernel(float *ap, int size)
{
const int tx = threadIdx.x;
const int bx = blockIdx.x;

int igrid = bx*CUDA_BLOCK_SIZE + tx;
if (igrid >= size)
return;

ap[igrid] = 0.1;
}

MisterAnderson42 · February 9, 2009, 10:17pm

You know that the maximum grid dimension in x (or y) is 65535, right? Though this isn’t the cause of your problem since you are only launching 32768 blocks: just keep it in mind
There was a post on this exact issue a few weeks back. nvcc does strange things when you assign blockIdx.x or threadIdx.x to an “int”. It effectively demotes that int to a short and you run into issues when you try to multiple numbers that result in totals over 128^3.

Change to:

const unsigned int tx = threadIdx.x;

const unsigned int bx = blockIdx.x;

or just use the threadIdx/blockIdx variables directly and the problem will go away.

pwang · February 9, 2009, 11:39pm

Indeed 2) is the problem. Thanks very much!

Topic		Replies	Views
Maximum memory allocation size CUDA Programming and Performance	7	16390	January 24, 2012
cudaMemcpy problem CUDA Programming and Performance	2	1573	June 29, 2012
cudaMemcpy max bytes size? CUDA Programming and Performance	7	8443	December 10, 2008
Confusing Execution Output CUDA Programming and Performance	2	1180	January 12, 2009
Thread Block Size CUDA Programming and Performance	1	851	September 17, 2009
Weird behavior of CUDA CUDA Programming and Performance	6	5545	February 13, 2008
My first program with CUDA need some help CUDA Programming and Performance	3	2563	August 10, 2009
What is maximum buffer size for cudamemcpy(), can it be modified ? CUDA Programming and Performance	3	2540	August 11, 2016
Problems with local memory CUDA Programming and Performance	3	781	April 22, 2016
An illegal memory access was encountered CUDA Programming and Performance cuda	2	829	December 1, 2022

strange behavior of data size in cudaMalloc or cudaMemcpy

Related topics