Odc
October 15, 2009, 2:53pm
1
I’ve just started to learn CUDA and have created a simple program that creates a 2D array of int, assigns the memory on the device and then copies the array onto the device. Eventually I want to expand this into a graph searching algorithm. However, when using an array with 1,000 verticies (indicies) it simply crashes. As far as I can tell, its populating the array on the host that causes the crash.
Call me a noob but I thought that an array of this size was perfectly acceptable?
Here’s my code anyway
[codebox]#include <stdio.h>
#include <cuda.h>
global void myKernel(int* deviceArrayPtr, int pitch)
{
}
main()
{
int* deviceArrayPtr;
size_t devicePitch, hostPitch, width, height;
int hostArray[1000][1000];
width = 1000;
height = 1000;
for(int i = 0; i < 1000; i ++)
for(int j = 0; j < 1000; j ++)
hostArray[i][j] = 20; //20 is an abitrary number
//Allocates memory on the device
cudaMallocPitch((void**)&deviceArrayPtr, &devicePitch, width * sizeof(int), height);
hostPitch = devicePitch;
//Copies hostArray onto the pre-allocated device memory
cudaMemcpy2D(deviceArrayPtr, devicePitch, &hostArray, hostPitch, width * sizeof(int), height, cudaMemcpyHostToDevice);
myKernel <<< 100, 512 >>> (deviceArrayPtr, devicePitch);
}[/codebox]
Anyone have any ideas about this?
Hi !
I’ve just got the same problem two hours ago ! But not with such big arrays !
The only solution I have found was to use multiple of 2 and to define a good size of grid & thread per block !
I’ve tried on your program, I did not test if it prints the good values but there is no more “segmentation fault” !
Here is the code modification I’ve made :
#include <stdio.h>
#include <cuda.h>
__global__ void myKernel(int* deviceArrayPtr, int pitch)
{
}
main()
{
int* deviceArrayPtr;
size_t devicePitch, hostPitch, width, height;
int hostArray[1024][1024];
width = 1024;
height = 1024;
for(int i = 0; i < 1024; i ++)
for(int j = 0; j < 1024; j ++)
hostArray[i][j] = 20; //20 is an abitrary number
//Allocates memory on the device
cudaMallocPitch((void**)&deviceArrayPtr, &devicePitch, width * sizeof(int), height);
hostPitch = devicePitch;
//Copies hostArray onto the pre-allocated device memory
cudaMemcpy2D(deviceArrayPtr, devicePitch, &hostArray, hostPitch, width * sizeof(int), height, cudaMemcpyHostToDevice);
dim3 threadPerBlock(16,16);
dim3 dimGrid(width/threadPerBlock.x , height/threadPerBlock.y);
myKernel <<< dimGrid, threadPerBlock >>> (deviceArrayPtr, devicePitch);
}
Odc
October 15, 2009, 4:03pm
3
Hi !
I’ve just got the same problem two hours ago ! But not with such big arrays !
The only solution I have found was to use multiple of 2 and to define a good size of grid & thread per block !
I’ve tried on your program, I did not test if it prints the good values but there is no more “segmentation fault” !
Here is the code modification I’ve made :
#include <stdio.h>
#include <cuda.h>
__global__ void myKernel(int* deviceArrayPtr, int pitch)
{
}
main()
{
int* deviceArrayPtr;
size_t devicePitch, hostPitch, width, height;
int hostArray[1024][1024];
width = 1024;
height = 1024;
for(int i = 0; i < 1024; i ++)
for(int j = 0; j < 1024; j ++)
hostArray[i][j] = 20; //20 is an abitrary number
//Allocates memory on the device
cudaMallocPitch((void**)&deviceArrayPtr, &devicePitch, width * sizeof(int), height);
hostPitch = devicePitch;
//Copies hostArray onto the pre-allocated device memory
cudaMemcpy2D(deviceArrayPtr, devicePitch, &hostArray, hostPitch, width * sizeof(int), height, cudaMemcpyHostToDevice);
dim3 threadPerBlock(16,16);
dim3 dimGrid(width/threadPerBlock.x , height/threadPerBlock.y);
myKernel <<< dimGrid, threadPerBlock >>> (deviceArrayPtr, devicePitch);
}
Thanks for the quick reply :).
I have just used your modifications but I’m still getting the same problems. When the console opens, it just stops responding straight away with no error codes.