This code runs fine on Ocelot ( http://code.google.com/p/gpuocelot/ ), although it uses a huge amount of memory (6-8GB on my machine). Ocelot on Valgrind reports 8,000,393,726 bytes allocated. I would guess that cudaMalloc is failing on your card… Try allocating less memory…
It fails at cudaFree. Also, when I run it in -deviceemu mode it works, but in normal mode it fails. It seems that something has not finished while another thing started.
#include <iostream>
int main(){
int n = 1000;
int k = 1000000;
float *X_h;
X_h = (float *)malloc(n*k*sizeof(float));
int count = 0;
for (int i = 0; i < n; i++){
for (int j = 0; j < k; j++){
X_h[count] = 0.0;
count++;
}
}
float *X_d;
cudaError_t error = cudaMalloc((void **) &X_d, n*k*sizeof(float));
if( error != cudaSuccess )
{
std::cout << "Failed at malloc.\n";
return 0;
}
error = cudaMemcpy(X_d, X_h, sizeof(float)*n*k, cudaMemcpyHostToDevice);
if( error != cudaSuccess )
{
std::cout << "Failed at memcpy.\n";
return 0;
}
error = cudaFree(X_d);
if( error != cudaSuccess )
{
std::cout << "Failed at free.\n";
return 0;
}
std::cout << "Passed\n";
return(0);
}
It fails in malloc on my card. The problem that you were probably having before was that it was failing in malloc, which returned a garbage pointer, which you were then passing to cudaMemcpy or cudaFree causing a segfault. In general you should check all cuda api calls to make sure that they succeed.