Hi all,
I guess I don’t understand some basic issue here about what variable is allowed to live in what scope. For the sake of an example I took the matrix multiplication example from the docs (which works fine) and tried to modify it so that it computes not the matrix product but the matrix product multiplied by a scalar.
The code is left as in the example, only a new variable is introduced, x, which will multiply the product. x lives and is set on the host while cuda_x lives on the device and is set to be equal to x.
This is what I thought would be a good approach:
float x;
__device__ float * cuda_x;
__global__ void Muld( float*, float*, int, int, float* );
void Mul( const float* A, const float* B, int hA, int wA, int wB, float* C )
{
........... same as in the docs, plus the following.............
cudaMalloc( (void**)&cuda_x, sizeof( float ) );
cudaMemcpy( cuda_x, &x, sizeof( float ), cudaMemcpyHostToDevice );
}
__global__ void Muld( float* A, float* B, int wA, int wB, float* C )
{
........... same as in the docs, but the last line is changed to ................
C[c + wB * ty + tx] = (*cuda_x) * Csub;
}
int main( void )
{
x = 2.0;
.......... create the matrices on the host...............
.......... call Mul( ) ...........
.......... calculate the product on the host too and compare......
}
Without the modification everything works perfectly, but when I include the parameter x and cuda_x I start getting ‘unspecified launch failure’ from cudaGetLastError.
What am I doing wrong?
While trying to debug this I compiled the above code with emulation turned on and there the strange thing was that if the declaration of the variable cuda_x is left as it is
device float * cuda_x;
the emulation version gave a seg fault, while changing the declaration to
float * cuda_x;
it ran fine. Isn’t it the point of the emulation mode that I don’t have to change declarations and such but the compiler figures it out? Probably I’m missing some basic thing here too.