Copying 2D array from host to device

I can use “nvcc -deviceemu **.cu” to compile successfully in Ubuntu.
But when I ran *.out, the error occurred as following:

pitch = 64
Segmentation fault

I check the value of matrix a and b in the kernel function and it indicates clearly that the value isn’t transfered from host to device after calling function “cudaMemcpy2D”.
Then I make some change to the code in the main(). I change the declaration of a from “float *a[n]” to “float a[n][n]” as well as b and c. Compiling and running again, everything is OK.

I don’t understand why it is.
Hope you could help me to make it clear.

What’s wrong with my code?
The following is the source code.

#include “cuda.h”
#include “stdio.h”

#define n 5

global void MatAdd( float a, float b, float c, int w, int h, int pitch )
{
for ( int i=0;i<h;i++ )
{
float a_row = (float)((char
)a+i
pitch);
float b_row = (float)((char
)b+ipitch);
float c_row = (float)((char
)c+i*pitch);

            for ( int j=0;j<w;j++ )
            {
                    c_row[j] = a_row[j] + b_row[j];
                    printf ( "%f ",c_row[j] );
                    printf ( "%f ",a_row[j] );
                    printf ( "%f ",b_row[j] );
            }
    }

}
int main(int argc, char* argv)
{
float *a[n],*b[n],*c[n];
float *da,*db,*dc;

    int i,j;
    for ( i=0;i<n;i++ )
    {
            a[i] = (float*)malloc ( n*sizeof(float) );
            b[i] = (float*)malloc ( n*sizeof(float) );
            c[i] = (float*)malloc ( n*sizeof(float) );
            for ( j=0;j<n;j++)
            {
                    a[i][j] = 1;
                    b[i][j] = 1;
                    c[i][j] = 0;
            }
    size_t pitch = 0;
    cudaMallocPitch ( (void**)&da, &pitch, n*sizeof(float), n );
    cudaMallocPitch ( (void**)&db, &pitch, n*sizeof(float), n );
    cudaMallocPitch ( (void**)&dc, &pitch, n*sizeof(float), n );

    printf ( "pitch = %d\n", pitch );

   // dim3 dimBlock(16,16);
    //dim3 dimGrid ( (n+dimBlock.x-1)/dimBlock.x,(n+dimBlock.y-1)/dimBlock.y );

    cudaMemcpy2D ( da, pitch, a, n*sizeof(float), n*sizeof(float), n, cudaMemcpyHostToDevice );
    cudaMemcpy2D ( db, pitch, b, n*sizeof(float), n*sizeof(float), n, cudaMemcpyHostToDevice );

    //MatAdd<<<dimGrid, dimBlock>>>( da,db,dc,n,n,pitch );
    MatAdd<<<1,1>>>( da,db,dc,n,n,pitch );

    cudaMemcpy2D ( c,  n*sizeof(float), dc, pitch, n*sizeof(float), n, cudaMemcpyDeviceToHost );

    for ( i=0;i<n;i++ )
    {
            printf ( "\n\n" );
            for ( j=0;j<n;j++ )
                    printf ( "c[%d][%d] = %f ",i,j,c[i][j] );
    }

    cudaFree ( da );
    cudaFree ( db );
    cudaFree ( dc );

    free ( a );
    free (B);
    free ©;

}

Who could give me some advice ?

Hi,

Try this ,

I think your problem is in assigning a proper pitch size.

Since you have 5 rows and 5 cols, your per row width in bytes is 5*4 = 20 B (since ur data type is float). The nearest memory segment size for proper coalescing is 32 B. Hence you have to pad your row size by another 12 B (or 3 floats more) to suit this segment size. Hence, your pitch size in bytes must be 32B.

size_t pitch = 32;

Hope this helps !!

Sabkalyan

Hi Sabkalyan,

Thanks for ur reply.

The pitch will be assigned automatically after calling cudaMallocPitch().

I try to assign 32 to pitch when calling cudaMemcpy2D() . But it seems not to work. The problem is still here.

Yes exactly. Don’t change the pitch! As I am sure you have found as stated below:

your gpu tells you the pitch used based on how much memory you are requesting. To index into the next “row” of the array you must, as I believe you have, multiply the row index by the pitch size.

float* a[n] is an array of pointers to floats and float a[n][n] is an an nXn array of floats which is what I believe you want. It sounds like you found the answer. “If it ain’t broke don’t fix it”. Or maybe I still do not understand your problem.

Hi Bitminer,

I just want to know if the dynamic array can be used in that program when copying array from host to device. It seems it cannot.

It should be able to. I do not however create memory using multi dimensional array syntax as you do with a and use of:

int i,j;

for ( i=0;i<n;i++ )

{

a[i] = (float*)malloc ( n*sizeof(float) );

b[i] = (float*)malloc ( n*sizeof(float) );

c[i] = (float*)malloc ( n*sizeof(float) );

for ( j=0;j<n;j++)

{

a[i][j] = 1;

b[i][j] = 1;

c[i][j] = 0;

}

I am not sure, but think you should get a contigious block of memory, but this why I do not use this syntax. I use malloc of the flavor:

char* ptr = malloc(sizeof(float)*n*n )

// and access using

ptr[width*row_index + col_index] = a_value;

Your call to:

cudaMemcpy2D ( da, pitch, a, n*sizeof(float), n*sizeof(float), n, cudaMemcpyHostToDevice );

Looks ok if I am reading and understanding this as pitch is begotten from the gpu with cudaMallocPitch ( (void**)&da, &pitch, n*sizeof(float), n );

Short answer is you should be able to use the dynamic allocated array as you have… so I am still cannot tell from just looking at the code what the issue is. I have wrapped the cuda linear, 2d, and 3D memory creation and copying into 3 C++ template classes and it took me a while to debug each one so I feel your pain in trying to get this code correct. I would recomend doing this so that you do not have to repeat the cudaMemcpy calls and others that are prone to error… at least they were for me with multiple mem transfers of varying types going on in my code. I would recommend stepping through the code until the segfault is found or your nvidia driver up and restarts (well some times it kills the entire machine - quite spectacular) as mine does when I have kernel mem access issues.

Sorry for not replying sooner… just saw that you posted again.

Just thought of this after I posted: Also try changing:

MatAdd<<<1,1>>>( da,db,dc,n,n,pitch );

cudaMemcpy2D ( c, n*sizeof(float), dc, pitch, n*sizeof(float), n, cudaMemcpyDeviceToHost );

to

MatAdd<<<1,1>>>( da,db,dc,n,n,pitch );

cudaThreadSynchronize();

cudaMemcpy2D ( c, n*sizeof(float), dc, pitch, n*sizeof(float), n, cudaMemcpyDeviceToHost );

adding cudaThreadSynchronize() may help, but others have told me recently this is not needed, I thought it may be as developer guide states:

i.e. your data may be in the process of being written by the kernel when it is trying to be copied out by the cpu call, or deleted by the cpu.

What I have heard is that the mem transfer calls will block and possibly sync threads, but this is not documented anywhere I can tell and should be met with skepticism as I could not find this in programmers guide or cuda reference manual. An answer to which I am currently seeking.