Copying 2D array from host to device

subdivision · July 7, 2010, 10:41pm

I can use “nvcc -deviceemu **.cu” to compile successfully in Ubuntu.
But when I ran *.out, the error occurred as following:

pitch = 64
Segmentation fault

I check the value of matrix a and b in the kernel function and it indicates clearly that the value isn’t transfered from host to device after calling function “cudaMemcpy2D”.
Then I make some change to the code in the main(). I change the declaration of a from “float *a[n]” to “float a[n][n]” as well as b and c. Compiling and running again, everything is OK.

I don’t understand why it is.
Hope you could help me to make it clear.

What’s wrong with my code?
The following is the source code.

#include “cuda.h”
#include “stdio.h”

#define n 5

global void MatAdd( float a, float b, float c, int w, int h, int pitch )
{
for ( int i=0;i<h;i++ )
{
float a_row = (float)((char)a+ipitch);
float b_row = (float)((char)b+ipitch);
float c_row = (float)((char)c+i*pitch);

            for ( int j=0;j<w;j++ )
            {
                    c_row[j] = a_row[j] + b_row[j];
                    printf ( "%f ",c_row[j] );
                    printf ( "%f ",a_row[j] );
                    printf ( "%f ",b_row[j] );
            }
    }

}
int main(int argc, char* argv)
{
float *a[n],*b[n],*c[n];
float *da,*db,*dc;

    int i,j;
    for ( i=0;i<n;i++ )
    {
            a[i] = (float*)malloc ( n*sizeof(float) );
            b[i] = (float*)malloc ( n*sizeof(float) );
            c[i] = (float*)malloc ( n*sizeof(float) );
            for ( j=0;j<n;j++)
            {
                    a[i][j] = 1;
                    b[i][j] = 1;
                    c[i][j] = 0;
            }
    size_t pitch = 0;
    cudaMallocPitch ( (void**)&da, &pitch, n*sizeof(float), n );
    cudaMallocPitch ( (void**)&db, &pitch, n*sizeof(float), n );
    cudaMallocPitch ( (void**)&dc, &pitch, n*sizeof(float), n );

    printf ( "pitch = %d\n", pitch );

   // dim3 dimBlock(16,16);
    //dim3 dimGrid ( (n+dimBlock.x-1)/dimBlock.x,(n+dimBlock.y-1)/dimBlock.y );

    cudaMemcpy2D ( da, pitch, a, n*sizeof(float), n*sizeof(float), n, cudaMemcpyHostToDevice );
    cudaMemcpy2D ( db, pitch, b, n*sizeof(float), n*sizeof(float), n, cudaMemcpyHostToDevice );

    //MatAdd<<<dimGrid, dimBlock>>>( da,db,dc,n,n,pitch );
    MatAdd<<<1,1>>>( da,db,dc,n,n,pitch );

    cudaMemcpy2D ( c,  n*sizeof(float), dc, pitch, n*sizeof(float), n, cudaMemcpyDeviceToHost );

    for ( i=0;i<n;i++ )
    {
            printf ( "\n\n" );
            for ( j=0;j<n;j++ )
                    printf ( "c[%d][%d] = %f ",i,j,c[i][j] );
    }

    cudaFree ( da );
    cudaFree ( db );
    cudaFree ( dc );

    free ( a );
    free (B);
    free Â©;

}

subdivision · July 14, 2010, 8:15pm

Who could give me some advice ?

sabkalyan · July 20, 2010, 5:13am

Hi,

Try this ,

I think your problem is in assigning a proper pitch size.

Since you have 5 rows and 5 cols, your per row width in bytes is 5*4 = 20 B (since ur data type is float). The nearest memory segment size for proper coalescing is 32 B. Hence you have to pad your row size by another 12 B (or 3 floats more) to suit this segment size. Hence, your pitch size in bytes must be 32B.

size_t pitch = 32;

Hope this helps !!

Sabkalyan

I can use “nvcc -deviceemu **.cu” to compile successfully in Ubuntu.

But when I ran *.out, the error occurred as following:

pitch = 64

Segmentation fault

I check the value of matrix a and b in the kernel function and it indicates clearly that the value isn’t transfered from host to device after calling function “cudaMemcpy2D”.

Then I make some change to the code in the main(). I change the declaration of a from “float *a[n]” to “float a[n][n]” as well as b and c. Compiling and running again, everything is OK.

I don’t understand why it is.

Hope you could help me to make it clear.

What’s wrong with my code?

The following is the source code.

include “cuda.h”

include “stdio.h”

define n 5

global void MatAdd( float *a, float *b, float *c, int w, int h, int pitch )

{
    for ( int i=0;i<h;i++ )

    {

            float *a_row = (float*)((char*)a+i*pitch);

            float *b_row = (float*)((char*)b+i*pitch);

            float *c_row = (float*)((char*)c+i*pitch);
for ( int j=0;j<w;j++ )
            {

                    c_row[j] = a_row[j] + b_row[j];

                    printf ( "%f ",c_row[j] );

                    printf ( "%f ",a_row[j] );

                    printf ( "%f ",b_row[j] );

            }

    }
}

int main(int argc, char* argv)

{
    float *a[n],*b[n],*c[n];

    float *da,*db,*dc;
int i,j;
    for ( i=0;i<n;i++ )

    {

            a[i] = (float*)malloc ( n*sizeof(float) );

            b[i] = (float*)malloc ( n*sizeof(float) );

            c[i] = (float*)malloc ( n*sizeof(float) );

            for ( j=0;j<n;j++)

            {

                    a[i][j] = 1;

                    b[i][j] = 1;

                    c[i][j] = 0;

            }

    size_t pitch = 0;

    cudaMallocPitch ( (void**)&da, &pitch, n*sizeof(float), n );

    cudaMallocPitch ( (void**)&db, &pitch, n*sizeof(float), n );

    cudaMallocPitch ( (void**)&dc, &pitch, n*sizeof(float), n );
printf ( “pitch = %d\n”, pitch );

// dim3 dimBlock(16,16);
    //dim3 dimGrid ( (n+dimBlock.x-1)/dimBlock.x,(n+dimBlock.y-1)/dimBlock.y );
cudaMemcpy2D ( da, pitch, a, nsizeof(float), nsizeof(float), n, cudaMemcpyHostToDevice );
    cudaMemcpy2D ( db, pitch, b, n*sizeof(float), n*sizeof(float), n, cudaMemcpyHostToDevice );
//MatAdd<<<dimGrid, dimBlock>>>( da,db,dc,n,n,pitch );
    MatAdd<<<1,1>>>( da,db,dc,n,n,pitch );
cudaMemcpy2D ( c, nsizeof(float), dc, pitch, nsizeof(float), n, cudaMemcpyDeviceToHost );

for ( i=0;i<n;i++ )
    {

            printf ( "\n\n" );

            for ( j=0;j<n;j++ )

                    printf ( "c[%d][%d] = %f ",i,j,c[i][j] );

    }
cudaFree ( da );
    cudaFree ( db );

    cudaFree ( dc );
free ( a );
    free (B);

    free Â©;
}

subdivision · July 21, 2010, 7:30pm

Hi Sabkalyan,

Thanks for ur reply.

The pitch will be assigned automatically after calling cudaMallocPitch().

I try to assign 32 to pitch when calling cudaMemcpy2D() . But it seems not to work. The problem is still here.

bitminer · July 21, 2010, 7:46pm

Yes exactly. Don’t change the pitch! As I am sure you have found as stated below:

your gpu tells you the pitch used based on how much memory you are requesting. To index into the next “row” of the array you must, as I believe you have, multiply the row index by the pitch size.

float* a[n] is an array of pointers to floats and float a[n][n] is an an nXn array of floats which is what I believe you want. It sounds like you found the answer. “If it ain’t broke don’t fix it”. Or maybe I still do not understand your problem.

subdivision · July 21, 2010, 8:23pm

Hi Bitminer,

I just want to know if the dynamic array can be used in that program when copying array from host to device. It seems it cannot.

bitminer · July 27, 2010, 8:52pm

It should be able to. I do not however create memory using multi dimensional array syntax as you do with a and use of:

int i,j;

for ( i=0;i<n;i++ )

{

a[i] = (float*)malloc ( n*sizeof(float) );

b[i] = (float*)malloc ( n*sizeof(float) );

c[i] = (float*)malloc ( n*sizeof(float) );

for ( j=0;j<n;j++)

{

a[i][j] = 1;

b[i][j] = 1;

c[i][j] = 0;

}

I am not sure, but think you should get a contigious block of memory, but this why I do not use this syntax. I use malloc of the flavor:

char* ptr = malloc(sizeof(float)*n*n )

// and access using

ptr[width*row_index + col_index] = a_value;

Your call to:

cudaMemcpy2D ( da, pitch, a, n*sizeof(float), n*sizeof(float), n, cudaMemcpyHostToDevice );

Looks ok if I am reading and understanding this as pitch is begotten from the gpu with cudaMallocPitch ( (void**)&da, &pitch, n*sizeof(float), n );

Short answer is you should be able to use the dynamic allocated array as you have… so I am still cannot tell from just looking at the code what the issue is. I have wrapped the cuda linear, 2d, and 3D memory creation and copying into 3 C++ template classes and it took me a while to debug each one so I feel your pain in trying to get this code correct. I would recomend doing this so that you do not have to repeat the cudaMemcpy calls and others that are prone to error… at least they were for me with multiple mem transfers of varying types going on in my code. I would recommend stepping through the code until the segfault is found or your nvidia driver up and restarts (well some times it kills the entire machine - quite spectacular) as mine does when I have kernel mem access issues.

Sorry for not replying sooner… just saw that you posted again.

bitminer · July 27, 2010, 9:13pm

Just thought of this after I posted: Also try changing:

MatAdd<<<1,1>>>( da,db,dc,n,n,pitch );

cudaMemcpy2D ( c, n*sizeof(float), dc, pitch, n*sizeof(float), n, cudaMemcpyDeviceToHost );

to

MatAdd<<<1,1>>>( da,db,dc,n,n,pitch );

cudaThreadSynchronize();

cudaMemcpy2D ( c, n*sizeof(float), dc, pitch, n*sizeof(float), n, cudaMemcpyDeviceToHost );

adding cudaThreadSynchronize() may help, but others have told me recently this is not needed, I thought it may be as developer guide states:

i.e. your data may be in the process of being written by the kernel when it is trying to be copied out by the cpu call, or deleted by the cpu.

What I have heard is that the mem transfer calls will block and possibly sync threads, but this is not documented anywhere I can tell and should be met with skepticism as I could not find this in programmers guide or cuda reference manual. An answer to which I am currently seeking.

Topic		Replies	Views
2D array of int copy from host to device and vice versa How to do it? CUDA Programming and Performance	1	894	July 9, 2010
Problems with accessing and copying 2D Arrays CUDA Programming and Performance	1	992	February 19, 2009
Simple Malloc on host CUDA Programming and Performance	4	1277	October 22, 2009
CUDA 2D Array Problem Need help to manipulate 2D arrays in CUDA CUDA Programming and Performance	4	26473	March 17, 2011
help with cudaMemcpy2D I can't get a matrix/ array to copy correctly from host to device CUDA Programming and Performance	3	5061	July 14, 2009
Can't get copyDeviceToHost to work with cudaMemcpy2D CUDA Programming and Performance	0	3633	November 13, 2009
2D array & Memory space Mostly about cudaMallocPitch & cudaMemcpy2D CUDA Programming and Performance	1	1494	October 15, 2009
problem with cudaMallocPitch and cudaMemcpy2D CUDA Programming and Performance	5	6374	April 22, 2009
trouble with cudaMemcpy2D I cant get a matrix to copy into 2D pitched memory CUDA Programming and Performance	1	925	July 13, 2009
cudaMemcpy2D help CUDA Programming and Performance	4	10607	July 28, 2009

Copying 2D array from host to device

I can use “nvcc -deviceemu **.cu” to compile successfully in Ubuntu. But when I ran *.out, the error occurred as following:

pitch = 64 Segmentation fault

Related topics

I can use “nvcc -deviceemu **.cu” to compile successfully in Ubuntu.
But when I ran *.out, the error occurred as following:

pitch = 64
Segmentation fault