Cuda 2.2 / Zero-copy access

some time I must do it.

for example:

you have a source matrix from host, this matrix with source[Width * Height]; (in this case width %2 != 0);

after calculate this matrix, the results will be saved in new matrix with target[Width + 1][Height]; (in this case (width + 1) %2 == 0);

when copy source matrix from host to device without set zero, the final element of each line not gets zero value.

I expected zero value in the final elements of each lines. this value will be use for my calculate.

In this case I know that inside kernel function, I can use some algorithms to do it, but my kernel becomes complicate and slow.

After calculated I copy back data from device to target matrix in host memory.

I am sorry if my understanding about “zero-copy memory” is wrong.:)

Zero-copy as in CUDA 2.2 beta is talking about it becoming unnecessary to copy data from the computer’s RAM into device memory before accessing it, whereas i think you are talking about setting the data in the memory to the value of zero. So, yes, you still need to clear memory and use memset even if you use zero-copy.

Sorry, where can I get the list of which boards are MCP7x?

My board is 9500gt chip, is it also support zero-copy?

MCP7x are all embedded GPUS, so they’d be on your motherboard, and not a board.

G200 boards also support zero-copy. So unfortunately your 9500GT won’t support zero-copy.

MCP79 supports zero-copy. MCP73 doesn’t.

As far as I understand, Zero-memory feature is going to allow me to read the host memory from a kernel and delete the cudaMalloc, cudaMemcpy H->D calls before the kernel call?

Thanks in advance!

Yes, but only from memory explicitly allocated to support zero-copy.

Where can I find more details of zero-copy related stuff in CUDA? For example, how to allocate zero-copy memory? The programming guide 2.2 beta doesn’t seem to have this kind of information.

It is all documented very well. Although the programming guide calls it mapped memory, not “zero-copy” as Tim refers to it.

I include a simple program (does vector addition). It is written in Python, but the API calls are similar to the original CUDA API.

#!/bin/env python

coding:utf-8: © Arno Pähler, 2007-09

from ctypes import *

from time import time

from cuda_defs import *

from cuda_api import *

from cuda_utils import *

from gpuFunctions import gpuVADD

BLOCK_SIZE = 320

GRID_SIZE = 1024

demo zero-copy of CUDA 2.2

def hostAlloc(n,dtype=t_si32):

flags1 = cudaHostAllocMapped#|cudaHostAllocPortable#|cudaHostAllocWri

teCombined

flags2 = 0

p = p_void()

size = n*dtype().itemsize

c_type = numpy_to_ctypes[dtype]

cudaHostAlloc(byref(p),size,flags1)

getLastError()

r = nc_a((c_type*n).from_address(p.value))

d = p_void()

status = cudaHostGetDevicePointer(byref(d),p,flags2)

getLastError()

return r,d.value

def main(vlength = 128,loops = 1):

n2 = vlength ## Vector length

h_X = (c_float*n2)()

h_Y = (c_float*n2)()

h_X,d_X = hostAlloc(n2,t_fp32)

h_Y,d_Y = hostAlloc(n2,t_fp32)

h_X.fill(1)

h_Y.fill(loops)

print '%6.0f%6.0f' % (h_X[0],h_Y[0]),

blockDim = dim3(BLOCK_SIZE,1,1)

gridDim   = dim3(GRID_SIZE,1,1)

t0 = time()

cudaThreadSynchronize()

for i in range(loops):

    cudaConfigureCall(gridDim,blockDim,0,0)

    ## d_Y = d_Y + d_X

    ## note, that neither d_Y nor d_X

    ## have ever been set directly

    ## addition takes place on the GPU

    ## with data residing in main memory

    gpuVADD(d_X,d_Y,n2)

cudaThreadSynchronize()

t0 = time()-t0

flops = (1.e-9*n2)*float(loops)

cudaThreadSynchronize()

h_Y (aka d_Y) has been altered

## without devie-to-host copy

v2MB = float(vlength)/float(1<<20)

print '%10.3f%10.3f%8.3f%6.0f%6.0f' % (v2MB,t0,flops/t0,h_X[0],h_Y[0])

freeHost(h_X)

freeHost(h_Y)

if name == ‘main’:

import sys

cudaSetDevice(0)

cudaSetDeviceFlags(cudaDeviceMapHost)

xmax = 26

LOOP = 2048

lmin,lmax = 18,xmax

if len(sys.argv) > 1:

    lmin = lmax = int(sys.argv[1])

loopx = -1

if len(sys.argv) > 2:

    loopx = int(sys.argv[2])

lmax = min(max(0,lmax),xmax)

lmin = min(max(0,lmin),lmax)

if lmin == lmax:

    loopx = LOOP >> (lmin-18)

for l in range(lmin,lmax+1):

    loops = max(LOOP >> (l-lmin),1)

    vlength = 1 << l

    if loopx > 0:

        loops = loopx

    print '%5d %5d' % (l,loops),

    main(vlength,loops)

cudaThreadExit()

Do the MCP7x IGPs use a PCIe bus for interaction with the system memory?

No, as the IGP is part of the chipset and thus sitting right next to the MCU, it has direct access via the MCU with system memory.

In that case, does anyone have some numbers on transfer rates?

It’s as fast as reading memory normally… Read over my explanation of copy-elimination on the first page again.