How is 4GB addressable on 32bit?

Hi,

If gobal memory is a 32bit addressable space how can one get to use anything more than 2^32 bit addresses ?
When I try to do it my kernel fails and I get error ‘Memory value too large’ ? The 4GB of Tesla is then of no use !!!
I am trying to create two 1 D arrays with size 1.5GB each and trying to address them. I have Tesla c1060 and am
using CUDA2.3.

Last time I check 2^32 = 4,294,967,295 or 4Gib.

Thanks for rechecking on that :) I am sorry if my statement conveyed otherwise but let me clarify my query further:

I create two 1D float type arrays on the GPU. Each thread writes to an element in each of the two arrays. Element is indexed by the position ID (= blockDim.x * blockIdx.x + threadIdx.x). I need 906572 threads to run,3553 bytes of global memory per thread (both arrays combined) and I am running them in 2024 blocks comprising of 448 threads which gives me total of ~3GB for all the threads. now a memalloc of each of these 1D arrays comes to ~1.5GB in size on the GPU. What I want to ask is that :

why this memalloc is failing ?? (Err:> Out of Memory ). If GPU uses 32 bit addressable space,then I must be able to access more than 3GB and thus no ‘out of memory’ error.

What platform are you running on? Did you read the release notes?

One issues that springs to my mind immeadiately is the limitation on the cudaMalloc allocation size in Vista and Server 2008:

This this in context with the full release notes here: http://developer.download.nvidia.com/compu…tes_windows.txt

if you are on a 32bit system, you will have adressed the ram as well…

Host and device pointers are in separate memory spaces. Problem solved.

I.e. both the host and the device have the full 4GB of RAM available. (well, the host maybe just 3.4 or 3.75MB when you use Windows, but that’s a different issue)

Christian

You need to split your computation into many kernels…

Each kernel would calculate F(portion of 1D array, whole of 2nd array).

And then,

you need to write a bunch of kernels that would REDUCE the results…

I did not read your posting fully… I think I spoke a bit early.

Consider 1 big FAT cudaMalloc as opposed to many small cudaMallocs… This will reduce framgementation of GPU memory.

Later, you can divide these pointers as you like.

I did break down one big cudaMalloc into smaller cudaMallocs and then used an array of these pointers as I needed them but doing this

reduces the instruction throughput of my spin and also overall speedup gain (as compared to OpenMP enabled code on Core Quad) is hit.

Thanks for pointing this out.

I am running XP64 bit with SP2. I read the release notes and since I am using a Tesla with 4GB system memory

going by MIN ( ( System Memory Size in MB - 512 MB ) / 2, PAGING_BUFFER_SEGMENT_SIZE )

I have =================== > MIN((4096-512)/2,2048) (all calculatioin in MB) = 1792 MB which is still

greater than 1676 MB (~1.5 GB). But I think I am running on thin ice . So long to one Large memalloc :(

Thanks for pointing this out.

I am running XP64 bit with SP2. I read the release notes and since I am using a Tesla with 4GB system memory

going by MIN ( ( System Memory Size in MB - 512 MB ) / 2, PAGING_BUFFER_SEGMENT_SIZE )

I have =================== > MIN((4096-512)/2,2048) (all calculatioin in MB) = 1792 MB which is still

greater than 1676 MB (~1.5 GB). But I think I am running on thin ice . So long to one Large memalloc :(