Failure of Fortran unformatted reads of files over 2 GB

We have been writing large binary files of less than 2GB using C++ and reading them successfully with Fortran programs compiled with pgf95 using UNFORMATTED reads on the same Linux computer.

We have now exceeded the 2GB limit (more correctly, 2 to the power 9 bytes) and find the read fails in the Fortran programs with the message “attempt to read past end of file”.

The C++ program writes the data as described in the following pseudo code:

int nloci = 45000;
int nanis = 3000;
int bytesInArray = nloci * nanis * 8;
double cz (nloci,nanis);

write (bostream, bytesInArray);
write (bostream, cz);
write (bostream, bytesInArray);


The Fortran programs read the file in the following way:

double precision, dimension(:,:), allocatable :: z
integer nloci = 45000
integer nanis = 3000
allocate( z(nloci,nanis))
open(32,file=infile,status=‘old’,form=‘unformatted’)
read(32) z
close(32)

The Fortran code is complied with the -Mlarge_arrays option.

The focus of our search for a solution has been the Type and Value of the first record in the file (ie, bytesInArray).

If bytesInArray is less than 2 to the power 9, then it can be stored in a variable of Type integer and correctly specifies the number of bytes to follow in the binary array.

If bytesInArray is more than 2 to the power 9, then it will overflow an integer and no longer specify the number of bytes in the array.

What does a Fortran UNFORMATTED read expect to find at the beginning of a file if the file is more than 2 GB?

Any suggestions that will help us solve our problem will be greatly appreciated.

Hi kb64,

The problem is that a FORTRAN variable length unformatted file is presented as a sequence of records, where each record has the layout

<NB>  <DATA>  <NB>

where,
is an int indicating the number of bytes in
a sequence of bytes of data
So, if the record is ‘large’ (>2GB), the size of the record overflows.

To accommodate this situation, we actually split up the record into multiple unformatted records and use to indicate split records. So your C++ program need to accommodate this.

Since the maximum value of an int is 2147483647 (0x7fffffff), when a record is larger than this value, it must be written in chunks. Setting amount of data in a continued record as 2147483639 (0x7fffffff-8) bytes indicates that a chunk is continued. The record lenght before and after the 2147483639 bytes of data will have the value (2147483639|0x80000000), i.e., an int with the sign bit on.

To write a large unformatted record, your C++ file output code show look something like:

int64 recsize // number of bytes in the record
char *p       // pointer to data to be written
int csz        //  2147483639 | 0x80000
while (recsize > 2147483639) {
   write(bostream, &csz, 4)       -- write 4 bytes
   write(bostream, p, 2147483639) -- write 2147483639 btytes of p
   write(bostream, &csz, 4)
   p += 2147483639
   recsize -= 2147483639
}

write(bostream, &recsize, 4)
write(bostream, p, recsize)
write(bostream, &recsize, 4)

Hope this helps,
Mat