On the very top of the Zip-Ada to-do list, for a while, was the support for archive containing large (more than 4 GiB data) or numerous (more than 65535) files – the so-called "Zip64" extensions.
It's now gone from the to-do list and done in the project.
The Zip archive format was defined with 32-bit file sizes and 16-bit number of files. At the time (1989) it was a perfectly sound choice, especially given that the software, PKZIP, was written for 16-bit PC's running on MS-DOS. At some point, these limits began to be problematic for archiving or backup tasks. Although the original Zip format is still the most widely used (including many file types without the .zip name extension, such as .docx, .xlsx, .pptx, .jar), the limits become, slowly, more and more frequently an issue over time. Consequently, PKWARE (the firm behind PKZIP) has introduced, around year 2000, a set of format extensions to overcome those limitations. Given the omnipresence of Zip archives and the difficulty to innovate in software, PKWARE did not opt for a fresh, new, simple, but incompatible format. Instead, they designed a set of extensions, allowed by the flexibility of the Zip format. This seems clumsy and over-engineered, but the design is actually quite clever: a program can create a Zip archive, without size limitations, and without checking in advance whether the Zip64 extensions are needed or not. When the archive is finished, either the Zip64 extensions were not needed and the archive is conform to the year 1989 original format, or they were – then, an unpacker for that archive file has to know about Zip64. Thus, the Zip64 design allowed for a progressive adoption of the new format, since the actual need went progressively from very rare to a bit less rare.
How to implement Zip64 in a new software? There are a few options, which you can combine:
1. Follow the documentation, appnote.txt
2. Take inspiration from some open-source software like Info-Zip or 7-Zip
3. Reverse-engineer data (hex dumps)
All three are very time-consuming (especially, the documentation, aimed at maintaining the openness of the Zip format, is "legally" correct but lacks some indications and practical details).
Even widespread, commercial software (with paying customers and many paid programmers behind it) does not support Zip64 (e.g. Microsoft Word for .docx), or took almost two decades to do so (Microsoft Windows for instance). It reveals something about the time and costs related to the implementation…
Fortunately, a very helpful person called Yaakov not only did an implementation from scratch, but also was kind enough to share his experience in an explanatory and colorful way, with a simple example as an illustration. The article is available here: ZIP64 - Go Big Or Go Home.
I show here (with permission) the key diagram of his article:
|Click to enlarge|
The Zip64 extensions are in dark background and white letters.
The part with the archived (eventually compressed) files is in pink and brown. In that example, there is only one file in the archive.
As you see, an implementation of Zip64 may be quite tricky, but it is feasible, if we have "the big picture" before our eyes.
Alas, things are a little bit more complicated than it appears when reading the article too fast...
We want, with Zip-Ada, to decode and unpack correctly archives not only made by Zip-Ada, but also by Info-Zip's Zip, 7-Zip, WinZip, and so on (the converse is also true: we want the other guys to be able to unpack our Zip-Ada archives).
For that reason, we need to anticipate all possible ways the data headers have been written, not only our choices. In that respect, we have had some headache with the per-entry "size" Zip64 extensions (the parts in brown and dark green in the chart).
The documentation (4.5.3) describes, for that, a record with a variable number of values:
8 bytes [A] Original uncompressed file size
8 bytes [B] Size of compressed data
8 bytes [C] Offset of local header record
4 bytes [D] Number of the disk on which this file starts
and defines in plain English some rules about the order and in which circumstances the values are present or not.
The documentation requires both [A] and [B] for the Local Header (part in brown in the chart). Let us figure out the explanation to that rule. For a program that creates the archive, it would not be practical to put only the uncompressed size before compressing the data, because the program would have to shift all the compressed content by 8 bytes in case the compressed size (which is known only after the compression) happened to also exceed 2**32 - 2. Hence, exactly and always two values if the uncompressed size (which is usually known in advance) exceeds the limit.
When it is time for the archiving program to write the Central Directory (the part that concludes the archive, in green, light and dark, in the chart), all information about sizes is known. An archiving program may choose to put only the necessary values in 64 bits. However, traditionally, you would expect that the variable record's contents are self-describing and would contain [A], or: [A] and [B], or: [A] and [B] and [C], or: [A] and [B] and [C] and [D], given the record's size.
With Zip64, it is not the case: each field is optional. This micro-optimization is ridiculous given the fact that the size of archive is breaking the 4 GiB limit. To make things worse, the documentation is not clear about that full optionality.
An archiver (7-Zip does so for instance) may put just [A], the uncompressed size of one first large file that is well compressible, or just [C], the offset of one small file following a file with a large size in the archive. Or it could put [A], [B], [C] in both cases (as we do for Zip-Ada's archive creation). So, we've adapted the archive reader to take all possible cases into consideration.
Thanks to Nicolas Brunot for pointing that (hopefully) ultimate difficulty with Zip64.
Link 1: https://github.com/zertovitch/zip-ada/
With git: git clone https://github.com/zertovitch/zip-ada.git
As Zip-ball: green button "Code", choice: "Download ZIP"
Link 2: https://sourceforge.net/p/unzip-ada/code/HEAD/tree/
With subversion: svn checkout https://svn.code.sf.net/p/unzip-ada/code/ za
As Zip-ball: button: "Download Snapshot"