Zip-Ada: Zip64 extensions

By: [email protected] (GdM)

1 June 2022 at 18:47

On the very top of the Zip-Ada to-do list, for a while, was the support for archive containing large (more than 4 GiB data) or numerous (more than 65535) files – the so-called "Zip64" extensions.

It's now gone from the to-do list and done in the project.

The Zip archive format was defined with 32-bit file sizes and 16-bit number of files. At the time (1989) it was a perfectly sound choice, especially given that the software, PKZIP, was written for 16-bit PC's running on MS-DOS. At some point, these limits began to be problematic for archiving or backup tasks. Although the original Zip format is still the most widely used (including many file types without the .zip name extension, such as .docx, .xlsx, .pptx, .jar), the limits become, slowly, more and more frequently an issue over time. Consequently, PKWARE (the firm behind PKZIP) has introduced, around year 2000, a set of format extensions to overcome those limitations. Given the omnipresence of Zip archives and the difficulty to innovate in software, PKWARE did not opt for a fresh, new, simple, but incompatible format. Instead, they designed a set of extensions, allowed by the flexibility of the Zip format. This seems clumsy and over-engineered, but the design is actually quite clever: a program can create a Zip archive, without size limitations, and without checking in advance whether the Zip64 extensions are needed or not. When the archive is finished, either the Zip64 extensions were not needed and the archive is conform to the year 1989 original format, or they were – then, an unpacker for that archive file has to know about Zip64. Thus, the Zip64 design allowed for a progressive adoption of the new format, since the actual need went progressively from very rare to a bit less rare.

How to implement Zip64 in a new software? There are a few options, which you can combine:

1.   Follow the documentation, appnote.txt
2.   Take inspiration from some open-source software like Info-Zip or 7-Zip
3.   Reverse-engineer data (hex dumps)

All three are very time-consuming (especially, the documentation, aimed at maintaining the openness of the Zip format, is "legally" correct but lacks some indications and practical details).
Even widespread, commercial software (with paying customers and many paid programmers behind it) does not support Zip64 (e.g. Microsoft Word for .docx), or took almost two decades to do so (Microsoft Windows for instance). It reveals something about the time and costs related to the implementation…

Fortunately, a very helpful person called Yaakov not only did an implementation from scratch, but also was kind enough to share his experience in an explanatory and colorful way, with a simple example as an illustration. The article is available here: ZIP64 - Go Big Or Go Home.
I show here (with permission) the key diagram of his article:

Click to enlarge

The Zip64 extensions are in dark background and white letters.
The part with the archived (eventually compressed) files is in pink and brown. In that example, there is only one file in the archive.

As you see, an implementation of Zip64 may be quite tricky, but it is feasible, if we have "the big picture" before our eyes.

Alas, things are a little bit more complicated than it appears when reading the article too fast...

We want, with Zip-Ada, to decode and unpack correctly archives not only made by Zip-Ada, but also by Info-Zip's Zip, 7-Zip, WinZip, and so on (the converse is also true: we want the other guys to be able to unpack our Zip-Ada archives).
For that reason, we need to anticipate all possible ways the data headers have been written, not only our choices. In that respect, we have had some headache with the per-entry "size" Zip64 extensions (the parts in brown and dark green in the chart).
The documentation (4.5.3) describes, for that, a record with a variable number of values:

8 bytes    [A] Original uncompressed file size
8 bytes    [B] Size of compressed data
8 bytes    [C] Offset of local header record
4 bytes    [D] Number of the disk on which this file starts

and defines in plain English some rules about the order and in which circumstances the values are present or not.

The documentation requires both [A] and [B] for the Local Header (part in brown in the chart). Let us figure out the explanation to that rule. For a program that creates the archive, it would not be practical to put only the uncompressed size before compressing the data, because the program would have to shift all the compressed content by 8 bytes in case the compressed size (which is known only after the compression) happened to also exceed 2**32 - 2. Hence, exactly and always two values if the uncompressed size (which is usually known in advance) exceeds the limit.

When it is time for the archiving program to write the Central Directory (the part that concludes the archive, in green, light and dark, in the chart), all information about sizes is known. An archiving program may choose to put only the necessary values in 64 bits. However, traditionally, you would expect that the variable record's contents are self-describing and would contain [A], or: [A] and [B], or: [A] and [B] and [C], or: [A] and [B] and [C] and [D], given the record's size.
With Zip64, it is not the case: each field is optional. This micro-optimization is ridiculous given the fact that the size of archive is breaking the 4 GiB limit. To make things worse, the documentation is not clear about that full optionality.
An archiver (7-Zip does so for instance) may put just [A], the uncompressed size of one first large file that is well compressible, or just [C], the offset of one small file following a file with a large size in the archive. Or it could put [A], [B], [C] in both cases (as we do for Zip-Ada's archive creation). So, we've adapted the archive reader to take all possible cases into consideration.

Thanks to Nicolas Brunot for pointing that (hopefully) ultimate difficulty with Zip64.

Latest sources:

Link 1: https://github.com/zertovitch/zip-ada/
With git: git clone https://github.com/zertovitch/zip-ada.git
As Zip-ball: green button "Code", choice: "Download ZIP"

Link 2: https://sourceforge.net/p/unzip-ada/code/HEAD/tree/
With subversion: svn checkout https://svn.code.sf.net/p/unzip-ada/code/ za
As Zip-ball: button: "Download Snapshot"

Gautier's blog
QOI, the Quite OK Image Format, added to GID, the Generic Image Decoder
28 February 2022 at 22:51

QOI, the Quite OK Image Format, added to GID, the Generic Image Decoder

Gautier's blog

By: [email protected] (GdM)

28 February 2022 at 22:51

QOI (the Quite OK Image Format, home page here) is a very simple raster image format with a decent lossless compression and an extremely good performance - a direct consequence of its simplicity and its compression features.

It was clear from the first sight on that format that it was urgent to add it to GID, the Generic Image Decoder 😎. GID is free, open-source, available on SourceForge here and GitHub here.

Here, a few examples of QOI test images decoded by GID (with its default background for transparency):

The cool ideas in the QOI format are

a "moving palette" - a list of recently shown colours, that is updated during the encoding or decoding of the pixels; the indexing is done with a hash function
a shortcut encoding for slightly different colours from a pixel to the next
a shortcut encoding for slightly different brightness from a pixel to the next.

Transparency, in the form of levels from 0 to 255, (the alpha channel) is supported.

Add to it a run-length encoding that fits well surfaces with identical colours and transparency, and the format remains so simple that you can squeeze its detailed specification on a single, readable, A4 page!

Click to enlarge

A curiosity with LZMA data compression

Gautier's blog

By: [email protected] (GdM)

10 August 2021 at 09:42

Uncompressed file: 1'029'744 bytes.

Compressed size (excluding Zip or 7z archive metadata; data is not preprocessed):

Bytes	Compressed / Uncompressed ratio	Format	Software
172'976	16.80%	PPMd	7-Zip 21.02 alpha
130'280	12.65%	BZip2	Zip 3.0
119'327	11.59%	BZip2	7-Zip 21.02 alpha
61'584	5.98%	LZMA	Zip-Ada v.57
50'398	4.89%	LZMA2	7-Zip 21.02 alpha
50'396	4.89%	LZMA	7-Zip 21.02 alpha
42'439	4.12%	LZMA	Zip-Ada v.58 (preview)
41'661	4.05%	LZMA	Zip-Ada (current research branch)

Conclusion: the Zip-Ada (current research branch) compresses that data 17.3% better than 7-Zip v.21.02!

The file (zipped to its smallest compressed size, 4.05%) can be downloaded here. It is part of the old Canterbury corpus benchmark file collection (file name: kennedy.xls).

Please don't draw any conclusion: the test data is a relatively small, special binary file with lots of redundancy.
But that result is a hint that some more juice can be extracted from the LZMA format.

The open-source Zip-Ada project can be found here and here.

The HAC scripts invasion (follow-up)

Gautier's blog

By: [email protected] (GdM)

26 July 2021 at 18:18

As a follow-up of another post about converting bash (Linux) or cmd (Windows) scripts to HAC scripts, here is a fresh example.

I needed to improve an existing cmd script for benchmarking compression software, with the possibility of switching various subsets separately: that is re-run this subset of methods, or that other subset, or the full tests (long!), etc.

Of course, if you use a real language instead of command-line interpreter ones, there is an obvious solution: you can define a set and you can programmatically flip the membership switches.

In Ada, it looks like

type Category is (
    Reduce_Shrink,
    Deflate,
    Deflate_External,
    BZip2_External,
    PPMd_External,
    LZMA_7z,
    LZMA_3,
    TAR,
    Preselection
);

cat_set : array (Category) of Boolean;

The good news is that you can run an Ada program exactly like a script by using HAC (the HAC Ada Compiler). That is, it runs immediately (with HAC), and HAC doesn't drop .ali, .o, .bexch, .tmp, .exe files which are too much waste for the sake of running a small script-like job.

Below are screenshots of the quick development of bench.adb using the LEA editor, where you can punch F4 to check eventual errors. If there is one, you get instantly to the offending line / column point.

This script is part of the Zip-Ada project and is very helpful for developing and testing new compression methods.

Click to enlarge

Gautier's blog
Some research with LZMA...
28 November 2020 at 19:54

Some research with LZMA...

Gautier's blog

By: [email protected] (GdM)

28 November 2020 at 19:54

A rare case where Zip-Ada's LZMA encoder is much better than LZMA SDK's. Rare but still interesting, and with standard LZMA parameters (no specific tuning for that file):

The compressed size with current revision (rev.#882) of Zip-Ada is slightly worse (42,559 bytes).

The file is part of the classic Canterbury Corpus compression benchmark data set.

Gautier's blog
HAC v.0.075: time functions: goodies for scripting tasks
20 October 2020 at 19:44

HAC v.0.075: time functions: goodies for scripting tasks

Gautier's blog

By: [email protected] (GdM)

20 October 2020 at 19:44

Today, HAC has a few more functions, from Ada.Calendar. I have added them in order to translate a Windows cmd script to Ada (with HAC_Pack). More precisely, it's "save.cmd", which takes a snapshot of the sources of the HAC system and other key files. This snapshot is a Zip archive and has a time stamp in its name, like "hac-2020-10-20-20-27-36-.zip". Hence the addition of standard functions like Year, Month, etc. The script is very practical for making backups between commits via subversion or git, and for other purposes. Now the script is called "save.adb" and does the same job, but not only on Windows, but also on Linux or other Operating Systems. Since the Zip compression is also programmed in Ada (Zip-Ada), you have in that script example a cool situation of Ada invading your computer 😊.

Here, a screenshot of the added functions running from the LEA editor (hum, also a full Ada software, by the way!):

HAC 0.075 running from LEA. Click to enlarge.

More to come soon with some subprograms stemming from Ada.Directories.

HAC is free and open-source, you can find it here and here.

Gautier's blog
Zip-Ada v.57
3 October 2020 at 20:55

Zip-Ada v.57

Gautier's blog

By: [email protected] (GdM)

3 October 2020 at 20:55

New in v.57 [rev. 799]:

- UnZip: fixed bad decoding case for the Shrink (LZW) format,
on some data compressed only by PKZIP up to v.1.10,
release date 1990-03-15.
- Zip.Create: added Zip_Entry_Stream_Type for doing output
streaming into Zip archives.
- Zip.Compress: Preselection method detects Audacity files (.aup, .au)
and compresses them better.

***

Zip-Ada is a pure Ada library for dealing with the Zip compressed
archive file format. It supplies:
- compression with the following sub-formats ("methods"):
Store, Reduce, Shrink (LZW), Deflate and LZMA
- decompression for the following sub-formats ("methods"):
Store, Reduce, Shrink (LZW), Implode, Deflate, Deflate64,
BZip2 and LZMA
- encryption and decryption (portable Zip 2.0 encryption scheme)
- unconditional portability - within limits of compiler's provided
integer types and target architecture capacity
- input archive to decompress can be any kind of indexed data stream
- output archive to build can be any kind of indexed data stream
- input data to compress can be any kind of data stream
- output data to extract can be any kind of data stream
- cross format compatibility with the most various tools and file formats
based on the Zip format: 7-zip, Info-Zip's Zip, WinZip, PKZip,
Java's JARs, OpenDocument files, MS Office 2007+,
Google Chrome extensions, Mozilla extensions, E-Pub documents
and many others
- task safety: this library can be used ad libitum in parallel processing
- endian-neutral I/O

***

Main site & contact info:
http://unzip-ada.sf.net
Project site & subversion repository:
https://sf.net/projects/unzip-ada/
GitHub clone with git repository:
https://github.com/zertovitch/zip-ada

Enjoy!

Gautier's blog
AZip 2.40 - Windows Explorer context menus
3 October 2020 at 18:00

AZip 2.40 - Windows Explorer context menus

Gautier's blog

By: [email protected] (GdM)

3 October 2020 at 18:00

New release (2.40) of AZip.

The long-awaited Windows Explorer integration is there:


Context menu for a file

Context menu for a folder

This integration is activated upon installation or on demand via the Manage button:

Configuration

This new version is based on the Zip-Ada library v.57 and includes its recent developments.

Enjoy!

Gautier's blog
Zip-Ada for Audacity backups
22 September 2020 at 18:21

Zip-Ada for Audacity backups

Gautier's blog

By: [email protected] (GdM)

22 September 2020 at 18:21

Audacity is a free, open source, audio editor, available here.

If you want to backup you Audacity project, you can manually do it with "Save Lossless Copy of Project..." with the name, say, X, which will create X.aup (project file), a folder X_data, and, in there, a file called "Audio Track.wav".

Some drawbacks:

It is a manual operation.
It is blocked during playback.
Envelopes are applied to the "Audio Track.wav" data. So data is altered and no more a real lossless copy of the project. Actually this operation is something between a backup and an export of the project to a foreign format.

A solution: Zip-Ada.

The latest commit (rev. 796) adds to the Preselection method a specific configuration for detecting Audacity files, so they are compressed better than with default settings.

Funny detail: that configuration makes, in most cases, the compression better than the best available compression with 7-Zip (v.19.00, "ultra" mode, .7z archive).

The compressing process is also around twice as fast as 7-Zip in "ultra" mode. This is no magic, since the "LZ" part of the LZMA compression scheme spends less time finding matches, in the chosen configuration for Zip-Ada.

A backup script could look like this (here for Windows' cmd):

rem --------------------------
rem Nice date YYYY-MM-DD_HH.MM
rem --------------------------

set year=%date:~-4,4%

set month=%date:~-7,2%
if "%month:~0,1%" equ " " set month=0%month:~1,1%

set day=%date:~-10,2%
if "%day:~0,1%" equ " " set day=0%day:~1,1%

set hour=%time:~0,2%
if "%hour:~0,1%" equ " " set hour=0%hour:~1,1%

set min=%time:~3,2%

set nice_date=%year%-%month%-%day%_%hour%.%min%

rem --------------------------

set audacity_project=The Cure - A Forest

zipada -ep2 "%audacity_project%_%nice_date%" "%audacity_project%.aup" "%audacity_project%_data\e08\d08\*.au"

Gautier's blog
AZip in action for duplicating a Thunderbird profile
18 September 2020 at 07:16

AZip in action for duplicating a Thunderbird profile

Gautier's blog

By: [email protected] (GdM)

18 September 2020 at 07:16

You want to copy your Thunderbird profile from machine A to machine B (with all mail accounts, passwords, settings, feeds, newgroups, ...) ? Actually it is very easy. From the user storage (on Windows, %appdata% (you get there with Windows key+R and typing %appdata%)), you copy the entire Thunderbird folder of machine A to the equivalent location on machine B, and that's it. The new active profile will be automatically selected since the file profiles.ini will be overwritten on the way.

Now, if you want or need to use a cloud drive or a USB stick for the operation, it's better to wrap everything in a Zip file (a single file instead of hundreds) to save time. Plus, you can store the Zip file in case of an emergency (losing data on both A and B machines).

With AZip, it's pretty easy:

Shut down Thunderbird on both machines.
On machine A: drag & drop the Thunderbird folder on an empty AZip window.
Copy or move the Zip file.
On machine B: extract everything with another drag & drop, from AZip to the Explorer window with the %appdata% path. When asked "Use archive's folder names for output", say "Yes". When asked "Do you want to replace this file ?", say "All".

That's it!

Here a few screenshots:

Folder tree view - click to enlarge

You can squeeze the data to a smaller size (the LZMA format will be most of the time chosen over Deflate) with the "Recompress" button (third from the right).

After recompression - click to enlarge

Gautier's blog
Zip-Ada: the new Zip_Entry_Stream output stream
2 September 2020 at 13:05

Zip-Ada: the new Zip_Entry_Stream output stream

Gautier's blog

By: [email protected] (GdM)

2 September 2020 at 13:05

The latest addition to Zip-Ada (commits #792 to 794) is the possibility of writing contents to a Zip file (or more generally, a Zip stream) as an output stream. You (the programmer) don't need to store contents into some buffer and design an input stream in the Zip_Streams.Root_Zipstream_Type'Class type class to read that buffer, as it is the case for Add_Stream in the Zip_Create package.

How does new output stream work in practice? The best way is to show an example. Here is a reduced version of Test_Zip_Entry_Stream (you can find the full version in the test directory of the Zip-Ada project's sources):

      with Zip.Create;
      with Ada.Command_Line, Ada.Text_IO;

      procedure Test_Zip_Entry_Stream is
        use Zip.Create;
        use Ada.Command_Line, Ada.Text_IO;

        Archive_Info  : Zip_Create_Info;
        Archive_File  : aliased Zip_File_Stream;
        Archive_Entry : aliased Zip_Entry_Stream_Type;
        Text : File_Type;
      begin
        Create_Archive (Archive_Info, Archive_File'Unchecked_Access, "test_zes.zip");
        for I in 1 .. Argument_Count loop
          Open (Archive_Entry);
          Open (Text, In_File, Argument (I));
          while not End_Of_File (Text) loop
            String'Write (Archive_Entry'Access, Get_Line (Text));
            Character'Write (Archive_Entry'Access, ASCII.LF);  --  UNIX end-of-line
          end loop;
          Close (Text);
          Close (Archive_Entry, "zes_" & Argument (I), use_clock, Archive_Info);
        end loop;
        Finish (Archive_Info);
      end Test_Zip_Entry_Stream;

Enjoy!

NB: this addition has been sponsored. It is used in an industrial software (robotics).

The open-source Zip-Ada project can be found at the following places:

Gautier's blog
AZip 2.38
1 August 2020 at 07:21

AZip 2.38

Gautier's blog

By: [email protected] (GdM)

1 August 2020 at 07:21

AZip can now install itself (if requested)!

AZip is a free, open-source Zip Archive Manager.

You can download the Windows version from here: https://azip.sourceforge.io/

blog.vacs.fr - Tag Ada
Using Ada LZMA to compress and decompress LZMA files
16 December 2015 at 10:25

Using Ada LZMA to compress and decompress LZMA files

blog.vacs.fr - Tag Ada

By: Stephane Carrez

16 December 2015 at 10:25

Setup of Ada LZMA binding

First download the Ada LZMA binding at http://download.vacs.fr/ada-lzma/ada-lzma-1.0.0.tar.gz or at [email protected]:stcarrez/ada-lzma.git, configure, build and install the library with the next commands:

./configure
make
make install

After these steps, you are ready to use the binding and you can add the next line at begining of your GNAT project file:


with "lzma";

Import Declaration

To use the Ada LZMA packages, you will first import the following packages in your Ada source code:


with Lzma.Base;
with Lzma.Container;
with Lzma.Check;

LZMA Stream Declaration and Initialization

The liblzma library uses the lzma_stream type to hold and control the data for the lzma operations. The lzma_stream must be initialized at begining of the compression or decompression and must be kept until the compression or decompression is finished. To use it, you must declare the LZMA stream as follows:


Stream  : aliased Lzma.Base.lzma_stream := Lzma.Base.LZMA_STREAM_INIT;

Most of the liblzma function return a status value of by lzma_ret, you may declare a result variable like this:


Result : Lzma.Base.lzma_ret;

Initialization of the lzma_stream

After the lzma_stream is declared, you must configure it either for compression or for decompression.

Initialize for compression

To configure the lzma_stream for compression, you will use the lzma_easy_encode function. The Preset parameter controls the compression level. Higher values provide better compression but are slower and require more memory for the program.


Result := Lzma.Container.lzma_easy_encoder (Stream'Unchecked_Access, Lzam.Container.LZMA_PRESET_DEFAULT,
                                            Lzma.Check.LZMA_CHECK_CRC64);
if Result /= Lzma.Base.LZMA_OK then
  Ada.Text_IO.Put_Line ("Error initializing the encoder");
end if;

Initialize for decompression

For the decompression, you will use the lzma_stream_decoder:


Result := Lzma.Container.lzma_stream_decoder (Stream'Unchecked_Access,
                                              Long_Long_Integer'Last,
                                              Lzma.Container.LZMA_CONCATENATED);

Compress or decompress the data

The compression and decompression is done by the lzma_code function which is called several times until it returns LZMA_STREAM_END code. Setup the stream 'next_out', 'avail_out', 'next_in' and 'avail_in' and call the lzma_code operation with the action (Lzma.Base.LZMA_RUN or Lzma.Base.LZMA_FINISH):


Result := Lzma.Base.lzma_code (Stream'Unchecked_Access, Action);

Release the LZMA stream

Close the LZMA stream:


    Lzma.Base.lzma_end (Stream'Unchecked_Access);

Sources

To better understand and use the library, use the source Luke

Normal view

Setup of Ada LZMA binding

Import Declaration

LZMA Stream Declaration and Initialization

Initialization of the lzma_stream

Initialize for compression

Initialize for decompression

Compress or decompress the data

Release the LZMA stream

Sources

Download