❌ About FreshRSS

Reading view

There are new articles available, click to refresh the page.

AZip, GWindows, the Windows API and another surprise!

 

Note for subscribers: if you are interested in my Ada programming articles only, you can use this RSS feed link.


Last post was about an improvement of the GWindows.Common_Controls.Ex_List_View widget which is distributed with the GWindows framework. Without changing the specification, it was possible to avoid Windows API calls during the performance-sensitive comparison function, leading to impressive speedup factors: from 3.3x with 4096 items to 16.5x with 32768 items.

But actually, it was only the prelude!

If you can live with data duplication, which is not nice and not always practical, it is possible to do much better. Concretely, you store the same strings in the cells of the List_View widget and in the payload data associated with each row. Then, the sorting for all columns (text, dates, numbers, ...) is using the payload data. An addition to the Ex_List_View package later (this time, enriching a bit the specification), the comparison function may exclusively use the payload of both compared rows, directly, thus avoiding any API call during the lifespan of the comparison function.

Without going into too many details, here are the performance gains in charts:



This represents a factor of ~100x to ~300x on top of the previous improvement, depending on the test machine.

Here are links to the sources:

AZip:

Project site & Subversion repository:
https://sf.net/projects/azip/
GitHub clone with Git repository:
https://github.com/zertovitch/azip

GWindows:

Project site & Subversion repository:
https://sf.net/projects/gnavi/
GitHub clone with Git repository:
https://github.com/zertovitch/gwindows

AZip, GWindows, the Windows API and a good surprise!

Note for subscribers: if you are interested in my Ada programming articles only, you can use this RSS feed link.


Programming user interfaces is often a bizarre experience where you program only one side and the other side is done by other people (for instance the programmers of the Windows components) that you cannot contact for asking questions. Even if it was possible (like working at Microsoft in Redmond), those people are probably already retired for long. So the only possibility is to experiment with the black box and skim the Internet in the hope someone else has encountered the same issues and found a solution. Fortunately, as time goes by, there are more and more solutions appearing.

Here is a typical example.

The AZip Zip archive manager is designed to operate with different user interface systems, like native Windows (it is done through the GWindows framework), or the multi-platform Gtk system (there is a draft version of that).

AZip uses the Ex_List_View_Control_Type widget (package GWindows.Common_Controls.Ex_List_View) which is an extension of List_View_Control_Type (package GWindows.Common_Controls), developed by the company KonAd GmbH, with cool features like individually coloured cells, and sorting.

A hiccup was the performance of the sorting. On Zip archives with many entries (say, 10,000 or more), the sorting became frustratingly slow. On other software sorting is MUCH faster, so there was an issue to solve.

Sorting a column in AZip - click to enlarge screenshot

Fortunately GWindows and its extensions are completely open-source, so you can inspect everything. By the way, it is also the case for GNAT (the open-source Ada compiler): you can inspect the entire run-time library. All in all, AZip is a rare software whose entire source (the program itself, the user interface, the data compression library, and the run-time library) can be comfortably browsed per mouse clicks from GNAT Studio. This amounts to more than 112,000 and 572 units (mostly packages), completely in Ada. You see here a slide from a FOSDEM presentation about AZip.

Break-down of the Ada source code of AZip

Wait, the "entire source"? Not really, since Windows is neither open-source, nor in Ada. That's where the fun begins. Back to our sorting issue (sorting is too slow). GWindows (the nice object-oriented framework) provides a Sort method, which sends behind the scenes a Windows message, LVM_SORTITEMS, which calls back a comparison function provided by GWindows, Compare_Internal, which in turn will calls a On_Compare method - built-in, or derived by the programmer, for instance for sorting the columns of a Zip archive where some columns are numerical and other ones are texts. This mechnism is convoluted but it is how it works if you want to use the List_View widget provided by Windows. It is a case where you really have to dance with the Windows API (of course, you would have the same situation for other user interface systems: Mac, Gtk, ...). After a certain amount of Internet searches, through forums, blogs, etc., it appears that there is an alternative way with a Windows message called LVM_SORTITEMSEX. Note the "EX" at the end: it is for "extended". You have probably noticed rather the last three letters, but anyway... This alternative way provides directly to the comparison call-back the indices of the rows that Windows want to compare. With the initial approach (using LVM_SORTITEMS, without EX) two messages, LVM_FINDITEM, have to be sent in order to get those row indices. So you can save those two calls and hope for a small speedup.

Now comes the surprise. The gain in terms of time is astounding!

On a certain computer, with 4096 items to be sorted, both calls (that can be skipped with the new approach) consume 70% of the entire sorting time - including the comparison itself, the sorting effort on Windows' side, the object-oriented dispatching, etc. . With 8192 items, it is 81%. With 16'384 items, it is 88%. With 32'768 items, 94% of the time is consumed by the extra calls. Seen differently, the old approach takes 14 seconds for sorting, and the new approach takes 0.85 second for the same job!

Here are performance charts on two differents computers.


 

Interestingly, the performance stays linear with the new approach.

Here are links to the sources:

AZip:

Project site & subversion repository:
https://sf.net/projects/azip/
GitHub clone with git repository:
https://github.com/zertovitch/azip

GWindows:

Project site & subversion repository:
https://sf.net/projects/gnavi/
GitHub clone with git repository:
https://github.com/zertovitch/gwindows

Zip-Ada: Zip64 extensions

On the very top of the Zip-Ada to-do list, for a while, was the support for archive containing large (more than 4 GiB data) or numerous (more than 65535) files – the so-called "Zip64" extensions.

It's now gone from the to-do list and done in the project.

The Zip archive format was defined with 32-bit file sizes and 16-bit number of files. At the time (1989) it was a perfectly sound choice, especially given that the software, PKZIP, was written for 16-bit PC's running on MS-DOS. At some point, these limits began to be problematic for archiving or backup tasks. Although the original Zip format is still the most widely used (including many file types without the .zip name extension, such as .docx, .xlsx, .pptx, .jar), the limits become, slowly, more and more frequently an issue over time. Consequently, PKWARE (the firm behind PKZIP) has introduced, around year 2000, a set of format extensions to overcome those limitations. Given the omnipresence of Zip archives and the difficulty to innovate in software, PKWARE did not opt for a fresh, new, simple, but incompatible format. Instead, they designed a set of extensions, allowed by the flexibility of the Zip format. This seems clumsy and over-engineered, but the design is actually quite clever: a program can create a Zip archive, without size limitations, and without checking in advance whether the Zip64 extensions are needed or not. When the archive is finished, either the Zip64 extensions were not needed and the archive is conform to the year 1989 original format, or they were – then, an unpacker for that archive file has to know about Zip64. Thus, the Zip64 design allowed for a progressive adoption of the new format, since the actual need went progressively from very rare to a bit less rare.

How to implement Zip64 in a new software? There are a few options, which you can combine:

1.    Follow the documentation, appnote.txt
2.    Take inspiration from some open-source software like Info-Zip or 7-Zip
3.    Reverse-engineer data (hex dumps)

All three are very time-consuming (especially, the documentation, aimed at maintaining the openness of the Zip format, is "legally" correct but lacks some indications and practical details).
Even widespread, commercial software (with paying customers and many paid programmers behind it) does not support Zip64 (e.g. Microsoft Word for .docx), or took almost two decades to do so (Microsoft Windows for instance). It reveals something about the time and costs related to the implementation…

Fortunately, a very helpful person called Yaakov not only did an implementation from scratch, but also was kind enough to share his experience in an explanatory and colorful way, with a simple example as an illustration. The article is available here: ZIP64 - Go Big Or Go Home.
I show here (with permission) the key diagram of his article:

Click to enlarge


The Zip64 extensions are in dark background and white letters.
The part with the archived (eventually compressed) files is in pink and brown. In that example, there is only one file in the archive.

As you see, an implementation of Zip64 may be quite tricky, but it is feasible, if we have "the big picture" before our eyes.

Alas, things are a little bit more complicated than it appears when reading the article too fast...

We want, with Zip-Ada, to decode and unpack correctly archives not only made by Zip-Ada, but also by Info-Zip's Zip, 7-Zip, WinZip, and so on (the converse is also true: we want the other guys to be able to unpack our Zip-Ada archives).
For that reason, we need to anticipate all possible ways the data headers have been written, not only our choices. In that respect, we have had some headache with the per-entry "size" Zip64 extensions (the parts in brown and dark green in the chart).
The documentation (4.5.3) describes, for that, a record with a variable number of values:

8 bytes    [A] Original uncompressed file size
8 bytes    [B] Size of compressed data
8 bytes    [C] Offset of local header record
4 bytes    [D] Number of the disk on which this file starts


and defines in plain English some rules about the order and in which circumstances the values are present or not.

The documentation requires both [A] and [B] for the Local Header (part in brown in the chart). Let us figure out the explanation to that rule. For a program that creates the archive, it would not be practical to put only the uncompressed size before compressing the data, because the program would have to shift all the compressed content by 8 bytes in case the compressed size (which is known only after the compression) happened to also exceed 2**32 - 2. Hence, exactly and always two values if the uncompressed size (which is usually known in advance) exceeds the limit.

When it is time for the archiving program to write the Central Directory (the part that concludes the archive, in green, light and dark, in the chart), all information about sizes is known. An archiving program may choose to put only the necessary values in 64 bits. However, traditionally, you would expect that the variable record's contents are self-describing and would contain [A], or: [A] and [B], or: [A] and [B] and [C], or: [A] and [B] and [C] and [D], given the record's size.
With Zip64, it is not the case: each field is optional. This micro-optimization is ridiculous given the fact that the size of archive is breaking the 4 GiB limit. To make things worse, the documentation is not clear about that full optionality.
An archiver (7-Zip does so for instance) may put just [A], the uncompressed size of one first large file that is well compressible, or just [C], the offset of one small file following a file with a large size in the archive. Or it could put [A], [B], [C] in both cases (as we do for Zip-Ada's archive creation). So, we've adapted the archive reader to take all possible cases into consideration.

Thanks to Nicolas Brunot for pointing that (hopefully) ultimate difficulty with Zip64.

Latest sources:
 
Link 1: https://github.com/zertovitch/zip-ada/ 
  With git: git clone https://github.com/zertovitch/zip-ada.git
  As Zip-ball: green button "Code", choice: "Download ZIP"
 
Link 2: https://sourceforge.net/p/unzip-ada/code/HEAD/tree/
  With subversion: svn checkout https://svn.code.sf.net/p/unzip-ada/code/ za
  As Zip-ball: button: "Download Snapshot"

A curiosity with LZMA data compression

Uncompressed file: 1'029'744 bytes.

Compressed size (excluding Zip or 7z archive metadata; data is not preprocessed):

BytesCompressed / Uncompressed ratio Format Software
172'976 16.80% PPMd 7-Zip 21.02 alpha
130'280 12.65% BZip2 Zip 3.0
119'327 11.59% BZip2 7-Zip 21.02 alpha
61'584 5.98% LZMA Zip-Ada v.57
50'398 4.89% LZMA2 7-Zip 21.02 alpha
50'396 4.89% LZMA 7-Zip 21.02 alpha
42'439 4.12% LZMA Zip-Ada v.58 (preview)
41'661 4.05% LZMA Zip-Ada (current research branch)

Conclusion: the Zip-Ada (current research branch) compresses that data 17.3% better than 7-Zip v.21.02!

The file (zipped to its smallest compressed size, 4.05%) can be downloaded here. It is part of the old Canterbury corpus benchmark file collection (file name: kennedy.xls).

Please don't draw any conclusion: the test data is a relatively small, special binary file with lots of redundancy.
But that result is a hint that some more juice can be extracted from the LZMA format.

The open-source Zip-Ada project can be found here and here.

The HAC scripts invasion (follow-up)

As a follow-up of another post about converting bash (Linux) or cmd (Windows) scripts to HAC scripts, here is a fresh example.

I needed to improve an existing cmd script for benchmarking compression software, with the possibility of switching various subsets separately: that is re-run this subset of methods, or that other subset, or the full tests (long!), etc.

Of course, if you use a real language instead of command-line interpreter ones, there is an obvious solution: you can define a set and you can programmatically flip the membership switches.

In Ada, it looks like

  type Category is (
    Reduce_Shrink,
    Deflate,
    Deflate_External,
    BZip2_External,
    PPMd_External,
    LZMA_7z,
    LZMA_3,
    TAR,
    Preselection
  );

  cat_set : array (Category) of Boolean;
The good news is that you can run an Ada program exactly like a script by using HAC (the HAC Ada Compiler). That is, it runs immediately (with HAC), and HAC doesn't drop .ali, .o, .bexch, .tmp, .exe files which are too much waste for the sake of running a small script-like job.

Below are screenshots of the quick development of bench.adb using the LEA editor, where you can punch F4 to check eventual errors. If there is one, you get instantly to the offending line / column point.

This script is part of the Zip-Ada project and is very helpful for developing and testing new compression methods.

 

Click to enlarge


Click to enlarge


Some research with LZMA...

A rare case where Zip-Ada's LZMA encoder is much better than LZMA SDK's. Rare but still interesting, and with standard LZMA parameters (no specific tuning for that file):


The compressed size with current revision (rev.#882) of Zip-Ada is slightly worse (42,559 bytes).

The file is part of the classic Canterbury Corpus compression benchmark data set.

The HAC scripts invasion

Perhaps you were already confronted to this problem:

  • You have "housekeeping" shell scripts for cleaning files, building tools, listing results, etc. . 
  • You would like to have the project, that these scripts are serving, running on multiple systems: Linux, MacOS, Windows, ... Especially for highly portable Ada projects, it is almost a must.
  • But in the end, you have lots of duplicate scripts: for each script one version for Linux, one version for Windows.
The solution: use HAC (the HAC Ada Compiler).

One practical example of script simplification can be found in the Zip-Ada project.
In the test directory, there were test_za.cmd and test_za.sh, meant to do the same thing: testing the compression side of the library. But the scripts were out of sync, and it was a pain to make them converge. So it was a perfect opportunity to switch to HAC, which has since its 0.076 version standard subprograms for file management. The unified script is test_za.adb, can be run with the hac test_za.adb command.
Now test_rz.cmd and test_rz.sh (for testing the Zip archive recompression tool, ReZip) are also unified, and so are make_za.cmd and make_za.sh for building everything.

More generally HAC scripts have tremendous advantages:
  • They are plain Ada, so there is no need to learn a new language.
  • If you need it, they can be compiled with an Ada compiler. It can be for different reasons:
    • You need performance (nested loops, for instance). You'll get C-level performance, at least with the GNAT compiler.
    • You need more functionalities that are not present in HAC.
    • You are afraid HAC is not developed or supported further.
Here are a few screenshots:



The screenshots are taken from TeXCAD and LEA - other Ada projects 😏...

HAC v.0.075: time functions: goodies for scripting tasks

Today, HAC has a few more functions, from Ada.Calendar. I have added them in order to translate a Windows cmd script to Ada (with HAC_Pack). More precisely, it's "save.cmd", which takes a snapshot of the sources of the HAC system and other key files. This snapshot is a Zip archive and has a time stamp in its name, like "hac-2020-10-20-20-27-36-.zip". Hence the addition of standard functions like Year, Month, etc. The script is very practical for making backups between commits via subversion or git, and for other purposes. Now the script is called "save.adb" and does the same job, but not only on Windows, but also on Linux or other Operating Systems. Since the Zip compression is also programmed in Ada (Zip-Ada), you have in that script example a cool situation of Ada invading your computer 😊.

Here, a screenshot of the added functions running from the LEA editor (hum, also a full Ada software, by the way!):

HAC 0.075 running from LEA. Click to enlarge.

More to come soon with some subprograms stemming from Ada.Directories.

HAC is free and open-source, you can find it here and here.

Zip-Ada v.57

 New in v.57 [rev. 799]:

  - UnZip: fixed bad decoding case for the Shrink (LZW) format,
        on some data compressed only by PKZIP up to v.1.10,
        release date 1990-03-15.
  - Zip.Create: added Zip_Entry_Stream_Type for doing output
        streaming into Zip archives
.
  - Zip.Compress: Preselection method detects Audacity files (.aup, .au)
        and compresses them better
.

***

Zip-Ada is a pure Ada library for dealing with the Zip compressed
archive file format. It supplies:
 - compression with the following sub-formats ("methods"):
     Store, Reduce, Shrink (LZW), Deflate and LZMA
 - decompression for the following sub-formats ("methods"):
     Store, Reduce, Shrink (LZW), Implode, Deflate, Deflate64,
     BZip2 and LZMA
 - encryption and decryption (portable Zip 2.0 encryption scheme)
 - unconditional portability - within limits of compiler's provided
     integer types and target architecture capacity
 - input archive to decompress can be any kind of indexed data stream
 - output archive to build can be any kind of indexed data stream
 - input data to compress can be any kind of data stream
 - output data to extract can be any kind of data stream
 - cross format compatibility with the most various tools and file formats
     based on the Zip format: 7-zip, Info-Zip's Zip, WinZip, PKZip,
     Java's JARs, OpenDocument files, MS Office 2007+,
     Google Chrome extensions, Mozilla extensions, E-Pub documents
     and many others
 - task safety: this library can be used ad libitum in parallel processing
 - endian-neutral I/O

***

Main site & contact info:
  http://unzip-ada.sf.net
Project site & subversion repository:
  https://sf.net/projects/unzip-ada/
GitHub clone with git repository:
  https://github.com/zertovitch/zip-ada

Enjoy!

Zip-Ada for Audacity backups

Audacity is a free, open source, audio editor, available here.

If you want to backup you Audacity project, you can manually do it with "Save Lossless Copy of Project..." with the name, say, X, which will create X.aup (project file), a folder X_data, and, in there, a file called "Audio Track.wav".

Some drawbacks:

  • It is a manual operation.
  • It is blocked during playback.
  • Envelopes are applied to the "Audio Track.wav" data. So data is altered and no more a real lossless copy of the project. Actually this operation is something between a backup and an export of the project to a foreign format.

A solution: Zip-Ada.

The latest commit (rev. 796) adds to the Preselection method a specific configuration for detecting Audacity files, so they are compressed better than with default settings.

Funny detail: that configuration makes, in most cases, the compression better than the best available compression with 7-Zip (v.19.00, "ultra" mode, .7z archive).

The compressing process is also around twice as fast as 7-Zip in "ultra" mode. This is no magic, since the "LZ" part of the LZMA compression scheme spends less time finding matches, in the chosen configuration for Zip-Ada.


A backup script could look like this (here for Windows' cmd):

rem --------------------------
rem Nice date YYYY-MM-DD_HH.MM
rem --------------------------

set year=%date:~-4,4%

set month=%date:~-7,2%
if "%month:~0,1%" equ " " set month=0%month:~1,1%

set day=%date:~-10,2%
if "%day:~0,1%" equ " " set day=0%day:~1,1%

set hour=%time:~0,2%
if "%hour:~0,1%" equ " " set hour=0%hour:~1,1%

set min=%time:~3,2%

set nice_date=%year%-%month%-%day%_%hour%.%min%

rem --------------------------

set audacity_project=The Cure - A Forest

zipada -ep2 "%audacity_project%_%nice_date%" "%audacity_project%.aup" "%audacity_project%_data\e08\d08\*.au"

AZip in action for duplicating a Thunderbird profile

You want to copy your Thunderbird profile from machine A to machine B (with all mail accounts, passwords, settings, feeds, newgroups, ...) ? Actually it is very easy. From the user storage (on Windows, %appdata% (you get there with Windows key+R and typing %appdata%)), you copy the entire Thunderbird folder of machine A to the equivalent location on machine B, and that's it. The new active profile will be automatically selected since the file profiles.ini will be overwritten on the way.

Now, if you want or need to use a cloud drive or a USB stick for the operation, it's better to wrap everything in a Zip file (a single file instead of hundreds) to save time. Plus, you can store the Zip file in case of an emergency (losing data on both A and B machines).

With AZip, it's pretty easy: 

  • Shut down Thunderbird on both machines.
  • On machine A: drag & drop the Thunderbird folder on an empty AZip window.
  • Copy or move the Zip file.
  • On machine B: extract everything with another drag & drop, from AZip to the Explorer window with the %appdata% path. When asked "Use archive's folder names for output", say "Yes". When asked "Do you want to replace this file ?", say "All".

That's it!

Here a few screenshots:

Folder tree view - click to enlarge

You can squeeze the data to a smaller size (the LZMA format will be most of the time chosen over Deflate) with the "Recompress" button (third from the right).

After recompression - click to enlarge


Zip-Ada: the new Zip_Entry_Stream output stream

The latest addition to Zip-Ada (commits #792 to 794) is the possibility of writing contents to a Zip file (or more generally, a Zip stream) as an output stream. You (the programmer) don't need to store contents into some buffer and design an input stream in the Zip_Streams.Root_Zipstream_Type'Class type class to read that buffer, as it is the case for Add_Stream in the Zip_Create package.

How does new output stream work in practice? The best way is to show an example. Here is a reduced version of Test_Zip_Entry_Stream (you can find the full version in the test directory of the Zip-Ada project's sources):

      with Zip.Create;
with Ada.Command_Line, Ada.Text_IO;

procedure Test_Zip_Entry_Stream is
use Zip.Create;
use Ada.Command_Line, Ada.Text_IO;

Archive_Info : Zip_Create_Info;
Archive_File : aliased Zip_File_Stream;
Archive_Entry : aliased Zip_Entry_Stream_Type;
Text : File_Type;
begin
Create_Archive (Archive_Info, Archive_File'Unchecked_Access, "test_zes.zip");
for I in 1 .. Argument_Count loop
Open (Archive_Entry);
Open (Text, In_File, Argument (I));
while not End_Of_File (Text) loop
String'Write (Archive_Entry'Access, Get_Line (Text));
Character'Write (Archive_Entry'Access, ASCII.LF); -- UNIX end-of-line
end loop;
Close (Text);
Close (Archive_Entry, "zes_" & Argument (I), use_clock, Archive_Info);
end loop;
Finish (Archive_Info);
end Test_Zip_Entry_Stream;

Enjoy!

NB: this addition has been sponsored. It is used in an industrial software (robotics).

The open-source Zip-Ada project can be found at the following places:

❌