Incorrect filename in zip archive

Enternal · December 14, 2018, 2:48am

Not exactly the same issue but it's also related to how DOpus is reading some of these archives. There is a HDD testing utility that is provided by http://hdd.by/victoria. They provide a portable version of it here, which is a zip file. Inside it, there is a readme file named Изменения.txt that is read correctly by WinRAR. However, if reading archive through DOpus, the filename becomes all garbled up. Doesn't really bother me but I wonder if it's related to the problem here with something wrong with the plug-ins and how it reads the file.

Leo · December 16, 2018, 9:45am

There's something weird about that one but I haven't worked out what yet.

If you recompress it with just about anything, it works properly in Opus.

Zip files potentially have two copies of every filename, so it's possible it has not encoded one of the copies correctly and Opus is using that while other tools favor the other copy, but that's just a guess. Proper investigation will take some debugging. I think the archive is incorrect in some way, though.

Do you run into that often or is it a one-off?

Enternal · December 16, 2018, 10:11am

Ah, that makes sense.

It's a one-off thing. But I still thought that it's worth mentioning just in case someone comes across it again in the future considering that it is a bit peculiar.

Leo · September 9, 2019, 8:41pm

We've investigated this and found the archive uses a fairly rare extension to the Zip format which stores the filename as both ANSI and Unicode/UTF8 at the same time. (So there are up to four copies of each filename in the archive. Two next to each file's data, and two more in the "central directory" listing at the end of the archive.)

Tools which don't support Unicode at all store the names in a format which only works when in the same codepage as the person who created the archive.

Most tools which do support Unicode will store the names as UTF8 (only), and set a flag saying they are UTF8.

But there are some rare tools which use a format extension to store names in both formats at once: one for backward compatibility (when in the same codepage) and the other for Unicode support (independent of codepages).

We'll add support for this in 12.17.1 beta, if not before. Code for this has been written but it's a question of when it makes sense to release it, since affected archives are obscure enough that we don't want to rush it out and risk introducing bugs into the zip code.

Enternal · September 12, 2019, 10:01pm

Thank you for looking into this again

Leo · September 13, 2019, 9:33pm

This is looking like it will take a while longer, unfortunately. Support for reading archives that use the extension was easy, but after testing we found issues if any changes were made to such archives, and making that work correctly is both more complex and more risky.

We don't want to risk corrupting archives as a side effect of adding support for an extension that is rarely used, as that would be catastrophic.

We are still aiming to add support for this, but will need to do some refactoring first.