Issue with ZIP File Filename Encoding

In my work, I often receive ZIP files from clients that display garbled filenames.

Currently, I have to adjust the setting "Infer UTF-8 filenames (requires restart)" under "ZIP Files" depending on the client's operating system:
When the client uses Windows, I must enable this option so that viewing or extracting the ZIP file displays filenames correctly.
When the client uses macOS, I must disable this option to avoid garbled filenames.

Would it be possible to implement automatic encoding detection in a future update?
The built-in ZIP tool in Opus is extremely convenient—especially the ability to browse ZIP archives directly as folders. Automatic handling of filename encodings would make this feature even better.

In general, yes, that is impossible. There is no reliable way to determine character encoding without something explicitly saying which encoding to use. Any algorithm which tries to guess will go wrong in a lot of situations, especially with short strings like filenames.

I'm surprised the MacOS archiver is still using anything other than UTF-8 these days. I would look for a better one as it doesn't make sense to use anything else; the world has settled on UTF-8 for things like this.

Thank you for your prompt reply.

I currently use NanaZIP as my primary tool for handling archives. It’s a 7-Zip–based derivative that can automatically detect the correct filename encoding in ZIP files without manual configuration.

This leads me to wonder: is this capability only possible because it’s a standalone application, or is Opus’s built-in ZIP functionality independently developed in a way that currently limits such features?

Would it be feasible for Directory Opus to integrate proven open-source libraries (such as those used by 7-Zip or NanaZIP) — perhaps via plugins or external components — to enable smarter, automatic encoding detection while retaining the convenience of browsing archives as folders?

I greatly appreciate the seamless archive integration in Opus and believe this enhancement would significantly improve the user experience when dealing with cross-platform ZIP files.

Could you possibly provide an example file so we can experiment? (Even if you have to ask your client to make one specifically with a dummy file in it)

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Generate Test ZIP Archives with Different Encodings
====================================================

This script creates test ZIP files with various character encodings to test
encoding detection and code page switching using 7-Zip command line.

Supported Encodings:
- GBK (Simplified Chinese, Code Page 936)
- Big5 (Traditional Chinese, Code Page 950)
- Shift-JIS (Japanese, Code Page 932)
- EUC-KR (Korean, Code Page 949)

For each encoding, three sample sizes are generated:
- "many": 20 files with long filenames (30-60 chars) - Auto detection should work
- "medium": 10 files with medium filenames (15-30 chars) - Auto detection may work
- "few": 3 files with short filenames (5-15 chars) - Auto detection may fail

Usage:
    Simply run this script in the directory where you want the test files:
    python generate_test_zips.py

The script will generate 12 ZIP files (4 encodings x 3 sample sizes).

How to extract with 7-Zip command line (with correct code page):

    7z x test_gbk_many.zip -mcp=936
    7z x test_big5_many.zip -mcp=950
    7z x test_sjis_many.zip -mcp=932
    7z x test_euckr_many.zip -mcp=949

    Where:
        x      = extract with full paths
        -mcp=N = specify code page N for filename encoding

Common Code Pages:
    936 = Simplified Chinese (GBK)
    950 = Traditional Chinese (Big5)
    932 = Japanese (Shift-JIS)
    949 = Korean (EUC-KR)
    65001 = UTF-8 (default in modern 7-Zip)

Note: These ZIP files do NOT have the UTF-8 flag set (Bit 11 = 0), so they can
be repaired by specifying the correct code page with -mcp. This is different
from Python's default zipfile module behavior which always sets the UTF-8 flag.

Why use -mcp?
    7-Zip GUI does not allow changing code page. You MUST use command line with
    the -mcp parameter to specify the correct encoding for filenames.
"""

import os
import struct
import io
import zlib

# Get the directory where this script is located
script_dir = os.path.dirname(os.path.abspath(__file__))
os.chdir(script_dir)
print(f"Output directory: {script_dir}")

def create_zip_manual(filename, files_dict, encoding='gbk', use_utf8_flag=False):
    """
    Manually construct a ZIP file with precise control over filename encoding.
    
    Args:
        filename: Output ZIP filename
        files_dict: Dictionary mapping {filename_in_encoding: file_content}
        encoding: Character encoding for filenames ('gbk', 'big5', 'shift_jis', 'euc_kr')
        use_utf8_flag: If True, sets UTF-8 flag (Bit 11) - makes file non-repairable by code page switching
    
    Returns:
        Full path to the created ZIP file
    """
    filepath = os.path.join(script_dir, filename)
    buffer = io.BytesIO()
    central_dir = io.BytesIO()
    local_header_offset = 0
    file_count = 0
    
    for fname, content in files_dict.items():
        # Encode filename using specified encoding (not UTF-8)
        fname_bytes = fname.encode(encoding)
        content_bytes = content.encode('utf-8')
        crc = zlib.crc32(content_bytes) & 0xffffffff
        
        # Bit 11 (0x800) = UTF-8 flag. We set this to 0 to allow code page switching.
        flags = 0x800 if use_utf8_flag else 0
        
        # Local File Header (30 bytes + filename length)
        local_header = struct.pack('<IHHHHHIIIHH', 
            0x04034b50,         # Local file header signature
            20,                 # Version needed (2.0)
            flags,              # General purpose bit flag
            0,                  # Compression method: 0 = stored (no compression)
            0,                  # File last modification time
            0,                  # File last modification date
            crc,                # CRC-32 checksum
            len(content_bytes), # Compressed size
            len(content_bytes), # Uncompressed size
            len(fname_bytes),   # Filename length
            0                   # Extra field length
        )
        local_header += fname_bytes
        
        buffer.write(local_header)
        buffer.write(content_bytes)
        
        # Central Directory Entry (46 bytes + filename length)
        central_entry = struct.pack('<IHHHHHHIIIHHHHHII',
            0x02014b50,         # Central directory signature
            20,                 # Version made by
            20,                 # Version needed to extract
            flags,              # General purpose bit flag
            0,                  # Compression method: 0 = stored
            0,                  # File last modification time
            0,                  # File last modification date
            crc,                # CRC-32 checksum
            len(content_bytes), # Compressed size
            len(content_bytes), # Uncompressed size
            len(fname_bytes),   # Filename length
            0,                  # Extra field length
            0,                  # File comment length
            0,                  # Disk number where file starts
            0,                  # Internal file attributes
            0,                  # External file attributes
            local_header_offset # Relative offset of local header
        )
        central_entry += fname_bytes
        central_dir.write(central_entry)
        
        local_header_offset += len(local_header) + len(content_bytes)
        file_count += 1
    
    # Write Central Directory
    central_dir_size = central_dir.tell()
    central_dir_offset = buffer.tell()
    buffer.write(central_dir.getvalue())
    
    # End of Central Directory Record (22 bytes)
    end_record = struct.pack('<IHHHHIIH',
        0x06054b50,         # End of central directory signature
        0,                  # Number of this disk
        0,                  # Disk with central directory
        file_count,         # Number of entries on this disk
        file_count,         # Total number of entries
        central_dir_size,   # Size of central directory
        central_dir_offset, # Offset of start of central directory
        0                   # ZIP file comment length
    )
    buffer.write(end_record)
    
    # Write the complete ZIP file to disk
    with open(filepath, 'wb') as f:
        f.write(buffer.getvalue())
    return filepath

# ============================================================================
# Filename Templates for Different Encodings and Sample Sizes
# Note: These are REAL filenames in each language (not English translations)
# They will appear garbled when opened with wrong encoding, allowing you to
# test code page switching with 7-Zip -mcp parameter.
# ============================================================================

templates = {
    'gbk': {
        # Simplified Chinese filenames - will show garbled text if opened with wrong encoding
        'many': [
            "这是一个非常长的中文文件名用于测试自动检测功能.txt",
            "项目文档说明包含详细的技术规范和实施方案.doc",
            "财务报告二零二四年度第一季度汇总数据分析表.xlsx",
            "人力资源部门员工培训计划及考核标准文档.pdf",
            "市场营销策略推广方案及客户反馈调查报告.ppt",
            "产品研发部门新技术开发进度跟踪记录表.docx",
            "客户服务部门投诉处理流程及解决方案汇总.txt",
            "供应链管理部门采购订单及供应商评估报告.pdf",
            "法务部门合同审查意见及风险提示说明文档.doc",
            "信息技术部门系统维护日志及故障处理记录.txt",
            "行政管理部办公用品采购清单及费用报销单据.xlsx",
            "质量管理部产品检测报告及合格证书扫描件.pdf",
            "生产管理部车间排班表及设备维护计划表.doc",
            "仓储物流部库存盘点表及出入库登记记录.xlsx",
            "研发实验室测试数据记录及分析报告文档.txt",
            "客户服务热线录音转写文本及满意度调查.doc",
            "品牌推广部广告投放计划及效果评估报告.ppt",
            "财务审计部内部审计报告及整改建议书.pdf",
            "人力资源部招聘面试记录及录用审批流程.docx",
            "总经理办公室会议纪要及决策事项跟踪表.txt",
        ],
        'medium': [
            "项目文档说明包含详细技术规范.txt",
            "财务报告第一季度汇总数据分析表.xlsx",
            "人力资源部门员工培训计划文档.pdf",
            "市场营销策略推广方案报告.ppt",
            "产品研发新技术开发进度记录表.docx",
            "客户服务投诉处理流程汇总.txt",
            "供应链管理采购订单评估报告.pdf",
            "法务合同审查意见说明文档.doc",
            "信息技术系统维护日志记录.txt",
            "行政管理办公用品采购清单.xlsx",
        ],
        'few': [
            "中文测试文件.txt",
            "项目文档说明.doc",
            "财务报告数据表.xlsx",
        ]
    },
    'big5': {
        # Traditional Chinese filenames - will show garbled text if opened with wrong encoding
        'many': [
            "這是一個非常長的繁體中文檔案名稱用於測試自動偵測功能.txt",
            "專案文件說明包含詳細的技術規格和實施方案計畫書.doc",
            "財務報告二零二四年度第一季匯總數據分析表暨圖表.xlsx",
            "人力資源部門員工教育訓練計畫及考核標準程序文件.pdf",
            "市場行銷策略推廣方案及客戶滿意度調查報告彙整.ppt",
            "產品研發部門新技術開發進度追蹤記錄表及評估報告.docx",
            "客戶服務部門客訴處理流程及解決方案彙整資料庫.txt",
            "供應鏈管理部門採購訂單及供應商評鑑報告彙整表.pdf",
            "法務部門合約審查意見及法律風險提示說明備忘錄.doc",
            "資訊技術部門系統維護日誌及故障排除處理記錄.txt",
            "行政管理部辦公用品採購清單及費用申請核銷單據.xlsx",
            "品質管理部產品檢測報告及合格證明書掃描存檔.pdf",
            "生產管理部工廠排班表及設備定期保養計畫表.doc",
            "倉儲物流部庫存盤點表及進出貨登記追蹤記錄.xlsx",
            "研發實驗室測試數據記錄及分析報告技術文件.txt",
            "客戶服務專線錄音轉寫文字及滿意度問卷調查.doc",
            "品牌推廣部廣告投放計畫及成效評估分析報告.ppt",
            "財務稽核部內部稽核報告及改善建議追蹤事項.pdf",
            "人力資源部招募面談記錄及錄用核准流程文件.docx",
            "總經理辦公室會議記錄及決議事項執行追蹤管制表.txt",
        ],
        'medium': [
            "專案文件說明包含詳細技術規格.txt",
            "財務報告第一季匯總數據分析表.xlsx",
            "人力資源部門員工教育訓練計畫.pdf",
            "市場行銷策略推廣方案報告.ppt",
            "產品研發新技術開發進度記錄表.docx",
            "客戶服務客訴處理流程彙整.txt",
            "供應鏈管理採購訂單評鑑報告.pdf",
            "法務合約審查意見說明文件.doc",
            "資訊技術系統維護日誌記錄.txt",
            "行政管理辦公用品採購清單.xlsx",
        ],
        'few': [
            "繁體中文測試.txt",
            "專案文件說明.doc",
            "財務報告數據表.xlsx",
        ]
    },
    'sjis': {
        # Japanese filenames - will show garbled text if opened with wrong encoding
        'many': [
            "これは非常に長い日本語ファイル名で自動検出機能をテストします.txt",
            "プロジェクト文書詳細な技術仕様と実施方案を含む説明書.doc",
            "財務報告書二零二四年度第一四半期集計データ分析表.xlsx",
            "人事部門社員教育訓練計画及び考核基準文書ファイル.pdf",
            "市場営業戦略推進方案及び顧客フィードバック調査報告.ppt",
            "製品研究開発部門新技術開発進度追跡記録表と評価報告.docx",
            "顧客サービス部門苦情処理流程及び解決方案まとめデータ.txt",
            "サプライチェーン管理部門購買発注及び供給者評価報告書.pdf",
            "法務部門契約審査意見及びリスク警告説明メモランダム.doc",
            "情報技術部門システム保守ログ及び故障処理記録ファイル.txt",
            "総務管理部門事務用品購入リスト及び経費申請書類.xlsx",
            "品質管理部門製品検査報告書及び合格証明書スキャン.pdf",
            "生産管理部門工場シフト表及び設備定期保守計画表.doc",
            "倉庫物流部門在庫棚卸表及び出入庫登録追跡記録.xlsx",
            "研究実験室テストデータ記録及び分析報告技術文書.txt",
            "顧客サービスホットライン録音文字起こし及び満足度調査.doc",
            "ブランド推進部門広告投放計画及び効果評価分析報告.ppt",
            "財務監査部門内部監査報告書及び改善提案追跡事項.pdf",
            "人事部門採用面接記録及び採用承認流程文書ファイル.docx",
            "社長室会議議事録及び決議事項実行追跡管理表.txt",
        ],
        'medium': [
            "プロジェクト文書詳細技術仕様.txt",
            "財務報告第一四半期データ分析表.xlsx",
            "人事部門社員教育訓練計画.pdf",
            "市場営業戦略推進方案報告.ppt",
            "製品研究開発進度追跡記録表.docx",
            "顧客サービス苦情処理データ.txt",
            "サプライチェーン購買発注報告書.pdf",
            "法務契約審査意見メモ.doc",
            "情報技術システム保守ログ.txt",
            "総務管理事務用品購入リスト.xlsx",
        ],
        'few': [
            "日本語テスト.txt",
            "プロジェクト文書.doc",
            "財務報告表.xlsx",
        ]
    },
    'euckr': {
        # Korean filenames - will show garbled text if opened with wrong encoding
        'many': [
            "이것은매우긴한국어파일이름으로자동감지기능을테스트합니다.txt",
            "프로젝트문서상세한기술사양과실시방안을포함한설명서.doc",
            "재무보고서이천이십사년도제일분기집계데이터분석표.xlsx",
            "인사부서사원교육훈련계획및평가기준문서파일입니다.pdf",
            "마케팅전략추진방안및고객피드백조사보고서요약본.ppt",
            "제품연구개발부서신기술개발진도추적기록표및평가보고.docx",
            "고객서비스부서불만처리프로세스및해결방안정리데이터.txt",
            "공급망관리부서구매발주및공급자평가보고서정리표.pdf",
            "법무부서계약심사의견및위험경고설명메모랜덤입니다.doc",
            "정보기술부서시스템유지보수로그및고장처리기록파일.txt",
            "행정관리부서사무용품구매목록및비용신청서류정리표.xlsx",
            "품질관리부서제품검사보고서및합격증명서스캔파일.pdf",
            "생산관리부서공장교대표및설비정기보수계획표문서.doc",
            "창고물류부서재고실사표및입출고등록추적기록정리.xlsx",
            "연구실험실테스트데이터기록및분석보고기술문서입니다.txt",
            "고객서비스핫라인녹음전사텍스트및만족도설문조사.doc",
            "브랜드추진부서광고투입계획및효과평가분석보고서.ppt",
            "재무감사부서내부감사보고서및개선제안추적사항파일.pdf",
            "인사부서채용면접기록및채용승인프로세스문서파일.docx",
            "총경리실회의회의록및결의사항실행추적관리표문서.txt",
        ],
        'medium': [
            "프로젝트문서상세기술사양설명.txt",
            "재무보고서제일분기데이터분석표.xlsx",
            "인사부서사원교육훈련계획.pdf",
            "마케팅전략추진방안보고서.ppt",
            "제품연구개발진도추적기록표.docx",
            "고객서비스불만처리데이터.txt",
            "공급망관리구매발주보고서.pdf",
            "법무부서계약심사의견문서.doc",
            "정보기술시스템유지보수로그.txt",
            "행정관리사무용품구매목록.xlsx",
        ],
        'few': [
            "한국어테스트.txt",
            "프로젝트문서.doc",
            "재무보고표.xlsx",
        ]
    }
}

# ============================================================================
# Encoding Configuration
# ============================================================================

# Python encoding names for each encoding type
encoding_map = {
    'gbk': 'gbk',
    'big5': 'big5',
    'sjis': 'shift_jis',
    'euckr': 'euc_kr'
}

# File content templates in each language (real text, not English)
content_templates = {
    'gbk': "这是文件{i}的内容,用于测试编码检测功能。",
    'big5': "這是檔案{i}的內容,用於測試編碼偵測功能。",
    'sjis': "これはファイル{i}の内容で、エンコーディング検出機能をテストします。",
    'euckr': "이것은파일{i}의내용으로인코딩감지기능을테스트합니다."
}

# ============================================================================
# Main Execution
# ============================================================================

print("=" * 70)
print("Generating Test ZIP Archives with Different Encodings")
print("=" * 70)
print("Purpose: Test encoding detection with 7-Zip -mcp parameter")
print("Note: Filenames are in REAL Chinese/Japanese/Korean (not English)")
print("      They will appear garbled when opened without correct code page.")
print("Output: 12 ZIP files (4 encodings x 3 sample sizes)")
print("=" * 70)

created_files = []

for enc_name, levels in templates.items():
    encoding = encoding_map[enc_name]
    content_template = content_templates[enc_name]
    print(f"\n[{enc_name.upper()} Encoding]")
    print("-" * 70)
    
    for level, names in levels.items():
        # Create file dictionary with sequential content
        files_dict = {}
        for i, name in enumerate(names, 1):
            content = content_template.format(i=i)
            files_dict[name] = content
        
        # Generate ZIP filename (e.g., test_gbk_many.zip)
        filename = f"test_{enc_name}_{level}.zip"
        filepath = create_zip_manual(filename, files_dict, encoding=encoding, use_utf8_flag=False)
        size = os.path.getsize(filepath)
        
        print(f"  {filename:<25} ({len(names):>2} files, {size:>6,} bytes)")
        created_files.append((filename, len(names), size, enc_name, level))

print("\n" + "=" * 70)
print("Generation Complete! File List:")
print("=" * 70)
print(f"{'No.':<4} {'Filename':<25} {'Encoding':<10} {'Sample':<8} {'Files':<6} {'Size':<12}")
print("-" * 70)

for i, (fname, count, size, enc, level) in enumerate(created_files, 1):
    size_str = f"{size:,} bytes"
    print(f"{i:<4} {fname:<25} {enc:<10} {level:<8} {count:<6} {size_str:<12}")

print("=" * 70)
print("\nHow to Extract with 7-Zip Command Line:")
print("-" * 70)
print("Basic syntax:")
print("    7z x <archive.zip> -mcp=<codepage>")
print("")
print("Examples:")
print("    7z x test_gbk_many.zip -mcp=936")
print("    7z x test_big5_many.zip -mcp=950")
print("    7z x test_sjis_many.zip -mcp=932")
print("    7z x test_euckr_many.zip -mcp=949")
print("")
print("Code Page Reference:")
print("    936  = Simplified Chinese (GBK)")
print("    950  = Traditional Chinese (Big5)")
print("    932  = Japanese (Shift-JIS)")
print("    949  = Korean (EUC-KR)")
print("    65001 = UTF-8")
print("")
print("Sample Size Guide:")
print("    many:   20 files, long filenames (30-60 chars) - Auto detection SHOULD work")
print("    medium: 10 files, medium filenames (15-30 chars) - Auto detection MAY work")
print("    few:    3 files, short filenames (5-15 chars) - Auto detection may FAIL")
print("")
print("Important Notes:")
print("    • These files have NO UTF-8 flag (Bit 11 = 0)")
print("    • They CAN be repaired using 7-Zip -mcp parameter")
print("    • 7-Zip GUI does NOT support code page switching - use command line")
print("    • Filenames are REAL CJK text (not English translations)")
print("=" * 70)

I have also encountered this issue, and you can use the above script to generate test ZIP files.

I hope to have automatic encoding detection. Of course, if the number of files is small and the filenames are short, the sample size will be limited, leading to low accuracy. Therefore, a fallback option should be added to allow users to choose the encoding. Even if automatic encoding detection is not implemented, users should still be able to select the encoding. In fact, all filenames in a compressed file share the same encoding; aggregating them into one sample would improve accuracy. For automatic encoding detection, libraries such as compact_enc_det or uchardet can be referenced.

The ZIP format does not mandate a field for filename encoding. Although mainstream tools now default to UTF-8 and set the corresponding flag, I cannot guarantee that ZIP files sent to me by others will use UTF-8 and have the flag set.

Sorry, I’m unable to share a ZIP sample—both for client data confidentiality and because I usually delete the original archive after extraction.

I noticed the method mentioned by user wincitng might be helpful. Thanks to @wincitng!

Without a sample I don't think we'll be able to look at it unfortunately.

CN_2.12_src.zip (10.0 MB)"The 'Guess UTF-8' feature is currently enabled. If you open the archive directly, the filenames will appear as garbled text (mojibake). I turned this setting on because a client previously sent me a ZIP file created on a Mac."

There's no option called 'Guess UTF-8' - do you mean 'Assume UTF-8 filenames'?

If that's turned on and gives garbled filenames, the correct procedure is to turn the option off as it presumably means the filenames aren't stored in UTF-8. Ths isn't a bug - basically you're telling Opus the filenames are UTF-8 when they're not.