PDF metadata salad

mazeckenrode · May 30, 2023, 5:54pm

Any idea why Directory Opus would display certain PDF metadata as lengthy nonsensical text strings for some files? I’ve been trying to determine what the common factors are, without success so far. Unfortunately, I can’t share any of the metadata that gets mangled when displayed by DOpus because all instances that I’ve noticed so far contain sensitive data. To give you some idea what I’m seeing, I have a PDF with metadata that was initially written using PDF-XChange Editor, then modified using ExifTool, then linearized using QPDF. The following are sanitized versions of various metadata values written to it:

Authors: Goofy [Google Sheet]; Mickey Mouse [PDF]

Subject: Financial transactions between Mickey Mouse & Donald Duck, 1923–2023, via Google Sheets spreadsheet by Goofy, 30 May 2023, ~12:00:00; File as downloaded: “20230530 - Goofy - Mickey & Donald Transactions 1923–2023.pdf” (742,895) [157 pp, 223309 w, 105121483/1210375631, 26025 l]; Source: <https://docs.google.com/spreadsheets/d/xyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxy/edit>

Title: Financial transactions between Mickey Mouse & Donald Duck, 1923–2023, via Google Sheets spreadsheet by Goofy, 30 May 2023, ~12:00:00

When I write the values above to a PDF using the software outlined above, DOpus correctly displays all three fields via the Metadata pane or Set Metadata panel. But when I wrote very similarly-constructed values of similar respective lengths to an actual PDF of importance, DOpus displays them as such:

Authors: eff004b0065007600690[…] (231 characters total)

Subject: eff00460069006e00610[…] (1547 characters total)

Title: eff00460069006e00610[…] (567 characters total)

Any suggestions?

Leo · May 30, 2023, 6:49pm

What does File Explorer or other software show for the same files?

mazeckenrode · May 30, 2023, 7:12pm

Both PDF-XChange Editor and ExifTool display/extract the metadata as originally written.

Re: File Explorer… I assume you mean Windows’ built-in Properties dialog > Details tab > Description section? Looks like for any PDF I try, no metadata is displayed at all. I’m currently on a Windows 7 laptop, by the way. I can try these files on my Windows 10 laptop later, if needed, but I can tell you that I’ve seen this kind of PDF metadata salad displayed by DOpus on that laptop as well.

Leo · May 30, 2023, 7:15pm

File Explorer may show the same details in columns if they aren't in the Properties dialog, although I don't know what Windows 7 did with PDFs.

Hard to say much else without example files to look at.

roirraWedorehT · May 30, 2023, 8:32pm

I confirm that on Windows 10 21H2, File Explorer doesn't show Authors or Title (I didn't find or set a PDF to have Subject) - neither via the Properties dialog on the Details tab, nor in File Explorer columns, and there were multiple example PDFs with Authors and Titles in the folder - upwards of several dozen.

I even copied the PDFs elsewhere, all to a simple, short folder structure (E:\Test), and lastly to yet a third location on another drive (S:\Test) and still never shows anything for Authors or Title, whether in Properties or in File Explorer columns. I double-checked, and the metadata was still present in Directory Opus 12.31. All drives are formatted in standard NTFS.

I don't normally set MetaData fields for PDFs (or most file types, other than music), so I don't have any insight on the issue presented here.

Leo · May 30, 2023, 8:36pm

It could depend on the PDF software (last) installed, since metadata shell extensions can provide details like that, but maybe not all software does. Not something I'm expert on though.

roirraWedorehT · May 30, 2023, 9:04pm

Just for more data relating to this puzzle, I don't have any PDF software installed at all, not counting Google Chrome (which is my default PDF viewer) and Microsoft Edge that comes with Windows. But, as you say, maybe whether File Explorer would show that information or not would depend on some correct software being installed, but my guess is simply that File Explorer doesn't support PDF metadata at all.

mazeckenrode · May 31, 2023, 1:37pm

I still haven’t managed to produce an example PDF that I can share, but in an attempt to do so, I’ve made copies of several existing problem PDFs, added blank pages to them, deleted all other pages, then viewed the metadata in Directory Opus, which displayed all metadata correctly at that point. Not sure if that tells us anything useful. The PDFs I have this problem with are often business invoices, which typically range from a few pages to, say, 20 pages in length. They’re not all produced by the same software, and they’re not all the same PDF standard version.

mazeckenrode · May 31, 2023, 3:56pm

Woo hoo! Finally did it! See the attached PDF, which, for me, displays both SUBJECT and TITLE as lengthy nonsensical strings.

For what it’s worth, my process to create this PDF was:

Made a copy, with custom name, of an existing 2-page invoice already populated with metadata.
Loaded into PDF-XChange Editor.
Inserted 2 blank pages.
Deleted all other pages.
Replaced existing metadata with new dummy metadata (but note that the old metadata still exists at this point).
Used ExifTool to import PDF creation and modification dates from filename.
Used ExifTool to export all current metadata to JSON file.
Because PDF-XChange Editor appears to write metadata field PDF:Keywords incorrectly (based on ExifTool-exported JSON showing ;"," between keywords instead of ","), loaded the JSON into a text editor and replaced ;"," with ",".
Used ExifTool to re-import the edited data from the exported JSON. NOTE that it was at this point that the PDF’s metadata first started being displayed incorrectly by DOpus, but I use this same process on practically ALL PDFs I add or modify metadata in, and they don’t all have that problem.
Used QPDF to linearize the PDF (deletes all old metadata and keep only most-recently added or modified).

Attached: “2023-05-31 10;00;00 - Test.7z” (3,094)

Contents:

“2023-05-31 10;00;00 - Test\”
“2023-05-31 10;00;00 - Test.pdf” (10,706)
“2023-05-31 10;00;00 - Test.pdf.json” (3,175)

2023-05-31 10;00;00 - Test.7z (3.0 KB)

Leo · May 31, 2023, 4:25pm

Many thanks, we'll take a look.

mazeckenrode · July 2, 2023, 2:51pm

So, did anything ever come of this?

Leo · July 2, 2023, 8:25pm

We worked out what the issue is (some PDF software using an absolutely insane way to double-encode text-as-text) and have it on the list to see if we can support what they're doing. I don't have an ETA because we're very busy finishing other work at the moment.

mazeckenrode · July 2, 2023, 8:47pm

Very curious, then, that not all of the PDFs that I use the same process on have that same problem. Seems like something I should maybe bring up in ExifTool’s forum, since that’s the tool that does the last metadata update before the problem rears its ugly head, when it does.

mazeckenrode · July 27, 2023, 6:33pm

@Leo

I started a topic about this issue at ExifTool’s forum. ExifTool author Phil Harvey had this to say:

DOpus is probably having trouble with Subject because it is stored as a hex string… But this is a perfectly valid storage format. DOpus should support this.

If that explanation doesn’t mesh with your assessment, please feel free to add to the ExifTool thread, or tell me something useful here that I can pass along.

Of course, everything I’ve read here and at ExifTool still doesn’t explain why it only happens for some PDFs, and not always the same metadata fields.

Leo · July 27, 2023, 6:37pm

We should support it, I agree. (It's also an incredibly strange and inefficient way to store information. But if it's some kind of standard then we will support it, now we're aware of it, and when we have time after finishing other work.)

mazeckenrode · July 31, 2023, 8:03pm

Not that this should affect your plans to eventually support hex-string PDF metadata, but just FYI, Phil Harvey (author of ExifTool) posted last week:

I replied that I routinely use certain extended ASCII characters in metadata, and almost never use UTF-specific characters, and not all of my PDF metadata is garbled when viewed in DOpus. He then posted:

The jury’s still out on whether any of those control characters actually exist in my metadata, or how they could have been introduced, though.

Leo · August 8, 2023, 6:10pm

We have a fix for this coming in the next update.

I also now understand why that string format is used sometimes, as we've had to do the same when strings contain certain characters that have special meaning in PDF files. I could not find a way to avoid it that didn't cause one program or another to misinterpret the strings.

None of this is ExifTool's fault; what it's doing makes sense. It's the fault of the PDF format itself, which has at least three ways to represent a text string, where none of them are actually good. Not that anyone can fix that now.

mazeckenrode · August 8, 2023, 9:06pm

@Leo

Cool, thanks.

Just some additional FYI, in case anybody cares: Part of my inquiry in the ExifTool support forum was focused on attempting to learn why my use of certain extended ASCII characters (“” ‘’ – — © ® é and possibly others) in PDF metadata does not always result in a garbled display by DOpus, despite being subjected to the same workflow and processing (including by ExifTool) as other PDFs which do result in a garbled display, but it appears that an answer is not forthcoming as of right now.

Leo · August 8, 2023, 9:09pm

It may be those combined with the ) character, although I am just guessing from what I ran into today while testing the changes on our side. As far as I can tell, if there are both non-ASCII characters and a ) character in a string, it has to be encoded in a special way.

mazeckenrode · August 9, 2023, 5:53pm

My attempts to confirm that here have failed, for what it’s worth.