Determine if PDF is searchable?

Hi,

Is Directory Opus able to set a marker so that the user is able to distinguish searchable / non searchable PDF files from each other?
For example, by creating a column "searchable" and put an S (or any marker) on each line for searchable PDF files?
(and not set a marker for non searchables)

Explanation:
I have lots of PDF files, some of them are searchable, some not.
I want to get an overview of searchables/non searcables, not by opening each file manually and check if its searchable.

The main goal behind this question is that I have lots of pdf files and I want to make them all searchable but to do that I first need to have an overview of which pdf files are already searchable and which not.

Is Directory Opus able to help me?

Thanks,
Mathijs

Anyone?

Opus can search inside PDF documents if the right software is installed, but Opus itself isn't what does the actual content searching: it relies on the IFilters which PDF software installs to make that possible (the same as Windows Search etc. also does).

There might be a way to find out if an IFilter is installed and able to search the contents of a particular file, but I don't know of it. Which is why I didn't reply; I don't know.

If you'd find a solution I'd be interested myself but seems like a "when you have DOpus everything looks like a nail" problem to me :slight_smile: If you can find a command line tool which can extract this info (neither cpdf nor mutool can do it from what I see) or find the flag somewhere in the file, then people could help you with extracting the info via scripting.

Checking the mere existence of embedded fonts in a PDF file does not indicate that the file is 100% searchable (the file could have 1 line of text and the rest of text as images), but the lack of embedded fonts might indicate the file is not indexed, but it's just a hunch, I'm no PDF expert. In that case both cpdf & mutool can help. they can list the embedded fonts on command line, i.e. its output can be parsed via scripting, but you gotta experiment with the 2 tools and your files to verify the hunch.

1 Like

Has latest version of Opus been improved support for the start question in this thread?

There's nothing built-in that can tell you if a PDF is "searchable" or not. What that means will depend on which IFilters you have installed anyway. Some IFilters can do OCR and would make almost any PDF searchable.

Similar questions:
On the XYplorer forum (by mgroen)
On the Total Commander forum (by mgroen)

I once woite something using pdftotext.exe. This was at its core (CMD/batch):

if not exist "%OUT-FOLDER%" md "%OUT-FOLDER%"

	for %%X in (*.pdf) do (
		echo.    [%%X]
		pdftotext.exe -simple "%%X" .\checkthis.txt
		for %%C in (checkthis.txt) DO if %%~zC LSS 25 ( move "%%X" "%OUT-FOLDER%" )
		del checkthis.txt
	)
1 Like

Has latest version of Opus been improved support for the start question in this thread?

No change on our side.

It will still depend on external software what “searchable PDF” means, since OCR is possible even with PDFs which contain only image data.

You could display the pdf metadata with ExifTool Custom Columns and determine the searchability indirectly by fields like Producer.

"Searchable" means it contains text (not images with text, but actual text), right?
So why not just perform a regex search to find matches for any character/word?

Correct me if I'm wrong, but you can set Producer field even if the pdf it's just images.

Yes, it's just an educated guess.

Just my 2ct: Everything can index a PDF's content (must be properly configured, though) and search with keyword content:

Maybe one can use this as a basis for implementing an Evaluator column or a JScript column to show a flag for a searchable PDF?