Is Directory Opus able to set a marker so that the user is able to distinguish searchable / non searchable PDF files from each other?
For example, by creating a column "searchable" and put an S (or any marker) on each line for searchable PDF files?
(and not set a marker for non searchables)
Explanation:
I have lots of PDF files, some of them are searchable, some not.
I want to get an overview of searchables/non searcables, not by opening each file manually and check if its searchable.
The main goal behind this question is that I have lots of pdf files and I want to make them all searchable but to do that I first need to have an overview of which pdf files are already searchable and which not.
Opus can search inside PDF documents if the right software is installed, but Opus itself isn't what does the actual content searching: it relies on the IFilters which PDF software installs to make that possible (the same as Windows Search etc. also does).
There might be a way to find out if an IFilter is installed and able to search the contents of a particular file, but I don't know of it. Which is why I didn't reply; I don't know.
If you'd find a solution I'd be interested myself but seems like a "when you have DOpus everything looks like a nail" problem to me If you can find a command line tool which can extract this info (neither cpdf nor mutool can do it from what I see) or find the flag somewhere in the file, then people could help you with extracting the info via scripting.
Checking the mere existence of embedded fonts in a PDF file does not indicate that the file is 100% searchable (the file could have 1 line of text and the rest of text as images), but the lack of embedded fonts might indicate the file is not indexed, but it's just a hunch, I'm no PDF expert. In that case both cpdf & mutool can help. they can list the embedded fonts on command line, i.e. its output can be parsed via scripting, but you gotta experiment with the 2 tools and your files to verify the hunch.
There's nothing built-in that can tell you if a PDF is "searchable" or not. What that means will depend on which IFilters you have installed anyway. Some IFilters can do OCR and would make almost any PDF searchable.
I once woite something using pdftotext.exe. This was at its core (CMD/batch):
if not exist "%OUT-FOLDER%" md "%OUT-FOLDER%"
for %%X in (*.pdf) do (
echo. [%%X]
pdftotext.exe -simple "%%X" .\checkthis.txt
for %%C in (checkthis.txt) DO if %%~zC LSS 25 ( move "%%X" "%OUT-FOLDER%" )
del checkthis.txt
)
"Searchable" means it contains text (not images with text, but actual text), right?
So why not just perform a regex search to find matches for any character/word?
Correct me if I'm wrong, but you can set Producer field even if the pdf it's just images.