I was wondering that is there a possibility to search images by text that is shown in the images? I need to search a hard drive full of files, for some files (text files or images) that have certain content text. No matter that are those files text files, documents or images (screenshots etc).
Not in general, although it could be done with an OCR IFilter. Maybe one already exists, since IFilters are also supported by parts of Windows and Office.
So.. the answer is no. Haven't found any tools for this with which I could index my NAS and search my gazillion screenshot files with search text that is contained in them. Phew, I'm in trouble..
Tesseract and Opus let you quickly build a text mirror of your image collection which can easily be searched with Opus or any convenient search tool.
Resources:
A button for Tesseract can be as simple as
@nodeselect
@nofilenamequoting
"/programfiles\Tesseract-OCR\tesseract.exe" "{filepath}" "{filepath}.txt"
Thanks for the reply. If I understand correctly, that OCRs one selected file to text when a button is pressed? How to extract those .txt files from the whole NAS to a certain folder and just search that folder then?
I asked AI about that and it suggested:
# Define paths
$srcFolder = '\\NAS\MyPictures\'
$dstFolder = 'C:\OCR-Results\'
# Get all jpg, jpeg and png files in source directory and subdirectories
$files = Get-ChildItem -Path $srcFolder -Recurse -Include *.jpg, *.jpeg, *.png
# Loop through all files
foreach ($file in $files) {
# Define output file name
$outputFile = Join-Path $dstFolder ($file.BaseName + ".txt")
# Run Tesseract OCR and output result to defined file
Start-Process -FilePath "/programfiles\Tesseract-OCR\tesseract.exe" -ArgumentList "`"$($file.FullName)`" `"$outputFile`""
}
Same principle in Opus. Get a list of files via Find or Flatview/Filter and run Tesseract on it. You can use codes in the button to rebuild the folder tree.
@nodeselect
@nofilenamequoting
"/programfiles\Tesseract-OCR\tesseract.exe" "{filepath}" "C:\OCR-Results\{filepath|noroot}.txt"
Nothing wrong with Powershell, of course. You will probably need to use the Windows notation for the path to tesseract.exe
.
I ended up with this, does this look usable and correct to you:
@nodeselect
@nofilenamequoting
@runonce:@set dirname={filepath|root}\OCR-files
@runonce:@set relpath={filepath|..}
@runonce:@set relpath={$relpath|noterm}
CreateFolder NAME "{$dirname}{$relpath}" READAUTO=no
"path\to\tesseract.exe" "{filepath}" "{$dirname}{$relpath}\{file|noext}.txt"
Try
@nodeselect
@nofilenamequoting
@runonce:@set dirname={filepath|\}OCR-files
@runonce:@set relpath={filepath|..|noroot|noterm}
CreateFolder NAME="{$dirname}\{$relpath}" READAUTO=no
"path\to\tesseract.exe" "{filepath}" "{$dirname}\{$relpath}\{file|noext}.txt"
Using this now:
@nodeselect
@nofilenamequoting
@runonce:@set dirname={filepath|\}OCR-files
@runonce:@set relpath={filepath|..|noroot|noterm}
CreateFolder NAME="{$dirname}\{$relpath}" READAUTO=no
"C:\Program Files\Tesseract-OCR\tesseract.exe" "{filepath}" "{$dirname}\{$relpath}\{file}"
I noticed that the OCR makes quite a lot of mistakes, which is a bit surprising. When there is clear text that says for example "Liike" it outputs "Like" and on some words it is missing multiple characters. I also noticed that it won't output umlauts. I guess it requires configuration somewhere. Have two kids to burn my time, so don't have the time to dig deeper.
Adding languages to the command line should help, at least when handling special characters. It's been a while since I last used Tesseract, so my knowledge is not current.
awesome! thanks for this guys!
yeah it makes a lot of mistakes... but the 3 other softwares ive been using do a well. each one will come up with different misspellings or random symbols. haha
its nice having this automatic so i dont have to spend time opening other software!
I used the top two buttons for a long time, but now they have stopped working. all I added to the system (after Ttesseract update to 5.4) is this two Windows environment variables "Path" and "TESSDATA_PREFIX". why doesn't it work now? Is this related to the current version of DOpus? any test?
If the command isn't working and you've modified it, showing us the command you're using is best.
No, Leo, sorry. I use the same buttons as above, but now they have stopped working. who knows why?
Either the button is generating the wrong command line, or the command itself isn't working for some reason.
I don't now anything about Ttesseract, but if I was trying to work out what was wrong, I'd look at that.
See what command the button generates (the button editor's Run button has some things in its menu to help here) and, if that looks correct, try running that command from a Command Prompt without Opus being involved.
Oh! That's great help! i set the wrong path to TESSDATA_PREFIX. now it works!
"Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory."