Is it possible to search files by text shown in images?

tunttunen · July 30, 2023, 8:02pm

I was wondering that is there a possibility to search images by text that is shown in the images? I need to search a hard drive full of files, for some files (text files or images) that have certain content text. No matter that are those files text files, documents or images (screenshots etc).

Leo · July 30, 2023, 8:15pm

Not in general, although it could be done with an OCR IFilter. Maybe one already exists, since IFilters are also supported by parts of Windows and Office.

tunttunen · August 1, 2023, 5:53pm

So.. the answer is no. Haven't found any tools for this with which I could index my NAS and search my gazillion screenshot files with search text that is contained in them. Phew, I'm in trouble..

lxp · August 1, 2023, 6:37pm

Tesseract and Opus let you quickly build a text mirror of your image collection which can easily be searched with Opus or any convenient search tool.

Resources:

A button for Tesseract can be as simple as

@nodeselect
@nofilenamequoting
"/programfiles\Tesseract-OCR\tesseract.exe" "{filepath}" "{filepath}.txt"

tunttunen · August 1, 2023, 7:09pm

Thanks for the reply. If I understand correctly, that OCRs one selected file to text when a button is pressed? How to extract those .txt files from the whole NAS to a certain folder and just search that folder then?

I asked AI about that and it suggested:

# Define paths
$srcFolder = '\\NAS\MyPictures\' 
$dstFolder = 'C:\OCR-Results\'

# Get all jpg, jpeg and png files in source directory and subdirectories
$files = Get-ChildItem -Path $srcFolder -Recurse -Include *.jpg, *.jpeg, *.png

# Loop through all files
foreach ($file in $files) {
    # Define output file name
    $outputFile = Join-Path $dstFolder ($file.BaseName + ".txt")

    # Run Tesseract OCR and output result to defined file
    Start-Process -FilePath "/programfiles\Tesseract-OCR\tesseract.exe" -ArgumentList "`"$($file.FullName)`" `"$outputFile`""
}

lxp · August 1, 2023, 7:29pm

Same principle in Opus. Get a list of files via Find or Flatview/Filter and run Tesseract on it. You can use codes in the button to rebuild the folder tree.

@nodeselect
@nofilenamequoting
"/programfiles\Tesseract-OCR\tesseract.exe" "{filepath}" "C:\OCR-Results\{filepath|noroot}.txt"

Nothing wrong with Powershell, of course. You will probably need to use the Windows notation for the path to tesseract.exe.

tunttunen · August 1, 2023, 8:09pm

I ended up with this, does this look usable and correct to you:

@nodeselect 
@nofilenamequoting 
@runonce:@set dirname={filepath|root}\OCR-files
@runonce:@set relpath={filepath|..}
@runonce:@set relpath={$relpath|noterm}

CreateFolder NAME "{$dirname}{$relpath}" READAUTO=no 

"path\to\tesseract.exe" "{filepath}" "{$dirname}{$relpath}\{file|noext}.txt"

lxp · August 1, 2023, 8:34pm

Try

@nodeselect 
@nofilenamequoting 
@runonce:@set dirname={filepath|\}OCR-files
@runonce:@set relpath={filepath|..|noroot|noterm}

CreateFolder NAME="{$dirname}\{$relpath}" READAUTO=no 

"path\to\tesseract.exe" "{filepath}" "{$dirname}\{$relpath}\{file|noext}.txt"

tunttunen · August 2, 2023, 5:20pm

Using this now:

@nodeselect 
@nofilenamequoting 
@runonce:@set dirname={filepath|\}OCR-files
@runonce:@set relpath={filepath|..|noroot|noterm}

CreateFolder NAME="{$dirname}\{$relpath}" READAUTO=no 

"C:\Program Files\Tesseract-OCR\tesseract.exe" "{filepath}" "{$dirname}\{$relpath}\{file}"

I noticed that the OCR makes quite a lot of mistakes, which is a bit surprising. When there is clear text that says for example "Liike" it outputs "Like" and on some words it is missing multiple characters. I also noticed that it won't output umlauts. I guess it requires configuration somewhere. Have two kids to burn my time, so don't have the time to dig deeper.

lxp · August 2, 2023, 6:18pm

Adding languages to the command line should help, at least when handling special characters. It's been a while since I last used Tesseract, so my knowledge is not current.

xavierarmand · April 15, 2024, 10:14pm

awesome! thanks for this guys!

yeah it makes a lot of mistakes... but the 3 other softwares ive been using do a well. each one will come up with different misspellings or random symbols. haha

its nice having this automatic so i dont have to spend time opening other software!