Tokenizer script - request for comments/ideas

I have been using this tokenizer-based script which parses filenames and automatically creates Everything Search strings. Most of my music, videos, documents follow similar naming schemes, and this script helps me to find say, any album from same artist, movies from same year, files with the same tags etc. Now that v13 and ES became tightly integrated it got even more powerful. Because of my work, I am very well versed with regexps but know they might be a bit scary for other people.

Basically, one tokenizer (red) splits the file name into tokens (outer parts) and a 2nd subtokenizer (yellow) picks certain portions of each token into smaller, inner parts, which are shown as Tx & Tx_Sy columns in the screenshot respectively. (Technically only outer one is a tokenizer, the inner one is match groups instead, that's why 2nd one always must have /g). The subtokens then are put together into an "OR" like regexp together and passed on to ES (blue). And if multiple files are selected then all subtokens across all files are collected. If all arguments are set to "" or 0, then it performs whole filename_stem as tokens. This has become a major part of my workflow: I select files related to one customer project (same prefix on all files for the same project), find e-books on a certain topic, etc pp In the screenshot, I select 2 files, press Alt-S and it auto-starts an ES global regexp search. The reason why there are also "final prefix" and "final suffix" arguments is when I was using external ES before v13, I would add e.g. ".exe$" at the end to filter further by exe files or add "^M:" in front to filter further by drive, etc. Of course they are available as command arguments. I see enormous potential for other people as well as it is; instead of dictating a certain naming scheme for a specific purpose, anybody can define their own, and use different regexps for different purposes. For example, tags in metadata are nice, but not all file formats support them and they require peeking into a deeper layer of access and reading files, whereas tags in filenames are blazing fast, since they're already read into memory, and the rest is ES magic running at speed of light.

My personal version uses the "tokennum" (default 0, but can be overridden via settings or command arguments) for picking the outer token#. But I decided to extend it and create visualization columns, since I already have this information and thought it might help others to create rename scripts (using the script columns) or to get more familiar with regexps or some other purpose, but it's getting into gold-plating territory (basically going way overboard).

Is there anybody interested in the columns version for whatever reason, or can come up with genius application ideas? Maybe people with less regexp experience? Or should I remove the columns and release the leaner script?

You are not until you write regex HTML parser.

1 Like

:smiley: Well, funny you say that, because I'd actually created the TypeScript DOpus helpers on my github from the decompiled DOpus .chm files using regexps :smiley: But true, HTML-parsing can be a real PITA.

Looks promising!
Two suggestions I can give you:

  1. The greatest use I see for it is in splitting the filenames based on certain characters. Since not many users are familiar with regular expressions at all, you could add an option in the script configuration/command arguments, so the user can write only the delimiter chars (eg. " #.-"). In that case, the regex assembly will be transparent to the user.
  2. If you want to invest your time in this project, and as the user input will change constantly, you could consider make use of the new FAYT scripts: Then your syntax could allows to reference the tokens and also text to the search.
    e.g. $1|$2 ext:exe: The command would internally transform $1 and $2 to the first 2 tokens of each filename (use negative numbers to refer to the last tokens). The resultant search string would be sent to EV at the end using the FIND command.
1 Like

Oooh nice. Thanks for the ideas!

#1 is easily doable, with another argument, SPLITCHARS or some sort. If it exist, overrides the regex tokenizer version. Subtokens are a bit trickier tho; since they select subgroups instead of further splitting; maybe I'd disable them if SPLITCHARS is given.

#2 I am not familiar with the FAYT scripts, gotta look into them. But what you suggest looks very, very interesting.

The FIND command though, does not support ES from what I saw, only Windows (or DOpus) search so far, that's why I'm triggering the search via CLI QUICKSEARCHENGINE everythingglobal.

It does. Check as an example this FAYT script I made using Everything. Is like the built-in but with more options and doesn't interrupt you when you're still typing :smile: .
The relevant args for the command are QUERYENGINE and QUERY.
e.g. FIND CLEAR QUERYENGINE=everythingglobal QUERY path:opus ext:exe

1 Like

Nice, good to know. I was looking at the FIND help, it isn't listed there, but the DOpus devs are working over-time already, so I guess in time it'll be added.

I'll have a look at your script as well. FAYT scripts look interesting.