Regular expression modes & syntax

MrC · June 12, 2011, 5:01pm

Thread split: This thread used to be on the end of the Wildcards thread.

Leo and I had a nice discussion in another thread. I'd like to suggest here is a case where for average users, sed or perl like substitutions are easier to grasp and linguistically more natural.

For example, in addition to the existing Regular Expression mode, with something like a new pull-down option:

Substitute mode RE's

and Old Name / New Name fields are replaced simply with Pattern.

A user here would simply enter:

/([^)]*)//

and not have to thing about the more complex clustering/capturing issues.

I suggest using only one text edit area so that the entire find/replace pattern can be copied into the paste buffer easily for modification or external testing. When two text areas are presented, this becomes more cumbersome.

Would something like this have a chance for implementation?

Leo · June 12, 2011, 5:49pm

I think you're confusing what you personally are used to with what people in general will find easier.

For anyone familiar with regexp it makes virtually no difference either way. For people trying to learn it, having two ways to do the same thing, and some examples using one while others use the other, plus the extra slashes and lack of clarity (unless you already know the syntax) that one string has both find and replace strings... Sorry, but it doesn't make sense to me. I don't think anyone else has ever requested it, either. People either seem happy with the current regexp method (which matches what most other Windows tools use if they support regexp at all) or they're unfamiliar with them and possibly already find them daunting enough without merging the two parts with slashes and having two ways to do the same thing.

David · June 13, 2011, 12:11am

Leo, Perhaps part of this discussion can split here to a new topic ?
What MrC has to offer on regular expressions is at least worth discussion.

Regular Expressions in Directory Opus are targeted to an expression that completes an entire Windows filename.
Yes, it is sometimes a bit more cumbersome to write an expression that completes the entire filename, often at the cost of extra Capture Groups.
The maximum number of capture groups is nine last I tried.
Windows scripting can accomplish the more common simple pattern replace though.
In light of this, It does seem valid to ask the kind of question MrC has asked even though the target is not long strings of text.
In PHP, there are simple functions such as Preg_Replace for this.
php.net/manual/en/function.preg-replace.php .

I'm still with GPSoft on this though, they taught me several lessons in using Henry Specer's old library.
Hey Leo, the old thread on Dots to spaces except in numbers. (For files and folders.)
seems to have changed. That IS MY CODE in the toolbar button, helped a little by you after an almost solve by me isn't it ?
I tend to think something about the thread is shorter now.

Dave

MrC · June 13, 2011, 1:50am

[quote="leo"]I think you're confusing what you personally are used to with what people in general will find easier.

For anyone familiar with regexp it makes virtually no difference either way. For people trying to learn it, having two ways to do the same thing, and some examples using one while others use the other, plus the extra slashes and lack of clarity (unless you already know the syntax) that one string has both find and replace strings... Sorry, but it doesn't make sense to me. I don't think anyone else has ever requested it, either. People either seem happy with the current regexp method (which matches what most other Windows tools use if they support regexp at all) or they're unfamiliar with them and possibly already find them daunting enough without merging the two parts with slashes and having two ways to do the same thing.[/quote]

I'd like to have an open discussion about this if you are willing. I get the sense the door is closed, and the arguments against are ad-hoc.

Regardless, I'd like to ask that you refrain from the argument strategy that others are confused or incorrect because they have different view points. Having studied computer languages and tools, having taught them, developed them, and having been in the biz for 30 years, there might be a valid discussion here.

If there is no room for a dialog, I'll let it rest. I do appreciate your time, and continued, awesome engagement.

Leo · June 13, 2011, 7:37am

I'm open to discussion but so far I don't find any of the arguments compelling. If I disagree with something it's not because I'm close-minded. Far from it. It's because you haven't convinced me.

You can't just assert that one method is easier than the other; you need to back it up with evidence. The evidence you've given so far has backfired, at least in my eyes.

In the other thread you jumped on my regexp not doing something as an example of how the current regexp system was over-complicated. It was only an example of me, in my haste, not considering a case that was implied but not explicitly mentioned in the question. You gave your version that didn't work either, and then fixed it with something that was far more complex than the simple amendment required to make my original regexp handle the extra case. My example worked using the basic regexp building blocks while yours required people to use and understand an advanced, additional concept and the syntax for it. If we're really talking about simplifying things for the average user then any argument involving the phrase negative lookbehind assertions is not helping your case. That gave me the impression you were asking for what you're used to, and trying to find evidence to back it up whether or not the evidence was really there, rather than looking at the situation objectively.

There's nothing wrong with asking for something because you are used to it, of course. We don't set out to create a program that people find alien. It's just less likely to gain traction unless lots of other people also ask for the same thing; enough people to justify either changing things or adding complexity (to development, testing and end-users) by having two modes for the same thing. And remember that the way it works now is what a lot of other people are used to (not just from Opus but from other programs), if they are used to regexps at all.

I'm also not sure why we have conflated the syntax issue (wanting to be able to use /find/replace/) with the issue of whether or not the expression has to match the whole name. While, so far, I am not personally convinced there is value in changing either of them, they are independent and one could change without the other.

SED-style syntax:

I disagree with the assertion that one method is much easier than the other. They are both about the same as each other, and in some ways the SED-style method you're advocating is more complicated, not less. Even in the situations where it is less complicated, it isn't a lot less complicated. I've yet to see an example that justifies adding complexity to the UI by having two methods for the same thing and adding confusion to regexp learners. Especially the people who feel intimidated by what looks, at first, like a load of random symbols; they aren't helped by joining the find & replace parts by yet more symbols (which also make it hard to see where the separation is, if you even know the expression is meant to have two parts to begin with). If you want to deny that those people exist then you haven't read these forums for very long. We've taught a bunch of people regexps over the years, and also had a bunch who were too intimidated to even try to learn them.

If we can make them less intimidating, great. Or if we can make them significantly more powerful, also great. But all SED-style syntax seems to do is let you push the slash key (three times) instead of the tab key (once), while making the result harder to parse and more error-prone (e.g. if you get the slashes wrong or don't escape things properly parts of your Find and Replace expressions end up in the other expression).

Matching the whole name:

My argument against changing this boils down to similar themes. It doesn't make things much easier. People who understand regexps can easily add (.*) at the start and end when needed. People who are learning regexps aren't going to be helped by adding complexity, in the form of an additional mode/checkbox that has to be set the right way, to the UI and examples. It also makes that one mode work contrary to how filename wildcards usually work, and makes it easy for people to create an expression that seems to work on some files, and then ends up dropping half the filename on others.

IMO, it is a benefit not a deficiency that regexp renaming skips filenames that the 'find' expression does not match entirely. Explicitly typing (.*) every so often is a small price to pay to avoids unexpected mishaps that mangle filenames in a batch rename.

MrC · June 13, 2011, 4:12pm

First, and foremost, thanks for taking the time.

I would think an optimal opening response would be "Hey, can you explain fully what you have in mind?" rather than assuming I made my entire argument and then dissect based upon that and the invalid assumptions made thereafter. For the record, I've been intentionally terse and brief in order to test the waters and because I know you are versed in the nomenclature and concepts.

And finally on the meta-discussion, let's avoid the "You can't just assert that one method is easier than the other; you need to back it up with evidence", as the remainder of your argument has plenty of assertions and beliefs sans evidence.

So let's step back.

Some classes of problems want a tool that operates as close to natural language as possible. In these cases, some of which we've commented on, users are asking the basic question "How do I cut out this piece?".

So the tool really wants to be a cutting or excision tool. If the problem's subject was a banana, I'd want a tool to cut out the bruised spot. Being that the tool doesn't know exactly what the shape of a bruise looks like, the tool user describes it as precisely as possible so that the cutting tool can do its job correctly.

But the tool currently offered is not a cutting tool. In fact, it is a segmenting and reassembling tool, where not only does one have to describe the precise shape of the middle segment, the user also must describe the shape of the segment fore and the segment aft, and then further needs to know how to reference these segments to reassemble them, and then reassemble them. Cut banana into three segments, precisely describing the mid (target) segment first, but do it in such a way that you don't accidentally consume the fore- or aft-segments, and then describe the fore-segment, and then describe the aft-segment, now take fore-segment and aft-segment and join them together to create your new, unbruised banana. (I hope you like bananas).

It IS easier to describe and initiate users into pattern matching / replacement concepts by starting with basic examples, and building from there. So the problem "How do I remove XYZ from a file name?" (assume here that XYZ requires an RE) really and naturally wants to be, and is in its simplest form 1) describe a pattern to match XYZ, 2) excise it. That's why users ask "How do I remove ..." rather than "How do I break this apart into 3 pieces and recombine them into 2 such that it doesn't include the middle part."

In one of your previous solutions to avoid initial whitespace that might precede some uppercase letters, the solution requires adding an RE component entirely unnatural and counter-intuitive to the problem at hand -- (.*[^ ]) -- which says gobble everything, then backup to a non-space. This requires a-priori knowledge of how the RE engine works, and how to make it behave in just the correct way. And this by definition is far more complex than describing how to cut out XYZ. This is why these concepts are always in later chapters in RE books, and the simplest concepts (eg. replace x with y) are taught first. My use of negative lookbehind assertions as an example did not backfire as an argument. In fact, these were developed and incorporated into the RE language precisely because there are certain classes which cannot be solved otherwise, and because linguistically they are more intuitive (natural way: match X, so long as it is not preceded by exactly Y, where one can focus first on X and then tack on the restriction Y). The use of .* at the beginning is often incorrect, and produces unexpected results due to greediness and the exact implementation of how the particular RE engine works. This is one of the most baffling concepts to RE newbs. The .* works in Opus because the string is anchored. Users have trouble, for example, that q[^u] does no mean "a q not followed by a u", and that instead it means "a q followed by any non-u". And similarly, the requirement from the problem we discussed elsewhere to use (.*[^ ]) says "anything, including nothing, followed by a non-space" but the subsequent part of the RE was [A-Z] so it "feels" to folks like they've already matched the uppercase letter with the [^ ] so how can it then match the following [A-Z]? This is complex stuff. The OP added a new restriction, which was no spaces at the beginning - the negative look-behind solves that problem directly and naturally. You've switched the argument from the natural way of thinking about the problem to one of difficult sounding words such as negative look-behind and admittedly obtuse character sequences (prove this to yourself - explain in detail exactly why and how (.[^ ])([A-Z].) works and why and how the (.*[^ ])) does its job; the negative assertion I added was simple - don't do it at the beginning of a file name, which is precisely what was asked: language translated into syntax, directly).

It was not my primary intent to suggest using sed-style expressions complete with their s/// form, or that a single text edit box be a necessity. This was a secondary feature to allow rapid testing by way of copy/paste. It is exceptionally cumbersome to actually test an RE using the GUI, esp. when adding new test cases (as the dialog must be dismissed, new file selection made, dialog re-opened, re-selection of Regular Expression, re-entry of Old name, re-entry of New name, re-select various checkboxes. This is like typing with a chopstick. So, my suggestions here were to support efficiency of operations for these processes for those users who wanted it (and not eliminating or replacing existing functionality). I found I could much more rapidly test out RE matches for complex patterns by using Cygwin tools and a shell, and then coping/pasting the final winning RE back into Opus. And the rapidity is not because I'm used to it... rather, it is because the steps involved are reduced.

So, to simplify, the suggestions / requests were, in order of priority:

Support a partial name RE Substitution mode (Old name, New Name is fine), which is essentially Find And Replace w/RE (and you already have Find And Replace!)

and distant seconds, ...:

Support a (additional) single text edit area to allow copy/paste of REs (perhaps this can be in Script mode)
support 2 above, including the nearly 40-year old defacto standard s/// (sed, awk, perl, and many more)
Support a refresh or new file inclusion (via drag-n-drop ?) in the Preview area of the Rename tool, so that new cases can be added quickly without dismissing the dialog (because re-entering/re-building the dialog is cumbersome). Rapid testing makes for faster learning.

Again, I very much appreciate your time, willingness to engage, and constructiveness.

Leo · June 19, 2011, 6:38am

[quote="MrC"]Some classes of problems want a tool that operates as close to natural language as possible. In these cases, some of which we've commented on, users are asking the basic question "How do I cut out this piece?".

So the tool really wants to be a cutting or excision tool.[/quote]

It's a renaming tool, not a cutting tool. Only some renames involve cutting parts out. Those are handled perfectly well by the current system. Adding another system that handles them slightly better, but is worse at (or even the same as) the other problems, isn't worth the complexity, IMO. When people want to batch-rename things they already have to decide between wildcards, find & replace, scripting and regular expressions. Adding a second regexp mode is overkill.

But not every rename is about removing bits of filenames. Even with those that are, you sometimes need to be explicit about which XYZ you remove (e.g. there may be several that match, but you only want to remove the last one). The concepts required to remove things using the current system are essential concepts to using regexp in the first place, so there's no loss in learning them. As regexps go, they are simple concepts, too, so they are not hard to learn.

It's common in regexps to say "give me any string that ends (or doesn't end) in a particular character." That is basic stuff.

Maybe you're used to doing things differently, fair enough, but I disagree that it's unnatural or unintuitive (as much as anything to do with regexps is natural or intuitive in the first place ).

It doesn't require such knowledge. I do not have that knowledge myself, yet I was able to write the regexp.

All it requires is knowing how regexps work in the first place. You don't need to know how the engine is implemented. I only think in terms of what the expressions say, now how the engine is going apply them.

.*[^ ] means find me anything that doesn't end in a space; how the regexp engine arrives at a match, I have only the vaguest of ideas.

Yet it wasn't needed in that case (and was only added after I pointed out that your "more simple" version didn't actually work). It made things more complicated by introducing an unnecessary concept when the basic building blocks easily did the job.

I'm sure those conditions, and the regexp style you're advocating in general, are extremely useful when doing a search & replace in a huge string/document, but filenames are not huge strings/documents.

As you say, it works in Opus. We're only talking about Opus here.

So people need to learn when to use ? + and * which are basic, essential parts of regexp and useful in lots of situations. I would teach people those, and when to use them, very early in any regexp lesson. (Certainly long before any kind of assertions.)

Maybe regexps themselves are complex -- I'd say they are, compared to most things people do on their computers outside of actual scripting/programming -- but (.*[^ ]) is not a complex regexp. It just isn't.

If people see something like [^ ][A-Z] and wonder why it would match "xY" then they just haven't understood regexps yet. Fair enough, not everyone gets them straight way (or at all), but if that's the case then they're probably also not going to understand the distinctions between the two slightly different regexp modes we're talking about, and definitely not stuff like lookbehind assertions.

I guess this is the heart of why I disagree with what you're saying: I feel that once someone understands the basics of regexp, they can use the regexp stuff in Opus to tackle all the problems we've talked about. You're saying they can't, that it's too complicated, while advocating a system that requires learning even more concepts in order to solve the same problems in only slightly different ways. (And where solving other problems that don't involve cutting things out are often made more difficult, not less.)

You've re-written the history of the thread a little there but, anyway, I think we'll have to agree that what each of us considers natural is different.

You need to use test cases with either style of regexp, to make sure they don't do anything unexpected, so I don't see the relevance that has to which mode(s) of regexp Opus supports.

(e.g. It took you two attempts to get the CamelCase regexp to work properly, and you'd started after the extra condition was made explicit. Took me two attempts as well, except the first attempt was made before the extra condition was made explicit and the second attempt was in response to that condition. So the need for testing doesn't go away with either regexp style.)

FWIW, you can use presets (or apply the rename, undo it, then use the Last Rename preset to recall it) to make things easier there.

Being able to add test filenames within the rename dialog would be useful, I agree, but it's a side issue to this discussion; non-trivial regexps in either style will require testing against different filenames.

myarmor · June 19, 2011, 7:47pm

Btw, while on the topic of regexes. Thanks for the TR1 implementation of regexes in DO. Finally a real engine (albeit not really equal to pcre)
Because DO10 supports both TR1 ecmascript and DO9 regexes, it really should have two instances of the regex syntax help (the old, and the new quite a bit more advanced one).

At least with DO10 you can use RegexBuddy or just about any online regex tester (which supports ecmascript/javascript style) to
test the regexes you make before actually using them.
if you're still using # to repeat it, I guess it should more or less equal g (global/all) flag for those testers.

Regarding .*:

Not sure if DO9 could do it, but with DO10 you can make an expression nongreedy by using "?", e.g .*?, .+? etc.

MrC · June 19, 2011, 8:41pm

I appreciate the reply, and yet do feel it is a bit pedantic and argumentative, so I'll leave this with an open thought.

If the dopus rename mechanisms are easy enough for folks, which has this essentially trivial problem sat for several days unanswered?

[url]Remove all brackets and what it contains except]

Regards.

Leo · June 19, 2011, 10:49pm

It's been unanswered because nobody had the time or inclination to read & understand the question and then answer it. That doesn't mean it's hard to solve, nor that it would be any easier to solve if your pet regexp syntax was implemented (which seems a strange assumption for you to have made, especially given the vast majority of regexp questions that do get answered; you've focused on an anomalous thread and jumped to an erroneous conclusion).

After already answering several lengthy regexp questions (the questions were lengthy, not the regexps that solved them) from the same person, and feeling like I had done enough while I've got a huge to-do list to get through, I decided to get on with some other work in the hope that someone else would answer their latest question (or, since they were asking for lots of regexps, the poster might have a go themselves and discover it's not that hard; sometimes the most helpful thing to do is not give out every answer on a plate).

In fact, regexp questions usually do get an answer here from someone else if I leave them, far more so than any other type of question, I'd say. (As a result, I'm more likely to leave a regexp question unanswered than most other types of questions, although I do still answer many of them, especially if one pops up while I'm waiting for code to compile...) Regardless of that, the forum is something people (including myself) answer in their spare time. This isn't the official support channel and definitely isn't a guaranteed "write any regexp you ask for" service (although most such questions do get answered, and usually quickly).

michaelkenward · June 20, 2011, 9:38am

This raises something that has puzzled me for a while. Is anyone else reluctant to throw regex questions into this place without even trying to work out out for themselves?

I don't do it because I have never considered the forum to be a place to go for some unpaid programming.

When I want to do something, I look at the forum for earlier examples, I also consult the samples that Leo and others have posted in the past. Over the years I have solved a lot of the problems I wanted sorted, and have learned something in the process.

If you just ask a "how do I do this?" question you may get an answer, but what do you learn in the process? After all, the people who can quickly sort out things in their heads don't have time to write detailed explanations of what their solutions achieve. Without that, you can't learn much.

I can usually work out what I want to do, but sometimes fail completely. That's when I might come and ask for help. I notice that other people seem to have the same tactic, turning up with half complete solutions that fall over for some reason.

I have also wondered why there isn't a charity box somewhere. A place where someone who has had help to solve a problem can make a donation, to the helper or a charity of their choice.

Apologies for stepping into this discussion if it is inappropriate, but, as I said, it made me think.

Leo · June 20, 2011, 2:36pm

I like the charity box idea, although I'm not sure how to set one up (PayPal I guess? can't escape them! ) and I wonder how many people would use it for real.

I feel the same (when asking questions elsewhere) about trying stuff myself and then presenting that as a starting point. At the same time, I don't mind other people asking for help to do something even if they don't know where to even start themselves. (As long as everyone understands there's no guaranteed answers here, especially the other forum members (me included) are busy.)

David · June 21, 2011, 2:40am

I'm very glad this thread has become a good discussion.
I'm still looking for time to disect Mr C's Perl Script reply to the other thread linked to in this.
I was going to write something myself, but other matters became more important.

The felt and plastic drum front bearing surface on my trusty clothes dryer failed.
I was able to fix it for $24.60, getting the parts on Ebay.
Works like new now after painting the scraped drum bearing surface with some epoxy primer and baking it in the sun half a day.

Well, the idea of setting up a charity box is a nice gesture, but seriously, to what end ?
Leo, just how much money did you actually collect when the paypal account was set up to help you buy a new notebook computer after your's was stolen in Portugal ?

My favorite charity would be to establish a Richard Nixon School of Integrity or perhaps a George W. Bush foundation for a Kinder and more Gentle America .
Honestly, I think the best that may be possible here is a charity box for Directory Opus Resource Centre computing hardships.

Regards,
Dave