Regex (everything after second " - ")

RogerP · November 18, 2012, 2:53pm

Hi

I have some troubles getting this to work. I have files which look like this:

20121009 1928 - SF zwei HD - Factory Made - So wird's gebaut.ts
20120805 0219 - BR-alpha - Alpha Centauri - Distanzen 1.ts

What I want now is to strip the first part away which is the date, the time and the TV channel name. Or differently said, I want everything after the second occurence of " - ".

Factory Made - So wird's gebaut.ts
Alpha Centauri - Distanzen 1.ts

I came up with the following expression (added some more groups to be able to quickly check the result):
^(\d{8}[ ]\d{4})(\s-\s)(.)(\s-\s)(.)$

But this doesn't work for the second case and I don't understand why...

Any help is greatly appreciated.

Cheers
Roger

David · November 18, 2012, 6:10pm

It may not be the best form, but this is working for me.

Rename Regexp Pattern="[^-]*-[^-]*-(.*\..*)" to "\1"

This results in :
Factory Made - So wird's gebaut.ts
alpha - Alpha Centauri - Distanzen 1.ts

Note that there is a space before Factory.

This will eliminate one space character after the second - character if it is there.

Rename Regexp Pattern="[^-]*-[^-]*-[ ]?(.*\..*)" to "\1"

David

playful · November 18, 2012, 6:30pm

Hi Roger,

The smallest modification to your regex to make it work is to add a ? in the first dot-star:
^(\d{8}[ ]\d{4})(\s-\s)(.?)(\s-\s)(.)$
Then replace with \5

This is because dot-star (.*) is greedy. In your original expression, it matches everything to the end of the file name. Then, to produce a match, the engine backtracks one character at a time. So the last capture group ends up only containing what is after the last space-dash-space.

When you add the question mark, the dot-star becomes ungreedy, and only matches until the next space-dash-space, which is what you intended.

playful · November 18, 2012, 6:36pm

p.s.:
The last post was the "why".
For my taste, I would probably go for something like:
pattern: (?:.?\s-\s){2}(.)
replace: \1

Please let me know if you have any questions.

David · November 18, 2012, 6:47pm

Very good Playful !

Thanks for a great example of the Regexp power we now have with Dopus 10.

playful · November 18, 2012, 7:08pm

Hi David, yes, that's killer, just love having that power under the hood.
Quietly crossing my fingers for a regex mode in the quick filter.

RogerP · November 18, 2012, 8:53pm

Hi

Thanks to both of you for your help!

Playful, your "compact" solutions works great. I was checking the documentation to find out about the first part of your pattern "(?:.*?" but didn't find an explanation for the colon. What does that mean? However I do understand that you check for that specific " - " pattern 2 times, that's a great idea!

Thanks as well for the explanation about my pattern problem that also explains how the regex engine works .

Roger

Leo · November 18, 2012, 8:57pm

From the Opus regexp documentation:

There are many different variants of regular expression; by default Opus uses what's called TR1 ECMAScript. Microsoft has a page on TR1 that goes into far more detail than this help file can.

playful · November 18, 2012, 11:01pm

Hi Roger,
The pages Leo linked to should have the information, but I'll explain it to complete the thread (and because it's fun).

So we are looking at the first set of parentheses: (?:.*?\s-\s)
(?: means that this set of parentheses is non-capturing. That means that whatever it matches will not go into Group 1, which you would later refer to as \1. This is why in the replace, we can use \1, as it refers to the only capturing parentheses in the pattern: the final dot-star.
After that, the dot-star-question mark ungreedily matches everything until we meet (and match) a space-dash-space.
After the closing parenthesis, you are quite right that we repeat this pattern thanks to the {2}
Once we're past the second occurrence of space-dash-space, we're free to capture everything (dot-star) into Group 1.

Of course this is only one way---there are a number of other ways of writing patterns to match these strings.

Wishing you all a fun week!

David · November 18, 2012, 11:37pm

Ooops,

I thought we were after the second -.
What was wanted was the second " - ".

I have my answer working again, but I'm kind of groping over just why it really works.
I wouldn't have expected this result.

Rename Regexp Pattern="[^-]*-[^-]*-[ ]+(.*\..*)" to "\1"

playful · November 19, 2012, 4:43am

Hi David,
Indeed, at first glance it can seem surprising that the pattern would work on the second file:
20120805 0219 - BR-alpha - Alpha Centauri - Distanzen 1.ts
We're matching no-dashes then a dash, no-dashes then a dash, then space. If the renamer was trying to match the entire file name, the regex would fail, because the second dash (the one in BR-alpha) is not followed by a space.

But that's not what the renamer does. It looks for files where the pattern matches somewhere, but the pattern doesn't have to match the entire file name. If you want the pattern to match the entire name, you use anchors (^ and $). For instance, for a pattern you could just have "alpha", and for rename you could have "beta", and the entire long file name above would get renamed to "beta".

With your pattern, the renamer is able to match BR-alpha - Alpha Centauri - Distanzen 1.ts
It may not be the string you expected to match, but it conforms to no-dashes then a dash, no-dashes then a dash, then space, than anything up to a dot, then anything.
And the parentheses correctly capture what you want.

So you got it right!

Leo · November 19, 2012, 8:39am

I think it'd go wrong if the were dashes in the words on the right.

michaelkenward · November 19, 2012, 9:30am

[quote="playful"]Hi Roger,
The pages Leo linked to should have the information, but I'll explain it to complete the thread (and because it's fun).[/quote]
Many thanks for this.

These "worked examples" are an excellent way of working out the complexities of these regex things. While it is wonderful when people throw in answers to particular challenges, you can't always take away any wider lessons.

This one message has taught me more than dozens of other "problem solved" replies.

David · November 19, 2012, 9:58am

Thanks very much Playful !

Before DOpus 10, at least at some point in time, the pattern had to match the entire filename.
So yes, naturally I had thought this pattern had to match the entire filename.
This being the case I have almost never used or needed anchors.

Regexp in DOpus 10 are more as there are in PHP or VB script.

Almost 4 am here, got to go !

playful · November 19, 2012, 10:10am

I nearly always bow to your wisdom, Leo, but David's pattern
[^-]-[^-]-+(...)
though it is not on my top-ten list for regex style, does work even for the following mouthful of dashes:
20120805 0219 - BR-alp-ha - Alp-ha Cen-tauri - Dist-anzen 1.ts
(We jump to the first dash, which happens to be the first space-dash-space component, then we jump to the next dash-space (just before Alp-ha), then we capture everything after that, so any dashes on the right shouldn't matter.)

Maybe I misunderstood and you had something else in mind. For instance, the pattern would break if there were a dash (with or without space) in the date on the left, because the first "dash test" is based on a plain dash, not space-dash-space, so that a plain dash on the left would be a "false positive".

Michael, thank you for your kind post. Sometimes I wonder if I've gone into my own little world and vomited a thousand keystrokes on a topic that's only interesting to me. So it's a treat to know you didn't find the details boring.

Wishing you all a beautiful week

Leo · November 19, 2012, 10:24am

My mistake, you're quite right. It's skipping the first dash, then finding the second dash with a space after it, which is fine for the specified inputs.

It'd only go wrong if there was an extra dash somewhere in the date at the start (not part of the specification, so that's okay) or if a word ended in a dash, e.g.

20120805 0219 - BR- alp-ha - Alp-ha Cen-tauri - Dist-anzen 1.ts

which is probably fine as well.

There are loads of regular expression tutorials, and interactive learning tools, on the web that walk you through step-by-step examples. If you want to learn regular expressions, the information is out there. (In addition to the beginner-level guide I wrote for the rename scripting area here.)

I agree it can be helpful to explain things step-by-step, but it's also very time consuming and often seems redundant when there are tutorials out there for people who want them. (Explaining things is definitely useful when some of the newer or more esoteric regexp features are used, as in Playful's example, of course.)

RogerP · November 19, 2012, 2:26pm

Hi all

Very intersting thread indeed .

I'm usually use Expresso to devlop and test expressions but the best tool is worthless if you don't know what you are doing .

Thanks to all of you!
Roger

michaelkenward · November 19, 2012, 2:32pm

[quote="leo"]
There are loads of regular expression tutorials, and interactive learning tools, on the web that walk you through step-by-step examples. If you want to learn regular expressions, the information is out there. (In addition to the beginner-level guide I wrote for the rename scripting area here.)[/quote]

I have looked at many of those tutorials.

The message I fingered for its usefulness was particularly helpful because it addresses a specific problem with a good example and a clear explanation of what each bit does.

Most tutorials start with very general issues and then provide comprehensive solutions. They also mostly come from people writing for readers like themselves, rather than for someone who wants to just deal with an issue rather than complete a PhD.

As in many walks of life, the worst explanations often come from the most knowledgeable people. This is why people like me can make a living translating technical stuff for readers who start with very little knowledge.