Regex - library order title fix

penguinaka · July 25, 2011, 9:36pm

I been trying to come up with some regex that will fix titles in file names that are named in library order and can't seem accomplish it. I was hoping someone with more experience would take a crack at it.

Examples:
George Martin - Ice & Fire 01 - Game of Thrones, The.epub
George Martin - Ice & Fire 01 - Game of Thrones, An.epub
George Martin - Ice & Fire 01 - Game of Thrones, A.epub
George Martin - Game of Thrones, The.epub
George Martin - Game of Thrones, An.epub
George Martin - Game of Thrones, A.epub

Fixed:
George Martin - Ice & Fire 01 - The Game of Thrones.epub
George Martin - Ice & Fire 01 - An Game of Thrones.epub
George Martin - Ice & Fire 01 - A Game of Thrones.epub
George Martin - The Game of Thrones.epub
George Martin - An Game of Thrones.epub
George Martin - A Game of Thrones.epub

MrC · July 25, 2011, 11:24pm

Perhaps you can explain what you understand and what you don't. Since these questions are essentially the same with only minor variations, it seems regular expressions, and how to go about breaking these problems into smaller, more manageable pieces, is presenting some challenges.

Suggestion... break these things down into simpler components, and then reconstruct from the pieces you have. Since you have a monster script already, you know how to pull out pieces and put them back together, right?

penguinaka · July 27, 2011, 8:14pm

[quote="MrC"]Perhaps you can explain what you understand and what you don't. Since these questions are essentially the same with only minor variations, it seems regular expressions, and how to go about breaking these problems into smaller, more manageable pieces, is presenting some challenges.

Suggestion... break these things down into simpler components, and then reconstruct from the pieces you have. Since you have a monster script already, you know how to pull out pieces and put them back together, right?[/quote]

thats where i'm having the trouble i can't seem to separate the ending were the title is with regex and get it to recognize either side of the comma as two parts and swap it... that's why i'm asking in here

Jon · July 27, 2011, 9:04pm

Old name: (.)- (.), (.).(.)
New name: \1- \3 \2.\4

JohnZeman · July 27, 2011, 9:15pm

Jon beat me to it, but since I had worked out a solution as well I'll go ahead and post it just for reference.

MrC · July 27, 2011, 9:24pm

Here's a trick that might help.

Build in pieces...

Select your files, and click the Rename button and select Type: Regular Expressions; make sure the preview window is open.

Start by entering b[/b] as the pattern (old name) and \1 as the replacement (new name). Thus, you should see that the Original Name and New Name are the same... no change.

Now, notice that you want essentially: A - B.suffix. And it is the B part you want modified - everything else remains the same. But let's start easier, instead just A - B. This is an easy enough pattern, right? It is just:

b - (.*)[/b]

Now, in the New Name, enter \1 and watch the New Name in the preview window. Now, instead, enter \2 and also watch the New Name. You should be able to see what is captured in the first set and in the second set. And \1 \2 gives you back the original, without the " - " in between.

So you know you want to change the part B expression. Let's start by removing the suffix. The suffix is matched by a dot followed by anything, so that would be: .(.*) (note: we have to backslash - escape - the dot, because it would otherwise mean anything, and we only want a dot to match). So let's add the suffix part, capturing it in the capture group 3.

b - (.).(.)[/b]

Now, do the same trick. Enter \1 as the new name, see the results. Do likewise for just \2. And finally, likewise for \3.

You should see that the A part is in \1, the B part is in \2, and the suffix is in \3.

So, now, you know you need to work on only the B part, which is the second set of parens:

b - (.).(.)[/b]

Working on JUST this part, the pattern you want to match is : C, D. That's easy, right? It is just: all things not a comma, followed by a comma, space and the remainder:

([^,])(, (.))

So let's now replace our new pattern for what was part B. We need to group all this

(.) - ([^,])(, (.)).(.)

Now, do the same trick as above, cycling through, one at a time, \1, \2, \3, and then finally \4 in the New name to see what has been captured.

it should now be obvious that you can just piece them all together in any order you want using \1, etc.

One modification - take a look at the results of \3. You really don't want to capture this group. We've captured 5 things, but only want 4. So, we can place a ?: just inside the parens, which prevents capture, but still groups the items:

(.) - ([^,])(?:, (.)).(.)

Now, you have a \1, \2, \3 and \4 you can stitch back together in any order you want, but text in between as you want. Try it!

MrC · July 27, 2011, 9:27pm

[quote="jon"]Old name: (.)- (.), (.).(.)
New name: \1- \3 \2.\4[/quote]

This won't work when the article is missing, such as:

George Martin - Ice & Fire 01 - Game of Thrones.epub

MrC · July 27, 2011, 9:44pm

[quote="MrC"][quote="jon"]Old name: (.)- (.), (.).(.)
New name: \1- \3 \2.\4[/quote]

This won't work when the article is missing, such as:

George Martin - Ice & Fire 01 - Game of Thrones.epub[/quote]
Bah, neither does mine.

Jon · July 27, 2011, 9:49pm

I dont think regex is flexible enough to cover all cases, so all you can do is assume it will only be used on filenames in the format the OP provided. Since his examples all did have articles in them I dont think you need to worry about cases where they dont

MrC · July 27, 2011, 10:24pm

We can safely ignore my last two posts, as they are the rambling nonsense of a sleep-deprived loony!

Of course, there is nothing to do for the non-article case, so it works. But it is always good to be thinking of how the REs will break with minor, expected changes to input. Recalling the monster script, we can foresee all sorts of ways for problems to arise.

@JohnZeman - I think you want to remove a space In your New name: \1\3\2\4

penguinaka - Hopefully you've picked up some of the ideas. One thing you'll want to consider - each of these REs you've been importing into your script generally are designed to work on only the small subset of input files you've provided. There are many ways for things to fall apart. This is why you want to consider a more "holistic" approach to the problem, rather than cut/paste.

JohnZeman · July 27, 2011, 10:39pm

The space between \3 and \2 was intentional MrC.

MrC · July 27, 2011, 10:47pm

Why was that? Replace it with a % and you'll see there are one too many spaces, in between, for example, "A" and "Game".

JohnZeman · July 27, 2011, 11:22pm

It doesn't do that on my machine so apparently Opus RegEx on your machine is being interpreted a little differently?
I added the space intentionally because without it there wouldn't be any space after the words "The" or "An" etc.

Jon · July 27, 2011, 11:25pm

If you look carefully at the screenshot you posted you can see there are two spaces following the final hyphen in each name (e.g. "George Martin-A Game of Thrones.epub") - I believe this is what MrC is referring to.

JohnZeman · July 27, 2011, 11:25pm

Oh I see what you mean and you're right. Only the extra space is before the words "The", "An", etc.

I'll cobble up a fix for that shortly.

JohnZeman · July 27, 2011, 11:28pm

The fix was easy enough, I just added a space after the comma in the old name.

penguinaka · July 27, 2011, 11:40pm

Thank you such much everyone for responding to this post... and the explanations besides the solution is extremely helpful since i'm still just a beginner at regex. I actually bought regex buddy and regex magic but even those have a huge learning curve so i'm still wading my way through it.

the fix was only needed for swapping the article back to the front. There are some ebook management tools that flip the article to the back in what they call "library order" it helps in looking up books when you can ignore it in the alphabetical search apparently. I prefer it not in library order and i have been editing a large collection by hand where some are and some are not.

I don't think it would fit well into bookcase because there could be instances where a comma was part of the title and suppose to be that way. So i was thinking it would be better as a stand alone. Thank you much everyone.

penguinaka · July 28, 2011, 12:07am

both work great ...what about this as a curve ball...

could a regex take into account something in brackets at the end and leave it in place?

for example:
from:
George RR Martin - Ice 01 - Game of Thrones, the (v5.0).epub
to:
George RR Martin - Ice 01 - The Game of Thrones (v5.0).epub

and still handle a case without brackets at the end?

MrC · July 28, 2011, 1:07am

This is why you want to consider ALL of your input samples at the same time... otherwise, you end up having to rewrite your expressions over and over. The regular expression engine has to work hard to determine what can match. So every little hint you can give it, every constraint, makes its job easier, and more importantly, allows it to work.

Since your input is of a different form now, we have to handle this one a little differently.

(.* - ){1,2}([\w ]+), (\w*)(.*).(\w+)

\1\3 \2\4.\5

I've helped you... now perhaps you can help others here by explaining how the above works. Your reference is here:

msdn.microsoft.com/en-us/library/bb982727.aspx

MrC · July 28, 2011, 1:29am

Exactly... and this was the point I was hoping to draw attention to.

Since whitespace is so difficult to "see" in the input/output, it might be nice if there was an option in the dialog to show a single space as a small light-grey middle-dot (or something), so that one could quickly count them. Since you can't select text in the preview window, there's no rapid, easy way to determine length of space runs.

Something like: a · b or a ·· b. I've made them colored so they are more easily read here in the forum.