Regex help request - transposing numbers (or text)

I have a zillion PDF's (magazines mostly but a lot of eBooks too).

The dates on these have evolved over the years but I have finally settled on the following format:

magazine name yyyy-mm

A lot (thousands) of my older file name are in various other formats, like:

magazine name month yyyy....or...magazine name mmyy

How can I help to automate the renames with regex or a script (I understand scripting a little better, but not by much)?

I'd like to take mmyy and turn it in to yymm at the very least, and hopefully handle mmyyyy as well. I can handle a laborious script to handle months in long format (unless someone can offer a better solution).

BTW, this is so the magazines will sort in date order in case anyone was wondering why :slight_smile:

Scott
Sarasota FL

Please post some example filenames so the date formats are clear.

Existing:
Road and Track April 2015
Road and Track 0415
Road and Track 042014
Woodworking April-May 2015

What I want:
Road and Track 2015-04
Woodworking 2015-04-05

I guess what I really need is to learn how to take the last xx characters of a file name and move them to another position...?

Thanks for helping!

I'd convert the names of all months into numbers first and then apply multiple regex renames, each matching one of your possible naming schemes. I guess you won't need a script for this, just setup and run the builtin rename tool multiple times to reshape your filenames.

Replacing April with 04 is rather basic, you can use the find and replace mode for this, probably no need for regexes. To give you a start on the numbers, create regex groups by embracing parts of your "filename" into round brackets. The new name uses \x to reference the index of the regex group. This is how you swap around your 2-digit month and 4-digit year schema (\1 is the the magazine name, .., \4 the extension).



Old name: (.?) ([0-9][0-9])([0-9][0-9][0-9][0-9])(..)
New name: \1 \3-\2\4

Perhaps easier to read:

Old name: (.?) ([0-9]{2})([0-9]{4})(..)
New name: \1 \3-\2\4

As you probably know, you can save regex patterns in the advanced rename dialog. You can also assign rename patterns to a button.

Here is one regex to convert
Nat Geo 0415.pdf
Nat Geo 042015.pdf

to

Nat Geo 1504.pdf

Search:

^(.*)(^|\D)(\d{2})(?:\d{2})?(\d{2})(\.[^.]+)$

Replace:

\1\2\4\3\5

Is this one of the things you want?

Note:
The code[/code] serves as a boundary hack.
It ensures that what precedes is not a digit.
Normally you would use (?<!\d) but TR1 regex does not support lookbehinds.

Would LOVE a real regex flavor in Opus at some stage. :slight_smile:

Explanation
^ Assert beginning of string
(.*) Capture any chars to group 1
(^|\D) Capture the zero-length beginning of string position or a non-digit to Group 2
(\d{2}) Capture two digits (the month) to Group 3
(?:\d{2})? Optionally match two digits (the century) without capturing
(\d{2}) Capture two digits (the year) to Group 4
(.[^.]+) Capture a dot and non-dot chars (the extension) to Group 5
$ Assert end of string

\1\2\4\3\5 Replace with the content of groups 1, 2, 4, 3, 5

TR1 is a "real" regex flavor. No regex flavor supports all the features of all the others, and very few support lookbehinds (which are explicitly considered a bad idea by many).

I'm not sure a lookbehind is needed or sensible there. You could use one of the asserts that tr1 includes for the same thing, or more simply start with (.*\D) since none of the inputs are just a date with no prefix.

Hi Leo,

I didn't want to assume that — that's why I put that in.

When you say "explicitly considered a bad idea by many"… it seems to me that they are considered a brilliant idea by many too, and heavily relied upon by people who use regex all the time. I won't engage into who is right or wrong, but it's a valuable tool. Yes, any tool can be misused: you know the famous regex quote about "now you have two problems".

And I do have to disagree about "very few support lookbehinds"…
Among others, there is lookbehind in:

  • PCRE (C, PHP, R, and probably available to you guys in C++, with a recent new version called PCRE2 that includes a substitution function)
  • Perl
  • .NET
  • Java
  • Objective C
  • Python
  • Ruby
  • TRegEx
  • even Powershell

JavaScript is one of the few flavors that doesn't support lookbehind.
Is it possible that the situation may have changed since you last researched it?

Opus is such a powerful beast, it would be lovely if you guys had a way to package a better regex library with it.

I think it's a safe assumption in this case. :slight_smile:

Maybe attitudes to lookbehinds have changed since I last looked, but until they are added to native C++ regex we're unlikely to change the whole regex system for a third party one and introduce compatibility issues, just for one controversial feature which can be handled in various other ways (asserts, or simple changes to the regex itself, or rename scripts which are often a lot easier and more expressive than trying to cram everything into one big regex once the task moves from the trivial).

I'd say lookbehind seems to be one of the more common features among engines today, and I can't
really think of a situation where it is needed but would be a bad idea.
I tried to do a search for where someone said something like that, but I couldn't find anything.

I would stay away from PCRE2 for at least a couple of versions (bugs and so on), but PCRE is highly recommended.

The Important notes about lookbehind part of this suggests support/abilities vary and is quite limited in many cases, also with poor performance for the fully featured lookbehind versions, everywhere except in .Net and one other proprietary library.

So I remain unconvinced, and think there are better, clearer and faster ways to achieve the same results. But it's a bit moot until/unless C++ STL gets the feature added, anyway.

If I'm looking at the right section of the page, Jan is talking about the lack of support of infinite lookbehind. This is rarely much of a limitation.
The usual workarounds for the lack of infinite lookbehind are \K (PCRE, Perl, Ruby) and capture groups.
By the way yes it is in .NET, but that's not the only place: it's also in Matthew Barnett's excellent alternate regex module for Python.

Oh, and here is a longer page about lookarounds, which shows a number of other use cases. Full disclosure, I wrote this some time last year. :slight_smile:

[quote="leo"]The Important notes about lookbehind part of this suggests support/abilities vary and is quite limited in many cases, also with poor performance for the fully featured lookbehind versions, everywhere except in .Net and one other proprietary library.

So I remain unconvinced, and think there are better, clearer and faster ways to achieve the same results. But it's a bit moot until/unless C++ STL gets the feature added, anyway.[/quote]
Well, support varies, especially when comparing 100+ flavors/versions, but PCRE has full support for it
except for infinite lookbehind like (?<=\ba+)text
It has quite good performance too.

Poor performance in lookbehinds doesn't mean the engine in general has poor performance. If the
alternative is no lookbehind support, well.
Btw, his tools has the most extensive regex support I've found anywhere.

Amen. Huge fan of RegexBuddy and EditPad Pro. @myarmor do you have other favorite regex tools not by Jan? Sharing mine: ABA Replace, TextDistil. And of course home-made ones. :slight_smile:

Sorry about the off-topic.

Doesn't the regexes already posted cover just about all the formats he/she listed?

@playful, just some selfmade ones such as a pcre centered regexeditor made pre RB4 (usually no need for it after v4).
You forgot to mention PowerGrep, the big "boss" when it comes to search and replace in files. It is somewhat expensive, but
easily pays for itself if you regularily do s&r.

Yeah, I guess we went quite off-topic.

Here's a rename script for:
Woodworking April-May 2015
Road and Track April 2015

(Updated code in next post.)

Open the advanced rename dialog, enable the script mode and paste the code above.
I tried to upload the .orp, but it was rejected as a possible attack vector (huh??).

I overlooked something in the code. It allowed "name year" in the second regex.
This one checks for valid monthnames in both formats.

@script jscript
function MonthToNumber(s){
  var m;
  switch(s.toLowerCase()){
    case "jan": /*falls through*/
    case "january": m=1;break;
    case "feb": /*falls through*/
    case "february": m=2;break;
    case "mar": /*falls through*/
    case "march": m=3;break;
    case "apr": /*falls through*/
    case "april": m=4;break;
    case "may": m=5;break;
    case "jun": /*falls through*/
    case "june": m=6;break;
    case "jul": /*falls through*/
    case "july": m=7;break;
    case "aug": /*falls through*/
    case "august": m=8;break;
    case "sep": /*falls through*/
    case "september": m=9;break;
    case "oct": /*falls through*/
    case "october": m=10;break;
    case "nov": /*falls through*/
    case "november": m=11;break;
    case "dec": /*falls through*/
    case "december": m=12;break;
    default: return undefined;
  }
  if (m<10){
    return '0'+m;
  } else {
    return m;
  }
}
function FixDate(s){
  var m1,m2;
  // Handle Month1-Month2 Year
  s=s.replace(/\b(\w{3,})[\s-]+(\w{3,})\s+(\d{4})$/i,function(match,month1,month2,year){
    m1=MonthToNumber(month1);
    m2=MonthToNumber(month2);
    // check to see if we have a valid months (to skip "name month year")
    if ((m1) && (m2)){
      return year+'-'+m1+'-'+m2;
    } else {
      // if not, return original text.
      return match;
    }
  });

  // Handle Month Year
  s=s.replace(/\b(\w{3,})\s+(\d{4})$/i,function(match,month1,year){
    m1=MonthToNumber(month1);
    // check to see if we have a valid month (to skip "name year")
    if (m1){
      return year+'-'+m1;
    } else {
      return match;
    }
  });
  return s;
}

// Called to retrieve the new name of a file during rename
function OnGetNewName(getNewNameData){
  var name;
  var ext;
  var fnamere=/^(.*)(\.[^\.]+)$/;
  var filename=getNewNameData.item.name;
  if (filename.search(fnamere)>=0){
    name=RegExp.$1;
    ext=RegExp.$2;
  } else{
    name=filename;
    ext='';
  }
  name=FixDate(name);
  return name+ext;
}

The Dynamic Renamer supports date guessing and reformatting. It didn't handle some of the cases here, but I've been modifying the code to support some more incomplete dates (and fixed a few typos). Example:

Road and Track April 2015 --> Road and Track 2015_04
Road and Track 0415 issue --> Road and Track 2015_04 issue
Road and Track 042014 --> Road and Track 2014_04

The one it won't handle fully is this one (it detects only the May 2015 as part of the date):
Woodworking April-May 2015

And you want the format to be:
Woodworking 2015-04-05

I wouldn't advise using that format, as it then becomes an ambiguous date (it looks like April 5th or May 4th of 2015). It would be better to name those files as something like: 2015-04,2015-05 or 2015-04,05, By using a comma as a range separator, instead of a dash, you give yourself future flexibility by removing the built-in ambiguity for later re-renaming or identification.

I'll post an update shortly, in case you're interested.