Last Updated:

Regular expressions in Notepad++

Notepad++

One of the features of the great old programming editors (with legendary Unix names such as Vi and Emacs) was their ability to use regular expressions (aka regex) in search and replace operations. One of the great things about the Notepad++ editor is that it matches the regular expression strengths of these old veterans without hiding them in a forest of mysterious commands.

Suppose your HTML code includes width=120px. You want it to be updated to the XHTML-compliant width='120px' attribute. (In HTML, quotation marks are rarely needed.in XHTML they are required.) A simple search/replace will do the trick, but you also have width=60px and many other widths. You need a regular expression.

This is the most common case of a regular expression. You need to find some text, part of which is fixed and part is variable. You need to change the fixed part, but keep the variable part. So, here's a tiny summary of regular expressions (there are entire books written on the subject!) to get you started.

PatternMatchers regular expression

  • a the letter "a" corresponds to itself—most characters correspond only to themselves
  • . a period corresponds to any character at all
  • \s (two characters read as "escapes") corresponds to any space (space, tab, return, or new string)
  • \d (any digit, from 0 to 9)
  • \w (w means "word") any alphabet (uppercase or lowercase), number (0 to 9) or underscore

Punctuation marks sometimes have a special meaning. Whether they're special or not, if you want the label itself, precede it with a backslash. (If you're not sure, which is often the case, "run away" from the punctuation mark — check it with a backslash in advance.)

  • \. dot
  • \+ plus sign

Single lowercase letters identifying special character classes can be inverted in meaning using a capital letter:

  • \s any probe character
  • \S any character that is not a space
  • \d any decimal digit
  • \D Any character EXCEPT the decimal digit

Multipliers

 

Some regular expression symbols provide a repetition rate called a "multiplier." The most common are:

  • + (one or more) \d+ is "one or more digits", a pattern that corresponds to each positive decimal integer
  • *(zero or greater).* is "zero or greater than any character"
  • ? (zero or one) the previous character is optional

Alternatives and grouping

| operator means "or". Parentheses group things together. (a|b) means "a" or "b".

Classes

Enclosing a list of characters in parentheses means: "Match exactly with one of these characters." Example: [iou]. It is an abbreviation for (i|o|u).

The pattern d[iou]g corresponds to "dig" or "dog" or "dug". It does not correspond to "drag" or "dragon".

You can use hyphens to denote ranges of characters. [A-Z] corresponds to any capital letter.

The carriage ("^") in the first position of the character class denies the class. It means "to match any symbol EXCEPT one of them. d[^iou]g corresponds to any string d.g("D" followed by any character, then "G") except "dig", "dog" or "dig". [^A-Z] corresponds to any character EXCEPT capital letter.

External links

 

If the regular expression contains parentheses, the characters corresponding to those parentheses can be used later. This is called "reverse reversal." Characters within a group that begin with the first " ( " are called either "$1" (in Perl and languages that closely copy Perl), or "RegExp.$1" in JavaScript or "\1" in the Replace line in an editor such as Notepad++.

Using Regular Expressions with Backlinks

 

Let's get back to our problem: we want to find all the "width=" occurrences followed by the width specification. We want to replace it with the same specification, but the width value must be enclosed in single quotation marks. To keep things simple, let's assume that all of your width specifications are in the same line and don't include built-in quotation marks.

The simplest part is the "width=" mapping. The regular expression for this is simply width=. If you're not sure if the equal sign has a special regular expression value (it doesn't), the safer version is width\=.

But what's next? You will have several numbers and letters "px". This will be \d+px (assuming your entire width is in pixels). It's time to create a test file and try it out.

Test data

Test One: Wash Left

<divwidth=100px>

Test Two: Not to the Left, Other Attributes

HTML <tag other=thing1 width=200px that=thing2>

Test Three: Different Width Size

<divwidth=50%>

Pixel Mounts

This specification will specify a width separated by pixels:

Find what: width\=(\d+px)
Replace with: width=»\1

Try it yourself. You must correctly convert the first two test cases to the width of the quotation marks. (Yes, this is a difficult way to fix two minor problems.But imagine you apply this to a large HTML file with a huge width.)

Fixing the entire width

What happens if the width is expressed in pixels, some in em, percentages or pixels (and yes, there are other possibilities). We could try to make an exhaustive list: (px|pt|%|...). But it's a tough road.

The width specification must end with a tag closure > or a space preceding another attribute. width\=([^\s>]+) should work. This is a "width=" followed by one or more characters, EXCEPT those specified in the class. The class specifies any space character or ">" character.

Regular expressions are both powerful and mysterious. Here we have only a simple example, and you can already see both. So, let's fix all three test cases at once:

Findwhat: width\=([^\s>]+)
Replace with: width=»\1

Test One: Wash Left

<divwidth=’100px’>

Test Two: Not Left, Other Attributes

other HTML here<tag other=thing1 width='200px' that=thing2>

Test Three: Different Width Size

<divwidth=’50%’>

Repetition

Our search pattern is width.... This finds the string "width" (followed by "...", which is part of this explanation rather than part of a regular expression — meaning we'll get there further). This was followed by an elusive punctuation mark: width\=.... It simply means "width" followed by "an equal sign without much meaning."

We then used parentheses to group the subexpressions: width\=(...). This makes the subexpression available in the Replacements string as \1. (If we had more groups, the second would be \2, and so on. We could even nest them if we wanted to. If you're not sure which group belongs, count the open parentheses.)

Now let's look inside this subexpression: (...). We used a character class followed by a plus sign: ([...] +). A character class corresponds to a single character, one of the characters in a class. The plus sign is a multiplier that says, "Use one or more of the previous ones."

Now we will dive into this class of symbols, shown above as [...]. It begins with a carriage [ ^ ... ], which means "use any characters EXCEPT those specified in this class." it continues with an escaped "s", \s, which is any space character, and then with ">", which is itself a sign of "more than". Thus, the denied class will match any character EXCEPT the space or ">".

Conclusion

A lot of work? Yes. But if you do as in the examples above you will see that this is much less work than if you do all these changes one by one.

Now imagine this regular expression-based search/replacement used in conjunction with the File Search feature. You test on a snippet, then on a small file. It works. You make a backup of your files, then click the "Find in files" button, tick "In all subfolders" and bingo!, you have converted the entire site.