Regex

Version 1.1
by Lummox JR

Regex is a library for searching and replacing text using regular expressions. Most standard regular expression functionality is included. To learn more about what regular expressions are and how to use them, please see Appendix A.

Regex quick reference
/pattern/flags
/pattern/replacement/flags
a|bFind a or b
^Beginning of line
$End of line
\ABeginning of text (single-line mode)
\ZEnd of text (single-line mode)
\bWord break (see \w); \B is non-break
.Any character
[abc0-9]Any of these characters; use - for range of characters
[^abc0-9]Any character except one of these
( )Treat contents as a group.
\nFind the same match found in nth group
$nReplace with match found in nth group
$nameFind value of named var name
[proc(arg,...)]Call proc and use result (replacement only)
Modifiers
*Match 0 or more times
+Match 1 or more times
?Match 0 or 1 time
OR make last modifier non-greedy
{n}Match n times
{n,}Match at least n times
{n,m}Match n to m times
Special characters
\Escape next character if it is not one of the following
\0nnnOctal character nnn
\dDigit 0-9; \D is non-digit
\lLowercase letter; \L is non-lower or non-letter
\nLine break
\sWhitespace character; \S is non-space
\tTab
\uUppercase letter; \U is non-upper or non-letter
\wWord character 0-9, A-Z, a-z, or _
\W is non-word character
\xnnHexadecimal character nn
Flags
iCase-insensitive
sTreat as single line
\A and \Z find start and end of entire text only
\s includes \n
gGlobal; replace all matches
eAllow expressions (proc calls)

Using Regex

To use regular expressions, first you have to compile the pattern you will use to search or replace. To do this you create a new /regex datum.

var/regex/R = new("/time/")

This will create a new regular expression which searches for the word "time" in lowercase. With this expression you can use Find() to find "time" and FindNext() to keep searching.

If the expression fails to compile, the error var will tell you the reason why. Always check error before searching the expression.

The / character is a delimiter which says where the pattern begins and ends, which is important for later examples. You can use almost any character as a delimiter. If you want to use that same character in the pattern, you must escape it with a backslash; in a BYOND string, you'll need to use \\ for that.

var/txt = "It's time to start the timer."
if(R.Find(txt))
  do
    world << "[copytext(txt,1,R.match)]\
              <b>[copytext(txt,R.match,R.index)]</b>\
              [copytext(txt,R.index)]"
  while(R.FindNext(txt))
Result:
It's time to start the timer.
It's time to start the timer.

If you had chosen /\btime\b/ as a pattern instead, "time" would only be found as a whole word, so the second line would not appear.

var/regex/R = new("/\\btime\\b/")

Since this is written in DM, where a backslash has special meaning, it's important when you use backslashes to escape them with another backslash so BYOND can read them. Likewise if you use brackets [...] you'll have to use \[ as well.

Replacing text

Replacement is about as easy as searching. Create a find-and-replace pattern and use Replace() to perform the replacement. Replace() returns the changed text if it does anything, and null if the pattern was not found.

var/regex/R = new("/\\bred\\b/blue/")
var/txt = "Fred threw the red ball to Jane. Jane caught the red ball."
var/newtxt = R.Replace(txt)
while(newtxt)
  txt = newtxt
  newtxt = R.ReplaceNext(txt)
Result:
Fred threw the blue ball to Jane. Jane caught the blue ball.

After Replace() runs and has found a replacement successfully, R.index points to the end of the replaced text. In the same way, ReplaceNext() will replace the next match found.

Of course you can replace everything in one fell swoop using the g flag. This will make Replace() replace everything it can all at once.

var/regex/R = new("/\\bred\\b/blue/g")
var/txt = "Fred threw the red ball to Jane. Jane caught the red ball."
txt = R.Replace(txt) || txt

Using groups

If your pattern includes groups using parentheses, you can read the text that was captured from that group by using GroupText(). This is the equivalent of $1, $2, etc. in a language like Perl or PHP.

var/regex/digits = new("/(\\d+)/")
var/txt = "3 pigs and 4 chickens walk into a bar and order 50 punchlines."
if(digits.Find(txt))
  do
    world << digits.GroupText(1)
  while(digits.FindNext(txt))
Result:
3
4
50

Named variables

You can insert your own variables by name into a search by using the namedvars associative list.

var/regex/R = new("/^$player logs (in|out) .*$/")
R.namedvars["player"] = player
if(R.Find(gamelog))
  do
    admin << copytext(txt,R.match,R.index)
  while(R.FindNext(txt))

Calling procs in replacement expressions

By using the e flag, you can allow your replacement expression to call a global proc, and include arguments. To do this you must put the proc name inside brackets [], and if any arguments are included, they go inside parentheses and are separated by commas. (Do not put any spaces between the proc name and the parentheses.) Remember, proc names are always case-sensitive.

/(\d+)\+(\d+)/[sum($1,$2)]/e

This expression will search for a number, a + sign, and another number, then add the two numbers using the sum() proc.

// a and b are strings
proc/sum(a, b)
  return text2num(a) + text2num(b)

mob/verb/TestAdd()
  var/regex/add = new("/(\\d+)\\+(\\d+)/[sum($1,$2)]/e")
  var/txt = "What's 11+23?"
  txt = add.Replace(txt) || txt
Result:
What's 34?

sum() is a global proc, and the arguments it receives are text strings, in this case "11" and "23". You can do anything with these strings that you want. Here they were numbers to be added, but you might want such a proc to calculate other values or format text in particular ways.

Splitting text

The Split() proc will take text and split it up according to this pattern. For example if your pattern is /,\s*/, you can split up a list separated by commas with optional spaces.

var/regex/commasplit = new("/,\\s*/")
var/list/items = commasplit.Split("trees, rocks, water, fire, air")
for(var/thing in items)
  world << thing
Result:
trees
rocks
water
fire
air

There's a second argument to Split() which can make the split inclusive. If this is nonzero, then the pattern matches themselves (in this case ", ") will be included in the list.

A Few Examples

Add commas to numbers

var/regex/commafy = new("/(\\d+)(\\d{3})/$1,$2/g")
var/txt = "9000 peaches and 12345678 pretzels"
var/newtxt = commafy.Replace(txt)
while(newtxt)
  txt = newtxt
  newtxt = commafy.Replace(txt)
Result:
9,000 peaches and 12,345,678 pretzels

Reformat paragraphs

var/regex/unpara = new("/([^\\n])\\n([^\\n])/$1$2/gs")
var/regex/repara = new("/^(.{,77}|\S{78,})\\s+/$1\\n/g")
txt = unpara.Replace(txt) || txt
txt = repara.Replace(txt) || txt

Datum Reference

This section explains the layout of the /regex datum.

/regex public vars

error
The error, if any, that occurred when this pattern was first created
match
The index of the last matching value in a searched text string; 0 if none found
index
The index after the last match found (or replaced) in a text string; 0 if none
namedvars
An associative list of named variables which can be set before a search
In a pattern these appear as $name

/regex public procs

Find(text, start=1)
Find a match in text. Return src.match, which is nonzero if found.
FindText(text)
Repeat the last search starting at src.index.
Replace(text, start=1)
Find a match in text and replace. Return modified text, or null if no match found.
ReplaceText(text)
Repeat the last replacement starting at src.index.
GroupText(group_number)
Return the last match found for parentheses group #group_number, or null if none found.
Split(text, inclusive)
Return a list of sections of text separated by this pattern. If inclusive, include the actual pattern matches in the list.

Appendix A: Regular Expression Basics

If you've never used regular expressions before, this section will give you a brief overview. Using the handy quick reference above as well as the explanations in this section, you should have enough info to build your own patterns.

A regular expression pattern may take two forms:

/pattern/flags
/pattern/replacement/flags

A basic pattern may contain something as simple as a single word or phrase. Literal text is easy to find. If you want to find it without worrying about upper- or lowercase, use the i flag at the end of the pattern.

/a basic pattern/i

The / character is a delimiter, which marks where the pattern ends. Almost any character may be used as a delimiter, including # or ! or letters. Any character you use this way, though, can't have a special meaning in the pattern, so if you use |, you can't use a|b to mean "find a or b". If you want to use that same character as literal text in the pattern, you should be able to escape it with a backslash \. For example, you can use /\d+\/\d+\/\d{2,4}/ to read a date.

For special characters, backslashes are very important. Any special character you want to use like a normal one, including a backslash, needs a backslash in front of it. As in regular DM, some things like \n or \t have special meaning. But there are more special escape sequences, including \d which means digits 0-9. If you use a backslash in your pattern, you'll probably need to use two when writing it as a BYOND string. The example above would have to be written as "/\\d+\\/\d+\\/\\d{2,4}/". For the rest of this appendix, it's assumed that you know that; the patterns you see will be the "raw" pattern like the one above, not in a BYOND string form.

Some characters are easier to include using an ASCII code than the actual character itself. \xnn will produce a character from a hexadecimal code; i.e. \x41 is A. You can do a similar thing with octal, where \0101 is A, but the leading 0 is needed.

Simple operators

Finding just text you can already do with BYOND's built-in procs, though, so let's move on to the meat of regular expressions and some of the things that patterns can include.

It's possible to search along word breaks using \b, or non-breaks using \B. Thus, /\bbrick\b/i will match "brick" but it will not match "bricklayer".

You can match just at the beginning of a line by using ^ and at the end of a line using $. \A and \Z mean the same things, except that if you use the s flag to treat your text like a single line, they'll only match the beginning and end of the entire text.

/^you can/i
/using \$\.$/

If you want to find one of several possible choices, then | which means "or" is the right tool for the job.

/banana|raspberry|grape/i

There are many other things you can find in a regular expression. A single period . will match any character. Brackets [] will find any character that matches what's between them; e.g. [AEIOUaeiou] will match a vowel, or [a-z] will match any lowercase letter. Or you can use the opposite of this by putting ^ at the beginning of your list of characters, so [^AEIOUaeiou] matches any character that is not a vowel. There are other kinds of character matches you can do easily without brackets:

\dAny digit 0-9\DAnything but 0-9
\w0-9, A-Z, a-z, or _\WAnything but \w
\sWhitespace\SNon-whitespace
\lAny lowercase letter\LAny uppercase or non-letter
\uAny uppercase letter\UAny lowercase or non-letter

Modifiers

Of course, it's not as much use to find any old character as to find it a certain number of times. That's where modifiers come in. Modifiers go right after part of the pattern and say how many times you want to find that part.

*Match 0 or more times
+Match 1 or more times
?Match 0 or 1 time
{n}Match exactly n times
{n,}Match at least n times
{n,m}Match n to m times

A modifier after a bit of text will just match the last character. That is, clue+ will look for clue followed by any more e's that immediately follow it. To use + for the entire block of text you need a group, which is done by putting parentheses around the text. (clue)+ will match anywhere that clue is found at least once in a row.

One important thing to know about modifiers is that they're greedy. That is, they'll find as much as they possibly can. /ed.*ed/i will match all of "Ed jumped on the bed", because .* will keep searching for more as long as there's an ed to follow it. To make the modifier non-greedy, use a ? after it. Searching the same text with /ed.*?ed/i will result in "Ed jumped".

Groups

Using parentheses to group things together is very useful. For one thing, you can use | inside a group. Try out this expression:

/(gold|silver|copper) coins?/i

That will match "copper coins" or "Silver coin".

When you use a group to match some text, a backreference will match that same text later on in the pattern. A backreference looks like \n, where it matches the nth group. Say for example that you want to match "Gold coins made of gold" but not "brass knuckles made of wood". A backreference will do the trick.

/([a-z]+) [a-z'\-]+ made of \1/i

The group will capture the material like gold or brass, so it can be matched again in the backreference \1.

In a replacement pattern, you'd use $n to replace part of a match with what was found in a group. Here's a simple replacement that will do that:

/([a-z]+) ([a-z'\-]+)/$2 of $1/

Here, "crystal jars" becomes "jars of crystal".

In both search and replacement patterns, you can use named variables by including them as $name. The value of this can be set at any time by setting regex.namedvars[name]=value.

Replacement

As you can see, replacing text isn't all that hard. Obviously replacements won't use most of the search operators like brackets, parentheses, or modifiers, so they're usually interpreted literally. There's one important thing to know about replacements, though: If you use the g flag at the end of your pattern, any replacement will be done on every single match. With this you could change all coconuts to raisins.

/coconut/raisin/ig

This is a case-insensitive replacement, but if an uppercase Coconut is found, it will be replaced with a lowercase raisin.

You may also call procs in replacement patterns by using the e flag as described above.

Version History

Version 1.1: May 2005

Version 1.0: May 2005