Version 1.1
by Lummox JR
Regex is a library for searching and replacing text using regular expressions. Most standard regular expression functionality is included. To learn more about what regular expressions are and how to use them, please see Appendix A.
| Regex quick reference | |
|---|---|
| /pattern/flags /pattern/replacement/flags | |
| a|b | Find a or b |
| ^ | Beginning of line |
| $ | End of line |
| \A | Beginning of text (single-line mode) |
| \Z | End of text (single-line mode) |
| \b | Word break (see \w); \B is non-break |
| . | Any character |
| [abc0-9] | Any of these characters; use - for range of characters |
| [^abc0-9] | Any character except one of these |
| ( ) | Treat contents as a group. |
| \n | Find the same match found in nth group |
| $n | Replace with match found in nth group |
| $name | Find value of named var name |
| [proc(arg,...)] | Call proc and use result (replacement only) |
| Modifiers | |
| * | Match 0 or more times |
| + | Match 1 or more times |
| ? | Match 0 or 1 time OR make last modifier non-greedy |
| {n} | Match n times |
| {n,} | Match at least n times |
| {n,m} | Match n to m times |
| Special characters | |
| \ | Escape next character if it is not one of the following |
| \0nnn | Octal character nnn |
| \d | Digit 0-9; \D is non-digit |
| \l | Lowercase letter; \L is non-lower or non-letter |
| \n | Line break |
| \s | Whitespace character; \S is non-space |
| \t | Tab |
| \u | Uppercase letter; \U is non-upper or non-letter |
| \w | Word character 0-9, A-Z, a-z, or _ \W is non-word character |
| \xnn | Hexadecimal character nn |
| Flags | |
| i | Case-insensitive |
| s | Treat as single line \A and \Z find start and end of entire text only \s includes \n |
| g | Global; replace all matches |
| e | Allow expressions (proc calls) |
To use regular expressions, first you have to compile the pattern you will use to search or replace. To do this you create a new /regex datum.
var/regex/R = new("/time/")
This will create a new regular expression which searches for the word "time" in lowercase. With this expression you can use Find() to find "time" and FindNext() to keep searching.
If the expression fails to compile, the error var will tell you the reason why. Always check error before searching the expression.
The / character is a delimiter which says where the pattern begins and ends, which is important for later examples. You can use almost any character as a delimiter. If you want to use that same character in the pattern, you must escape it with a backslash; in a BYOND string, you'll need to use \\ for that.
var/txt = "It's time to start the timer."
if(R.Find(txt))
do
world << "[copytext(txt,1,R.match)]\
<b>[copytext(txt,R.match,R.index)]</b>\
[copytext(txt,R.index)]"
while(R.FindNext(txt))
Result:
It's time to start the timer.
It's time to start the timer.
If you had chosen /\btime\b/ as a pattern instead, "time" would only be found as a whole word, so the second line would not appear.
var/regex/R = new("/\\btime\\b/")
Since this is written in DM, where a backslash has special meaning, it's important when you use backslashes to escape them with another backslash so BYOND can read them. Likewise if you use brackets [...] you'll have to use \[ as well.
Replacement is about as easy as searching. Create a find-and-replace pattern and use Replace() to perform the replacement. Replace() returns the changed text if it does anything, and null if the pattern was not found.
var/regex/R = new("/\\bred\\b/blue/")
var/txt = "Fred threw the red ball to Jane. Jane caught the red ball."
var/newtxt = R.Replace(txt)
while(newtxt)
txt = newtxt
newtxt = R.ReplaceNext(txt)
Result:
Fred threw the blue ball to Jane. Jane caught the blue ball.
After Replace() runs and has found a replacement successfully, R.index points to the end of the replaced text. In the same way, ReplaceNext() will replace the next match found.
Of course you can replace everything in one fell swoop using the g flag. This will make Replace() replace everything it can all at once.
var/regex/R = new("/\\bred\\b/blue/g")
var/txt = "Fred threw the red ball to Jane. Jane caught the red ball."
txt = R.Replace(txt) || txt
If your pattern includes groups using parentheses, you can read the text that was captured from that group by using GroupText(). This is the equivalent of $1, $2, etc. in a language like Perl or PHP.
var/regex/digits = new("/(\\d+)/")
var/txt = "3 pigs and 4 chickens walk into a bar and order 50 punchlines."
if(digits.Find(txt))
do
world << digits.GroupText(1)
while(digits.FindNext(txt))
Result:
3
4
50
You can insert your own variables by name into a search by using the namedvars associative list.
var/regex/R = new("/^$player logs (in|out) .*$/")
R.namedvars["player"] = player
if(R.Find(gamelog))
do
admin << copytext(txt,R.match,R.index)
while(R.FindNext(txt))
By using the e flag, you can allow your replacement expression to call a global proc, and include arguments. To do this you must put the proc name inside brackets [], and if any arguments are included, they go inside parentheses and are separated by commas. (Do not put any spaces between the proc name and the parentheses.) Remember, proc names are always case-sensitive.
/(\d+)\+(\d+)/[sum($1,$2)]/e
This expression will search for a number, a + sign, and another number, then add the two numbers using the sum() proc.
// a and b are strings
proc/sum(a, b)
return text2num(a) + text2num(b)
mob/verb/TestAdd()
var/regex/add = new("/(\\d+)\\+(\\d+)/[sum($1,$2)]/e")
var/txt = "What's 11+23?"
txt = add.Replace(txt) || txt
Result:
What's 34?
sum() is a global proc, and the arguments it receives are text strings, in this case "11" and "23". You can do anything with these strings that you want. Here they were numbers to be added, but you might want such a proc to calculate other values or format text in particular ways.
The Split() proc will take text and split it up according to this pattern. For example if your pattern is /,\s*/, you can split up a list separated by commas with optional spaces.
var/regex/commasplit = new("/,\\s*/")
var/list/items = commasplit.Split("trees, rocks, water, fire, air")
for(var/thing in items)
world << thing
Result:
trees
rocks
water
fire
air
There's a second argument to Split() which can make the split inclusive. If this is nonzero, then the pattern matches themselves (in this case ", ") will be included in the list.
var/regex/commafy = new("/(\\d+)(\\d{3})/$1,$2/g")
var/txt = "9000 peaches and 12345678 pretzels"
var/newtxt = commafy.Replace(txt)
while(newtxt)
txt = newtxt
newtxt = commafy.Replace(txt)
Result:
9,000 peaches and 12,345,678 pretzels
var/regex/unpara = new("/([^\\n])\\n([^\\n])/$1$2/gs")
var/regex/repara = new("/^(.{,77}|\S{78,})\\s+/$1\\n/g")
txt = unpara.Replace(txt) || txt
txt = repara.Replace(txt) || txt
This section explains the layout of the /regex datum.
If you've never used regular expressions before, this section will give you a brief overview. Using the handy quick reference above as well as the explanations in this section, you should have enough info to build your own patterns.
A regular expression pattern may take two forms:
/pattern/flags /pattern/replacement/flags
A basic pattern may contain something as simple as a single word or phrase. Literal text is easy to find. If you want to find it without worrying about upper- or lowercase, use the i flag at the end of the pattern.
/a basic pattern/i
The / character is a delimiter, which marks where the pattern ends. Almost any character may be used as a delimiter, including # or ! or letters. Any character you use this way, though, can't have a special meaning in the pattern, so if you use |, you can't use a|b to mean "find a or b". If you want to use that same character as literal text in the pattern, you should be able to escape it with a backslash \. For example, you can use /\d+\/\d+\/\d{2,4}/ to read a date.
For special characters, backslashes are very important. Any special character you want to use like a normal one, including a backslash, needs a backslash in front of it. As in regular DM, some things like \n or \t have special meaning. But there are more special escape sequences, including \d which means digits 0-9. If you use a backslash in your pattern, you'll probably need to use two when writing it as a BYOND string. The example above would have to be written as "/\\d+\\/\d+\\/\\d{2,4}/". For the rest of this appendix, it's assumed that you know that; the patterns you see will be the "raw" pattern like the one above, not in a BYOND string form.
Some characters are easier to include using an ASCII code than the actual character itself. \xnn will produce a character from a hexadecimal code; i.e. \x41 is A. You can do a similar thing with octal, where \0101 is A, but the leading 0 is needed.
Finding just text you can already do with BYOND's built-in procs, though, so let's move on to the meat of regular expressions and some of the things that patterns can include.
It's possible to search along word breaks using \b, or non-breaks using \B. Thus, /\bbrick\b/i will match "brick" but it will not match "bricklayer".
You can match just at the beginning of a line by using ^ and at the end of a line using $. \A and \Z mean the same things, except that if you use the s flag to treat your text like a single line, they'll only match the beginning and end of the entire text.
/^you can/i /using \$\.$/
If you want to find one of several possible choices, then | which means "or" is the right tool for the job.
/banana|raspberry|grape/i
There are many other things you can find in a regular expression. A single period . will match any character. Brackets [] will find any character that matches what's between them; e.g. [AEIOUaeiou] will match a vowel, or [a-z] will match any lowercase letter. Or you can use the opposite of this by putting ^ at the beginning of your list of characters, so [^AEIOUaeiou] matches any character that is not a vowel. There are other kinds of character matches you can do easily without brackets:
\d Any digit 0-9 \D Anything but 0-9 \w 0-9, A-Z, a-z, or _ \W Anything but \w \s Whitespace \S Non-whitespace \l Any lowercase letter \L Any uppercase or non-letter \u Any uppercase letter \U Any lowercase or non-letter
Of course, it's not as much use to find any old character as to find it a certain number of times. That's where modifiers come in. Modifiers go right after part of the pattern and say how many times you want to find that part.
* Match 0 or more times + Match 1 or more times ? Match 0 or 1 time {n} Match exactly n times {n,} Match at least n times {n,m} Match n to m times
A modifier after a bit of text will just match the last character. That is, clue+ will look for clue followed by any more e's that immediately follow it. To use + for the entire block of text you need a group, which is done by putting parentheses around the text. (clue)+ will match anywhere that clue is found at least once in a row.
One important thing to know about modifiers is that they're greedy. That is, they'll find as much as they possibly can. /ed.*ed/i will match all of "Ed jumped on the bed", because .* will keep searching for more as long as there's an ed to follow it. To make the modifier non-greedy, use a ? after it. Searching the same text with /ed.*?ed/i will result in "Ed jumped".
Using parentheses to group things together is very useful. For one thing, you can use | inside a group. Try out this expression:
/(gold|silver|copper) coins?/i
That will match "copper coins" or "Silver coin".
When you use a group to match some text, a backreference will match that same text later on in the pattern. A backreference looks like \n, where it matches the nth group. Say for example that you want to match "Gold coins made of gold" but not "brass knuckles made of wood". A backreference will do the trick.
/([a-z]+) [a-z'\-]+ made of \1/i
The group will capture the material like gold or brass, so it can be matched again in the backreference \1.
In a replacement pattern, you'd use $n to replace part of a match with what was found in a group. Here's a simple replacement that will do that:
/([a-z]+) ([a-z'\-]+)/$2 of $1/
Here, "crystal jars" becomes "jars of crystal".
In both search and replacement patterns, you can use named variables by including them as $name. The value of this can be set at any time by setting regex.namedvars[name]=value.
As you can see, replacing text isn't all that hard. Obviously replacements won't use most of the search operators like brackets, parentheses, or modifiers, so they're usually interpreted literally. There's one important thing to know about replacements, though: If you use the g flag at the end of your pattern, any replacement will be done on every single match. With this you could change all coconuts to raisins.
/coconut/raisin/ig
This is a case-insensitive replacement, but if an uppercase Coconut is found, it will be replaced with a lowercase raisin.
You may also call procs in replacement patterns by using the e flag as described above.
Version 1.1: May 2005
Version 1.0: May 2005