Introducing Tcl 8.7 Part 1: regsub enhancements
The long-awaited alpha release of Tcl 8.7 calls for a series of posts summarizing the enhancements in this release. The first in this series is about the new -command
option to the regsub command. I have wished for this feature many times and now thanks to DKF it is now available in Tcl.
To take Tcl 8.7 for a spin, you can download a pre-alpha binary for your platform. Alternatively, you can build it yourself from the
core-8-branch
branch in the Tcl fossil repository.
The regsub command before Tcl 8.7 allowed for transforming strings by substitution where the replacement could be composed of literal strings and matched patterns from the string. What Tcl 8.7 adds is the ability substitute any arbitrary computed value which may depend on the matched patterns.
A first example
Consider the problem of URL encoding a given string where a simple but still conforming method is to replace all non-alphanumeric characters with their hexadecimal values preceded by a %
character. This following does the job by invoking the enc
procedure on every character matching the regular expression and replacing the character with the returned value.
% proc enc {ch} {format %%%02X [scan $ch %c]}
% regsub -command -all {[^0-9A-Za-z]} some-random+string enc
some%2Drandom%2Bstring
Or if you have an aversion to one-use procedures and are willing to tolerate readability issues with apply,
% regsub -command -all {[^0-9A-Za-z]} some-random+string {
apply {ch {format %%%02X [scan $ch %c]}}
}
some%2Drandom%2Bstring
Or if you keep the Tcllib lambda
package handy,
package require lambda
% regsub -command -all {[^0-9A-Za-z]} some-random+string [
lambda ch {format %%%02X [scan $ch %c]}
]
But I digress. The original point was to illustrate the utility of this new feature. So for comparison, you might want to look at the implementation of the ncgi::encode
command from Tcllib's ncgi
module that provides the functionality for Tcl 8.6 and earlier.
Syntax
The syntax of regsub
takes the form
regsub ?SWITCHES? REGEX STRING SUBSTITUTIONSPEC ?VARNAME?
Note from the earlier example, that -command
is a binary switch and not an option that takes the command as an argument. Rather it causes the SUBSTITUTIONSPEC
argument to regsub
to be treated as a command and not the actual substitution. When the command is invoked, it is passed one or more arguments. The first is the value of the matched expression. Subsequent arguments, if present, correspond to matched subexpressions specified in REGEX
.
For illustrative purposes,
% proc print {whole first second} {puts $whole,$first,$second}
% regsub -all -command {(.)(.)} abcd print
ab,a,b
cd,c,d
The convenience of the command arises from its succint combination of iteration (with the -all
option), selective matching of the iteration operand, and execution of code of any complexity.
More examples
Ensure first character after punctuation is upper case:
set text "First sentence. second sentence? third sentence."
% regsub -command -all {([.!?])\s+(.)} $text {
apply {{- punc ch} {return "$punc [string toupper $ch]"}}
}
First sentence. Second sentence? Third sentence.
Converting Markdown headings to HTML:
% set md_line "# First level heading\n## A second level heading"
% regsub -all -line -command {^(#+)\s+(.*)$} $md_line [
lambda {- level text} {
set h h[string length $level]; return "<$h>$text</$h>"
}
]
<h1>First level heading</h1>
<h2>A second level heading</h2>
Sometimes, you may not even care about transforming the string but use it for iteration. To find the longest word in a sentence:
set maxlen 0
regsub -command -all {\w+} $sentence [
lambda word {
set len [string length $word]
if {$len > $::maxlen} {set ::maxlen $len}
}
]
set maxlen
One last point about using regsub. Consider changing all numerals in a sentence to their Devnagiri versions. Whip out your regsub hammer!
% regsub -all -command {\d} "42 is the answer." [
lambda {ch} {format %c [expr {[scan $ch %c]+0x966-0x30}]}
]
४२ is the answer.
Realize, if you haven't already, that string map
could do this just as well using a character map and likely be faster and clearer. Where regsub
should be used is when the map is not static, or is too big, or depends on the context as in the earlier examples.
See the command reference for additional examples.