An implementation of the Ruby regex flavor.
This implementation supports translating all Ruby regular expressions to ECMAScript regular
expressions with the exception of the following features:
- case-insensitive matching: Ruby regular expression allow both case-sensitive matching and
case-insensitive matching within the same regular expression. Also, Ruby's notion of
case-insensitivity differs from the one in ECMAScript. For that reason, we would have to
translate all Ruby regular expressions to case-sensitive ECMAScript regular expressions and we
would support case-insensitivity by case-folding any character matchers in the Ruby regular
expression. However, Ruby has a more sophisticated notion of case-insensitivity than ECMAScript,
which can lead to, e.g., two characters such as "ss" matching a single character such as
"ß", meaning there is no longer a 1-to-1 correspondence between character matchers. In order
to support this, we would have to replicate the same case-folding behaviorin the Ruby flavor
implementation.
- case-insensitive backreferences: As stated above, case-insensitive matching has to be
implemented by case-folding. However, there is no way we can case-fold a backreference, since we
don't know which string it will match.
- \G escape sequence: In Ruby regular expressions, \G can be used to assert that we are at some
special position that was marked by a previous execution of the regular expression on that input.
ECMAScript doesn't support assertions which check the current index against some reference
value.
- \K keep command: This command can be used in Ruby regular expressions to modify the matcher's
state so that it deletes any characters matched so far and considers the current position as the
start of the reported match. There is no operator like this in ECMAScript that would allow one to
tinker with the matcher's state.
- named capture groups with the same name: Ruby admits regular expressions with named capture
groups that share the same name. These situations can't be handled by replacing those capture
groups with regular numbered capture groups and then mapping the capture group names to lists of
capture group indices as we wouldn't know which of the homonymous capture groups was matched last
and therefore which value should be used.
- Unicode character properties not supported by ECMAScript and not covered by the POSIX
character classes: Ruby regular expressions use the syntax \p{...} for Unicode character
properties. Similar to ECMAScript, they offer access to Unicode Scripts, General Categories and
some character properties. Ruby also allows access to character properties that refer to POSIX
character classes (e.g. \p{Alnum} for [[:alnum:]]). We support all of the above, including any
character properties specified by Ruby's documentation. However, Ruby regular expressions still
have access to extra Unicode character properties (e.g. Age) that we do not support. We could
dive through Ruby's implementation to find out which other properties might be used and try
providing them too.
- \g<...> subexpression calls and \k<...+-x> backreferences to other levels: Ruby
allows recursive calls into subexpressions of the regular expression. There is nothing like this
in ECMAScript or in the TRegex engine. Furthermore, Ruby allows backreferences to access captured
groups on different levels (of the call stack), so as we don't support subexpression calls, we
also don't support those backreferences.
- (?>....) atomic groups: This construct allows control over the matcher's backtracking by
making committed choices which can't be undone. This is not something we can support using
ECMAScript regexes.
- \X extended grapheme cluster escapes: This is just syntactic sugar for a certain expression
which uses atomic groups, and it is therefore not supported.
- \R line break escapes: These are also translated by Joni to atomic groups, which we do not
support.
- possessive quantifiers, e.g. a*+: Possessive quantifiers are quantifiers which consume
greedily and also do not allow backtracking, so they are another example of the atomic groups
that we do not support (a*+ is equivalent to (?>a*)).
- (?~...) absent expressions: These constructs can be used in Ruby regular expressions to match
strings that do not contain a match for a given expression. ECMAScript doesn't offer a similar
operation.
- quantifiers on lookaround assertions: We translate the Ruby regular expressions to
Unicode-mode ECMAScript regular expressions. Among other reasons, this lets us assume that a
single character matcher will match a single Unicode code point, not just a UTF-16 code unit, as
would be the case in non-Unicode ECMAScript regular expressions. Unicode-mode ECMAScript regular
expressions do not allow quantifiers on lookaround assertions, as they rarely make any sense. One
would hope to implement this by dropping any lookaround assertions that have a quantifier on them
that makes them optional. However, this is not correct as the lookaround assertion might contain
capture groups and thus have visible side effects.
- conditional backreferences (?(group)then|else): There is no counterpart to this in ECMAScript
regular expressions.
However, there are subtle differences in how some fundamental constructs behave in ECMAScript
regular expressions and Ruby regular expressions. This concerns core concepts like loops and
capture groups and their interactions. These issues cannot be handled by transpiling alone and
they require extra care on the side of TRegex. The issues and the solutions are listed below.
- backreferences to unmatched capture groups should fail: In ECMAScript, when a backreference
is made to a capture group which hasn't been matched, such a backreference is ignored and
matching proceeds. If this happens in Ruby, the backreference will fail to match and the search
will stop and backtrack.
Node.js (ECMAScript):
> /(?:(a)|(b))\1/.exec("b")
[ 'b', undefined, 'b', index: 0, input: 'b', groups: undefined ]
MRI (Ruby):
irb(main):001:0> /(?:(a)|(b))\1/.match("b")
=> nil
This is solved in TRegexBacktrackingNFAExecutorNode, by the introduction of the
backrefWithNullTargetSucceeds field, which controls how backreferences to unmatched
capture groups are resolved. Also, in RegexParser, an optimization that drops forward
references and nested references from ECMAScript regular expressions is turned off for Ruby
regular expressions.
- re-entering a loop should not reset enclosed capture groups: In ECMAScript, when a group is
re-entered while looping, all of the capture groups contained within the looping group are reset.
On the other hand, in Ruby, their contents are preserved from one iteration of the loop to the
next. As we see in the example below, ECMAScript drops the contents of the
(a) capture
group, while Ruby keeps it.
Node.js (ECMAScript):
> /((a)|(b))+/.exec("ab")
[ 'ab', 'b', undefined, 'b', index: 0, input: 'ab', groups: undefined ]
MRI (Ruby):
irb(main):001:0> /((a)|(b))+/.match("ab")
=> #<MatchData "ab" 1:"b" 2:"a" 3:"b">
This is solved in NFATraversalRegexASTVisitor. The method getGroupBoundaries is
modified so that the instructions for clearing enclosed capture groups are omitted from generated
NFA transitions when processing Ruby regular expressions.
- loops should be repeated as long as the state of capture groups evolves: In ECMAScript, when
a loop matches the minimum required number of iterations, any further iterations are only matched
provided they consume some characters from the input. This is a measure intended to stop infinite
loops once they no longer consume any input. Ruby has a similar guard, but it admits extra
iterations if they either consume characters or change the state of capture groups. Thus it is
possible to have extra iterations that don't consume any characters but that store empty strings
as matches of capture groups. In the example below, ECMAScript executes the outer
? loop
zero times, since executing it once would consume no characters. As a result, the contents of
capture group 1 are null. On the other hand, Ruby executes the loop once, because the execution
modifies the contents of capture group 1 so that it contains the empty string.
Node.js (ECMAScript):
> /(a*)? /.exec("")
[ '', undefined, index: 0, input: '', groups: undefined ]
MRI (Ruby):
irb(main):001:1" /(a*)?/.match("")
=> #<MatchData "" 1:"">
This is solved by permitting one extra empty iteration of a loop when traversing the AST and
generating the NFA. In the absence of backreferences, an extra empty iteration is sufficient,
because any other iteration on top of that will retread the same path and have no further
effects. With backreferences (or more specifically, forward references), it is possible to create
situations where several empty iterations are required, sometimes even in the middle of a loop,
as in the example below.
irb(main):001:0> / (a|\2b|\3()|())* /x.match("aaabbb")
=> #<MatchData "aaabbb" 1:"" 2:"" 3:"">
In NFATraversalRegexASTVisitor, we let NFA transitions pass through one empty iteration
of a loop (extraEmptyLoopIterations in NFATraversalRegexASTVisitor#doAdvance).
This generates an extra empty iteration at the end of loops and it also gives correct behavior on
constructions such as the one given above, as it lets us generate transitions that use an extra
empty iteration though the loop to populate some new capture group and then arrive at a new
backreference node. Since a single NFA transition can now correspond to more complex paths
through the AST, we also need to change the way we check the guards that the transitions are
annotated with by interleaving the state changes and assertions (see the use of
TRegexBacktrackingNFAExecutorNode#transitionMatchesStepByStep). We also need to implement
the empty check, by verifying the state of the capture groups on top of verifying the current
index (see TRegexBacktrackingNFAExecutorNode#monitorCaptureGroupsInEmptyCheck). For that,
we need fine-grained information about capture group updates and so we include this information
in the transition guards by QuantifierGuard.createUpdateCG(int).
In unrolled loops, we disable empty checks altogether (in RegexParser, in the calls to
RegexParser#createOptional). This is correct since Ruby's empty checks terminate a loop
only when it reaches a fixed point w.r.t. to any observable state. Finally, also in
RegexParser. we also disable an optimization that drops zero-width groups and lookaround
assertions with optional quantifiers.
- failing the empty check should lead to matching the sequel of the quantified expression
instead of backtracking: In ECMAScript, when a loop fails the empty check (an iteration matches
only the empty string), the engine terminates the loop by rejecting this branch and backtracking
to another alternative (eventually backtracking to the point where it chooses not to re-enter the
loop and consider it finished). On the other hand, in Ruby, when a loop fails the empty check (an
iteration matches only the empty string and it does not modify the state of the capture groups),
the engine continues with the current branch by proceeding to the continuation of the loop. Most
notably, it doesn't try to backtrack and alter decisions made inside the loop until some future
failure forces it to. This can be illustrated on the following example, where ECMAScript will
backtrack into the loop and choose the second alternative, whereas Ruby will proceed with the
empty match.
Node.js (ECMAScript)
> /(?:|a)?/.exec('a')
[ 'a', index: 0, input: 'a', groups: undefined ]
MRI (Ruby):
irb(main):001:0> /(?:|a)?/.match('a')
=> #<MatchData "">
We implement this in NFATraversalRegexASTVisitor by introducing two transitions whenever
we leave a loop, one leading to the start of the loop (empty check passes) and one escaping past
the loop (empty check fails). The two transitions are then annotated with complementary guards
(QuantifierGuard.createEscapeZeroWidth(com.oracle.truffle.regex.tregex.parser.Token.Quantifier) and QuantifierGuard.createEscapeZeroWidth(com.oracle.truffle.regex.tregex.parser.Token.Quantifier),
respectively), so that at runtime, only one of the two transitions will be admissible.