public final class RubyFlavor extends RegexFlavor
This implementation supports all Ruby regular expressions with the exception of the following features:
However, there are subtle differences in how some fundamental constructs behave in ECMAScript regular expressions and Ruby regular expressions. This concerns core concepts like loops and capture groups and their interactions. These issues cannot be handled by transpiling alone and they require extra care on the side of TRegex. The issues and the solutions are listed below.
Node.js (ECMAScript):
> /(?:(a)|(b))\1/.exec("b")
[ 'b', undefined, 'b', index: 0, input: 'b', groups: undefined ]
MRI (Ruby):
irb(main):001:0> /(?:(a)|(b))\1/.match("b")
=> nil
This is solved in TRegexBacktrackingNFAExecutorNode, by the introduction of the
backrefWithNullTargetSucceeds field, which controls how backreferences to unmatched
capture groups are resolved. Also, in JSRegexParser, an optimization that drops forward
references and nested references from ECMAScript regular expressions is turned off for Ruby
regular expressions.(a) capture
group, while Ruby keeps it.
Node.js (ECMAScript):
> /((a)|(b))+/.exec("ab")
[ 'ab', 'b', undefined, 'b', index: 0, input: 'ab', groups: undefined ]
MRI (Ruby):
irb(main):001:0> /((a)|(b))+/.match("ab")
=> #<MatchData "ab" 1:"b" 2:"a" 3:"b">
This is solved in NFATraversalRegexASTVisitor. The method getGroupBoundaries is
modified so that the instructions for clearing enclosed capture groups are omitted from generated
NFA transitions when processing Ruby regular expressions.? loop
zero times, since executing it once would consume no characters. As a result, the contents of
capture group 1 are null. On the other hand, Ruby executes the loop once, because the execution
modifies the contents of capture group 1 so that it contains the empty string.
Node.js (ECMAScript):
> /(a*)? /.exec("")
[ '', undefined, index: 0, input: '', groups: undefined ]
MRI (Ruby):
irb(main):001:1" /(a*)?/.match("")
=> #<MatchData "" 1:"">
This is solved by permitting one extra empty iteration of a loop when traversing the AST and
generating the NFA. In the absence of backreferences, an extra empty iteration is sufficient,
because any other iteration on top of that will retread the same path and have no further
effects. With backreferences (or more specifically, forward references), it is possible to create
situations where several empty iterations are required, sometimes even in the middle of a loop,
as in the example below.
irb(main):001:0> / (a|\2b|\3()|())* /x.match("aaabbb")
=> #<MatchData "aaabbb" 1:"" 2:"" 3:"">
In NFATraversalRegexASTVisitor, we let NFA transitions pass through one empty iteration
of a loop (extraEmptyLoopIterations in NFATraversalRegexASTVisitor#doAdvance).
This generates an extra empty iteration at the end of loops and it also gives correct behavior on
constructions such as the one given above, as it lets us generate transitions that use an extra
empty iteration though the loop to populate some new capture group and then arrive at a new
backreference node. Since a single NFA transition can now correspond to more complex paths
through the AST, we also need to change the way we check the guards that the transitions are
annotated with by interleaving the state changes and assertions (see the use of
TRegexBacktrackingNFAExecutorNode#transitionMatchesStepByStep). We also need to implement
the empty check, by verifying the state of the capture groups on top of verifying the current
index (see TRegexBacktrackingNFAExecutorNode#monitorCaptureGroupsInEmptyCheck). For that,
we need fine-grained information about capture group updates and so we include this information
in the transition guards by QuantifierGuard.createUpdateCG(int).
In unrolled loops, we disable empty checks altogether (in JSRegexParser, in the calls to
RegexParser#createOptional). This is correct since Ruby's empty checks terminate a loop
only when it reaches a fixed point w.r.t. to any observable state. Finally, also in
JSRegexParser. we also disable an optimization that drops zero-width groups and
lookaround assertions with optional quantifiers.
Node.js (ECMAScript)
> /(?:|a)?/.exec('a')
[ 'a', index: 0, input: 'a', groups: undefined ]
MRI (Ruby):
irb(main):001:0> /(?:|a)?/.match('a')
=> #<MatchData "">
We implement this in NFATraversalRegexASTVisitor by introducing two transitions whenever
we leave a loop, one leading to the start of the loop (empty check passes) and one escaping past
the loop (empty check fails). The two transitions are then annotated with complementary guards
(QuantifierGuard.createEscapeZeroWidth(com.oracle.truffle.regex.tregex.parser.Token.Quantifier) and QuantifierGuard.createEscapeZeroWidth(com.oracle.truffle.regex.tregex.parser.Token.Quantifier),
respectively), so that at runtime, only one of the two transitions will be admissible.| Modifier and Type | Field and Description |
|---|---|
static RubyFlavor |
INSTANCE |
BACKREFERENCES_TO_UNMATCHED_GROUPS_FAIL, EMPTY_CHECKS_MONITOR_CAPTURE_GROUPS, FAILING_EMPTY_CHECKS_DONT_BACKTRACK, LOOKBEHINDS_RUN_LEFT_TO_RIGHT, NESTED_CAPTURE_GROUPS_KEPT_ON_LOOP_REENTRY, USES_LAST_GROUP_RESULT_FIELD| Modifier and Type | Method and Description |
|---|---|
RegexParser |
createParser(RegexLanguage language,
RegexSource source,
CompilationBuffer compilationBuffer) |
RegexValidator |
createValidator(RegexSource source) |
backreferencesToUnmatchedGroupsFail, canHaveEmptyLoopIterations, emptyChecksMonitorCaptureGroups, failingEmptyChecksDontBacktrack, lookBehindsRunLeftToRight, nestedCaptureGroupsKeptOnLoopReentry, usesLastGroupResultFieldpublic static final RubyFlavor INSTANCE
public RegexValidator createValidator(RegexSource source)
createValidator in class RegexFlavorpublic RegexParser createParser(RegexLanguage language, RegexSource source, CompilationBuffer compilationBuffer)
createParser in class RegexFlavor