Class RE
- All Implemented Interfaces:
Serializable
To compile a regular expression (RE), you can simply construct an RE matcher object from the string specification of the pattern, like this:
RE r = new RE("a*b");
Once you have done this, you can call either of the RE.match methods to perform matching on a String. For example:
boolean matched = r.match("aaaab");will cause the boolean matched to be set to true because the pattern "a*b" matches the string "aaaab".
If you were interested in the number of a's which matched the first part of our example expression, you could change the expression to "(a*)b". Then when you compiled the expression and matched it against something like "xaaaab", you would get results like this:
RE r = new RE("(a*)b"); // Compile expression boolean matched = r.match("xaaaab"); // Match against "xaaaab" String wholeExpr = r.getParen(0); // wholeExpr will be 'aaaab' String insideParens = r.getParen(1); // insideParens will be 'aaaa' int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1 int endWholeExpr = r.getParenEnd(0); // endWholeExpr will be index 6 int lenWholeExpr = r.getParenLength(0); // lenWholeExpr will be 5 int startInside = r.getParenStart(1); // startInside will be index 1 int endInside = r.getParenEnd(1); // endInside will be index 5 int lenInside = r.getParenLength(1); // lenInside will be 4You can also refer to the contents of a parenthesized expression within a regular expression itself. This is called a 'backreference'. The first backreference in a regular expression is denoted by \1, the second by \2 and so on. So the expression:
([0-9]+)=\1will match any string of the form n=n (like 0=0 or 2=2).
The full regular expression syntax accepted by RE is described here:
Characters unicodeChar Matches any identical unicode character \ Used to quote a meta-character (like '*') \\ Matches a single '\' character \0nnn Matches a given octal character \xhh Matches a given 8-bit hexadecimal character \\uhhhh Matches a given 16-bit hexadecimal character \t Matches an ASCII tab character \n Matches an ASCII newline character \r Matches an ASCII return character \f Matches an ASCII form feed character Character Classes [abc] Simple character class [a-zA-Z] Character class with ranges [^abc] Negated character classNOTE: Incomplete ranges will be interpreted as "starts from zero" or "ends with last character".
I.e. [-a] is the same as [\\u0000-a], and [a-] is the same as [a-\\uFFFF], [-] means "all characters".
Standard POSIX Character Classes [:alnum:] Alphanumeric characters. [:alpha:] Alphabetic characters. [:blank:] Space and tab characters. [:cntrl:] Control characters. [:digit:] Numeric characters. [:graph:] Characters that are printable and are also visible. (A space is printable, but not visible, while an `a' is both.) [:lower:] Lower-case alphabetic characters. [:print:] Printable characters (characters that are not control characters.) [:punct:] Punctuation characters (characters that are not letter, digits, control characters, or space characters). [:space:] Space characters (such as space, tab, and formfeed, to name a few). [:upper:] Upper-case alphabetic characters. [:xdigit:] Characters that are hexadecimal digits. Non-standard POSIX-style Character Classes [:javastart:] Start of a Java identifier [:javapart:] Part of a Java identifier Predefined Classes . Matches any character other than newline \w Matches a "word" character (alphanumeric plus "_") \W Matches a non-word character \s Matches a whitespace character \S Matches a non-whitespace character \d Matches a digit character \D Matches a non-digit character Boundary Matchers ^ Matches only at the beginning of a line $ Matches only at the end of a line \b Matches only at a word boundary \B Matches only at a non-word boundary Greedy Closures A* Matches A 0 or more times (greedy) A+ Matches A 1 or more times (greedy) A? Matches A 1 or 0 times (greedy) A{n} Matches A exactly n times (greedy) A{n,} Matches A at least n times (greedy) A{n,m} Matches A at least n but not more than m times (greedy) Reluctant Closures A*? Matches A 0 or more times (reluctant) A+? Matches A 1 or more times (reluctant) A?? Matches A 0 or 1 times (reluctant) Logical Operators AB Matches A followed by B A|B Matches either A or B (A) Used for subexpression grouping (?:A) Used for subexpression clustering (just like grouping but no backrefs) Backreferences \1 Backreference to 1st parenthesized subexpression \2 Backreference to 2nd parenthesized subexpression \3 Backreference to 3rd parenthesized subexpression \4 Backreference to 4th parenthesized subexpression \5 Backreference to 5th parenthesized subexpression \6 Backreference to 6th parenthesized subexpression \7 Backreference to 7th parenthesized subexpression \8 Backreference to 8th parenthesized subexpression \9 Backreference to 9th parenthesized subexpression
All closure operators (+, *, ?, {m,n}) are greedy by default, meaning that they match as many elements of the string as possible without causing the overall match to fail. If you want a closure to be reluctant (non-greedy), you can simply follow it with a '?'. A reluctant closure will match as few elements of the string as possible when finding matches. {m,n} closures don't currently support reluctancy.
Line terminators
A line terminator is a one- or two-character sequence that marks
the end of a line of the input character sequence. The following
are recognized as line terminators:
- A newline (line feed) character ('\n'),
- A carriage-return character followed immediately by a newline character ("\r\n"),
- A standalone carriage-return character ('\r'),
- A next-line character (' '),
- A line-separator character (' '), or
- A paragraph-separator character (' ).
RE runs programs compiled by the RECompiler class. But the RE matcher class does not include the actual regular expression compiler for reasons of efficiency. In fact, if you want to pre-compile one or more regular expressions, the 'recompile' class can be invoked from the command line to produce compiled output like this:
// Pre-compiled regular expression "a*b" char[] re1Instructions = { 0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041, 0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047, 0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000, 0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000, 0x0000, }; REProgram re1 = new REProgram(re1Instructions);You can then construct a regular expression matcher (RE) object from the pre-compiled expression re1 and thus avoid the overhead of compiling the expression at runtime. If you require more dynamic regular expressions, you can construct a single RECompiler object and re-use it to compile each expression. Similarly, you can change the program run by a given matcher object at any time. However, RE and RECompiler are not threadsafe (for efficiency reasons, and because requiring thread safety in this class is deemed to be a rare requirement), so you will need to construct a separate compiler or matcher object for each thread (unless you do thread synchronization yourself). Once expression compiled into the REProgram object, REProgram can be safely shared across multiple threads and RE objects.
ISSUES:
- Version:
- $Id: RE.java 518156 2007-03-14 14:31:26Z vgritsenko $
- Author:
- Jonathan Locke, Tobias Schäfer
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescription(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) int
(package private) int
(package private) int
(package private) int[]
(package private) int[]
static final int
Flag to indicate that matching should be case-independent (folded)static final int
Newlines should match as BOL/EOL (^ and $)static final int
Specifies normal, case-sensitive matching behaviour.static final int
Consider all input a single body of text - newlines are matched by .(package private) int
(package private) static final int
(package private) static final int
(package private) int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
* The format of a node in a program is: * * [ OPCODE ] [ OPDATA ] [ OPNEXT ] [ OPERAND ] * * char OPCODE - instruction * char OPDATA - modifying data * char OPNEXT - next node (relative offset) * *(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) int
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) static final char
(package private) REProgram
static final int
Flag bit that indicates that subst should replace all occurrences of this regular expression.static final int
Flag bit that indicates that subst should replace backreferencesstatic final int
Flag bit that indicates that subst should only replace the first occurrence of this regular expression.(package private) CharacterIterator
(package private) int
(package private) int
(package private) int
(package private) int[]
(package private) int[]
-
Constructor Summary
ConstructorsConstructorDescriptionRE()
Constructs a regular expression matcher with no initial program.Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler.Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler.Construct a matcher for a pre-compiled regular expression from program (bytecode) data.Construct a matcher for a pre-compiled regular expression from program (bytecode) data. -
Method Summary
Modifier and TypeMethodDescriptionprivate void
Performs lazy allocation of subexpression arraysprivate int
compareChars
(char c1, char c2, boolean caseIndependent) Compares two characters.int
Returns the current match behaviour flags.getParen
(int which) Gets the contents of a parenthesized subexpression after a successful match.int
Returns the number of parenthesized subexpressions available after a successful match.final int
getParenEnd
(int which) Returns the end index of a given paren level.final int
getParenLength
(int which) Returns the length of a given paren level.final int
getParenStart
(int which) Returns the start index of a given paren level.Returns the current regular expression program in use by this matcher object.String[]
Returns an array of Strings, whose toString representation matches a regular expression.protected void
Throws an Error representing an internal error condition probably resulting from a bug in the regular expression compiler (or possibly data corruption).private boolean
isNewline
(int i) boolean
Matches the current regular expression program against a String.boolean
Matches the current regular expression program against a character array, starting at a given index.boolean
match
(CharacterIterator search, int i) Matches the current regular expression program against a character array, starting at a given index.protected boolean
matchAt
(int i) Match the current regular expression program against the current input string, starting at index i of the input string.protected int
matchNodes
(int firstNode, int lastNode, int idxStart) Try to match a string against a subset of nodes in the programvoid
setMatchFlags
(int matchFlags) Sets match behaviour flags which alter the way RE does matching.protected final void
setParenEnd
(int which, int i) Sets the end of a paren levelprotected final void
setParenStart
(int which, int i) Sets the start of a paren levelvoid
setProgram
(REProgram program) Sets the current regular expression program used by this matcher object.static String
Converts a 'simplified' regular expression to a full regular expressionString[]
Splits a string into an array of strings on regular expression boundaries.Substitutes a string for this regular expression in another string.Substitutes a string for this regular expression in another string.
-
Field Details
-
MATCH_NORMAL
public static final int MATCH_NORMALSpecifies normal, case-sensitive matching behaviour.- See Also:
-
MATCH_CASEINDEPENDENT
public static final int MATCH_CASEINDEPENDENTFlag to indicate that matching should be case-independent (folded)- See Also:
-
MATCH_MULTILINE
public static final int MATCH_MULTILINENewlines should match as BOL/EOL (^ and $)- See Also:
-
MATCH_SINGLELINE
public static final int MATCH_SINGLELINEConsider all input a single body of text - newlines are matched by .- See Also:
-
OP_END
static final char OP_END* The format of a node in a program is: * * [ OPCODE ] [ OPDATA ] [ OPNEXT ] [ OPERAND ] * * char OPCODE - instruction * char OPDATA - modifying data * char OPNEXT - next node (relative offset) * *- See Also:
-
OP_BOL
static final char OP_BOL- See Also:
-
OP_EOL
static final char OP_EOL- See Also:
-
OP_ANY
static final char OP_ANY- See Also:
-
OP_ANYOF
static final char OP_ANYOF- See Also:
-
OP_BRANCH
static final char OP_BRANCH- See Also:
-
OP_ATOM
static final char OP_ATOM- See Also:
-
OP_STAR
static final char OP_STAR- See Also:
-
OP_PLUS
static final char OP_PLUS- See Also:
-
OP_MAYBE
static final char OP_MAYBE- See Also:
-
OP_ESCAPE
static final char OP_ESCAPE- See Also:
-
OP_OPEN
static final char OP_OPEN- See Also:
-
OP_OPEN_CLUSTER
static final char OP_OPEN_CLUSTER- See Also:
-
OP_CLOSE
static final char OP_CLOSE- See Also:
-
OP_CLOSE_CLUSTER
static final char OP_CLOSE_CLUSTER- See Also:
-
OP_BACKREF
static final char OP_BACKREF- See Also:
-
OP_GOTO
static final char OP_GOTO- See Also:
-
OP_NOTHING
static final char OP_NOTHING- See Also:
-
OP_CONTINUE
static final char OP_CONTINUE- See Also:
-
OP_RELUCTANTSTAR
static final char OP_RELUCTANTSTAR- See Also:
-
OP_RELUCTANTPLUS
static final char OP_RELUCTANTPLUS- See Also:
-
OP_RELUCTANTMAYBE
static final char OP_RELUCTANTMAYBE- See Also:
-
OP_POSIXCLASS
static final char OP_POSIXCLASS- See Also:
-
E_ALNUM
static final char E_ALNUM- See Also:
-
E_NALNUM
static final char E_NALNUM- See Also:
-
E_BOUND
static final char E_BOUND- See Also:
-
E_NBOUND
static final char E_NBOUND- See Also:
-
E_SPACE
static final char E_SPACE- See Also:
-
E_NSPACE
static final char E_NSPACE- See Also:
-
E_DIGIT
static final char E_DIGIT- See Also:
-
E_NDIGIT
static final char E_NDIGIT- See Also:
-
POSIX_CLASS_ALNUM
static final char POSIX_CLASS_ALNUM- See Also:
-
POSIX_CLASS_ALPHA
static final char POSIX_CLASS_ALPHA- See Also:
-
POSIX_CLASS_BLANK
static final char POSIX_CLASS_BLANK- See Also:
-
POSIX_CLASS_CNTRL
static final char POSIX_CLASS_CNTRL- See Also:
-
POSIX_CLASS_DIGIT
static final char POSIX_CLASS_DIGIT- See Also:
-
POSIX_CLASS_GRAPH
static final char POSIX_CLASS_GRAPH- See Also:
-
POSIX_CLASS_LOWER
static final char POSIX_CLASS_LOWER- See Also:
-
POSIX_CLASS_PRINT
static final char POSIX_CLASS_PRINT- See Also:
-
POSIX_CLASS_PUNCT
static final char POSIX_CLASS_PUNCT- See Also:
-
POSIX_CLASS_SPACE
static final char POSIX_CLASS_SPACE- See Also:
-
POSIX_CLASS_UPPER
static final char POSIX_CLASS_UPPER- See Also:
-
POSIX_CLASS_XDIGIT
static final char POSIX_CLASS_XDIGIT- See Also:
-
POSIX_CLASS_JSTART
static final char POSIX_CLASS_JSTART- See Also:
-
POSIX_CLASS_JPART
static final char POSIX_CLASS_JPART- See Also:
-
maxNode
static final int maxNode- See Also:
-
MAX_PAREN
static final int MAX_PAREN- See Also:
-
offsetOpcode
static final int offsetOpcode- See Also:
-
offsetOpdata
static final int offsetOpdata- See Also:
-
offsetNext
static final int offsetNext- See Also:
-
nodeSize
static final int nodeSize- See Also:
-
program
REProgram program -
search
-
matchFlags
int matchFlags -
maxParen
int maxParen -
parenCount
transient int parenCount -
start0
transient int start0 -
end0
transient int end0 -
start1
transient int start1 -
end1
transient int end1 -
start2
transient int start2 -
end2
transient int end2 -
startn
transient int[] startn -
endn
transient int[] endn -
startBackref
transient int[] startBackref -
endBackref
transient int[] endBackref -
REPLACE_ALL
public static final int REPLACE_ALLFlag bit that indicates that subst should replace all occurrences of this regular expression.- See Also:
-
REPLACE_FIRSTONLY
public static final int REPLACE_FIRSTONLYFlag bit that indicates that subst should only replace the first occurrence of this regular expression.- See Also:
-
REPLACE_BACKREFERENCES
public static final int REPLACE_BACKREFERENCESFlag bit that indicates that subst should replace backreferences- See Also:
-
-
Constructor Details
-
RE
Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler. If you will be compiling many expressions, you may prefer to use a single RECompiler object instead.- Parameters:
pattern
- The regular expression pattern to compile.- Throws:
RESyntaxException
- Thrown if the regular expression has invalid syntax.- See Also:
-
RE
Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler. If you will be compiling many expressions, you may prefer to use a single RECompiler object instead.- Parameters:
pattern
- The regular expression pattern to compile.matchFlags
- The matching style- Throws:
RESyntaxException
- Thrown if the regular expression has invalid syntax.- See Also:
-
RE
Construct a matcher for a pre-compiled regular expression from program (bytecode) data. Permits special flags to be passed in to modify matching behaviour.- Parameters:
program
- Compiled regular expression program (see RECompiler and/or recompile)matchFlags
- One or more of the RE match behaviour flags (RE.MATCH_*):MATCH_NORMAL // Normal (case-sensitive) matching MATCH_CASEINDEPENDENT // Case folded comparisons MATCH_MULTILINE // Newline matches as BOL/EOL
- See Also:
-
RE
Construct a matcher for a pre-compiled regular expression from program (bytecode) data.- Parameters:
program
- Compiled regular expression program- See Also:
-
RE
public RE()Constructs a regular expression matcher with no initial program. This is likely to be an uncommon practice, but is still supported.
-
-
Method Details
-
simplePatternToFullRegularExpression
Converts a 'simplified' regular expression to a full regular expression- Parameters:
pattern
- The pattern to convert- Returns:
- The full regular expression
-
setMatchFlags
public void setMatchFlags(int matchFlags) Sets match behaviour flags which alter the way RE does matching.- Parameters:
matchFlags
- One or more of the RE match behaviour flags (RE.MATCH_*):MATCH_NORMAL // Normal (case-sensitive) matching MATCH_CASEINDEPENDENT // Case folded comparisons MATCH_MULTILINE // Newline matches as BOL/EOL
-
getMatchFlags
public int getMatchFlags()Returns the current match behaviour flags.- Returns:
- Current match behaviour flags (RE.MATCH_*).
MATCH_NORMAL // Normal (case-sensitive) matching MATCH_CASEINDEPENDENT // Case folded comparisons MATCH_MULTILINE // Newline matches as BOL/EOL
- See Also:
-
setProgram
Sets the current regular expression program used by this matcher object.- Parameters:
program
- Regular expression program compiled by RECompiler.- See Also:
-
getProgram
Returns the current regular expression program in use by this matcher object.- Returns:
- Regular expression program
- See Also:
-
getParenCount
public int getParenCount()Returns the number of parenthesized subexpressions available after a successful match.- Returns:
- Number of available parenthesized subexpressions
-
getParen
Gets the contents of a parenthesized subexpression after a successful match.- Parameters:
which
- Nesting level of subexpression- Returns:
- String
-
getParenStart
public final int getParenStart(int which) Returns the start index of a given paren level.- Parameters:
which
- Nesting level of subexpression- Returns:
- String index
-
getParenEnd
public final int getParenEnd(int which) Returns the end index of a given paren level.- Parameters:
which
- Nesting level of subexpression- Returns:
- String index
-
getParenLength
public final int getParenLength(int which) Returns the length of a given paren level.- Parameters:
which
- Nesting level of subexpression- Returns:
- Number of characters in the parenthesized subexpression
-
setParenStart
protected final void setParenStart(int which, int i) Sets the start of a paren level- Parameters:
which
- Which paren leveli
- Index in input array
-
setParenEnd
protected final void setParenEnd(int which, int i) Sets the end of a paren level- Parameters:
which
- Which paren leveli
- Index in input array
-
internalError
Throws an Error representing an internal error condition probably resulting from a bug in the regular expression compiler (or possibly data corruption). In practice, this should be very rare.- Parameters:
s
- Error description- Throws:
Error
-
allocParens
private void allocParens()Performs lazy allocation of subexpression arrays -
matchNodes
protected int matchNodes(int firstNode, int lastNode, int idxStart) Try to match a string against a subset of nodes in the program- Parameters:
firstNode
- Node to start at in programlastNode
- Last valid node (used for matching a subexpression without matching the rest of the program as well).idxStart
- Starting position in character array- Returns:
- Final input array index if match succeeded. -1 if not.
-
matchAt
protected boolean matchAt(int i) Match the current regular expression program against the current input string, starting at index i of the input string. This method is only meant for internal use.- Parameters:
i
- The input string index to start matching at- Returns:
- True if the input matched the expression
-
match
Matches the current regular expression program against a character array, starting at a given index.- Parameters:
search
- String to match againsti
- Index to start searching at- Returns:
- True if string matched
-
match
Matches the current regular expression program against a character array, starting at a given index.- Parameters:
search
- String to match againsti
- Index to start searching at- Returns:
- True if string matched
-
match
Matches the current regular expression program against a String.- Parameters:
search
- String to match against- Returns:
- True if string matched
-
split
Splits a string into an array of strings on regular expression boundaries. This function works the same way as the Perl function of the same name. Given a regular expression of "[ab]+" and a string to split of "xyzzyababbayyzabbbab123", the result would be the array of Strings "[xyzzy, yyz, 123]".Please note that the first string in the resulting array may be an empty string. This happens when the very first character of input string is matched by the pattern.
- Parameters:
s
- String to split on this regular exression- Returns:
- Array of strings
-
subst
Substitutes a string for this regular expression in another string. This method works like the Perl function of the same name. Given a regular expression of "a*b", a String to substituteIn of "aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the resulting String returned by subst would be "-foo-garply-wacky-".- Parameters:
substituteIn
- String to substitute withinsubstitution
- String to substitute for all matches of this regular expression.- Returns:
- The string substituteIn with zero or more occurrences of the current regular expression replaced with the substitution String (if this regular expression object doesn't match at any position, the original String is returned unchanged).
-
subst
Substitutes a string for this regular expression in another string. This method works like the Perl function of the same name. Given a regular expression of "a*b", a String to substituteIn of "aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the resulting String returned by subst would be "-foo-garply-wacky-".It is also possible to reference the contents of a parenthesized expression with $0, $1, ... $9. A regular expression of "http://[\\.\\w\\-\\?/~_@invalid input: '&'=%]+", a String to substituteIn of "visit us: http://www.apache.org!" and the substitution String "<a href=\"$0\">$0</a>", the resulting String returned by subst would be "visit us: <a href=\"http://www.apache.org\">http://www.apache.org</a>!".
Note: $0 represents the whole match.
- Parameters:
substituteIn
- String to substitute withinsubstitution
- String to substitute for matches of this regular expressionflags
- One or more bitwise flags from REPLACE_*. If the REPLACE_FIRSTONLY flag bit is set, only the first occurrence of this regular expression is replaced. If the bit is not set (REPLACE_ALL), all occurrences of this pattern will be replaced. If the flag REPLACE_BACKREFERENCES is set, all backreferences will be processed.- Returns:
- The string substituteIn with zero or more occurrences of the current regular expression replaced with the substitution String (if this regular expression object doesn't match at any position, the original String is returned unchanged).
-
grep
Returns an array of Strings, whose toString representation matches a regular expression. This method works like the Perl function of the same name. Given a regular expression of "a*b" and an array of String objects of [foo, aab, zzz, aaaab], the array of Strings returned by grep would be [aab, aaaab].- Parameters:
search
- Array of Objects to search- Returns:
- Array of Strings whose toString() value matches this regular expression.
-
isNewline
private boolean isNewline(int i) - Returns:
- true if character at i-th position in the
search
string is a newline
-
compareChars
private int compareChars(char c1, char c2, boolean caseIndependent) Compares two characters.- Parameters:
c1
- first character to compare.c2
- second character to compare.caseIndependent
- whether comparision is case insensitive or not.- Returns:
- negative, 0, or positive integer as the first character less than, equal to, or greater then the second.
-