לדלג לתוכן

Regexp

(Regular Expressions) הרחבה לרשימה בxcoco-regular expressions (regex) (מילון Bash) מתוך האתר של GNU

  • בגדול כל התוכנות של GNU משתמשות ב-BRE ואם נותנים להן e- אז ב-ERE דוגמאות בולטות: sed grep

הקדמה שלי

regular expression
זו דרך להציג פורמלית תרשימי זרימה של מצבים לרובוטים...
החבר'ה של unix הפכו את זה ל-pattern matching
יש עניין שלם של דטרמיניסטי לעומת נון-דרטמניסטי (איך להתקדם מדיסיונקציה עם שני מצבים, לפיצול של שני מצבי המשך אפשריים?)
ed לוקח קוד נונ-דטרמיניסטי ו-"משלים" עם אסמבלי סניפטס שהוא מקמפל על המקום!
https://www.youtube.com/watch?v=6pqhDjQKWng
https://www.youtube.com/watch?v=528Jc3q86F8

Klenee קלייני המציא את הרג'קס
Ken Thompson הוא הבחור מיוניקס שעשה את ed
(הגרסה הקודמת של sed, שקדם כנראה ל-grep).

grep ליטרלי קרוי על שם פקודה ב-ed: ^boy g/re/p

על החרא הזה בנויות הרבה ספריות של מאגרי נתונים, SQL וכו'

Perl היא ספריית רג'קס מאוחרת יותר עם סינטקס קצת שונה מbre ו-ere, הרבה שפות גבוהות כמו פיית'ון, ג'אווה וכו' משתמשות בה.


2 Basic (BRE) and extended (ERE) regular expression

Basic and extended regular expressions are two variations on the syntax ofC the specified pattern. Basic Regular Expression (BRE) syntax is the default in sed (and similarly in grep). Use the POSIX-specified -E option (-r, --regexp-extended) to enable Extended Regular Expression (ERE) syntax.

In GNU sed, the only difference between basic and extended regular expressions is in the behavior of a few special characters: ‘?’, ‘+’, parentheses, braces (‘{}’), and ‘|’.

With basic (BRE) syntax, these characters do not have special meaning unless prefixed with a backslash (‘\’); While with extended (ERE) syntax it is reversed: these characters are special unless they are prefixed with backslash (‘\’).

Desired pattern Basic (BRE) Syntax Extended (ERE) Syntax
literal ‘+’ (plus sign) $ echo 'a+b=c' > foo $ sed -n '/a+b/p' foo a+b=c $ echo 'a+b=c' > foo $ sed -E -n '/a\+b/p' foo a+b=c
One or more ‘a’ characters followed by ‘b’ (plus sign as special meta-character) $ echo aab > foo $ sed -n '/a\+b/p' foo aab $ echo aab > foo $ sed -E -n '/a+b/p' foo aab

3 Overview of bre syntax

Here is a brief description of regular expression syntax as used in sed.

char

A single ordinary character matches itself.

*

Matches a sequence of zero or more instances of matches for the preceding regular expression, which must be an ordinary character, a special character preceded by \, a ., a grouped regexp (see below), or a bracket expression. As a GNU extension, a postfixed regular expression can also be followed by *; for example, a** is equivalent to a*. POSIX 1003.1-2001 says that * stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this and portable scripts should instead use \* in these contexts.

.

Matches any character, including newline.

^

Matches the null string at beginning of the pattern space, i.e. what appears after the circumflex must appear at the beginning of the pattern space.

In most scripts, pattern space is initialized to the content of each line (see How sed works). So, it is a useful simplification to think of ^#include as matching only lines where ‘#include’ is the first thing on line—if there are spaces before, for example, the match fails. This simplification is valid as long as the original content of pattern space is not modified, for example with an s command.

^ acts as a special character only at the beginning of the regular expression or subexpression (that is, after \( or \|). Portable scripts should avoid ^ at the beginning of a subexpression, though, as POSIX allows implementations that treat ^ as an ordinary character in that context.

$

It is the same as ^, but refers to end of pattern space. $ also acts as a special character only at the end of the regular expression or subexpression (that is, before \) or \|), and its use at the end of a subexpression is not portable.

[list] & [^list]

Matches any single character in list: for example, [aeiou] matches all vowels. A list may include sequences like char1-char2, which matches any character between (inclusive) char1 and char2. See Character Classes and Bracket Expressions.

\+

As *, but matches one or more. It is a GNU extension.

\?

As *, but only matches zero or one. It is a GNU extension.

\{i\}

As *, but matches exactly i sequences (i is a decimal integer; for portability, keep it between 0 and 255 inclusive).

\{i,j\}

Matches between i and j, inclusive, sequences.

\{i,\}

Matches more than or equal to i sequences.

\(regexp\)

Groups the inner regexp as a whole, this is used to:

  • Apply postfix operators, like \(abcd\)*: this will search for zero or more whole sequences of ‘abcd’, while abcd* would search for ‘abc’ followed by zero or more occurrences of ‘d’. Note that support for \(abcd\)* is required by POSIX 1003.1-2001, but many non-GNU implementations do not support it and hence it is not universally portable.
  • Use back references (see below).

regexp1\|regexp2

Matches either regexp1 or regexp2. Use parentheses to use complex alternative regular expressions. The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. It is a GNU extension.

regexp1regexp2

Matches the concatenation of regexp1 and regexp2. Concatenation binds more tightly than \|, ^, and $, but less tightly than the other regular expression operators.

\digit

Matches the digit-th \(…\) parenthesized subexpression in the regular expression. This is called a back reference. Subexpressions are implicitly numbered by counting occurrences of \( left-to-right.

\n

Matches the newline character.

\char

Matches char, where char is one of $, *, ., [, \, or ^. Note that the only C-like backslash sequences that you can portably assume to be interpreted are \n and \\; in particular \t is not portable, and matches a ‘t’ under most implementations of sed, rather than a tab character.

Note that the regular expression matcher is greedy, i.e., matches are attempted from left to right and, if two or more matches are possible starting at the same character, it selects the longest.


Examples:

‘abcdef’

Matches ‘abcdef’.

‘a*b’

Matches zero or more ‘a’s followed by a single ‘b’. For example, ‘b’ or ‘aaaaab’.

‘a\?b’

Matches ‘b’ or ‘ab’.

‘a\+b\+’

Matches one or more ‘a’s followed by one or more ‘b’s: ‘ab’ is the shortest possible match, but other examples are ‘aaaab’ or ‘abbbbb’ or ‘aaaaaabbbbbbb’.

‘.*’

‘.\+’

These two both match all the characters in a string; however, the first matches every string (including the empty string), while the second matches only strings containing at least one character.

‘^main.*(.*)’

This matches a string starting with ‘main’, followed by an opening and closing parenthesis. The ‘n’, ‘(’ and ‘)’ need not be adjacent.

‘^#’

This matches a string beginning with ‘#’.

‘\\$’

This matches a string ending with a single backslash. The regexp contains two backslashes for escaping.

‘\$’

Instead, this matches a string consisting of a single dollar sign, because it is escaped.

‘[a-zA-Z0-9]’

In the C locale, this matches any ASCII letters or digits.

‘[^ TAB]\+’

(Here TAB stands for a single tab character.) This matches a string of one or more characters, none of which is a space or a tab. Usually this means a word.

‘^\(.*\)\n\1$’

This matches a string consisting of two equal substrings separated by a newline.

‘.\{9\}A$’

This matches nine characters followed by an ‘A’ at the end of a line.

‘^.\{15\}A’

This matches the start of a string that contains 16 characters, the last of which is an ‘A’.


4 Overview of ere syntax

The only difference between basic and extended regular expressions is in the behavior of a few characters: ‘?’, ‘+’, parentheses, braces (‘{}’), and ‘|’. While basic regular expressions require these to be escaped if you want them to behave as special characters, when using extended regular expressions you must escape them if you want them to match a literal character. ‘|’ is special here because ‘\|’ is a GNU extension – standard basic regular expressions do not provide its functionality.

Examples:

abc?

becomes ‘abc\?’ when using extended regular expressions. It matches the literal string ‘abc?’.

c\+

becomes ‘c+’ when using extended regular expressions. It matches one or more ‘c’s.

a\{3,\}

becomes ‘a{3,}’ when using extended regular expressions. It matches three or more ‘a’s.

\(abc\)\{2,3\}

becomes ‘(abc){2,3}’ when using extended regular expressions. It matches either ‘abcabc’ or ‘abcabcabc’.

\(abc*\)\1

becomes ‘(abc*)\1’ when using extended regular expressions. Backreferences must still be escaped when using extended regular expressions.

a\|b

becomes ‘a|b’ when using extended regular expressions. It matches ‘a’ or ‘b’.


5 Character Classes and Bracket Expressions

A bracket expression is a list of characters enclosed by ‘[’ and ‘]’. It matches any single character in that list; if the first character of the list is the caret ‘^’, then it matches any character not in the list. For example, the following command replaces the words ‘gray’ or ‘grey’ with ‘blue’:

Bracket expressions can be used in both basic and extended regular expressions (that is, with or without the -E/-r options).

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive. In the default C locale, the sorting sequence is the native character order; for example, ‘[a-d]’ is equivalent to ‘[abcd]’.

Finally, certain named classes of characters are predefined within bracket expressions, as follows.

These named classes must be used inside brackets themselves. Correct usage:

$ echo 1 | sed 's/[[:digit:]]/X/'
X

Incorrect usage is rejected by newer sed versions. Older versions accepted it but treated it as a single bracket expression (which is equivalent to ‘[dgit:]’, that is, only the characters d/g/i/t/:):

# current GNU sed versions - incorrect usage rejected
$ echo 1 | sed 's/[:digit:]/X/'
sed: character class syntax is [[:space:]], not [:space:]

# older GNU sed versions
$ echo 1 | sed 's/[:digit:]/X/'
1

‘\[:alnum:\]’

Alphanumeric characters: ‘[:alpha:]’ and ‘[:digit:]’; in the ‘C’ locale and ASCII character encoding, this is the same as ‘[0-9A-Za-z]’.

‘\[:alpha:\]’

Alphabetic characters: ‘[:lower:]’ and ‘[:upper:]’; in the ‘C’ locale and ASCII character encoding, this is the same as ‘[A-Za-z]’.

‘\[:blank:\]’

Blank characters: space and tab.

‘\[:cntrl:\]’

Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In other character sets, these are the equivalent characters, if any.

‘\[:digit:\]’

Digits: 0 1 2 3 4 5 6 7 8 9.

‘\[:graph:\]’

Graphical characters: ‘[:alnum:]’ and ‘[:punct:]’.

‘\[:lower:\]’

Lower-case letters; in the ‘C’ locale and ASCII character encoding, this is a b c d e f g h i j k l m n o p q r s t u v w x y z.

‘\[:print:\]’

Printable characters: ‘[:alnum:]’, ‘[:punct:]’, and space.

‘\[:punct:\]’

Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

‘\[:space:\]’

Space characters: in the ‘C’ locale, this is tab, newline, vertical tab, form feed, carriage return, and space.

‘\[:upper:\]’

Upper-case letters: in the ‘C’ locale and ASCII character encoding, this is A B C D E F G H I J K L M N O P Q R S T U V W X Y Z.

‘\[:xdigit:\]’

Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f.

Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket expression.

Most meta-characters lose their special meaning inside bracket expressions:

‘\]’

ends the bracket expression if it’s not the first list item. So, if you want to make the ‘]’ character a list item, you must put it first.

‘\-’

represents the range if it’s not first or last in a list or the ending point of a range.

‘^’

represents the characters not in the list. If you want to make the ‘^’ character a list item, place it anywhere but first.

TODO: incorporate this paragraph (copied verbatim from BRE section).

The characters $, *, ., [, and \ are normally not special within list. For example, [\*] matches either ‘\’ or ‘*’, because the \ is not special here. However, strings like [.ch.], [=a=], and [:space:] are special within list and represent collating symbols, equivalence classes, and character classes, respectively, and [ is therefore special within list when it is followed by ., =, or :. Also, when not in POSIXLY_CORRECT mode, special escapes like \n and \t are recognized within list. See Escapes.

‘\[.’

represents the open collating symbol.

‘.\]’

represents the close collating symbol.

‘\[=’

represents the open equivalence class.

‘\=\]’

represents the close equivalence class.

‘\[:’

represents the open character class symbol, and should be followed by a valid character class name.

‘:\]’

represents the close character class symbol.


6 regular expression extensions

The following sequences have special meaning inside regular expressions (used in addresses and the s command).

These can be used in both basic and extended regular expressions (that is, with or without the -E/-r options).

\w

Matches any “word” character. A “word” character is any letter or digit or the underscore character.

$ echo "abc %-= def." | sed 's/\w/X/g'
XXX %-= XXX.

\W

Matches any “non-word” character.

$ echo "abc %-= def." | sed 's/\W/X/g'
abcXXXXXdefX

\b

Matches a word boundary; that is it matches if the character to the left is a “word” character and the character to the right is a “non-word” character, or vice-versa.

$ echo "abc %-= def." | sed 's/\b/X/g'
XabcX %-= XdefX.

\B

Matches everywhere but on a word boundary; that is it matches if the character to the left and the character to the right are either both “word” characters or both “non-word” characters.

$ echo "abc %-= def." | sed 's/\B/X/g'
aXbXc X%X-X=X dXeXf.X

\s

Matches whitespace characters (spaces and tabs). Newlines embedded in the pattern/hold spaces will also match:

$ echo "abc %-= def." | sed 's/\s/X/g'
abcX%-=Xdef.

\S

Matches non-whitespace characters.

$ echo "abc %-= def." | sed 's/\S/X/g'
XXX XXX XXXX

\<

Matches the beginning of a word.

$ echo "abc %-= def." | sed 's/\</X/g'
Xabc %-= Xdef.

\>

Matches the end of a word.

$ echo "abc %-= def." | sed 's/\>/X/g'
abcX %-= defX.

\`

Matches only at the start of pattern space. This is different from ^ in multi-line mode.

Compare the following two examples:

$ printf "a\nb\nc\n" | sed 'N;N;s/^/X/gm'
Xa
Xb
Xc

$ printf "a\nb\nc\n" | sed 'N;N;s/\\`/X/gm'
Xa
b
c

\'

Matches only at the end of pattern space. This is different from $ in multi-line mode.


7 Back-references and Subexpressions

back-references are regular expression commands which refer to a previous part of the matched regular expression. Back-references are specified with backslash and a single digit (e.g. ‘\1’). The part of the regular expression they refer to is called a subexpression, and is designated with parentheses.

Back-references and subexpressions are used in two cases: in the regular expression search pattern, and in the replacement part of the s command (see Regular Expression Addresses and The "s" Command).

In a regular expression pattern, back-references are used to match the same content as a previously matched subexpression. In the following example, the subexpression is ‘.’ - any single character (being surrounded by parentheses makes it a subexpression). The back-reference ‘\1’ asks to match the same content (same character) as the sub-expression.

The command below matches words starting with any character, followed by the letter ‘o’, followed by the same character as the first.

$ sed -E -n '/^(.)o\1$/p' /usr/share/dict/words
bob
mom
non
pop
sos
tot
wow

Multiple subexpressions are automatically numbered from left-to-right. This command searches for 6-letter palindromes (the first three letters are 3 subexpressions, followed by 3 back-references in reverse order):

$ sed -E -n '/^(.)(.)(.)\3\2\1$/p' /usr/share/dict/words
redder

In the s command, back-references can be used in the replacement part to refer back to subexpressions in the regexp part.

The following example uses two subexpressions in the regular expression to match two space-separated words. The back-references in the replacement part prints the words in a different order:

$ echo "James Bond" | sed -E 's/(.*) (.*)/The name is \2, \1 \2./'
The name is Bond, James Bond.

When used with alternation, if the group does not participate in the match then the back-reference makes the whole match fail. For example, ‘a(.)|b\1’ will not match ‘ba’. When multiple regular expressions are given with -e or from a file (‘-f file’), back-references are local to each expression.


8 Escape Sequences - specifying special characters

Until this chapter, we have only encountered escapes of the form ‘\^’, which tell sed not to interpret the circumflex as a special character, but rather to take it literally. For example, ‘\*’ matches a single asterisk rather than zero or more backslashes.

This chapter introduces another kind of escape6—that is, escapes that are applied to a character or sequence of characters that ordinarily are taken literally, and that sed replaces with a special character. This provides a way of encoding non-printable characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters in a sed script but when a script is being prepared in the shell or by text editing, it is usually easier to use one of the following escape sequences than the binary character it represents:

The list of these escapes is:

\a

Produces or matches a BEL character, that is an “alert” (ASCII 7).

\f

Produces or matches a form feed (ASCII 12).

\n

Produces or matches a newline (ASCII 10).

\r

Produces or matches a carriage return (ASCII 13).

\t

Produces or matches a horizontal tab (ASCII 9).

\v

Produces or matches a so called “vertical tab” (ASCII 11).

\cx

Produces or matches CONTROL-x, where x is any character. The precise effect of ‘\cx’ is as follows: if x is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. Thus ‘\cz’ becomes hex 1A, but ‘\c{’ becomes hex 3B, while ‘\c;’ becomes hex 7B.

\dxxx

Produces or matches a character whose decimal ASCII value is xxx.

\oxxx

Produces or matches a character whose octal ASCII value is xxx.

\xxx

Produces or matches a character whose hexadecimal ASCII value is xx.

‘\b’ (backspace) was omitted because of the conflict with the existing “word boundary” meaning.

8.1 Escaping Precedence

GNU sed processes escape sequences before passing the text onto the regular-expression matching of the s/// command and Address matching. Thus the follwing two commands are equivalent (‘0x5e’ is the hexadecimal ASCII value of the character ‘^’):

$ echo 'a^c' | sed 's/^/b/'
ba^c

$ echo 'a^c' | sed 's/\x5e/b/'
ba^c

As are the following (‘0x5b’,‘0x5d’ are the hexadecimal ASCII values of ‘[’,‘]’, respectively):

$ echo abc | sed 's/[a]/x/'
Xbc
$ echo abc | sed 's/\x5ba\x5d/x/'
Xbc

However it is recommended to avoid such special characters due to unexpected edge-cases. For example, the following are not equivalent:

$ echo 'a^c' | sed 's/\^/b/'
abc

$ echo 'a^c' | sed 's/\\\x5e/b/'
a^c

9 Multibyte characters and Locale Considerations

GNU sed processes valid multibyte characters in multibyte locales (e.g. UTF-8). 7

The following example uses the Greek letter Capital Sigma (Σ, Unicode code point 0x03A3). In a UTF-8 locale, sed correctly processes the Sigma as one character despite it being 2 octets (bytes):

$ locale | grep LANG
LANG=en_US.UTF-8

$ printf 'a\u03A3b'
aΣb

$ printf 'a\u03A3b' | sed 's/./X/g'
XXX

$ printf 'a\u03A3b' | od -tx1 -An
 61 ce a3 62

To force sed to process octets separately, use the C locale (also known as the POSIX locale):

$ printf 'a\u03A3b' | LC_ALL=C sed 's/./X/g'
XXXX

9.1 Invalid multibyte characters

sed’s regular expressions do not match invalid multibyte sequences in a multibyte locale.

In the following examples, the ascii value 0xCE is an incomplete multibyte character (shown here as �). The regular expression ‘.’ does not match it:

$ printf 'a\xCEb\n'
a�e

$ printf 'a\xCEb\n' | sed 's/./X/g'
X�X

$ printf 'a\xCEc\n' | sed 's/./X/g' | od -tx1c -An
  58  ce  58  0a
   X      X   \n

Similarly, the ’catch-all’ regular expression ‘.*’ does not match the entire line:

$ printf 'a\xCEc\n' | sed 's/.*//' | od -tx1c -An
  ce  63  0a
       c  \n

GNU sed offers the special z command to clear the current pattern space regardless of invalid multibyte characters (i.e. it works like s/.*// but also removes invalid multibyte characters):

$ printf 'a\xCEc\n' | sed 'z' | od -tx1c -An
   0a
   \n

Alternatively, force the C locale to process each octet separately (every octet is a valid character in the C locale):

$ printf 'a\xCEc\n' | LC_ALL=C sed 's/.*//' | od -tx1c -An
  0a
  \n

sed’s inability to process invalid multibyte characters can be used to detect such invalid sequences in a file. In the following examples, the \xCE\xCE is an invalid multibyte sequence, while \xCE\A3 is a valid multibyte sequence (of the Greek Sigma character).

The following sed program removes all valid characters using s/.//g. Any content left in the pattern space (the invalid characters) are added to the hold space using the H command. On the last line ($), the hold space is retrieved (x), newlines are removed (s/\n//g), and any remaining octets are printed unambiguously (l). Thus, any invalid multibyte sequences are printed as octal values:

$ printf 'ab\nc\n\xCE\xCEde\n\xCE\xA3f\n' > invalid.txt

$ cat invalid.txt
ab
c
��de
Σf

$ sed -n 's/.//g ; H ; ${x;s/\n//g;l}' invalid.txt
\316\316$

With a few more commands, sed can print the exact line number corresponding to each invalid characters (line 3). These characters can then be removed by forcing the C locale and using octal escape sequences:

$ sed -n 's/.//g;=;l' invalid.txt | paste - -  | awk '$2!="$"'
3       \316\316$

$ LC_ALL=C sed '3s/\o316\o316//' invalid.txt > fixed.txt

9.2 Upper/Lower case conversion

GNU sed’s substitute command (s) supports upper/lower case conversions using \U,\L codes. These conversions support multibyte characters:

$ printf 'ABC\u03a3\n'
ABCΣ

$ printf 'ABC\u03a3\n' | sed 's/.*/\L&/'
abcσ

See The "s" Command.

9.3 Multibyte regexp character classes

In other locales, the sorting sequence is not specified, and ‘[a-d]’ might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to match any character, or the set of characters that it matches might even be erratic. To obtain the traditional interpretation of bracket expressions, you can use the ‘C’ locale by setting the LC_ALL environment variable to the value ‘C’.

# TODO: is there any real-world system/locale where 'A'
#       is replaced by '-' ?
$ echo A | sed 's/[a-z]/-/'
A

Their interpretation depends on the LC_CTYPE locale; for example, ‘[[:alnum:]]’ means the character class of numbers and letters in the current locale.

TODO: show example of collation

# TODO: this works on glibc systems, not on musl-libc/freebsd/macosx.
$ printf 'cliché\n' | LC_ALL=fr_FR.utf8 sed 's/[[=e=]]/X/g'
clichX