Saturday, January 23, 2021

Regular expression 正则表达式

以下摘自 Wikipedia 的 Regular expression 条目。

A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

正则表达式(缩写为 regex 或 regexp; 也有称之为有理表达式)是定义搜索模式的一系列字符。通常,字符串搜索算法会将此类模式用于字符串的 “查找” 或 “查找并替换” 操作,或用于输入验证。它是在理论计算机科学和形式语言理论中开发出来的技术。

The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. The concept came into common use with Unix text-processing utilities. Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used, being the Perl syntax.

这个概念出现在 1950 年代,当时美国数学家 Stephen Cole Kleene 对常规语言的描述进行了正规化(regular 的来源)。该概念后来普遍用于 Unix 文本处理实用程序。自 1980 年代以来,存在编写正则表达式的不同语法,一种是 POSIX 标准,另一种是被广泛使用的 Perl 语法。

Regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex capabilities either built-in or via libraries.

正则表达式用于搜索引擎,文字处理器和文本编辑器的搜索和替换对话框,sed 和 AWK 等文本处理实用程序以及词法分析中。许多编程语言通过内置或通过库提供正则表达式功能。

Syntax

语法

A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters. Metacharacters help form: atoms; quantifiers telling how many atoms (and whether it is a greedy quantifier or not); a logical OR character, which offers a set of alternatives, and a logical NOT character, which negates an atom's existence; and backreferences to refer to previous atoms of a completing pattern of atoms. A match is made, not when all the atoms of the string are matched, but rather when all the pattern atoms in the regex have matched. The idea is to make a small pattern of characters stand for a large number of possible strings, rather than compiling a large list of all the literal possibilities.

正则表达式模式(regex pattern)匹配目标字符串(string)。模式由一系列原子(atoms)组成。原子是正则表达式模式中的单个点,它试图与目标字符串匹配。最简单的原子是文字(literal),但是将模式的多个部件组合起来以形成一个原子将需要使用元字符()。元字符帮助形成:原子;量词(quantifiers)告诉我们有多少原子(以及它是否是贪婪的量词);逻辑“或”字符(提供一组不同选择)和逻辑“非”字符(使某个原子不存在);和反向引用是指先前的完整模式的原子。进行匹配不是字符串的所有原子都匹配,而是在正则表达式中的所有模式原子都匹配。这样做的想法是使小的字符模式代表大量可能的字符串,而不是编译所有所有可能出现的字符的列表。

Depending on the regex processor there are about fourteen metacharacters, characters that may or may not have their literal character meaning, depending on context, or whether they are "escaped", i.e. preceded by an escape sequence, in this case, the backslash \. Modern and POSIX extended regexes use metacharacters more often than their literal meaning, so to avoid "backslash-osis" or leaning toothpick syndrome it makes sense to have a metacharacter escape to a literal mode; but starting out, it makes more sense to have the four bracketing metacharacters ( ) and { } be primarily literal, and "escape" this usual meaning to become metacharacters. Common standards implement both. The usual metacharacters are {}[]()^$.|*+? and \. The usual characters that become metacharacters when escaped are dswDSW and N.

不同的正则表达式处理器支持的元字符个数不一同,大约有十四个元字符,根据上下文或是否被 “转义”(即在前面加上转义序列,在本例中为反斜杠 \),字符可能具有或不具有其文字字符含义。现代和 POSIX 扩展正则表达式更多地使用元字符,而不是其字面含义,这样避免了 “反斜线病” 或 “牙签综合症”,使元字符转义为字面模式是有意义的,但发明正则表示式时,将四个括号元字符()和 {} 设为字面意义,并 “转义” 这些通常的含义成为元字符更有意义。通用标准同时使用了这两种做法。通常元字符是 { } [ ] ( ) ^ $ . | * + ? 和 \。转义变成元字符的常见字符是 dswDSW 和 N。

Delimiters

定界符

When entering a regex in a programming language, they may be represented as a usual string literal, hence usually quoted; this is common in C, Java, and Python for instance, where the regex re is entered as "re". However, they are often written with slashes as delimiters, as in /re/ for the regex re. This originates in ed, where / is the editor command for searching, and an expression /re/ can be used to specify a range of lines (matching the pattern), which can be combined with other commands on either side, most famously g/re/p as in grep ("global regex print"), which is included in most Unix-based operating systems, such as Linux distributions. A similar convention is used in sed, where search and replace is given by s/re/replacement/ and patterns can be joined with a comma to specify a range of lines as in /re1/,/re2/. This notation is particularly well known due to its use in Perl, where it forms part of the syntax distinct from normal string literals. In some cases, such as sed and Perl, alternative delimiters can be used to avoid collision with contents, and to avoid having to escape occurrences of the delimiter character in the contents. For example, in sed the command s,/,X, will replace a / with an X, using commas as delimiters.

以编程语言输入正则表达式时,它们可以表示为通常的字符串文字,因此通常用引号表示;例如,这在 C,Java 和 Python 中很常见,其中 regex re 输入为 “re”。但是,它们经常以斜杠作为分隔符,如 /re/ 中的 regex re。这起源于 ed,其中 / 是用于搜索的编辑器命令,表达式 /re/ 可用于指定行范围(与模式匹配),该行可与任一侧的其他命令结合使用,最著名的是 g /re/p 如 grep(“全局正则表达式打印”)中所述,grep 包含在大多数基于 Unix 的操作系统中,例如 Linux 发行版。sed 中使用了类似的约定,其中搜索和替换由 s /re/replacement/ 给出,并且模式可以用逗号连接以指定行范围,如 /re1 /,/re2 /。由于在 Perl 中使用了这种表示法,因此它是众所周知的,在 Perl 中,这种表示法构成了不同于普通字符串文字的语法的一部分。在某些情况下,例如 sed 和 Perl,可以使用其他分隔符来避免与内容冲突,并避免必须避免内容中出现分隔符。例如,在 sed 中,命令 s,/,X, 将使用逗号作为定界符将 X 替换为 /。

No comments:

Post a Comment