
42ec3d7fd668b4496286a07743cb2fcc.ppt
- Количество слайдов: 27
Regular Expressions Providing a Search Pattern and Description of Classes of Strings Copyright © 2008 -2011 Curt Hill
What are they? • A special text pattern for describing a string – Frequently used in searches – One regular expression often describes an infinite number of similar strings • This text pattern allows special sequences to have special meaning • Any other characters may just appear in the searched string Copyright © 2008 -2011 Curt Hill
Disclaimer • Standardization: There is none – Different applications may use subsets or extensions of what will be described • Many languages have facility for using regular expressions – Built in: Perl – An object or function: C++, Java. Script Copyright © 2008 -2011 Curt Hill
Specials • The special characters include – [ ]^|*$. ? +(){} – The braces may be literal or special depending on their usage • Any other character just matches itself • Thus Hello as a pattern just matches the obvious string • Since many of these characters are valuable in strings the escape is used to match them © 2008 -2011 Curt Hill Copyright
Escape • The backslash character is the escape • Thus to look for an asterisk (a special) in a string it must be escaped: * – This allows a search to find the asterisk • The C family uses some of the same escape sequences: – n newline or linefeed – t tab – r carriage return Copyright © 2008 -2011 Curt Hill
Coded escapes • An x and two hexadecimal digits may also follow the backslash • Thus x 4 E gives the ASCII character with hexadecimal value 4 E (an N in ASCII) Copyright © 2008 -2011 Curt Hill
Positioning • There are two specials that force a position • ^ matches the beginning of the line • $ matches the end of the line • Both of these match a position rather than a character • Without these a pattern could match anywhere within a string Copyright © 2008 -2011 Curt Hill
Positioning examples • The pattern: ^Hi will match any line that starts with the two characters H and I • The pattern: , $ will match any line that ends with a comma • The pattern: ^Hello$ will match only a line that has Hello as its only content Copyright © 2008 -2011 Curt Hill
Wildcards • The dot will match any one character – Except end of line control characters • Thus A. B could match ABB, ACB, A. B or any other three character sequence starting with A and ending with B • DOS/Windows uses a ? for this Copyright © 2008 -2011 Curt Hill
Repetition • It is often desirable to repeat a pattern a fixed number of times • This is done by following the pattern with a set of braces with an integer inside • Thus abbbc is the same as ab{3}c Copyright © 2008 -2011 Curt Hill
Repetition • There are three repetition characters which are more general • Closure is the * – It represents zero or more repetitions of the previous item – Kleene star • The + represents one or more repetitions of the previous item • The ? represent zero or one occurrences of the previous item Copyright © 2008 -2011 Curt Hill
Examples • ~* matches any number (including zero) of successive tildes • -* matches zero or more dashes • . + matches one or more of any character • hats? matches either hat or hats Copyright © 2008 -2011 Curt Hill
Grouping • The repetitions could only be applied to a single character • What is next needed is some type of grouping • This is provided by the parenthesis • Enclosing a pattern in parenthesis makes it a group • This group can then be followed by a repetition character Copyright © 2008 -2011 Curt Hill
Examples • (*-)* will match – *– *-*-*- etc • The * is greedy – it will try to match as many of these as is possible Copyright © 2008 -2011 Curt Hill
More interesting patterns • A number is pretty easy to understand from our perspective but not so easy to describe – Except in regular expressions • An integer is a string of digits – Possibly preceded by a plus or minus • So how is this done? • With sets and repetition Copyright © 2008 -2011 Curt Hill
A set • A pair of brackets may be filled with characters • This will match any one of them • Thus the digits could be done with: [0123456789] • An integer could then be: [-+]? [0123456789]+ • Any single vowel is: [aeiou. AEIOU] Copyright © 2008 -2011 Curt Hill
Ranges in sets • The letters are somewhat more than we want to type • The range is handled by a dash: [0 -9] is the same as [0123456789] • The letters are then: [a-z. A-Z] • If you want a dash in a set place it first Copyright © 2008 -2011 Curt Hill
Complement or Negation • You may place a caret ^ at the beginning of a set to ask for any character but those present • Thus [^0 -9] is any character but a digit Copyright © 2008 -2011 Curt Hill
Shortcut sets • Several classes are so commonly used that a shortcut exists • This is an escaped character • d is a digit [0 -9] • D is not a digit [^0 -9] • w is an alphanumeric [a-z. A-Z 0 -9_] • W is not an alphanumeric [^a-z. A-Z 09_] • s is whitespace [ rntfv] – f is formfeed, v is vertical tab • S is not whitespace [^ rntfv] Copyright © 2008 -2011 Curt Hill
Specials • In some sense the right parenthesis, right bracket and dash are ambiguous as specials • If found in certain contexts they are regular and in others as specials • The rights are only special if there is a leading left • Dash is only special in a set and following another character Copyright © 2008 -2011 Curt Hill
Alternation • A set provides intuitive alternation • The match process may choose any character within the set to use • The alternation is only applied to number of single characters • There is also an alternation character – The vertical bar | • This allows either simple or complicated patterns to alternate Copyright © 2008 -2011 Curt Hill
Alternation • Thus: A|E|I|O|U is equivalent to [AEIOU] • However, more interesting alternations are possible and useful – (abc)|(123) will match either of the two strings – ([-+]? d+)|(w+) will match any string of characters that looks like a number or word Copyright © 2008 -2011 Curt Hill
Example • Suppose the following expression: ^ab(cde)*f$ • Which of the following match this: • abf • abcdecdef • abcdeaf • abcdecdef • acdef • abcdefa Copyright © 2008 -2011 Curt Hill
Languages • A regular expression may define something more substantial than a simple string • It is a class of strings – Perhaps of infinite size • In practice it is a regular language • The regular expression is one way to define the grammar of that language • Regular languages are simple – Many interesting languages are not regular Copyright © 2008 -2011 Curt Hill
Practice • There is a web site that allows you to try out regular expressions: https: //regex 101. com/ • Give you the ability to test out patterns on text • See the next screen for an example Copyright © 2008 -2011 Curt Hill
Example Copyright © 2008 -2011 Curt Hill
Finally • We have a strange thing here • Regular expressions are important to several diverse areas: – Language theory, including automata and parsing – Practical searching • For us they just tend to be a very useful way to express the regularity we see Copyright © 2008 -2011 Curt Hill