At the more complex end of things, Scanner is geared towards helping you tokenize programming languages or parse complex data representation formats (e.g. JSON, XML, YAML, Markdown).
The code is hosted on Github:
Both of these examples are pretty trivial to do with only one regular expression; for the sake of demonstration pretend that's not the case. If you want to see more complex examples, see the JSON parser and/or JavaScript tokenizer, which are included with the code.
Say you wish to parse a date in some known format (D[D]/M[M]/YY[YY] HH:mm:ss).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | scanner = Scanner('09/3/2011 12:15:45') day = scanner.scan(r'\d+') # -> '09' scanner.get() # -> / month = scanner.scan(r'\d+') # -> '3' scanner.get() # -> / year = scanner.scan(r'\d+') # -> '2011' scanner.scan(r'\s+') # -> ' ' hour = scanner.scan(r'\d+') # -> '12' scanner.get() # -> : minute = scanner.scan(r'\d+') # -> '15' scanner.get() # -> : second = scanner.scan(r'\d+') # -> '45' print (day, month, year, hour, minute, second) # -> ('09', '3', '2011', '12', '15', '45') |
Example 2: Validating that a numeric expression is a valid base-10 number (allow leading +-, disallow leading 0s unless preceding a point, don't require any numbers before the point, allow exponent and +-, require numbers after a point, disallow anything else)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | import re def is_number(num): has_exp = False s = Scanner(num) s.scan(r'[+\-]?') # consume +- but don't require it # consume everything up to the point, if it exists if not s.scan(r'\.|0\.|[1-9]\d*\.?|0$'): return False has_point = s.match().count('.') > 0 # if the number is just a straight integer, we might have reached the end here if s.eos() and not has_point: return True # we now expect to see more numbers or an exponent s.scan(r'\d+') # if we're following a point, we need that to have matched if has_point and not s.matched(): return False # now consume the exponent and if it's present, require trailing numbers if s.scan(r'[e][+\-]?', re.I): if s.scan(r'\d+') is None: return False # we should now be at the end of the string return s.eos() >>> is_number('0') True >>> is_number('0.0') True >>> is_number('0.000') True >>> is_number('0.000e100') True >>> is_number('.123e-100') True >>> is_number('3.14159') True >>> is_number('i') False >>> is_number('0xFACE') False |
A simple class to aid in lexical analysis of a string. Styled after, but not entirely the same as, Ruby's StringScanner class. Basic philosophy is simple: Scanner traverses a string left to right, consuming the string as it goes. The current position is the string pointer Scanner.pos. At each iteration, the caller uses the scanning methods to determine what the current piece of string actually is. Scanning methods: With the exception of get and peek, all scanning methods take a pattern and (optionally) flags (e.g re.X). The patterns are assumed to be either strings or compiled regular expression objects (i.e. the result of re.compile, or equivalent). If a pattern is not a string but does not implement match or search (whichever is being used), a ValueError is raised. String patterns are compiled and cached internally. The check, scan and skip methods all try to match *at* the current scan pointer. check_to, scan_to and skip_to all try to find a match somewhere beyond the scan pointer and jump *to* that position. check_until, scan_until, and skip_until are like *_to, but also consume the match (so the jump to the *end* of that position) Lookahead: check() check_to() check_until() peek() Consume: get() scan() scan_to() scan_until() skip() skip_to() skip_until() skip_bytes() (convenience wrapper) skip_lines() (convenience wrapper) skip_whitespace() (convenience wrapper) Note that scan* and check* both return either a string, in the case of a match, or None, in the case of no match. If the match exists but is zero length, the empty string is returned. Be careful handling this as both None and the empty string evaluate to False, but mean very different things. peek and get also return the empty string when the end of the stream is reached. Most recent match data: matched() -- True/False - was the most recent match a success? The following methods all throw Exception if not matched() match() -- matched string match_len() -- matched string length match_pos() -- offset of match Wrappers around re.* match_info() -- the re.MatchObject match_group() match_groups() match_groupdict() pre_match() -- string preceding the match post_match() -- string following the match Misc: pos -- get/set current scan pointer position bol() -- beginning of line? (DOS/Unix/Mac aware) eol() -- end of line? (DOS/Unix/Mac aware) eos() -- end of string? rest() -- remaining (unconsumed) string rest_len() -- length of remaining string unscan() -- revert to previous state Setup: string -- get/set current source string reset() -- reset the scanner ready to start again terminate() -- trigger premature finish
The current string pointer position.
The source string
Constructor Arguments: src -- a string to scan. This can be set later by string()
Return whether or not the scan pointer is immediately after a newline character (DOS/Unix/Mac aware), or at the start of the string.
Return a match for the pattern (or None) at the scan pointer without actually consuming the string If the pattern matched but was zero length, the empty string is returned If the pattern did not match, None is returned
Return all text up until the beginning of the first match for the pattern after the scan pointer without consuming the string If the pattern matched but was zero length, the empty string is returned If the pattern did not match, None is returned
Return all text up until the end of the first match for the pattern after the scan pointer without consuming the string If the pattern matched but was zero length, the empty string is returned If the pattern did not match, None is returned
Return whether or not the scan pointer is immediately before a newline character (DOS/Unix/Mac aware) or at the end of the string.
Return True iff we are at the end of the string, else False.
Return True if the given pattern matches ANYWHERE after the scan pointer. Don't advance the scan pointer
Return the given number of characters from the current string pointer and consume them If we reach the end of the stream, the empty string is returned
Return as a tuple: (linenumber, bytenumber)
Return the last matching string Raise Exception if no match attempts have been recorded. Raise Exception if most recent match failed
Return the contents of the given group in the most recent match. This is a wrapper to re.MatchObject.group() raise IndexError if the match exists but the group does not raise Exception if no match attempts have been recorded raise Exception if most recent match failed
Return a dict containing group_name => match. This is a wrapper to re.MatchObject.groupdict() and as such it only works for named groups Raise Exception if no match attempts have been recorded. Raise Exception if most recent match failed
Return the most recent's match's groups, this is a wrapper to re.MatchObject.groups() Raise Exception if no match attempts have been recorded. Raise Exception if most recent match failed
Return the most recent match's MatchObject. This is what's returned by the re module. Use this if the other methods here don't expose what you need. Raise Exception if no match attempts have been recorded. Raise Exception if most recent match failed
Return the length of the last matching string This is equivalent to len(scanner.match()). Raise Exception if no match attempts have been recorded. Raise Exception if most recent match failed
Return the offset into the string of the last match Raise Exception if no match attempts have been recorded. Raise Exception if most recent match failed
Return True if the last match was successful, else False. Raise Exception if no match attempts have been recorded.
Return the given number of characters from the current string pointer without consuming them. If we reach the end of the stream, the empty string is returned
Return the string following the last match or None. This is equivalent to: scanner.string[scanner.match_pos() + scanner.match_len() : ] raise Exception if no match attempts have been recorded
Return the string preceding the last match or None. This is equivalent to: scanner.string[:scanner.match_pos()] raise Exception if no match attempts have been recorded
Reset the scanner's state including string pointer and match history.
Return the string from the current pointer onwards, i.e. the segment of string which has not yet been consumed.
Return the length of string remaining. This is equivalent to len(rest())
Return a match for the pattern at the scan pointer and consume the string. Return None if not match is found
Return all text up until the beginning of the first match for the pattern after the scan pointer. The pattern is not included in the match. The scan pointer will be moved such that it immediately precedes the pattern Return None if no match is found
Return the first match for the pattern after the scan pointer and consumes the string up until the end of the match. Return None if no match is found
Scan ahead over the given pattern and return how many characters were consumed, or None. Similar to scan, but does not return the string or record the match
Skip the given number of bytes and return the number of bytes consumed
Skip the given number of lines and return the number of lines consumed
Scan ahead until the beginning of first occurrance of the given pattern and return how many characters were skipped, or None if the match failed The match is not recorded.
Scan ahead until the end of first occurrance of the given pattern and return how many characters were consumed, or None if the match failed The match is not recorded
Skip over whitespace characters and return the number of characters consumed Arguments: n -- maximum number of characters to cosume (default None) multiline -- whether or not to consume newline characters (default True)
Set the string pointer to the end of the input and clear the match history.
Revert the scanner's state to that of the previous match. Only one previous state is remembered Throw Exception if there is no previous known state to restore
Page loaded in 0.042s
2 calls to Luminous, total: 0.0073s, average: 0.0037s