String Scanning class for Python

About

Ruby has a nice class called StringScanner. It's a minimal abstraction on the concept of performing some kind of lexical analysis of a string. Python does not have such a class built in, so I ended up implementing something similar. This is not exactly StringScanner for Python, but it's pretty similar.

Use case/When would I need this?

The use case is pretty broad. The simplest use case is to either validate that a string conforms to some specification or to extract data from it (or both at once), but you're finding that the task is pushing the bounds of when it is sensible to use a straight regular expression; perhaps you want to match it in parts and do a bit of processing between each match, or perhaps your specification just isn't a regular language and requires multiple regular expressions with some logic in-between them. Basically if you're thinking about iterating over your string with a loop, Scanner will probably make your life a lot easier!

At the more complex end of things, Scanner is geared towards helping you tokenize programming languages or parse complex data representation formats (e.g. JSON, XML, YAML, Markdown).

Download

The code is hosted on Github:

Simple Examples

The idea is that a string is scanned left to right. At each point, we look to see if the start of the remaining string matches some pattern. String is consumed as matches are successful. We can also look ahead into the string without consuming it.

Both of these examples are pretty trivial to do with only one regular expression; for the sake of demonstration pretend that's not the case. If you want to see more complex examples, see the JSON parser and/or JavaScript tokenizer, which are included with the code.

Say you wish to parse a date in some known format (D[D]/M[M]/YY[YY] HH:mm:ss).

 1 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 9 
 10 
 11 
 12 
 13 
 14 
 15 
scanner = Scanner('09/3/2011 12:15:45')
day = scanner.scan(r'\d+') # -> '09'
scanner.get() #   -> /
month = scanner.scan(r'\d+') # -> '3'
scanner.get() # -> /
year = scanner.scan(r'\d+') # -> '2011'
scanner.scan(r'\s+') # -> ' '

hour = scanner.scan(r'\d+') # -> '12'
scanner.get() #   -> :
minute = scanner.scan(r'\d+') # -> '15'
scanner.get() # -> :
second = scanner.scan(r'\d+') # -> '45'

print (day, month, year, hour, minute, second) # -> ('09', '3', '2011', '12', '15', '45')

Example 2: Validating that a numeric expression is a valid base-10 number (allow leading +-, disallow leading 0s unless preceding a point, don't require any numbers before the point, allow exponent and +-, require numbers after a point, disallow anything else)

 1 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
 28 
 29 
 30 
 31 
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 
 40 
import re

def is_number(num):
  has_exp = False
  s = Scanner(num)
  s.scan(r'[+\-]?') # consume +- but don't require it  
  # consume everything up to the point, if it exists
  if not s.scan(r'\.|0\.|[1-9]\d*\.?|0$'): return False  
  has_point = s.match().count('.') > 0
  # if the number is just a straight integer, we might have reached the end here  
  if s.eos() and not has_point: return True

  # we now expect to see more numbers or an exponent
  s.scan(r'\d+')
  # if we're following a point, we need that to have matched
  if has_point and not s.matched(): return False

  # now consume the exponent and if it's present, require trailing numbers
  if s.scan(r'[e][+\-]?', re.I): 
    if s.scan(r'\d+') is None: return False

  # we should now be at the end of the string
  return s.eos()

>>> is_number('0')
True  
>>> is_number('0.0') 
True
>>> is_number('0.000')
True
>>> is_number('0.000e100')
True
>>> is_number('.123e-100')
True
>>> is_number('3.14159')
True
>>> is_number('i')
False
>>> is_number('0xFACE')
False

API docs

class Scanner

About

A simple class to aid in lexical analysis of a string.
Styled after, but not entirely the same as, Ruby's StringScanner class.

Basic philosophy is simple: Scanner traverses a string left to right,
consuming the string as it goes. The current position is the string pointer
Scanner.pos. At each iteration, the caller uses the scanning methods to 
determine what the current piece of string actually is.

Scanning methods:
  With the exception of get and peek, all scanning methods take a pattern and
  (optionally) flags (e.g re.X). The patterns are assumed to be either 
  strings or compiled regular expression objects (i.e. the result of 
  re.compile, or equivalent). If a pattern is not a string but does not 
  implement match or search (whichever is being used), a ValueError is raised.
  String patterns are compiled and cached internally.

  The check, scan and skip methods all try to match *at* the current scan
  pointer. check_to, scan_to and skip_to all try to find a match somewhere
  beyond the scan pointer and jump *to* that position. check_until, scan_until,
  and skip_until are like *_to, but also consume the match (so the jump to
  the *end* of that position)

  Lookahead:
    check()
    check_to()
    check_until()
    peek()

  Consume:
    get()
    scan()
    scan_to()
    scan_until()      
    skip()
    skip_to()
    skip_until()
    skip_bytes()      (convenience wrapper)
    skip_lines()      (convenience wrapper)
    skip_whitespace() (convenience wrapper)

  Note that scan* and check* both return either a string, in the case of a
  match, or None, in the case of no match. If the match exists but is zero
  length, the empty string is returned. Be careful handling this as both 
  None and the empty string evaluate to False, but mean very different things.

  peek and get also return the empty string when the end of the stream is 
  reached.


Most recent match data:

  matched() -- True/False - was the most recent match a success?

  The following methods all throw Exception if not matched()

  match() -- matched string
  match_len() -- matched string length
  match_pos() -- offset of match

  Wrappers around re.*
  match_info()  -- the re.MatchObject
  match_group()
  match_groups()
  match_groupdict()

  pre_match() -- string preceding the match
  post_match() -- string following the match

Misc:
  pos -- get/set current scan pointer position

  bol() -- beginning of line? (DOS/Unix/Mac aware)
  eol() -- end of line? (DOS/Unix/Mac aware)
  eos() -- end of string?
  rest() -- remaining (unconsumed) string
  rest_len() -- length of remaining string
  unscan() -- revert to previous state

Setup:
  string -- get/set current source string

  reset() -- reset the scanner ready to start again
  terminate() -- trigger premature finish  

Properties

pos
The current string pointer position.
string
The source string

Methods

__init__(src=None)
Constructor 

Arguments:
src -- a string to scan. This can be set later by string()
bol()
Return whether or not the scan pointer is immediately after a newline
character (DOS/Unix/Mac aware), or at the start of the string. 
check(pattern, flags=0)
Return a match for the pattern (or None) at the scan pointer without 
actually consuming the string
If the pattern matched but was zero length, the empty string is returned
If the pattern did not match, None is returned
check_to(pattern, flags=0)
Return all text up until the beginning of the first match for the pattern 
after the scan pointer without consuming the string
If the pattern matched but was zero length, the empty string is returned
If the pattern did not match, None is returned
check_until(pattern, flags=0)
Return all text up until the end of the first match for the pattern 
after the scan pointer without consuming the string
If the pattern matched but was zero length, the empty string is returned
If the pattern did not match, None is returned
eol()
Return whether or not the scan pointer is immediately before a newline 
character (DOS/Unix/Mac aware) or at the end of the string.
eos()
Return True iff we are at the end of the string, else False.
exists(pattern, flags=0)
Return True if the given pattern matches ANYWHERE after the scan 
pointer. Don't advance the scan pointer
get(length=1)
Return the given number of characters from the current string pointer 
and consume them
If we reach the end of the stream, the empty string is returned
location()
Return as a tuple: (linenumber, bytenumber) 
match()
Return the last matching string
Raise Exception if no match attempts have been recorded.
Raise Exception if most recent match failed
match_group(*args)
Return the contents of the given group in the most recent match.
This is a wrapper to re.MatchObject.group()
raise IndexError if the match exists but the group does not
raise Exception if no match attempts have been recorded
raise Exception if most recent match failed
match_groupdict(default=None)
Return a dict containing group_name => match. This is a wrapper to
re.MatchObject.groupdict() and as such it only works for named groups

Raise Exception if no match attempts have been recorded.
Raise Exception if most recent match failed
match_groups(default=None)
Return the most recent's match's groups, this is a wrapper to 
re.MatchObject.groups()

Raise Exception if no match attempts have been recorded.
Raise Exception if most recent match failed
match_info()
Return the most recent match's MatchObject. This is what's returned by
the re module. Use this if the other methods here don't expose what you 
need.
Raise Exception if no match attempts have been recorded.
Raise Exception if most recent match failed
match_len()
Return the length of the last matching string
This is equivalent to len(scanner.match()).

Raise Exception if no match attempts have been recorded.
Raise Exception if most recent match failed    
match_pos()
Return the offset into the string of the last match
Raise Exception if no match attempts have been recorded.
Raise Exception if most recent match failed    
matched()
Return True if the last match was successful, else False.
Raise Exception if no match attempts have been recorded.
peek(length=1)
Return the given number of characters from the current string pointer
without consuming them.
If we reach the end of the stream, the empty string is returned
post_match()
Return the string following the last match or None. This is equivalent 
to:  scanner.string[scanner.match_pos() + scanner.match_len() : ]

raise Exception if no match attempts have been recorded
pre_match()
Return the string preceding the last match or None. This is equivalent 
to:  scanner.string[:scanner.match_pos()]

raise Exception if no match attempts have been recorded
reset()
Reset the scanner's state including string pointer and match history.
rest()
Return the string from the current pointer onwards, i.e. the segment of 
string which has not yet been consumed.
rest_len()
Return the length of string remaining. 
This is equivalent to len(rest())
scan(pattern, flags=0)
Return a match for the pattern at the scan pointer and consume the 
string.
Return None if not match is found
scan_to(pattern, flags=0)
Return all text up until the beginning of the first match for the pattern
after the scan pointer.
The pattern is not included in the match.
The scan pointer will be moved such that it immediately precedes the pattern
Return None if no match is found
scan_until(pattern, flags=0)
Return the first match for the pattern after the scan pointer and 
consumes the string up until the end of the match.    
Return None if no match is found
skip(pattern, flags=0)
Scan ahead over the given pattern and return how many characters were
consumed, or None.
Similar to scan, but does not return the string or record the match 
skip_bytes(n)
Skip the given number of bytes and return the number of bytes consumed
skip_lines(n=1)
Skip the given number of lines and return the number of lines consumed 
skip_to(pattern, flags=0)
Scan ahead until the beginning of first occurrance of the given pattern
and return how many characters were skipped, or None if the match
failed
The match is not recorded.
skip_until(pattern, flags=0)
Scan ahead until the end of first occurrance of the given pattern and 
return how many characters were consumed, or None if the match failed
The match is not recorded
skip_whitespace(n=None, multiline=True)
Skip over whitespace characters and return the number of characters 
consumed

Arguments: 
n -- maximum number of characters to cosume (default None)
multiline -- whether or not to consume newline characters (default True)
terminate()
Set the string pointer to the end of the input and clear the match 
history.
unscan()
Revert the scanner's state to that of the previous match. Only one 
previous state is remembered
Throw Exception if there is no previous known state to restore
Last edited by mark at 23:23:35 09/03/11 -- "create page"

Page loaded in 0.071s
2 calls to Luminous, total: 0.0298s, average: 0.0149s