CMS 3D CMS Logo

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Properties Friends Macros Pages
List of all members | Public Member Functions | Public Attributes | Static Public Attributes
BeautifulSoup.BeautifulSoup Class Reference
Inheritance diagram for BeautifulSoup.BeautifulSoup:
BeautifulSoup.BeautifulStoneSoup BeautifulSoup.BeautifulStoneSoup BeautifulSoup.Tag BeautifulSoup.Tag BeautifulSoup.Tag BeautifulSoup.Tag BeautifulSoup.PageElement BeautifulSoup.PageElement BeautifulSoup.PageElement BeautifulSoup.PageElement BeautifulSoup.PageElement BeautifulSoup.PageElement BeautifulSoup.PageElement BeautifulSoup.PageElement BeautifulSoup.ICantBelieveItsBeautifulSoup BeautifulSoup.ICantBelieveItsBeautifulSoup BeautifulSoup.MinimalSoup BeautifulSoup.MinimalSoup BeautifulSoup.RobustHTMLParser BeautifulSoup.RobustHTMLParser BeautifulSoup.RobustWackAssHTMLParser BeautifulSoup.RobustWackAssHTMLParser BeautifulSoup.RobustWackAssHTMLParser BeautifulSoup.RobustWackAssHTMLParser BeautifulSoup.RobustInsanelyWackAssHTMLParser BeautifulSoup.RobustInsanelyWackAssHTMLParser BeautifulSoup.RobustInsanelyWackAssHTMLParser BeautifulSoup.RobustInsanelyWackAssHTMLParser

Public Member Functions

def __init__
 
def __init__
 
def extractCharsetFromMeta
 
def extractCharsetFromMeta
 
- Public Member Functions inherited from BeautifulSoup.BeautifulStoneSoup
def __init__
 
def __init__
 
def endData
 
def endData
 
def extractCharsetFromMeta
 
def extractCharsetFromMeta
 
def handle_data
 
def handle_data
 
def isSelfClosingTag
 
def isSelfClosingTag
 
def popTag
 
def popTag
 
def pushTag
 
def pushTag
 
def reset
 
def reset
 
def unknown_endtag
 
def unknown_endtag
 
def unknown_starttag
 
def unknown_starttag
 
- Public Member Functions inherited from BeautifulSoup.PageElement
def append
 
def append
 
def extract
 
def extract
 
def findAllNext
 
def findAllNext
 
def findAllPrevious
 
def findAllPrevious
 
def findNext
 
def findNext
 
def findNextSibling
 
def findNextSibling
 
def findNextSiblings
 
def findNextSiblings
 
def findParent
 
def findParent
 
def findParents
 
def findParents
 
def findPrevious
 
def findPrevious
 
def findPreviousSibling
 
def findPreviousSibling
 
def findPreviousSiblings
 
def findPreviousSiblings
 
def insert
 
def insert
 
def nextGenerator
 
def nextGenerator
 
def nextSiblingGenerator
 
def nextSiblingGenerator
 
def parentGenerator
 
def parentGenerator
 
def previousGenerator
 
def previousGenerator
 
def previousSiblingGenerator
 
def previousSiblingGenerator
 
def replaceWith
 
def replaceWith
 
def setup
 
def setup
 
def substituteEncoding
 
def substituteEncoding
 
def toEncoding
 
def toEncoding
 

Public Attributes

 declaredHTMLEncoding
 
 originalEncoding
 
- Public Attributes inherited from BeautifulSoup.BeautifulStoneSoup
 builder
 
 convertEntities
 
 convertHTMLEntities
 
 convertXMLEntities
 
 currentData
 
 currentTag
 
 declaredHTMLEncoding
 
 escapeUnrecognizedEntities
 
 fromEncoding
 
 hidden
 
 instanceSelfClosingTags
 
 literal
 
 markup
 
 markupMassage
 
 originalEncoding
 
 parseOnlyThese
 
 previous
 
 quoteStack
 
 smartQuotesTo
 
 tagStack
 
- Public Attributes inherited from BeautifulSoup.PageElement
 next
 
 nextSibling
 
 parent
 
 previous
 
 previousSibling
 

Static Public Attributes

tuple CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
 
list NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
 
list NESTABLE_INLINE_TAGS
 
dictionary NESTABLE_LIST_TAGS
 
dictionary NESTABLE_TABLE_TAGS
 
tuple NESTABLE_TAGS
 
list NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre']
 
tuple PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
 
dictionary QUOTE_TAGS = {'script' : None, 'textarea' : None}
 
tuple RESET_NESTING_TAGS
 
tuple SELF_CLOSING_TAGS
 
- Static Public Attributes inherited from BeautifulSoup.BeautifulStoneSoup
 ALL_ENTITIES = XHTML_ENTITIES
 
string HTML_ENTITIES = "html"
 
list MARKUP_MASSAGE
 
dictionary NESTABLE_TAGS = {}
 
list PRESERVE_WHITESPACE_TAGS = []
 
dictionary QUOTE_TAGS = {}
 
dictionary RESET_NESTING_TAGS = {}
 
string ROOT_TAG_NAME = u'[document]'
 
dictionary SELF_CLOSING_TAGS = {}
 
dictionary STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
 
string XHTML_ENTITIES = "xhtml"
 
string XML_ENTITIES = "xml"
 
- Static Public Attributes inherited from BeautifulSoup.PageElement
 fetchNextSiblings = findNextSiblings
 
 fetchParents = findParents
 
 fetchPrevious = findAllPrevious
 
 fetchPreviousSiblings = findPreviousSiblings
 

Detailed Description

This parser knows the following facts about HTML:

* Some tags have no closing tag and should be interpreted as being
  closed as soon as they are encountered.

* The text inside some tags (ie. 'script') may contain tags which
  are not really part of the document and which should be parsed
  as text, not tags. If you want to parse the text as tags, you can
  always fetch it and parse it explicitly.

* Tag nesting rules:

  Most tags can't be nested at all. For instance, the occurance of
  a <p> tag should implicitly close the previous <p> tag.

   <p>Para1<p>Para2
    should be transformed into:
   <p>Para1</p><p>Para2

  Some tags can be nested arbitrarily. For instance, the occurance
  of a <blockquote> tag should _not_ implicitly close the previous
  <blockquote> tag.

   Alice said: <blockquote>Bob said: <blockquote>Blah
    should NOT be transformed into:
   Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

  Some tags can be nested, but the nesting is reset by the
  interposition of other tags. For instance, a <tr> tag should
  implicitly close the previous <tr> tag within the same <table>,
  but not close a <tr> tag in another table.

   <table><tr>Blah<tr>Blah
    should be transformed into:
   <table><tr>Blah</tr><tr>Blah
    but,
   <tr>Blah<table><tr>Blah
    should NOT be transformed into
   <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source
of problems with the BeautifulSoup class. If BeautifulSoup is not
treating as nestable a tag your page author treats as nestable,
try ICantBelieveItsBeautifulSoup, MinimalSoup, or
BeautifulStoneSoup before writing your own subclass.

Definition at line 1447 of file BeautifulSoup.py.

Constructor & Destructor Documentation

def BeautifulSoup.BeautifulSoup.__init__ (   self,
  args,
  kwargs 
)

Definition at line 1495 of file BeautifulSoup.py.

References BeautifulSoup.BeautifulStoneSoup.HTML_ENTITIES.

Referenced by BeautifulSoup.BeautifulSoup.__init__().

1496  def __init__(self, *args, **kwargs):
1497  if not kwargs.has_key('smartQuotesTo'):
1498  kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1499  kwargs['isHTML'] = True
1500  BeautifulStoneSoup.__init__(self, *args, **kwargs)
def BeautifulSoup.BeautifulSoup.__init__ (   self,
  args,
  kwargs 
)

Definition at line 1495 of file BeautifulSoup.py.

References BeautifulSoup.BeautifulSoup.__init__(), BeautifulSoup.buildTagMap(), and BeautifulSoup.BeautifulStoneSoup.HTML_ENTITIES.

1496  def __init__(self, *args, **kwargs):
1497  if not kwargs.has_key('smartQuotesTo'):
1498  kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1499  kwargs['isHTML'] = True
1500  BeautifulStoneSoup.__init__(self, *args, **kwargs)

Member Function Documentation

def BeautifulSoup.BeautifulSoup.extractCharsetFromMeta (   self,
  attrs 
)
Beautiful Soup can detect a charset included in a META tag,
try to convert the document to that charset, and re-parse the
document from the beginning.

Definition at line 1553 of file BeautifulSoup.py.

References BeautifulSoup.BeautifulStoneSoup.declaredHTMLEncoding.

Referenced by BeautifulSoup.BeautifulSoup.extractCharsetFromMeta().

1554  def extractCharsetFromMeta(self, attrs):
1555  """Beautiful Soup can detect a charset included in a META tag,
1556  try to convert the document to that charset, and re-parse the
1557  document from the beginning."""
1558  httpEquiv = None
1559  contentType = None
1560  contentTypeIndex = None
1561  tagNeedsEncodingSubstitution = False
1562 
1563  for i in range(0, len(attrs)):
1564  key, value = attrs[i]
1565  key = key.lower()
1566  if key == 'http-equiv':
1567  httpEquiv = value
1568  elif key == 'content':
1569  contentType = value
1570  contentTypeIndex = i
1571 
1572  if httpEquiv and contentType: # It's an interesting meta tag.
1573  match = self.CHARSET_RE.search(contentType)
1574  if match:
1575  if (self.declaredHTMLEncoding is not None or
1576  self.originalEncoding == self.fromEncoding):
1577  # An HTML encoding was sniffed while converting
1578  # the document to Unicode, or an HTML encoding was
1579  # sniffed during a previous pass through the
1580  # document, or an encoding was specified
1581  # explicitly and it worked. Rewrite the meta tag.
1582  def rewrite(match):
1583  return match.group(1) + "%SOUP-ENCODING%"
1584  newAttr = self.CHARSET_RE.sub(rewrite, contentType)
1585  attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
1586  newAttr)
1587  tagNeedsEncodingSubstitution = True
1588  else:
1589  # This is our first pass through the document.
1590  # Go through it again with the encoding information.
1591  newCharset = match.group(3)
1592  if newCharset and newCharset != self.originalEncoding:
1593  self.declaredHTMLEncoding = newCharset
1594  self._feed(self.declaredHTMLEncoding)
1595  raise StopParsing
1596  pass
1597  tag = self.unknown_starttag("meta", attrs)
1598  if tag and tagNeedsEncodingSubstitution:
1599  tag.containsSubstitutions = True
1600 
def BeautifulSoup.BeautifulSoup.extractCharsetFromMeta (   self,
  attrs 
)
Beautiful Soup can detect a charset included in a META tag,
try to convert the document to that charset, and re-parse the
document from the beginning.

Definition at line 1553 of file BeautifulSoup.py.

References BeautifulSoup.BeautifulStoneSoup._feed(), BeautifulSoup.BeautifulStoneSoup.declaredHTMLEncoding, BeautifulSoup.BeautifulSoup.extractCharsetFromMeta(), BeautifulSoup.BeautifulStoneSoup.fromEncoding, BeautifulSoup.BeautifulStoneSoup.originalEncoding, and BeautifulSoup.BeautifulStoneSoup.unknown_starttag().

1554  def extractCharsetFromMeta(self, attrs):
1555  """Beautiful Soup can detect a charset included in a META tag,
1556  try to convert the document to that charset, and re-parse the
1557  document from the beginning."""
1558  httpEquiv = None
1559  contentType = None
1560  contentTypeIndex = None
1561  tagNeedsEncodingSubstitution = False
1562 
1563  for i in range(0, len(attrs)):
1564  key, value = attrs[i]
1565  key = key.lower()
1566  if key == 'http-equiv':
1567  httpEquiv = value
1568  elif key == 'content':
1569  contentType = value
1570  contentTypeIndex = i
1571 
1572  if httpEquiv and contentType: # It's an interesting meta tag.
1573  match = self.CHARSET_RE.search(contentType)
1574  if match:
1575  if (self.declaredHTMLEncoding is not None or
1576  self.originalEncoding == self.fromEncoding):
1577  # An HTML encoding was sniffed while converting
1578  # the document to Unicode, or an HTML encoding was
1579  # sniffed during a previous pass through the
1580  # document, or an encoding was specified
1581  # explicitly and it worked. Rewrite the meta tag.
1582  def rewrite(match):
1583  return match.group(1) + "%SOUP-ENCODING%"
1584  newAttr = self.CHARSET_RE.sub(rewrite, contentType)
1585  attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
1586  newAttr)
1587  tagNeedsEncodingSubstitution = True
1588  else:
1589  # This is our first pass through the document.
1590  # Go through it again with the encoding information.
1591  newCharset = match.group(3)
1592  if newCharset and newCharset != self.originalEncoding:
1593  self.declaredHTMLEncoding = newCharset
1594  self._feed(self.declaredHTMLEncoding)
1595  raise StopParsing
1596  pass
1597  tag = self.unknown_starttag("meta", attrs)
1598  if tag and tagNeedsEncodingSubstitution:
1599  tag.containsSubstitutions = True
1600 

Member Data Documentation

tuple BeautifulSoup.BeautifulSoup.CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
static

Definition at line 1551 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.declaredHTMLEncoding

Definition at line 1592 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.__init__(), and BeautifulSoup.UnicodeDammit._detectEncoding().

list BeautifulSoup.BeautifulSoup.NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
static

Definition at line 1518 of file BeautifulSoup.py.

list BeautifulSoup.BeautifulSoup.NESTABLE_INLINE_TAGS
static
Initial value:
1 = ['span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
2  'center']

Definition at line 1512 of file BeautifulSoup.py.

dictionary BeautifulSoup.BeautifulSoup.NESTABLE_LIST_TAGS
static
Initial value:
1 = { 'ol' : [],
2  'ul' : [],
3  'li' : ['ul', 'ol'],
4  'dl' : [],
5  'dd' : ['dl'],
6  'dt' : ['dl'] }

Definition at line 1521 of file BeautifulSoup.py.

dictionary BeautifulSoup.BeautifulSoup.NESTABLE_TABLE_TAGS
static
Initial value:
1 = {'table' : [],
2  'tr' : ['table', 'tbody', 'tfoot', 'thead'],
3  'td' : ['tr'],
4  'th' : ['tr'],
5  'thead' : ['table'],
6  'tbody' : ['table'],
7  'tfoot' : ['table'],
8  }

Definition at line 1529 of file BeautifulSoup.py.

tuple BeautifulSoup.BeautifulSoup.NESTABLE_TAGS
static
Initial value:
1 = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
2  NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)

Definition at line 1547 of file BeautifulSoup.py.

list BeautifulSoup.BeautifulSoup.NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre']
static

Definition at line 1538 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.originalEncoding

Definition at line 1575 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.__init__(), and BeautifulSoup.UnicodeDammit._convertFrom().

tuple BeautifulSoup.BeautifulSoup.PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
static

Definition at line 1505 of file BeautifulSoup.py.

dictionary BeautifulSoup.BeautifulSoup.QUOTE_TAGS = {'script' : None, 'textarea' : None}
static

Definition at line 1507 of file BeautifulSoup.py.

tuple BeautifulSoup.BeautifulSoup.RESET_NESTING_TAGS
static
Initial value:
1 = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
2  NON_NESTABLE_BLOCK_TAGS,
3  NESTABLE_LIST_TAGS,
4  NESTABLE_TABLE_TAGS)

Definition at line 1542 of file BeautifulSoup.py.

tuple BeautifulSoup.BeautifulSoup.SELF_CLOSING_TAGS
static
Initial value:
1 = buildTagMap(None,
2  ['br' , 'hr', 'input', 'img', 'meta',
3  'spacer', 'link', 'frame', 'base'])

Definition at line 1501 of file BeautifulSoup.py.