CMS 3D CMS Logo

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Properties Friends Macros Pages
List of all members | Public Member Functions | Public Attributes | Static Public Attributes
BeautifulSoup.BeautifulSoup Class Reference
Inheritance diagram for BeautifulSoup.BeautifulSoup:
BeautifulSoup.BeautifulStoneSoup BeautifulSoup.Tag BeautifulSoup.PageElement BeautifulSoup.ICantBelieveItsBeautifulSoup BeautifulSoup.MinimalSoup BeautifulSoup.RobustHTMLParser BeautifulSoup.RobustWackAssHTMLParser BeautifulSoup.RobustInsanelyWackAssHTMLParser

Public Member Functions

def __init__
 
def start_meta
 
- Public Member Functions inherited from BeautifulSoup.BeautifulStoneSoup
def __getattr__
 
def __init__
 
def convert_charref
 
def endData
 
def handle_charref
 
def handle_comment
 
def handle_data
 
def handle_decl
 
def handle_entityref
 
def handle_pi
 
def isSelfClosingTag
 
def parse_declaration
 
def popTag
 
def pushTag
 
def reset
 
def unknown_endtag
 
def unknown_starttag
 
- Public Member Functions inherited from BeautifulSoup.Tag
def __call__
 
def __contains__
 
def __delitem__
 
def __eq__
 
def __getattr__
 
def __getitem__
 
def __init__
 
def __iter__
 
def __len__
 
def __ne__
 
def __nonzero__
 
def __repr__
 
def __setitem__
 
def __str__
 
def __unicode__
 
def childGenerator
 
def clear
 
def decompose
 
def fetchText
 
def find
 
def findAll
 
def firstText
 
def get
 
def getString
 
def getText
 
def has_key
 
def index
 
def prettify
 
def recursiveChildGenerator
 
def renderContents
 
def setString
 

Public Attributes

 declaredHTMLEncoding
 
 originalEncoding
 
- Public Attributes inherited from BeautifulSoup.BeautifulStoneSoup
 convertEntities
 
 convertHTMLEntities
 
 convertXMLEntities
 
 currentData
 
 currentTag
 
 declaredHTMLEncoding
 
 escapeUnrecognizedEntities
 
 fromEncoding
 
 hidden
 
 instanceSelfClosingTags
 
 literal
 
 markup
 
 markupMassage
 
 originalEncoding
 
 parseOnlyThese
 
 previous
 
 quoteStack
 
 smartQuotesTo
 
 tagStack
 
- Public Attributes inherited from BeautifulSoup.Tag
 attrMap
 
 attrs
 
 containsSubstitutions
 
 contents
 
 convertHTMLEntities
 
 convertXMLEntities
 
 escapeUnrecognizedEntities
 
 hidden
 
 isSelfClosing
 
 name
 
 parserClass
 

Static Public Attributes

tuple CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
 
tuple NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del')
 
tuple NESTABLE_INLINE_TAGS
 
dictionary NESTABLE_LIST_TAGS
 
dictionary NESTABLE_TABLE_TAGS
 
tuple NESTABLE_TAGS
 
tuple NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre')
 
tuple PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
 
dictionary QUOTE_TAGS = {'script' : None, 'textarea' : None}
 
tuple RESET_NESTING_TAGS
 
tuple SELF_CLOSING_TAGS
 
- Static Public Attributes inherited from BeautifulSoup.BeautifulStoneSoup
 ALL_ENTITIES = XHTML_ENTITIES
 
string HTML_ENTITIES = "html"
 
list MARKUP_MASSAGE
 
dictionary NESTABLE_TAGS = {}
 
list PRESERVE_WHITESPACE_TAGS = []
 
dictionary QUOTE_TAGS = {}
 
dictionary RESET_NESTING_TAGS = {}
 
string ROOT_TAG_NAME = u'[document]'
 
dictionary SELF_CLOSING_TAGS = {}
 
dictionary STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
 
string XHTML_ENTITIES = "xhtml"
 
string XML_ENTITIES = "xml"
 
- Static Public Attributes inherited from BeautifulSoup.Tag
 fetch = findAll
 
 findChild = find
 
 findChildren = findAll
 
 first = find
 

Additional Inherited Members

- Properties inherited from BeautifulSoup.Tag
 string = property(getString, setString)
 
 text = property(getText)
 

Detailed Description

This parser knows the following facts about HTML:

* Some tags have no closing tag and should be interpreted as being
  closed as soon as they are encountered.

* The text inside some tags (ie. 'script') may contain tags which
  are not really part of the document and which should be parsed
  as text, not tags. If you want to parse the text as tags, you can
  always fetch it and parse it explicitly.

* Tag nesting rules:

  Most tags can't be nested at all. For instance, the occurance of
  a <p> tag should implicitly close the previous <p> tag.

   <p>Para1<p>Para2
    should be transformed into:
   <p>Para1</p><p>Para2

  Some tags can be nested arbitrarily. For instance, the occurance
  of a <blockquote> tag should _not_ implicitly close the previous
  <blockquote> tag.

   Alice said: <blockquote>Bob said: <blockquote>Blah
    should NOT be transformed into:
   Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

  Some tags can be nested, but the nesting is reset by the
  interposition of other tags. For instance, a <tr> tag should
  implicitly close the previous <tr> tag within the same <table>,
  but not close a <tr> tag in another table.

   <table><tr>Blah<tr>Blah
    should be transformed into:
   <table><tr>Blah</tr><tr>Blah
    but,
   <tr>Blah<table><tr>Blah
    should NOT be transformed into
   <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source
of problems with the BeautifulSoup class. If BeautifulSoup is not
treating as nestable a tag your page author treats as nestable,
try ICantBelieveItsBeautifulSoup, MinimalSoup, or
BeautifulStoneSoup before writing your own subclass.

Definition at line 1470 of file BeautifulSoup.py.

Constructor & Destructor Documentation

def BeautifulSoup.BeautifulSoup.__init__ (   self,
  args,
  kwargs 
)

Definition at line 1518 of file BeautifulSoup.py.

References BeautifulSoup.BeautifulStoneSoup.HTML_ENTITIES.

1519  def __init__(self, *args, **kwargs):
1520  if not kwargs.has_key('smartQuotesTo'):
1521  kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1522  kwargs['isHTML'] = True
1523  BeautifulStoneSoup.__init__(self, *args, **kwargs)

Member Function Documentation

def BeautifulSoup.BeautifulSoup.start_meta (   self,
  attrs 
)
Beautiful Soup can detect a charset included in a META tag,
try to convert the document to that charset, and re-parse the
document from the beginning.

Definition at line 1576 of file BeautifulSoup.py.

References BeautifulSoup.BeautifulStoneSoup.declaredHTMLEncoding.

1577  def start_meta(self, attrs):
1578  """Beautiful Soup can detect a charset included in a META tag,
1579  try to convert the document to that charset, and re-parse the
1580  document from the beginning."""
1581  httpEquiv = None
1582  contentType = None
1583  contentTypeIndex = None
1584  tagNeedsEncodingSubstitution = False
1585 
1586  for i in range(0, len(attrs)):
1587  key, value = attrs[i]
1588  key = key.lower()
1589  if key == 'http-equiv':
1590  httpEquiv = value
1591  elif key == 'content':
1592  contentType = value
1593  contentTypeIndex = i
1594 
1595  if httpEquiv and contentType: # It's an interesting meta tag.
1596  match = self.CHARSET_RE.search(contentType)
1597  if match:
1598  if (self.declaredHTMLEncoding is not None or
1599  self.originalEncoding == self.fromEncoding):
1600  # An HTML encoding was sniffed while converting
1601  # the document to Unicode, or an HTML encoding was
1602  # sniffed during a previous pass through the
1603  # document, or an encoding was specified
1604  # explicitly and it worked. Rewrite the meta tag.
1605  def rewrite(match):
1606  return match.group(1) + "%SOUP-ENCODING%"
1607  newAttr = self.CHARSET_RE.sub(rewrite, contentType)
1608  attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
1609  newAttr)
1610  tagNeedsEncodingSubstitution = True
1611  else:
1612  # This is our first pass through the document.
1613  # Go through it again with the encoding information.
1614  newCharset = match.group(3)
1615  if newCharset and newCharset != self.originalEncoding:
1616  self.declaredHTMLEncoding = newCharset
1617  self._feed(self.declaredHTMLEncoding)
1618  raise StopParsing
1619  pass
1620  tag = self.unknown_starttag("meta", attrs)
1621  if tag and tagNeedsEncodingSubstitution:
1622  tag.containsSubstitutions = True

Member Data Documentation

tuple BeautifulSoup.BeautifulSoup.CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
static

Definition at line 1574 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.declaredHTMLEncoding

Definition at line 1615 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._detectEncoding().

tuple BeautifulSoup.BeautifulSoup.NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del')
static

Definition at line 1541 of file BeautifulSoup.py.

tuple BeautifulSoup.BeautifulSoup.NESTABLE_INLINE_TAGS
static
Initial value:
1 = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
2  'center')

Definition at line 1535 of file BeautifulSoup.py.

dictionary BeautifulSoup.BeautifulSoup.NESTABLE_LIST_TAGS
static
Initial value:
1 = { 'ol' : [],
2  'ul' : [],
3  'li' : ['ul', 'ol'],
4  'dl' : [],
5  'dd' : ['dl'],
6  'dt' : ['dl'] }

Definition at line 1544 of file BeautifulSoup.py.

dictionary BeautifulSoup.BeautifulSoup.NESTABLE_TABLE_TAGS
static
Initial value:
1 = {'table' : [],
2  'tr' : ['table', 'tbody', 'tfoot', 'thead'],
3  'td' : ['tr'],
4  'th' : ['tr'],
5  'thead' : ['table'],
6  'tbody' : ['table'],
7  'tfoot' : ['table'],
8  }

Definition at line 1552 of file BeautifulSoup.py.

tuple BeautifulSoup.BeautifulSoup.NESTABLE_TAGS
static
Initial value:
1 = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
2  NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)

Definition at line 1570 of file BeautifulSoup.py.

tuple BeautifulSoup.BeautifulSoup.NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre')
static

Definition at line 1561 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.originalEncoding

Definition at line 1598 of file BeautifulSoup.py.

tuple BeautifulSoup.BeautifulSoup.PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
static

Definition at line 1528 of file BeautifulSoup.py.

dictionary BeautifulSoup.BeautifulSoup.QUOTE_TAGS = {'script' : None, 'textarea' : None}
static

Definition at line 1530 of file BeautifulSoup.py.

tuple BeautifulSoup.BeautifulSoup.RESET_NESTING_TAGS
static
Initial value:
1 = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
2  NON_NESTABLE_BLOCK_TAGS,
3  NESTABLE_LIST_TAGS,
4  NESTABLE_TABLE_TAGS)

Definition at line 1565 of file BeautifulSoup.py.

tuple BeautifulSoup.BeautifulSoup.SELF_CLOSING_TAGS
static
Initial value:
1 = buildTagMap(None,
2  ('br' , 'hr', 'input', 'img', 'meta',
3  'spacer', 'link', 'frame', 'base', 'col'))

Definition at line 1524 of file BeautifulSoup.py.