CMS 3D CMS Logo

List of all members | Public Member Functions | Public Attributes | Static Public Attributes
BeautifulSoup.BeautifulSoup Class Reference
Inheritance diagram for BeautifulSoup.BeautifulSoup:
BeautifulSoup.BeautifulStoneSoup BeautifulSoup.Tag BeautifulSoup.PageElement BeautifulSoup.ICantBelieveItsBeautifulSoup BeautifulSoup.MinimalSoup BeautifulSoup.RobustHTMLParser BeautifulSoup.RobustWackAssHTMLParser BeautifulSoup.RobustInsanelyWackAssHTMLParser

Public Member Functions

def __init__ (self, args, kwargs)
 
def start_meta (self, attrs)
 
- Public Member Functions inherited from BeautifulSoup.BeautifulStoneSoup
def __getattr__ (self, methodName)
 
def __init__ (self, markup="", parseOnlyThese=None, fromEncoding=None, markupMassage=True, smartQuotesTo=XML_ENTITIES, convertEntities=None, selfClosingTags=None, isHTML=False)
 
def convert_charref (self, name)
 
def endData (self, containerClass=NavigableString)
 
def handle_charref (self, ref)
 
def handle_comment (self, text)
 
def handle_data (self, data)
 
def handle_decl (self, data)
 
def handle_entityref (self, ref)
 
def handle_pi (self, text)
 
def isSelfClosingTag (self, name)
 
def parse_declaration (self, i)
 
def popTag (self)
 
def pushTag (self, tag)
 
def reset (self)
 
def unknown_endtag (self, name)
 
def unknown_starttag (self, name, attrs, selfClosing=0)
 
- Public Member Functions inherited from BeautifulSoup.Tag
def __call__ (self, args, kwargs)
 
def __contains__ (self, x)
 
def __delitem__ (self, key)
 
def __eq__ (self, other)
 
def __getattr__ (self, tag)
 
def __getitem__ (self, key)
 
def __init__ (self, parser, name, attrs=None, parent=None, previous=None)
 
def __iter__ (self)
 
def __len__ (self)
 
def __ne__ (self, other)
 
def __nonzero__ (self)
 
def __repr__ (self, encoding=DEFAULT_OUTPUT_ENCODING)
 
def __setitem__ (self, key, value)
 
def __str__ (self, encoding=DEFAULT_OUTPUT_ENCODING, prettyPrint=False, indentLevel=0)
 
def __unicode__ (self)
 
def childGenerator (self)
 
def clear (self)
 
def decompose (self)
 
def fetchText (self, text=None, recursive=True, limit=None)
 
def find (self, name=None, attrs={}, recursive=True, text=None, kwargs)
 
def findAll (self, name=None, attrs={}, recursive=True, text=None, limit=None, kwargs)
 
def firstText (self, text=None, recursive=True)
 
def get (self, key, default=None)
 
def getString (self)
 
def getText (self, separator=u"")
 
def has_key (self, key)
 
def index (self, element)
 
def prettify (self, encoding=DEFAULT_OUTPUT_ENCODING)
 
def recursiveChildGenerator (self)
 
def renderContents (self, encoding=DEFAULT_OUTPUT_ENCODING, prettyPrint=False, indentLevel=0)
 
def setString (self, string)
 
- Public Member Functions inherited from BeautifulSoup.PageElement
def append (self, tag)
 
def extract (self)
 
def findAllNext (self, name=None, attrs={}, text=None, limit=None, kwargs)
 
def findAllPrevious (self, name=None, attrs={}, text=None, limit=None, kwargs)
 
def findNext (self, name=None, attrs={}, text=None, kwargs)
 
def findNextSibling (self, name=None, attrs={}, text=None, kwargs)
 
def findNextSiblings (self, name=None, attrs={}, text=None, limit=None, kwargs)
 
def findParent (self, name=None, attrs={}, kwargs)
 
def findParents (self, name=None, attrs={}, limit=None, kwargs)
 
def findPrevious (self, name=None, attrs={}, text=None, kwargs)
 
def findPreviousSibling (self, name=None, attrs={}, text=None, kwargs)
 
def findPreviousSiblings (self, name=None, attrs={}, text=None, limit=None, kwargs)
 
def insert (self, position, newChild)
 
def nextGenerator (self)
 
def nextSiblingGenerator (self)
 
def parentGenerator (self)
 
def previousGenerator (self)
 
def previousSiblingGenerator (self)
 
def replaceWith (self, replaceWith)
 
def replaceWithChildren (self)
 
def setup (self, parent=None, previous=None)
 
def substituteEncoding (self, str, encoding=None)
 
def toEncoding (self, s, encoding=None)
 

Public Attributes

 declaredHTMLEncoding
 
 originalEncoding
 
- Public Attributes inherited from BeautifulSoup.BeautifulStoneSoup
 convertEntities
 
 convertHTMLEntities
 
 convertXMLEntities
 
 currentData
 
 currentTag
 
 declaredHTMLEncoding
 
 escapeUnrecognizedEntities
 
 fromEncoding
 
 hidden
 
 instanceSelfClosingTags
 
 literal
 
 markup
 
 markupMassage
 
 originalEncoding
 
 parseOnlyThese
 
 previous
 
 quoteStack
 
 smartQuotesTo
 
 tagStack
 
- Public Attributes inherited from BeautifulSoup.Tag
 attrMap
 
 attrs
 
 containsSubstitutions
 
 contents
 
 convertHTMLEntities
 
 convertXMLEntities
 
 escapeUnrecognizedEntities
 
 hidden
 
 isSelfClosing
 
 name
 
 parserClass
 
- Public Attributes inherited from BeautifulSoup.PageElement
 next
 
 nextSibling
 
 parent
 
 previous
 
 previousSibling
 

Static Public Attributes

 CHARSET_RE
 
 NESTABLE_BLOCK_TAGS
 
 NESTABLE_INLINE_TAGS
 
 NESTABLE_LIST_TAGS
 
 NESTABLE_TABLE_TAGS
 
 NESTABLE_TAGS
 
 NON_NESTABLE_BLOCK_TAGS
 
 PRESERVE_WHITESPACE_TAGS
 
 QUOTE_TAGS
 
 RESET_NESTING_TAGS
 
 SELF_CLOSING_TAGS
 
- Static Public Attributes inherited from BeautifulSoup.BeautifulStoneSoup
 ALL_ENTITIES
 
 HTML_ENTITIES
 
 MARKUP_MASSAGE
 
 NESTABLE_TAGS
 
 PRESERVE_WHITESPACE_TAGS
 
 QUOTE_TAGS
 
 RESET_NESTING_TAGS
 
 ROOT_TAG_NAME
 
 SELF_CLOSING_TAGS
 
 STRIP_ASCII_SPACES
 
 XHTML_ENTITIES
 
 XML_ENTITIES
 
- Static Public Attributes inherited from BeautifulSoup.Tag
 fetch
 
 findChild
 
 findChildren
 
 first
 
- Static Public Attributes inherited from BeautifulSoup.PageElement
 BARE_AMPERSAND_OR_BRACKET
 
 fetchNextSiblings
 
 fetchParents
 
 fetchPrevious
 
 fetchPreviousSiblings
 
 XML_ENTITIES_TO_SPECIAL_CHARS
 
 XML_SPECIAL_CHARS_TO_ENTITIES
 

Additional Inherited Members

- Properties inherited from BeautifulSoup.Tag
 string = property(getString, setString)
 
 text = property(getText)
 

Detailed Description

This parser knows the following facts about HTML:

* Some tags have no closing tag and should be interpreted as being
  closed as soon as they are encountered.

* The text inside some tags (ie. 'script') may contain tags which
  are not really part of the document and which should be parsed
  as text, not tags. If you want to parse the text as tags, you can
  always fetch it and parse it explicitly.

* Tag nesting rules:

  Most tags can't be nested at all. For instance, the occurance of
  a <p> tag should implicitly close the previous <p> tag.

   <p>Para1<p>Para2
    should be transformed into:
   <p>Para1</p><p>Para2

  Some tags can be nested arbitrarily. For instance, the occurance
  of a <blockquote> tag should _not_ implicitly close the previous
  <blockquote> tag.

   Alice said: <blockquote>Bob said: <blockquote>Blah
    should NOT be transformed into:
   Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

  Some tags can be nested, but the nesting is reset by the
  interposition of other tags. For instance, a <tr> tag should
  implicitly close the previous <tr> tag within the same <table>,
  but not close a <tr> tag in another table.

   <table><tr>Blah<tr>Blah
    should be transformed into:
   <table><tr>Blah</tr><tr>Blah
    but,
   <tr>Blah<table><tr>Blah
    should NOT be transformed into
   <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source
of problems with the BeautifulSoup class. If BeautifulSoup is not
treating as nestable a tag your page author treats as nestable,
try ICantBelieveItsBeautifulSoup, MinimalSoup, or
BeautifulStoneSoup before writing your own subclass.

Definition at line 1470 of file BeautifulSoup.py.

Constructor & Destructor Documentation

def BeautifulSoup.BeautifulSoup.__init__ (   self,
  args,
  kwargs 
)

Definition at line 1518 of file BeautifulSoup.py.

References BeautifulSoup.BeautifulStoneSoup.HTML_ENTITIES.

1518  def __init__(self, *args, **kwargs):
1519  if not kwargs.has_key('smartQuotesTo'):
1520  kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1521  kwargs['isHTML'] = True
1522  BeautifulStoneSoup.__init__(self, *args, **kwargs)
1523 
def __init__(self, args, kwargs)

Member Function Documentation

def BeautifulSoup.BeautifulSoup.start_meta (   self,
  attrs 
)
Beautiful Soup can detect a charset included in a META tag,
try to convert the document to that charset, and re-parse the
document from the beginning.

Definition at line 1576 of file BeautifulSoup.py.

References BeautifulSoup.BeautifulStoneSoup.declaredHTMLEncoding, and FastTimerService_cff.range.

1576  def start_meta(self, attrs):
1577  """Beautiful Soup can detect a charset included in a META tag,
1578  try to convert the document to that charset, and re-parse the
1579  document from the beginning."""
1580  httpEquiv = None
1581  contentType = None
1582  contentTypeIndex = None
1583  tagNeedsEncodingSubstitution = False
1584 
1585  for i in range(0, len(attrs)):
1586  key, value = attrs[i]
1587  key = key.lower()
1588  if key == 'http-equiv':
1589  httpEquiv = value
1590  elif key == 'content':
1591  contentType = value
1592  contentTypeIndex = i
1593 
1594  if httpEquiv and contentType: # It's an interesting meta tag.
1595  match = self.CHARSET_RE.search(contentType)
1596  if match:
1597  if (self.declaredHTMLEncoding is not None or
1599  # An HTML encoding was sniffed while converting
1600  # the document to Unicode, or an HTML encoding was
1601  # sniffed during a previous pass through the
1602  # document, or an encoding was specified
1603  # explicitly and it worked. Rewrite the meta tag.
1604  def rewrite(match):
1605  return match.group(1) + "%SOUP-ENCODING%"
1606  newAttr = self.CHARSET_RE.sub(rewrite, contentType)
1607  attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
1608  newAttr)
1609  tagNeedsEncodingSubstitution = True
1610  else:
1611  # This is our first pass through the document.
1612  # Go through it again with the encoding information.
1613  newCharset = match.group(3)
1614  if newCharset and newCharset != self.originalEncoding:
1615  self.declaredHTMLEncoding = newCharset
1616  self._feed(self.declaredHTMLEncoding)
1617  raise StopParsing
1618  pass
1619  tag = self.unknown_starttag("meta", attrs)
1620  if tag and tagNeedsEncodingSubstitution:
1621  tag.containsSubstitutions = True
1622 
def unknown_starttag(self, name, attrs, selfClosing=0)
def _feed(self, inDocumentEncoding=None, isHTML=False)

Member Data Documentation

BeautifulSoup.BeautifulSoup.CHARSET_RE
static

Definition at line 1574 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.declaredHTMLEncoding

Definition at line 1615 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._detectEncoding().

BeautifulSoup.BeautifulSoup.NESTABLE_BLOCK_TAGS
static

Definition at line 1541 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.NESTABLE_INLINE_TAGS
static

Definition at line 1535 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.NESTABLE_LIST_TAGS
static

Definition at line 1544 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.NESTABLE_TABLE_TAGS
static

Definition at line 1552 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.NESTABLE_TAGS
static

Definition at line 1570 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.NON_NESTABLE_BLOCK_TAGS
static

Definition at line 1561 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.originalEncoding

Definition at line 1598 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.PRESERVE_WHITESPACE_TAGS
static

Definition at line 1528 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.QUOTE_TAGS
static

Definition at line 1530 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.RESET_NESTING_TAGS
static

Definition at line 1565 of file BeautifulSoup.py.

BeautifulSoup.BeautifulSoup.SELF_CLOSING_TAGS
static

Definition at line 1524 of file BeautifulSoup.py.