Public Member Functions | |
def | __init__ |
def | __init__ |
def | extractCharsetFromMeta |
def | extractCharsetFromMeta |
Public Attributes | |
declaredHTMLEncoding | |
originalEncoding | |
Static Public Attributes | |
tuple | CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M) |
list | NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del'] |
list | NESTABLE_INLINE_TAGS |
dictionary | NESTABLE_LIST_TAGS |
dictionary | NESTABLE_TABLE_TAGS |
tuple | NESTABLE_TAGS |
list | NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre'] |
tuple | PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea']) |
dictionary | QUOTE_TAGS = {'script' : None, 'textarea' : None} |
tuple | RESET_NESTING_TAGS |
tuple | SELF_CLOSING_TAGS |
This parser knows the following facts about HTML: * Some tags have no closing tag and should be interpreted as being closed as soon as they are encountered. * The text inside some tags (ie. 'script') may contain tags which are not really part of the document and which should be parsed as text, not tags. If you want to parse the text as tags, you can always fetch it and parse it explicitly. * Tag nesting rules: Most tags can't be nested at all. For instance, the occurance of a <p> tag should implicitly close the previous <p> tag. <p>Para1<p>Para2 should be transformed into: <p>Para1</p><p>Para2 Some tags can be nested arbitrarily. For instance, the occurance of a <blockquote> tag should _not_ implicitly close the previous <blockquote> tag. Alice said: <blockquote>Bob said: <blockquote>Blah should NOT be transformed into: Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah Some tags can be nested, but the nesting is reset by the interposition of other tags. For instance, a <tr> tag should implicitly close the previous <tr> tag within the same <table>, but not close a <tr> tag in another table. <table><tr>Blah<tr>Blah should be transformed into: <table><tr>Blah</tr><tr>Blah but, <tr>Blah<table><tr>Blah should NOT be transformed into <tr>Blah<table></tr><tr>Blah Differing assumptions about tag nesting rules are a major source of problems with the BeautifulSoup class. If BeautifulSoup is not treating as nestable a tag your page author treats as nestable, try ICantBelieveItsBeautifulSoup, MinimalSoup, or BeautifulStoneSoup before writing your own subclass.
Definition at line 1447 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulSoup::__init__ | ( | self, | |
args, | |||
kwargs | |||
) |
Definition at line 1495 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulSoup::__init__ | ( | self, | |
args, | |||
kwargs | |||
) |
Definition at line 1495 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulSoup::extractCharsetFromMeta | ( | self, | |
attrs | |||
) |
Beautiful Soup can detect a charset included in a META tag, try to convert the document to that charset, and re-parse the document from the beginning.
Reimplemented from BeautifulSoup::BeautifulStoneSoup.
Definition at line 1553 of file BeautifulSoup.py.
01554 : 01555 """Beautiful Soup can detect a charset included in a META tag, 01556 try to convert the document to that charset, and re-parse the 01557 document from the beginning.""" 01558 httpsEquiv = None 01559 contentType = None 01560 contentTypeIndex = None 01561 tagNeedsEncodingSubstitution = False 01562 01563 for i in range(0, len(attrs)): 01564 key, value = attrs[i] 01565 key = key.lower() 01566 if key == 'https-equiv': 01567 httpsEquiv = value 01568 elif key == 'content': 01569 contentType = value 01570 contentTypeIndex = i 01571 01572 if httpsEquiv and contentType: # It's an interesting meta tag. 01573 match = self.CHARSET_RE.search(contentType) 01574 if match: 01575 if (self.declaredHTMLEncoding is not None or 01576 self.originalEncoding == self.fromEncoding): 01577 # An HTML encoding was sniffed while converting 01578 # the document to Unicode, or an HTML encoding was 01579 # sniffed during a previous pass through the 01580 # document, or an encoding was specified 01581 # explicitly and it worked. Rewrite the meta tag. 01582 def rewrite(match): 01583 return match.group(1) + "%SOUP-ENCODING%" 01584 newAttr = self.CHARSET_RE.sub(rewrite, contentType) 01585 attrs[contentTypeIndex] = (attrs[contentTypeIndex][0], 01586 newAttr) 01587 tagNeedsEncodingSubstitution = True 01588 else: 01589 # This is our first pass through the document. 01590 # Go through it again with the encoding information. 01591 newCharset = match.group(3) 01592 if newCharset and newCharset != self.originalEncoding: 01593 self.declaredHTMLEncoding = newCharset 01594 self._feed(self.declaredHTMLEncoding) 01595 raise StopParsing 01596 pass 01597 tag = self.unknown_starttag("meta", attrs) 01598 if tag and tagNeedsEncodingSubstitution: 01599 tag.containsSubstitutions = True 01600
def BeautifulSoup::BeautifulSoup::extractCharsetFromMeta | ( | self, | |
attrs | |||
) |
Beautiful Soup can detect a charset included in a META tag, try to convert the document to that charset, and re-parse the document from the beginning.
Reimplemented from BeautifulSoup::BeautifulStoneSoup.
Definition at line 1553 of file BeautifulSoup.py.
01554 : 01555 """Beautiful Soup can detect a charset included in a META tag, 01556 try to convert the document to that charset, and re-parse the 01557 document from the beginning.""" 01558 httpsEquiv = None 01559 contentType = None 01560 contentTypeIndex = None 01561 tagNeedsEncodingSubstitution = False 01562 01563 for i in range(0, len(attrs)): 01564 key, value = attrs[i] 01565 key = key.lower() 01566 if key == 'https-equiv': 01567 httpsEquiv = value 01568 elif key == 'content': 01569 contentType = value 01570 contentTypeIndex = i 01571 01572 if httpsEquiv and contentType: # It's an interesting meta tag. 01573 match = self.CHARSET_RE.search(contentType) 01574 if match: 01575 if (self.declaredHTMLEncoding is not None or 01576 self.originalEncoding == self.fromEncoding): 01577 # An HTML encoding was sniffed while converting 01578 # the document to Unicode, or an HTML encoding was 01579 # sniffed during a previous pass through the 01580 # document, or an encoding was specified 01581 # explicitly and it worked. Rewrite the meta tag. 01582 def rewrite(match): 01583 return match.group(1) + "%SOUP-ENCODING%" 01584 newAttr = self.CHARSET_RE.sub(rewrite, contentType) 01585 attrs[contentTypeIndex] = (attrs[contentTypeIndex][0], 01586 newAttr) 01587 tagNeedsEncodingSubstitution = True 01588 else: 01589 # This is our first pass through the document. 01590 # Go through it again with the encoding information. 01591 newCharset = match.group(3) 01592 if newCharset and newCharset != self.originalEncoding: 01593 self.declaredHTMLEncoding = newCharset 01594 self._feed(self.declaredHTMLEncoding) 01595 raise StopParsing 01596 pass 01597 tag = self.unknown_starttag("meta", attrs) 01598 if tag and tagNeedsEncodingSubstitution: 01599 tag.containsSubstitutions = True 01600
tuple BeautifulSoup::BeautifulSoup::CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M) [static] |
Definition at line 1551 of file BeautifulSoup.py.
Reimplemented from BeautifulSoup::BeautifulStoneSoup.
Definition at line 1555 of file BeautifulSoup.py.
list BeautifulSoup::BeautifulSoup::NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del'] [static] |
Definition at line 1518 of file BeautifulSoup.py.
list BeautifulSoup::BeautifulSoup::NESTABLE_INLINE_TAGS [static] |
['span', 'font', 'q', 'object', 'bdo', 'sub', 'sup', 'center']
Definition at line 1512 of file BeautifulSoup.py.
dictionary BeautifulSoup::BeautifulSoup::NESTABLE_LIST_TAGS [static] |
{ 'ol' : [], 'ul' : [], 'li' : ['ul', 'ol'], 'dl' : [], 'dd' : ['dl'], 'dt' : ['dl'] }
Definition at line 1521 of file BeautifulSoup.py.
dictionary BeautifulSoup::BeautifulSoup::NESTABLE_TABLE_TAGS [static] |
{'table' : [], 'tr' : ['table', 'tbody', 'tfoot', 'thead'], 'td' : ['tr'], 'th' : ['tr'], 'thead' : ['table'], 'tbody' : ['table'], 'tfoot' : ['table'], }
Definition at line 1529 of file BeautifulSoup.py.
tuple BeautifulSoup::BeautifulSoup::NESTABLE_TAGS [static] |
buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS, NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)
Reimplemented from BeautifulSoup::BeautifulStoneSoup.
Reimplemented in BeautifulSoup::ICantBelieveItsBeautifulSoup, and BeautifulSoup::MinimalSoup.
Definition at line 1547 of file BeautifulSoup.py.
list BeautifulSoup::BeautifulSoup::NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre'] [static] |
Definition at line 1538 of file BeautifulSoup.py.
Reimplemented from BeautifulSoup::BeautifulStoneSoup.
Definition at line 1555 of file BeautifulSoup.py.
tuple BeautifulSoup::BeautifulSoup::PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea']) [static] |
Reimplemented from BeautifulSoup::BeautifulStoneSoup.
Definition at line 1505 of file BeautifulSoup.py.
dictionary BeautifulSoup::BeautifulSoup::QUOTE_TAGS = {'script' : None, 'textarea' : None} [static] |
Reimplemented from BeautifulSoup::BeautifulStoneSoup.
Definition at line 1507 of file BeautifulSoup.py.
tuple BeautifulSoup::BeautifulSoup::RESET_NESTING_TAGS [static] |
buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript', NON_NESTABLE_BLOCK_TAGS, NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)
Reimplemented from BeautifulSoup::BeautifulStoneSoup.
Reimplemented in BeautifulSoup::MinimalSoup.
Definition at line 1542 of file BeautifulSoup.py.
tuple BeautifulSoup::BeautifulSoup::SELF_CLOSING_TAGS [static] |
buildTagMap(None, ['br' , 'hr', 'input', 'img', 'meta', 'spacer', 'link', 'frame', 'base'])
Reimplemented from BeautifulSoup::BeautifulStoneSoup.
Definition at line 1501 of file BeautifulSoup.py.