This class contains the basic parser and search code. It defines a parser that knows nothing about tag behavior except for the following: You can't close a tag without closing all the tags it encloses. That is, "<foo><bar></foo>" actually means "<foo><bar></bar></foo>". [Another possible explanation is "<foo><bar /></foo>", but since this class defines no SELF_CLOSING_TAGS, it will never use that explanation.] This class is useful for parsing XML or made-up markup languages, or when BeautifulSoup makes an assumption counter to what you were expecting.
Definition at line 1120 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::__init__ | ( | self, | |
markup = "" , |
|||
parseOnlyThese = None , |
|||
fromEncoding = None , |
|||
markupMassage = True , |
|||
smartQuotesTo = XML_ENTITIES , |
|||
convertEntities = None , |
|||
selfClosingTags = None , |
|||
isHTML = False , |
|||
builder = HTMLParserBuilder |
|||
) |
The Soup object is initialized as the 'root tag', and the provided markup (which can be a string or a file-like object) is fed into the underlying parser. HTMLParser will process most bad HTML, and the BeautifulSoup class has some tricks for dealing with some HTML that kills HTMLParser, but Beautiful Soup can nonetheless choke or lose data if your data uses self-closing tags or declarations incorrectly. By default, Beautiful Soup uses regexes to sanitize input, avoiding the vast majority of these problems. If the problems don't apply to you, pass in False for markupMassage, and you'll get better performance. The default parser massage techniques fix the two most common instances of invalid HTML that choke HTMLParser: <br/> (No space between name of closing tag and tag close) <! --Comment--> (Extraneous whitespace in declaration) You can pass in a custom list of (RE object, replace method) tuples to get Beautiful Soup to scrub your input the way you want.
Definition at line 1164 of file BeautifulSoup.py.
01168 : 01169 """The Soup object is initialized as the 'root tag', and the 01170 provided markup (which can be a string or a file-like object) 01171 is fed into the underlying parser. 01172 01173 HTMLParser will process most bad HTML, and the BeautifulSoup 01174 class has some tricks for dealing with some HTML that kills 01175 HTMLParser, but Beautiful Soup can nonetheless choke or lose data 01176 if your data uses self-closing tags or declarations 01177 incorrectly. 01178 01179 By default, Beautiful Soup uses regexes to sanitize input, 01180 avoiding the vast majority of these problems. If the problems 01181 don't apply to you, pass in False for markupMassage, and 01182 you'll get better performance. 01183 01184 The default parser massage techniques fix the two most common 01185 instances of invalid HTML that choke HTMLParser: 01186 01187 <br/> (No space between name of closing tag and tag close) 01188 <! --Comment--> (Extraneous whitespace in declaration) 01189 01190 You can pass in a custom list of (RE object, replace method) 01191 tuples to get Beautiful Soup to scrub your input the way you 01192 want.""" 01193 01194 self.parseOnlyThese = parseOnlyThese 01195 self.fromEncoding = fromEncoding 01196 self.smartQuotesTo = smartQuotesTo 01197 self.convertEntities = convertEntities 01198 # Set the rules for how we'll deal with the entities we 01199 # encounter 01200 if self.convertEntities: 01201 # It doesn't make sense to convert encoded characters to 01202 # entities even while you're converting entities to Unicode. 01203 # Just convert it all to Unicode. 01204 self.smartQuotesTo = None 01205 if convertEntities == self.HTML_ENTITIES: 01206 self.convertXMLEntities = False 01207 self.convertHTMLEntities = True 01208 self.escapeUnrecognizedEntities = True 01209 elif convertEntities == self.XHTML_ENTITIES: 01210 self.convertXMLEntities = True 01211 self.convertHTMLEntities = True 01212 self.escapeUnrecognizedEntities = False 01213 elif convertEntities == self.XML_ENTITIES: 01214 self.convertXMLEntities = True 01215 self.convertHTMLEntities = False 01216 self.escapeUnrecognizedEntities = False 01217 else: 01218 self.convertXMLEntities = False 01219 self.convertHTMLEntities = False 01220 self.escapeUnrecognizedEntities = False 01221 01222 self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags) 01223 self.builder = builder(self) 01224 self.reset() 01225 01226 if hasattr(markup, 'read'): # It's a file-type object. 01227 markup = markup.read() 01228 self.markup = markup 01229 self.markupMassage = markupMassage 01230 try: 01231 self._feed(isHTML=isHTML) 01232 except StopParsing: 01233 pass 01234 self.markup = None # The markup can now be GCed. 01235 self.builder = None # So can the builder.
def BeautifulSoup::BeautifulStoneSoup::__init__ | ( | self, | |
markup = "" , |
|||
parseOnlyThese = None , |
|||
fromEncoding = None , |
|||
markupMassage = True , |
|||
smartQuotesTo = XML_ENTITIES , |
|||
convertEntities = None , |
|||
selfClosingTags = None , |
|||
isHTML = False , |
|||
builder = HTMLParserBuilder |
|||
) |
The Soup object is initialized as the 'root tag', and the provided markup (which can be a string or a file-like object) is fed into the underlying parser. HTMLParser will process most bad HTML, and the BeautifulSoup class has some tricks for dealing with some HTML that kills HTMLParser, but Beautiful Soup can nonetheless choke or lose data if your data uses self-closing tags or declarations incorrectly. By default, Beautiful Soup uses regexes to sanitize input, avoiding the vast majority of these problems. If the problems don't apply to you, pass in False for markupMassage, and you'll get better performance. The default parser massage techniques fix the two most common instances of invalid HTML that choke HTMLParser: <br/> (No space between name of closing tag and tag close) <! --Comment--> (Extraneous whitespace in declaration) You can pass in a custom list of (RE object, replace method) tuples to get Beautiful Soup to scrub your input the way you want.
Definition at line 1164 of file BeautifulSoup.py.
01168 : 01169 """The Soup object is initialized as the 'root tag', and the 01170 provided markup (which can be a string or a file-like object) 01171 is fed into the underlying parser. 01172 01173 HTMLParser will process most bad HTML, and the BeautifulSoup 01174 class has some tricks for dealing with some HTML that kills 01175 HTMLParser, but Beautiful Soup can nonetheless choke or lose data 01176 if your data uses self-closing tags or declarations 01177 incorrectly. 01178 01179 By default, Beautiful Soup uses regexes to sanitize input, 01180 avoiding the vast majority of these problems. If the problems 01181 don't apply to you, pass in False for markupMassage, and 01182 you'll get better performance. 01183 01184 The default parser massage techniques fix the two most common 01185 instances of invalid HTML that choke HTMLParser: 01186 01187 <br/> (No space between name of closing tag and tag close) 01188 <! --Comment--> (Extraneous whitespace in declaration) 01189 01190 You can pass in a custom list of (RE object, replace method) 01191 tuples to get Beautiful Soup to scrub your input the way you 01192 want.""" 01193 01194 self.parseOnlyThese = parseOnlyThese 01195 self.fromEncoding = fromEncoding 01196 self.smartQuotesTo = smartQuotesTo 01197 self.convertEntities = convertEntities 01198 # Set the rules for how we'll deal with the entities we 01199 # encounter 01200 if self.convertEntities: 01201 # It doesn't make sense to convert encoded characters to 01202 # entities even while you're converting entities to Unicode. 01203 # Just convert it all to Unicode. 01204 self.smartQuotesTo = None 01205 if convertEntities == self.HTML_ENTITIES: 01206 self.convertXMLEntities = False 01207 self.convertHTMLEntities = True 01208 self.escapeUnrecognizedEntities = True 01209 elif convertEntities == self.XHTML_ENTITIES: 01210 self.convertXMLEntities = True 01211 self.convertHTMLEntities = True 01212 self.escapeUnrecognizedEntities = False 01213 elif convertEntities == self.XML_ENTITIES: 01214 self.convertXMLEntities = True 01215 self.convertHTMLEntities = False 01216 self.escapeUnrecognizedEntities = False 01217 else: 01218 self.convertXMLEntities = False 01219 self.convertHTMLEntities = False 01220 self.escapeUnrecognizedEntities = False 01221 01222 self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags) 01223 self.builder = builder(self) 01224 self.reset() 01225 01226 if hasattr(markup, 'read'): # It's a file-type object. 01227 markup = markup.read() 01228 self.markup = markup 01229 self.markupMassage = markupMassage 01230 try: 01231 self._feed(isHTML=isHTML) 01232 except StopParsing: 01233 pass 01234 self.markup = None # The markup can now be GCed. 01235 self.builder = None # So can the builder.
def BeautifulSoup::BeautifulStoneSoup::_feed | ( | self, | |
inDocumentEncoding = None , |
|||
isHTML = False |
|||
) | [private] |
Definition at line 1236 of file BeautifulSoup.py.
01237 : 01238 # Convert the document to Unicode. 01239 markup = self.markup 01240 if isinstance(markup, unicode): 01241 if not hasattr(self, 'originalEncoding'): 01242 self.originalEncoding = None 01243 else: 01244 dammit = UnicodeDammit\ 01245 (markup, [self.fromEncoding, inDocumentEncoding], 01246 smartQuotesTo=self.smartQuotesTo, isHTML=isHTML) 01247 markup = dammit.unicode 01248 self.originalEncoding = dammit.originalEncoding 01249 self.declaredHTMLEncoding = dammit.declaredHTMLEncoding 01250 if markup: 01251 if self.markupMassage: 01252 if not isList(self.markupMassage): 01253 self.markupMassage = self.MARKUP_MASSAGE 01254 for fix, m in self.markupMassage: 01255 markup = fix.sub(m, markup) 01256 # TODO: We get rid of markupMassage so that the 01257 # soup object can be deepcopied later on. Some 01258 # Python installations can't copy regexes. If anyone 01259 # was relying on the existence of markupMassage, this 01260 # might cause problems. 01261 del(self.markupMassage) 01262 self.builder.reset() 01263 01264 self.builder.feed(markup) 01265 # Close out any unfinished strings and close all the open tags. 01266 self.endData() 01267 while self.currentTag.name != self.ROOT_TAG_NAME: 01268 self.popTag()
def BeautifulSoup::BeautifulStoneSoup::_feed | ( | self, | |
inDocumentEncoding = None , |
|||
isHTML = False |
|||
) | [private] |
Definition at line 1236 of file BeautifulSoup.py.
01237 : 01238 # Convert the document to Unicode. 01239 markup = self.markup 01240 if isinstance(markup, unicode): 01241 if not hasattr(self, 'originalEncoding'): 01242 self.originalEncoding = None 01243 else: 01244 dammit = UnicodeDammit\ 01245 (markup, [self.fromEncoding, inDocumentEncoding], 01246 smartQuotesTo=self.smartQuotesTo, isHTML=isHTML) 01247 markup = dammit.unicode 01248 self.originalEncoding = dammit.originalEncoding 01249 self.declaredHTMLEncoding = dammit.declaredHTMLEncoding 01250 if markup: 01251 if self.markupMassage: 01252 if not isList(self.markupMassage): 01253 self.markupMassage = self.MARKUP_MASSAGE 01254 for fix, m in self.markupMassage: 01255 markup = fix.sub(m, markup) 01256 # TODO: We get rid of markupMassage so that the 01257 # soup object can be deepcopied later on. Some 01258 # Python installations can't copy regexes. If anyone 01259 # was relying on the existence of markupMassage, this 01260 # might cause problems. 01261 del(self.markupMassage) 01262 self.builder.reset() 01263 01264 self.builder.feed(markup) 01265 # Close out any unfinished strings and close all the open tags. 01266 self.endData() 01267 while self.currentTag.name != self.ROOT_TAG_NAME: 01268 self.popTag()
def BeautifulSoup::BeautifulStoneSoup::_popToTag | ( | self, | |
name, | |||
inclusivePop = True |
|||
) | [private] |
Pops the tag stack up to and including the most recent instance of the given tag. If inclusivePop is false, pops the tag stack up to but *not* including the most recent instqance of the given tag.
Definition at line 1329 of file BeautifulSoup.py.
01330 : 01331 """Pops the tag stack up to and including the most recent 01332 instance of the given tag. If inclusivePop is false, pops the tag 01333 stack up to but *not* including the most recent instqance of 01334 the given tag.""" 01335 #print "Popping to %s" % name 01336 if name == self.ROOT_TAG_NAME: 01337 return 01338 01339 numPops = 0 01340 mostRecentTag = None 01341 for i in range(len(self.tagStack)-1, 0, -1): 01342 if name == self.tagStack[i].name: 01343 numPops = len(self.tagStack)-i 01344 break 01345 if not inclusivePop: 01346 numPops = numPops - 1 01347 01348 for i in range(0, numPops): 01349 mostRecentTag = self.popTag() 01350 return mostRecentTag
def BeautifulSoup::BeautifulStoneSoup::_popToTag | ( | self, | |
name, | |||
inclusivePop = True |
|||
) | [private] |
Pops the tag stack up to and including the most recent instance of the given tag. If inclusivePop is false, pops the tag stack up to but *not* including the most recent instqance of the given tag.
Definition at line 1329 of file BeautifulSoup.py.
01330 : 01331 """Pops the tag stack up to and including the most recent 01332 instance of the given tag. If inclusivePop is false, pops the tag 01333 stack up to but *not* including the most recent instqance of 01334 the given tag.""" 01335 #print "Popping to %s" % name 01336 if name == self.ROOT_TAG_NAME: 01337 return 01338 01339 numPops = 0 01340 mostRecentTag = None 01341 for i in range(len(self.tagStack)-1, 0, -1): 01342 if name == self.tagStack[i].name: 01343 numPops = len(self.tagStack)-i 01344 break 01345 if not inclusivePop: 01346 numPops = numPops - 1 01347 01348 for i in range(0, numPops): 01349 mostRecentTag = self.popTag() 01350 return mostRecentTag
def BeautifulSoup::BeautifulStoneSoup::_smartPop | ( | self, | |
name | |||
) | [private] |
We need to pop up to the previous tag of this type, unless one of this tag's nesting reset triggers comes between this tag and the previous tag of this type, OR unless this tag is a generic nesting trigger and another generic nesting trigger comes between this tag and the previous tag of this type. Examples: <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'. <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'. <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'. <li><ul><li> *<li>* should pop to 'ul', not the first 'li'. <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr' <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
Definition at line 1351 of file BeautifulSoup.py.
01352 : 01353 01354 """We need to pop up to the previous tag of this type, unless 01355 one of this tag's nesting reset triggers comes between this 01356 tag and the previous tag of this type, OR unless this tag is a 01357 generic nesting trigger and another generic nesting trigger 01358 comes between this tag and the previous tag of this type. 01359 01360 Examples: 01361 <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'. 01362 <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'. 01363 <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'. 01364 01365 <li><ul><li> *<li>* should pop to 'ul', not the first 'li'. 01366 <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr' 01367 <td><tr><td> *<td>* should pop to 'tr', not the first 'td' 01368 """ 01369 01370 nestingResetTriggers = self.NESTABLE_TAGS.get(name) 01371 isNestable = nestingResetTriggers != None 01372 isResetNesting = self.RESET_NESTING_TAGS.has_key(name) 01373 popTo = None 01374 inclusive = True 01375 for i in range(len(self.tagStack)-1, 0, -1): 01376 p = self.tagStack[i] 01377 if (not p or p.name == name) and not isNestable: 01378 #Non-nestable tags get popped to the top or to their 01379 #last occurance. 01380 popTo = name 01381 break 01382 if (nestingResetTriggers != None 01383 and p.name in nestingResetTriggers) \ 01384 or (nestingResetTriggers == None and isResetNesting 01385 and self.RESET_NESTING_TAGS.has_key(p.name)): 01386 01387 #If we encounter one of the nesting reset triggers 01388 #peculiar to this tag, or we encounter another tag 01389 #that causes nesting to reset, pop up to but not 01390 #including that tag. 01391 popTo = p.name 01392 inclusive = False 01393 break 01394 p = p.parent 01395 if popTo: 01396 self._popToTag(popTo, inclusive)
def BeautifulSoup::BeautifulStoneSoup::_smartPop | ( | self, | |
name | |||
) | [private] |
We need to pop up to the previous tag of this type, unless one of this tag's nesting reset triggers comes between this tag and the previous tag of this type, OR unless this tag is a generic nesting trigger and another generic nesting trigger comes between this tag and the previous tag of this type. Examples: <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'. <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'. <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'. <li><ul><li> *<li>* should pop to 'ul', not the first 'li'. <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr' <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
Definition at line 1351 of file BeautifulSoup.py.
01352 : 01353 01354 """We need to pop up to the previous tag of this type, unless 01355 one of this tag's nesting reset triggers comes between this 01356 tag and the previous tag of this type, OR unless this tag is a 01357 generic nesting trigger and another generic nesting trigger 01358 comes between this tag and the previous tag of this type. 01359 01360 Examples: 01361 <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'. 01362 <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'. 01363 <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'. 01364 01365 <li><ul><li> *<li>* should pop to 'ul', not the first 'li'. 01366 <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr' 01367 <td><tr><td> *<td>* should pop to 'tr', not the first 'td' 01368 """ 01369 01370 nestingResetTriggers = self.NESTABLE_TAGS.get(name) 01371 isNestable = nestingResetTriggers != None 01372 isResetNesting = self.RESET_NESTING_TAGS.has_key(name) 01373 popTo = None 01374 inclusive = True 01375 for i in range(len(self.tagStack)-1, 0, -1): 01376 p = self.tagStack[i] 01377 if (not p or p.name == name) and not isNestable: 01378 #Non-nestable tags get popped to the top or to their 01379 #last occurance. 01380 popTo = name 01381 break 01382 if (nestingResetTriggers != None 01383 and p.name in nestingResetTriggers) \ 01384 or (nestingResetTriggers == None and isResetNesting 01385 and self.RESET_NESTING_TAGS.has_key(p.name)): 01386 01387 #If we encounter one of the nesting reset triggers 01388 #peculiar to this tag, or we encounter another tag 01389 #that causes nesting to reset, pop up to but not 01390 #including that tag. 01391 popTo = p.name 01392 inclusive = False 01393 break 01394 p = p.parent 01395 if popTo: 01396 self._popToTag(popTo, inclusive)
def BeautifulSoup::BeautifulStoneSoup::endData | ( | self, | |
containerClass = NavigableString |
|||
) |
Definition at line 1306 of file BeautifulSoup.py.
01307 : 01308 if self.currentData: 01309 currentData = u''.join(self.currentData) 01310 if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and 01311 not set([tag.name for tag in self.tagStack]).intersection( 01312 self.PRESERVE_WHITESPACE_TAGS)): 01313 if '\n' in currentData: 01314 currentData = '\n' 01315 else: 01316 currentData = ' ' 01317 self.currentData = [] 01318 if self.parseOnlyThese and len(self.tagStack) <= 1 and \ 01319 (not self.parseOnlyThese.text or \ 01320 not self.parseOnlyThese.search(currentData)): 01321 return 01322 o = containerClass(currentData) 01323 o.setup(self.currentTag, self.previous) 01324 if self.previous: 01325 self.previous.next = o 01326 self.previous = o 01327 self.currentTag.contents.append(o) 01328
def BeautifulSoup::BeautifulStoneSoup::endData | ( | self, | |
containerClass = NavigableString |
|||
) |
Definition at line 1306 of file BeautifulSoup.py.
01307 : 01308 if self.currentData: 01309 currentData = u''.join(self.currentData) 01310 if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and 01311 not set([tag.name for tag in self.tagStack]).intersection( 01312 self.PRESERVE_WHITESPACE_TAGS)): 01313 if '\n' in currentData: 01314 currentData = '\n' 01315 else: 01316 currentData = ' ' 01317 self.currentData = [] 01318 if self.parseOnlyThese and len(self.tagStack) <= 1 and \ 01319 (not self.parseOnlyThese.text or \ 01320 not self.parseOnlyThese.search(currentData)): 01321 return 01322 o = containerClass(currentData) 01323 o.setup(self.currentTag, self.previous) 01324 if self.previous: 01325 self.previous.next = o 01326 self.previous = o 01327 self.currentTag.contents.append(o) 01328
def BeautifulSoup::BeautifulStoneSoup::extractCharsetFromMeta | ( | self, | |
attrs | |||
) |
Reimplemented in BeautifulSoup::BeautifulSoup, and BeautifulSoup::BeautifulSoup.
Definition at line 1443 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::extractCharsetFromMeta | ( | self, | |
attrs | |||
) |
Reimplemented in BeautifulSoup::BeautifulSoup, and BeautifulSoup::BeautifulSoup.
Definition at line 1443 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::handle_data | ( | self, | |
data | |||
) |
Definition at line 1440 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::handle_data | ( | self, | |
data | |||
) |
Definition at line 1440 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::isSelfClosingTag | ( | self, | |
name | |||
) |
Returns true iff the given string is the name of a self-closing tag according to this parser.
Definition at line 1269 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::isSelfClosingTag | ( | self, | |
name | |||
) |
Returns true iff the given string is the name of a self-closing tag according to this parser.
Definition at line 1269 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::popTag | ( | self | ) |
Reimplemented in BeautifulSoup::BeautifulSOAP, and BeautifulSoup::BeautifulSOAP.
Definition at line 1285 of file BeautifulSoup.py.
01286 : 01287 tag = self.tagStack.pop() 01288 # Tags with just one string-owning child get the child as a 01289 # 'string' property, so that soup.tag.string is shorthand for 01290 # soup.tag.contents[0] 01291 if len(self.currentTag.contents) == 1 and \ 01292 isinstance(self.currentTag.contents[0], NavigableString): 01293 self.currentTag.string = self.currentTag.contents[0] 01294 01295 #print "Pop", tag.name 01296 if self.tagStack: 01297 self.currentTag = self.tagStack[-1] 01298 return self.currentTag
def BeautifulSoup::BeautifulStoneSoup::popTag | ( | self | ) |
Reimplemented in BeautifulSoup::BeautifulSOAP, and BeautifulSoup::BeautifulSOAP.
Definition at line 1285 of file BeautifulSoup.py.
01286 : 01287 tag = self.tagStack.pop() 01288 # Tags with just one string-owning child get the child as a 01289 # 'string' property, so that soup.tag.string is shorthand for 01290 # soup.tag.contents[0] 01291 if len(self.currentTag.contents) == 1 and \ 01292 isinstance(self.currentTag.contents[0], NavigableString): 01293 self.currentTag.string = self.currentTag.contents[0] 01294 01295 #print "Pop", tag.name 01296 if self.tagStack: 01297 self.currentTag = self.tagStack[-1] 01298 return self.currentTag
def BeautifulSoup::BeautifulStoneSoup::pushTag | ( | self, | |
tag | |||
) |
Definition at line 1299 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::pushTag | ( | self, | |
tag | |||
) |
Definition at line 1299 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::reset | ( | self | ) |
Definition at line 1275 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::reset | ( | self | ) |
Definition at line 1275 of file BeautifulSoup.py.
def BeautifulSoup::BeautifulStoneSoup::unknown_endtag | ( | self, | |
name | |||
) |
Definition at line 1427 of file BeautifulSoup.py.
01428 : 01429 #print "End tag %s" % name 01430 if self.quoteStack and self.quoteStack[-1] != name: 01431 #This is not a real end tag. 01432 #print "</%s> is not real!" % name 01433 self.handle_data('</%s>' % name) 01434 return 01435 self.endData() 01436 self._popToTag(name) 01437 if self.quoteStack and self.quoteStack[-1] == name: 01438 self.quoteStack.pop() 01439 self.literal = (len(self.quoteStack) > 0)
def BeautifulSoup::BeautifulStoneSoup::unknown_endtag | ( | self, | |
name | |||
) |
Definition at line 1427 of file BeautifulSoup.py.
01428 : 01429 #print "End tag %s" % name 01430 if self.quoteStack and self.quoteStack[-1] != name: 01431 #This is not a real end tag. 01432 #print "</%s> is not real!" % name 01433 self.handle_data('</%s>' % name) 01434 return 01435 self.endData() 01436 self._popToTag(name) 01437 if self.quoteStack and self.quoteStack[-1] == name: 01438 self.quoteStack.pop() 01439 self.literal = (len(self.quoteStack) > 0)
def BeautifulSoup::BeautifulStoneSoup::unknown_starttag | ( | self, | |
name, | |||
attrs, | |||
selfClosing = 0 |
|||
) |
Definition at line 1397 of file BeautifulSoup.py.
01398 : 01399 #print "Start tag %s: %s" % (name, attrs) 01400 if self.quoteStack: 01401 #This is not a real tag. 01402 #print "<%s> is not real!" % name 01403 attrs = ''.join(map(lambda(x, y): ' %s="%s"' % (x, y), attrs)) 01404 self.handle_data('<%s%s>' % (name, attrs)) 01405 return 01406 self.endData() 01407 01408 if not self.isSelfClosingTag(name) and not selfClosing: 01409 self._smartPop(name) 01410 01411 if self.parseOnlyThese and len(self.tagStack) <= 1 \ 01412 and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)): 01413 return 01414 01415 tag = Tag(self, name, attrs, self.currentTag, self.previous) 01416 if self.previous: 01417 self.previous.next = tag 01418 self.previous = tag 01419 self.pushTag(tag) 01420 if selfClosing or self.isSelfClosingTag(name): 01421 self.popTag() 01422 if name in self.QUOTE_TAGS: 01423 #print "Beginning quote (%s)" % name 01424 self.quoteStack.append(name) 01425 self.literal = 1 01426 return tag
def BeautifulSoup::BeautifulStoneSoup::unknown_starttag | ( | self, | |
name, | |||
attrs, | |||
selfClosing = 0 |
|||
) |
Definition at line 1397 of file BeautifulSoup.py.
01398 : 01399 #print "Start tag %s: %s" % (name, attrs) 01400 if self.quoteStack: 01401 #This is not a real tag. 01402 #print "<%s> is not real!" % name 01403 attrs = ''.join(map(lambda(x, y): ' %s="%s"' % (x, y), attrs)) 01404 self.handle_data('<%s%s>' % (name, attrs)) 01405 return 01406 self.endData() 01407 01408 if not self.isSelfClosingTag(name) and not selfClosing: 01409 self._smartPop(name) 01410 01411 if self.parseOnlyThese and len(self.tagStack) <= 1 \ 01412 and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)): 01413 return 01414 01415 tag = Tag(self, name, attrs, self.currentTag, self.previous) 01416 if self.previous: 01417 self.previous.next = tag 01418 self.previous = tag 01419 self.pushTag(tag) 01420 if selfClosing or self.isSelfClosingTag(name): 01421 self.popTag() 01422 if name in self.QUOTE_TAGS: 01423 #print "Beginning quote (%s)" % name 01424 self.quoteStack.append(name) 01425 self.literal = 1 01426 return tag
Definition at line 1156 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
Definition at line 1275 of file BeautifulSoup.py.
Definition at line 1275 of file BeautifulSoup.py.
Reimplemented in BeautifulSoup::BeautifulSoup.
Definition at line 1236 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
Definition at line 1275 of file BeautifulSoup.py.
string BeautifulSoup::BeautifulStoneSoup::HTML_ENTITIES = "html" [static] |
Definition at line 1152 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
Definition at line 1397 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
list BeautifulSoup::BeautifulStoneSoup::MARKUP_MASSAGE [static] |
[(re.compile('(<[^<>]*)/>'), lambda x: x.group(1) + ' />'), (re.compile('<!\s+([^<>]*)>'), lambda x: '<!' + x.group(1) + '>') ]
Definition at line 1144 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
dictionary BeautifulSoup::BeautifulStoneSoup::NESTABLE_TAGS = {} [static] |
Reimplemented in BeautifulSoup::BeautifulSoup, BeautifulSoup::ICantBelieveItsBeautifulSoup, and BeautifulSoup::MinimalSoup.
Definition at line 1139 of file BeautifulSoup.py.
Reimplemented in BeautifulSoup::BeautifulSoup.
Definition at line 1236 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
list BeautifulSoup::BeautifulStoneSoup::PRESERVE_WHITESPACE_TAGS = [] [static] |
Reimplemented in BeautifulSoup::BeautifulSoup.
Definition at line 1142 of file BeautifulSoup.py.
Reimplemented from BeautifulSoup::PageElement.
Definition at line 1306 of file BeautifulSoup.py.
dictionary BeautifulSoup::BeautifulStoneSoup::QUOTE_TAGS = {} [static] |
Reimplemented in BeautifulSoup::BeautifulSoup.
Definition at line 1141 of file BeautifulSoup.py.
Definition at line 1275 of file BeautifulSoup.py.
dictionary BeautifulSoup::BeautifulStoneSoup::RESET_NESTING_TAGS = {} [static] |
Reimplemented in BeautifulSoup::BeautifulSoup, and BeautifulSoup::MinimalSoup.
Definition at line 1140 of file BeautifulSoup.py.
string BeautifulSoup::BeautifulStoneSoup::ROOT_TAG_NAME = u'[document]' [static] |
Definition at line 1150 of file BeautifulSoup.py.
dictionary BeautifulSoup::BeautifulStoneSoup::SELF_CLOSING_TAGS = {} [static] |
Reimplemented in BeautifulSoup::BeautifulSoup.
Definition at line 1138 of file BeautifulSoup.py.
Definition at line 1187 of file BeautifulSoup.py.
dictionary BeautifulSoup::BeautifulStoneSoup::STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, } [static] |
Definition at line 1162 of file BeautifulSoup.py.
Definition at line 1275 of file BeautifulSoup.py.
string BeautifulSoup::BeautifulStoneSoup::XHTML_ENTITIES = "xhtml" [static] |
Definition at line 1154 of file BeautifulSoup.py.
string BeautifulSoup::BeautifulStoneSoup::XML_ENTITIES = "xml" [static] |
Definition at line 1153 of file BeautifulSoup.py.