CMS 3D CMS Logo

BeautifulSoup.py
Go to the documentation of this file.
1 """Beautiful Soup
2 Elixir and Tonic
3 "The Screen-Scraper's Friend"
4 http://www.crummy.com/software/BeautifulSoup/
5 
6 Beautiful Soup parses a (possibly invalid) XML or HTML document into a
7 tree representation. It provides methods and Pythonic idioms that make
8 it easy to navigate, search, and modify the tree.
9 
10 A well-formed XML/HTML document yields a well-formed data
11 structure. An ill-formed XML/HTML document yields a correspondingly
12 ill-formed data structure. If your document is only locally
13 well-formed, you can use this library to find and process the
14 well-formed part of it.
15 
16 Beautiful Soup works with Python 2.2 and up. It has no external
17 dependencies, but you'll have more success at converting data to UTF-8
18 if you also install these three packages:
19 
20 * chardet, for auto-detecting character encodings
21  http://chardet.feedparser.org/
22 * cjkcodecs and iconv_codec, which add more encodings to the ones supported
23  by stock Python.
24  http://cjkpython.i18n.org/
25 
26 Beautiful Soup defines classes for two main parsing strategies:
27 
28  * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
29  language that kind of looks like XML.
30 
31  * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
32  or invalid. This class has web browser-like heuristics for
33  obtaining a sensible parse tree in the face of common HTML errors.
34 
35 Beautiful Soup also defines a class (UnicodeDammit) for autodetecting
36 the encoding of an HTML or XML document, and converting it to
37 Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
38 
39 For more than you ever wanted to know about Beautiful Soup, see the
40 documentation:
41 http://www.crummy.com/software/BeautifulSoup/documentation.html
42 
43 Here, have some legalese:
44 
45 Copyright (c) 2004-2010, Leonard Richardson
46 
47 All rights reserved.
48 
49 Redistribution and use in source and binary forms, with or without
50 modification, are permitted provided that the following conditions are
51 met:
52 
53  * Redistributions of source code must retain the above copyright
54  notice, this list of conditions and the following disclaimer.
55 
56  * Redistributions in binary form must reproduce the above
57  copyright notice, this list of conditions and the following
58  disclaimer in the documentation and/or other materials provided
59  with the distribution.
60 
61  * Neither the name of the the Beautiful Soup Consortium and All
62  Night Kosher Bakery nor the names of its contributors may be
63  used to endorse or promote products derived from this software
64  without specific prior written permission.
65 
66 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
67 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
68 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
69 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
70 CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
71 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
72 PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
73 PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
74 LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
75 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
76 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
77 
78 """
79 from __future__ import generators
80 from __future__ import print_function
81 
82 __author__ = "Leonard Richardson (leonardr@segfault.org)"
83 __version__ = "3.2.1"
84 __copyright__ = "Copyright (c) 2004-2012 Leonard Richardson"
85 __license__ = "New-style BSD"
86 
87 from sgmllib import SGMLParser, SGMLParseError
88 import codecs
89 import markupbase
90 import types
91 import re
92 import sgmllib
93 try:
94  from htmlentitydefs import name2codepoint
95 except ImportError:
96  name2codepoint = {}
97 try:
98  set
99 except NameError:
100  from sets import Set as set
101 
102 #These hacks make Beautiful Soup able to parse XML with namespaces
103 sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
104 markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match
105 
106 DEFAULT_OUTPUT_ENCODING = "utf-8"
107 
109  """Build a RE to match the given CSS class."""
110  return re.compile(r"(^|.*\s)%s($|\s)" % str)
111 
112 # First, the classes that represent markup elements.
113 
115  """Contains the navigational information for some part of the page
116  (either a tag or a piece of text)"""
117 
118  def _invert(h):
119  "Cheap function to invert a hash."
120  i = {}
121  for k,v in h.items():
122  i[v] = k
123  return i
124 
125  XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'",
126  "quot" : '"',
127  "amp" : "&",
128  "lt" : "<",
129  "gt" : ">" }
130 
131  XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS)
132 
133  def setup(self, parent=None, previous=None):
134  """Sets up the initial relations between this element and
135  other elements."""
136  self.parent = parent
137  self.previous = previous
138  self.next = None
139  self.previousSibling = None
140  self.nextSibling = None
141  if self.parent and self.parent.contents:
142  self.previousSibling = self.parent.contents[-1]
143  self.previousSibling.nextSibling = self
144 
145  def replaceWith(self, replaceWith):
146  oldParent = self.parent
147  myIndex = self.parent.index(self)
148  if hasattr(replaceWith, "parent")\
149  and replaceWith.parent is self.parent:
150  # We're replacing this element with one of its siblings.
151  index = replaceWith.parent.index(replaceWith)
152  if index and index < myIndex:
153  # Furthermore, it comes before this element. That
154  # means that when we extract it, the index of this
155  # element will change.
156  myIndex = myIndex - 1
157  self.extract()
158  oldParent.insert(myIndex, replaceWith)
159 
161  myParent = self.parent
162  myIndex = self.parent.index(self)
163  self.extract()
164  reversedChildren = list(self.contents)
165  reversedChildren.reverse()
166  for child in reversedChildren:
167  myParent.insert(myIndex, child)
168 
169  def extract(self):
170  """Destructively rips this element out of the tree."""
171  if self.parent:
172  try:
173  del self.parent.contents[self.parent.index(self)]
174  except ValueError:
175  pass
176 
177  #Find the two elements that would be next to each other if
178  #this element (and any children) hadn't been parsed. Connect
179  #the two.
180  lastChild = self._lastRecursiveChild()
181  nextElement = lastChild.next
182 
183  if self.previous:
184  self.previous.next = nextElement
185  if nextElement:
186  nextElement.previous = self.previous
187  self.previous = None
188  lastChild.next = None
189 
190  self.parent = None
191  if self.previousSibling:
192  self.previousSibling.nextSibling = self.nextSibling
193  if self.nextSibling:
194  self.nextSibling.previousSibling = self.previousSibling
195  self.previousSibling = self.nextSibling = None
196  return self
197 
199  "Finds the last element beneath this object to be parsed."
200  lastChild = self
201  while hasattr(lastChild, 'contents') and lastChild.contents:
202  lastChild = lastChild.contents[-1]
203  return lastChild
204 
205  def insert(self, position, newChild):
206  if isinstance(newChild, str) \
207  and not isinstance(newChild, NavigableString):
208  newChild = NavigableString(newChild)
209 
210  position = min(position, len(self.contents))
211  if hasattr(newChild, 'parent') and newChild.parent is not None:
212  # We're 'inserting' an element that's already one
213  # of this object's children.
214  if newChild.parent is self:
215  index = self.index(newChild)
216  if index > position:
217  # Furthermore we're moving it further down the
218  # list of this object's children. That means that
219  # when we extract this element, our target index
220  # will jump down one.
221  position = position - 1
222  newChild.extract()
223 
224  newChild.parent = self
225  previousChild = None
226  if position == 0:
227  newChild.previousSibling = None
228  newChild.previous = self
229  else:
230  previousChild = self.contents[position-1]
231  newChild.previousSibling = previousChild
232  newChild.previousSibling.nextSibling = newChild
233  newChild.previous = previousChild._lastRecursiveChild()
234  if newChild.previous:
235  newChild.previous.next = newChild
236 
237  newChildsLastElement = newChild._lastRecursiveChild()
238 
239  if position >= len(self.contents):
240  newChild.nextSibling = None
241 
242  parent = self
243  parentsNextSibling = None
244  while not parentsNextSibling:
245  parentsNextSibling = parent.nextSibling
246  parent = parent.parent
247  if not parent: # This is the last element in the document.
248  break
249  if parentsNextSibling:
250  newChildsLastElement.next = parentsNextSibling
251  else:
252  newChildsLastElement.next = None
253  else:
254  nextChild = self.contents[position]
255  newChild.nextSibling = nextChild
256  if newChild.nextSibling:
257  newChild.nextSibling.previousSibling = newChild
258  newChildsLastElement.next = nextChild
259 
260  if newChildsLastElement.next:
261  newChildsLastElement.next.previous = newChildsLastElement
262  self.contents.insert(position, newChild)
263 
264  def append(self, tag):
265  """Appends the given tag to the contents of this tag."""
266  self.insert(len(self.contents), tag)
267 
268  def findNext(self, name=None, attrs={}, text=None, **kwargs):
269  """Returns the first item that matches the given criteria and
270  appears after this Tag in the document."""
271  return self._findOne(self.findAllNext, name, attrs, text, **kwargs)
272 
273  def findAllNext(self, name=None, attrs={}, text=None, limit=None,
274  **kwargs):
275  """Returns all items that match the given criteria and appear
276  after this Tag in the document."""
277  return self._findAll(name, attrs, text, limit, self.nextGenerator,
278  **kwargs)
279 
280  def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
281  """Returns the closest sibling to this Tag that matches the
282  given criteria and appears after this Tag in the document."""
283  return self._findOne(self.findNextSiblings, name, attrs, text,
284  **kwargs)
285 
286  def findNextSiblings(self, name=None, attrs={}, text=None, limit=None,
287  **kwargs):
288  """Returns the siblings of this Tag that match the given
289  criteria and appear after this Tag in the document."""
290  return self._findAll(name, attrs, text, limit,
291  self.nextSiblingGenerator, **kwargs)
292  fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x
293 
294  def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
295  """Returns the first item that matches the given criteria and
296  appears before this Tag in the document."""
297  return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)
298 
299  def findAllPrevious(self, name=None, attrs={}, text=None, limit=None,
300  **kwargs):
301  """Returns all items that match the given criteria and appear
302  before this Tag in the document."""
303  return self._findAll(name, attrs, text, limit, self.previousGenerator,
304  **kwargs)
305  fetchPrevious = findAllPrevious # Compatibility with pre-3.x
306 
307  def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
308  """Returns the closest sibling to this Tag that matches the
309  given criteria and appears before this Tag in the document."""
310  return self._findOne(self.findPreviousSiblings, name, attrs, text,
311  **kwargs)
312 
313  def findPreviousSiblings(self, name=None, attrs={}, text=None,
314  limit=None, **kwargs):
315  """Returns the siblings of this Tag that match the given
316  criteria and appear before this Tag in the document."""
317  return self._findAll(name, attrs, text, limit,
318  self.previousSiblingGenerator, **kwargs)
319  fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x
320 
321  def findParent(self, name=None, attrs={}, **kwargs):
322  """Returns the closest parent of this Tag that matches the given
323  criteria."""
324  # NOTE: We can't use _findOne because findParents takes a different
325  # set of arguments.
326  r = None
327  l = self.findParents(name, attrs, 1)
328  if l:
329  r = l[0]
330  return r
331 
332  def findParents(self, name=None, attrs={}, limit=None, **kwargs):
333  """Returns the parents of this Tag that match the given
334  criteria."""
335 
336  return self._findAll(name, attrs, None, limit, self.parentGenerator,
337  **kwargs)
338  fetchParents = findParents # Compatibility with pre-3.x
339 
340  #These methods do the real heavy lifting.
341 
342  def _findOne(self, method, name, attrs, text, **kwargs):
343  r = None
344  l = method(name, attrs, text, 1, **kwargs)
345  if l:
346  r = l[0]
347  return r
348 
349  def _findAll(self, name, attrs, text, limit, generator, **kwargs):
350  "Iterates over a generator looking for things that match."
351 
352  if isinstance(name, SoupStrainer):
353  strainer = name
354  # (Possibly) special case some findAll*(...) searches
355  elif text is None and not limit and not attrs and not kwargs:
356  # findAll*(True)
357  if name is True:
358  return [element for element in generator()
359  if isinstance(element, Tag)]
360  # findAll*('tag-name')
361  elif isinstance(name, str):
362  return [element for element in generator()
363  if isinstance(element, Tag) and
364  element.name == name]
365  else:
366  strainer = SoupStrainer(name, attrs, text, **kwargs)
367  # Build a SoupStrainer
368  else:
369  strainer = SoupStrainer(name, attrs, text, **kwargs)
370  results = ResultSet(strainer)
371  g = generator()
372  while True:
373  try:
374  i = next(g)
375  except StopIteration:
376  break
377  if i:
378  found = strainer.search(i)
379  if found:
380  results.append(found)
381  if limit and len(results) >= limit:
382  break
383  return results
384 
385  #These Generators can be used to navigate starting from both
386  #NavigableStrings and Tags.
387  def nextGenerator(self):
388  i = self
389  while i is not None:
390  i = i.next
391  yield i
392 
394  i = self
395  while i is not None:
396  i = i.nextSibling
397  yield i
398 
399  def previousGenerator(self):
400  i = self
401  while i is not None:
402  i = i.previous
403  yield i
404 
406  i = self
407  while i is not None:
408  i = i.previousSibling
409  yield i
410 
411  def parentGenerator(self):
412  i = self
413  while i is not None:
414  i = i.parent
415  yield i
416 
417  # Utility methods
418  def substituteEncoding(self, str, encoding=None):
419  encoding = encoding or "utf-8"
420  return str.replace("%SOUP-ENCODING%", encoding)
421 
422  def toEncoding(self, s, encoding=None):
423  """Encodes an object to a string in some encoding, or to Unicode.
424  ."""
425  if isinstance(s, unicode):
426  if encoding:
427  s = s.encode(encoding)
428  elif isinstance(s, str):
429  if encoding:
430  s = s.encode(encoding)
431  else:
432  s = unicode(s)
433  else:
434  if encoding:
435  s = self.toEncoding(str(s), encoding)
436  else:
437  s = unicode(s)
438  return s
439 
440  BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
441  + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
442  + ")")
443 
444  def _sub_entity(self, x):
445  """Used with a regular expression to substitute the
446  appropriate XML entity for an XML special character."""
447  return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";"
448 
449 
450 class NavigableString(unicode, PageElement):
451 
452  def __new__(cls, value):
453  """Create a new NavigableString.
454 
455  When unpickling a NavigableString, this method is called with
456  the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
457  passed in to the superclass's __new__ or the superclass won't know
458  how to handle non-ASCII characters.
459  """
460  if isinstance(value, unicode):
461  return unicode.__new__(cls, value)
462  return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
463 
464  def __getnewargs__(self):
465  return (NavigableString.__str__(self),)
466 
467  def __getattr__(self, attr):
468  """text.string gives you text. This is for backwards
469  compatibility for Navigable*String, but for CData* it lets you
470  get the string without the CData wrapper."""
471  if attr == 'string':
472  return self
473  else:
474  raise AttributeError("'%s' object has no attribute '%s'" % (self.__class__.__name__, attr))
475 
476  def __unicode__(self):
477  return str(self).decode(DEFAULT_OUTPUT_ENCODING)
478 
479  def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
480  # Substitute outgoing XML entities.
481  data = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, self)
482  if encoding:
483  return data.encode(encoding)
484  else:
485  return data
486 
488 
489  def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
490  return "<![CDATA[%s]]>" % NavigableString.__str__(self, encoding)
491 
493  def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
494  output = self
495  if "%SOUP-ENCODING%" in output:
496  output = self.substituteEncoding(output, encoding)
497  return "<?%s?>" % self.toEncoding(output, encoding)
498 
500  def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
501  return "<!--%s-->" % NavigableString.__str__(self, encoding)
502 
504  def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
505  return "<!%s>" % NavigableString.__str__(self, encoding)
506 
508 
509  """Represents a found HTML tag with its attributes and contents."""
510 
511  def _convertEntities(self, match):
512  """Used in a call to re.sub to replace HTML, XML, and numeric
513  entities with the appropriate Unicode characters. If HTML
514  entities are being converted, any unrecognized entities are
515  escaped."""
516  x = match.group(1)
517  if self.convertHTMLEntities and x in name2codepoint:
518  return unichr(name2codepoint[x])
519  elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS:
520  if self.convertXMLEntities:
521  return self.XML_ENTITIES_TO_SPECIAL_CHARS[x]
522  else:
523  return u'&%s;' % x
524  elif len(x) > 0 and x[0] == '#':
525  # Handle numeric entities
526  if len(x) > 1 and x[1] == 'x':
527  return unichr(int(x[2:], 16))
528  else:
529  return unichr(int(x[1:]))
530 
531  elif self.escapeUnrecognizedEntities:
532  return u'&amp;%s;' % x
533  else:
534  return u'&%s;' % x
535 
536  def __init__(self, parser, name, attrs=None, parent=None,
537  previous=None):
538  "Basic constructor."
539 
540  # We don't actually store the parser object: that lets extracted
541  # chunks be garbage-collected
542  self.parserClass = parser.__class__
543  self.isSelfClosing = parser.isSelfClosingTag(name)
544  self.name = name
545  if attrs is None:
546  attrs = []
547  elif isinstance(attrs, dict):
548  attrs = attrs.items()
549  self.attrs = attrs
550  self.contents = []
551  self.setup(parent, previous)
552  self.hidden = False
554  self.convertHTMLEntities = parser.convertHTMLEntities
555  self.convertXMLEntities = parser.convertXMLEntities
556  self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities
557 
558  # Convert any HTML, XML, or numeric entities in the attribute values.
559  convert = lambda k_val: (k_val[0],
560  re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);",
561  self._convertEntities,
562  k_val[1]))
563  self.attrs = map(convert, self.attrs)
564 
565  def getString(self):
566  if (len(self.contents) == 1
567  and isinstance(self.contents[0], NavigableString)):
568  return self.contents[0]
569 
570  def setString(self, string):
571  """Replace the contents of the tag with a string"""
572  self.clear()
573  self.append(string)
574 
575  string = property(getString, setString)
576 
577  def getText(self, separator=u""):
578  if not len(self.contents):
579  return u""
580  stopNode = self._lastRecursiveChild().next
581  strings = []
582  current = self.contents[0]
583  while current is not stopNode:
584  if isinstance(current, NavigableString):
585  strings.append(current.strip())
586  current = current.next
587  return separator.join(strings)
588 
589  text = property(getText)
590 
591  def get(self, key, default=None):
592  """Returns the value of the 'key' attribute for the tag, or
593  the value given for 'default' if it doesn't have that
594  attribute."""
595  return self._getAttrMap().get(key, default)
596 
597  def clear(self):
598  """Extract all children."""
599  for child in self.contents[:]:
600  child.extract()
601 
602  def index(self, element):
603  for i, child in enumerate(self.contents):
604  if child is element:
605  return i
606  raise ValueError("Tag.index: element not in tag")
607 
608  def has_key(self, key):
609  return key in self._getAttrMap()
610 
611  def __getitem__(self, key):
612  """tag[key] returns the value of the 'key' attribute for the tag,
613  and throws an exception if it's not there."""
614  return self._getAttrMap()[key]
615 
616  def __iter__(self):
617  "Iterating over a tag iterates over its contents."
618  return iter(self.contents)
619 
620  def __len__(self):
621  "The length of a tag is the length of its list of contents."
622  return len(self.contents)
623 
624  def __contains__(self, x):
625  return x in self.contents
626 
627  def __nonzero__(self):
628  "A tag is non-None even if it has no contents."
629  return True
630 
631  def __setitem__(self, key, value):
632  """Setting tag[key] sets the value of the 'key' attribute for the
633  tag."""
634  self._getAttrMap()
635  self.attrMap[key] = value
636  found = False
637  for i in range(0, len(self.attrs)):
638  if self.attrs[i][0] == key:
639  self.attrs[i] = (key, value)
640  found = True
641  if not found:
642  self.attrs.append((key, value))
643  self._getAttrMap()[key] = value
644 
645  def __delitem__(self, key):
646  "Deleting tag[key] deletes all 'key' attributes for the tag."
647  for item in self.attrs:
648  if item[0] == key:
649  self.attrs.remove(item)
650  #We don't break because bad HTML can define the same
651  #attribute multiple times.
652  self._getAttrMap()
653  if key in self.attrMap:
654  del self.attrMap[key]
655 
656  def __call__(self, *args, **kwargs):
657  """Calling a tag like a function is the same as calling its
658  findAll() method. Eg. tag('a') returns a list of all the A tags
659  found within this tag."""
660  return self.findAll(*args, **kwargs)
661 
662  def __getattr__(self, tag):
663  #print "Getattr %s.%s" % (self.__class__, tag)
664  if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3:
665  return self.find(tag[:-3])
666  elif tag.find('__') != 0:
667  return self.find(tag)
668  raise AttributeError("'%s' object has no attribute '%s'" % (self.__class__, tag))
669 
670  def __eq__(self, other):
671  """Returns true iff this tag has the same name, the same attributes,
672  and the same contents (recursively) as the given tag.
673 
674  NOTE: right now this will return false if two tags have the
675  same attributes in a different order. Should this be fixed?"""
676  if other is self:
677  return True
678  if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other):
679  return False
680  for i in range(0, len(self.contents)):
681  if self.contents[i] != other.contents[i]:
682  return False
683  return True
684 
685  def __ne__(self, other):
686  """Returns true iff this tag is not identical to the other tag,
687  as defined in __eq__."""
688  return not self == other
689 
690  def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
691  """Renders this tag as a string."""
692  return self.__str__(encoding)
693 
694  def __unicode__(self):
695  return self.__str__(None)
696 
697  def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING,
698  prettyPrint=False, indentLevel=0):
699  """Returns a string or Unicode representation of this tag and
700  its contents. To get Unicode, pass None for encoding.
701 
702  NOTE: since Python's HTML parser consumes whitespace, this
703  method is not certain to reproduce the whitespace present in
704  the original string."""
705 
706  encodedName = self.toEncoding(self.name, encoding)
707 
708  attrs = []
709  if self.attrs:
710  for key, val in self.attrs:
711  fmt = '%s="%s"'
712  if isinstance(val, str):
713  if self.containsSubstitutions and '%SOUP-ENCODING%' in val:
714  val = self.substituteEncoding(val, encoding)
715 
716  # The attribute value either:
717  #
718  # * Contains no embedded double quotes or single quotes.
719  # No problem: we enclose it in double quotes.
720  # * Contains embedded single quotes. No problem:
721  # double quotes work here too.
722  # * Contains embedded double quotes. No problem:
723  # we enclose it in single quotes.
724  # * Embeds both single _and_ double quotes. This
725  # can't happen naturally, but it can happen if
726  # you modify an attribute value after parsing
727  # the document. Now we have a bit of a
728  # problem. We solve it by enclosing the
729  # attribute in single quotes, and escaping any
730  # embedded single quotes to XML entities.
731  if '"' in val:
732  fmt = "%s='%s'"
733  if "'" in val:
734  # TODO: replace with apos when
735  # appropriate.
736  val = val.replace("'", "&squot;")
737 
738  # Now we're okay w/r/t quotes. But the attribute
739  # value might also contain angle brackets, or
740  # ampersands that aren't part of entities. We need
741  # to escape those to XML entities too.
742  val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val)
743 
744  attrs.append(fmt % (self.toEncoding(key, encoding),
745  self.toEncoding(val, encoding)))
746  close = ''
747  closeTag = ''
748  if self.isSelfClosing:
749  close = ' /'
750  else:
751  closeTag = '</%s>' % encodedName
752 
753  indentTag, indentContents = 0, 0
754  if prettyPrint:
755  indentTag = indentLevel
756  space = (' ' * (indentTag-1))
757  indentContents = indentTag + 1
758  contents = self.renderContents(encoding, prettyPrint, indentContents)
759  if self.hidden:
760  s = contents
761  else:
762  s = []
763  attributeString = ''
764  if attrs:
765  attributeString = ' ' + ' '.join(attrs)
766  if prettyPrint:
767  s.append(space)
768  s.append('<%s%s%s>' % (encodedName, attributeString, close))
769  if prettyPrint:
770  s.append("\n")
771  s.append(contents)
772  if prettyPrint and contents and contents[-1] != "\n":
773  s.append("\n")
774  if prettyPrint and closeTag:
775  s.append(space)
776  s.append(closeTag)
777  if prettyPrint and closeTag and self.nextSibling:
778  s.append("\n")
779  s = ''.join(s)
780  return s
781 
782  def decompose(self):
783  """Recursively destroys the contents of this tree."""
784  self.extract()
785  if len(self.contents) == 0:
786  return
787  current = self.contents[0]
788  while current is not None:
789  next = current.next
790  if isinstance(current, Tag):
791  del current.contents[:]
792  current.parent = None
793  current.previous = None
794  current.previousSibling = None
795  current.next = None
796  current.nextSibling = None
797  current = next
798 
799  def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING):
800  return self.__str__(encoding, True)
801 
802  def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
803  prettyPrint=False, indentLevel=0):
804  """Renders the contents of this tag as a string in the given
805  encoding. If encoding is None, returns a Unicode string.."""
806  s=[]
807  for c in self:
808  text = None
809  if isinstance(c, NavigableString):
810  text = c.__str__(encoding)
811  elif isinstance(c, Tag):
812  s.append(c.__str__(encoding, prettyPrint, indentLevel))
813  if text and prettyPrint:
814  text = text.strip()
815  if text:
816  if prettyPrint:
817  s.append(" " * (indentLevel-1))
818  s.append(text)
819  if prettyPrint:
820  s.append("\n")
821  return ''.join(s)
822 
823  #Soup methods
824 
825  def find(self, name=None, attrs={}, recursive=True, text=None,
826  **kwargs):
827  """Return only the first child of this Tag matching the given
828  criteria."""
829  r = None
830  l = self.findAll(name, attrs, recursive, text, 1, **kwargs)
831  if l:
832  r = l[0]
833  return r
834  findChild = find
835 
836  def findAll(self, name=None, attrs={}, recursive=True, text=None,
837  limit=None, **kwargs):
838  """Extracts a list of Tag objects that match the given
839  criteria. You can specify the name of the Tag and any
840  attributes you want the Tag to have.
841 
842  The value of a key-value pair in the 'attrs' map can be a
843  string, a list of strings, a regular expression object, or a
844  callable that takes a string and returns whether or not the
845  string matches for some custom definition of 'matches'. The
846  same is true of the tag name."""
847  generator = self.recursiveChildGenerator
848  if not recursive:
849  generator = self.childGenerator
850  return self._findAll(name, attrs, text, limit, generator, **kwargs)
851  findChildren = findAll
852 
853  # Pre-3.x compatibility methods
854  first = find
855  fetch = findAll
856 
857  def fetchText(self, text=None, recursive=True, limit=None):
858  return self.findAll(text=text, recursive=recursive, limit=limit)
859 
860  def firstText(self, text=None, recursive=True):
861  return self.find(text=text, recursive=recursive)
862 
863  #Private methods
864 
865  def _getAttrMap(self):
866  """Initializes a map representation of this tag's attributes,
867  if not already initialized."""
868  if not getattr(self, 'attrMap'):
869  self.attrMap = {}
870  for (key, value) in self.attrs:
871  self.attrMap[key] = value
872  return self.attrMap
873 
874  #Generator methods
875  def childGenerator(self):
876  # Just use the iterator from the contents
877  return iter(self.contents)
878 
880  if not len(self.contents):
881  raise StopIteration
882  stopNode = self._lastRecursiveChild().next
883  current = self.contents[0]
884  while current is not stopNode:
885  yield current
886  current = current.next
887 
888 
889 # Next, a couple classes to represent queries and their results.
891  """Encapsulates a number of ways of matching a markup element (tag or
892  text)."""
893 
894  def __init__(self, name=None, attrs={}, text=None, **kwargs):
895  self.name = name
896  if isinstance(attrs, str):
897  kwargs['class'] = _match_css_class(attrs)
898  attrs = None
899  if kwargs:
900  if attrs:
901  attrs = attrs.copy()
902  attrs.update(kwargs)
903  else:
904  attrs = kwargs
905  self.attrs = attrs
906  self.text = text
907 
908  def __str__(self):
909  if self.text:
910  return self.text
911  else:
912  return "%s|%s" % (self.name, self.attrs)
913 
914  def searchTag(self, markupName=None, markupAttrs={}):
915  found = None
916  markup = None
917  if isinstance(markupName, Tag):
918  markup = markupName
919  markupAttrs = markup
920  callFunctionWithTagData = callable(self.name) \
921  and not isinstance(markupName, Tag)
922 
923  if (not self.name) \
924  or callFunctionWithTagData \
925  or (markup and self._matches(markup, self.name)) \
926  or (not markup and self._matches(markupName, self.name)):
927  if callFunctionWithTagData:
928  match = self.name(markupName, markupAttrs)
929  else:
930  match = True
931  markupAttrMap = None
932  for attr, matchAgainst in self.attrs.items():
933  if not markupAttrMap:
934  if hasattr(markupAttrs, 'get'):
935  markupAttrMap = markupAttrs
936  else:
937  markupAttrMap = {}
938  for k,v in markupAttrs:
939  markupAttrMap[k] = v
940  attrValue = markupAttrMap.get(attr)
941  if not self._matches(attrValue, matchAgainst):
942  match = False
943  break
944  if match:
945  if markup:
946  found = markup
947  else:
948  found = markupName
949  return found
950 
951  def search(self, markup):
952  #print 'looking for %s in %s' % (self, markup)
953  found = None
954  # If given a list of items, scan it for a text element that
955  # matches.
956  if hasattr(markup, "__iter__") \
957  and not isinstance(markup, Tag):
958  for element in markup:
959  if isinstance(element, NavigableString) \
960  and self.search(element):
961  found = element
962  break
963  # If it's a Tag, make sure its name or attributes match.
964  # Don't bother with Tags if we're searching for text.
965  elif isinstance(markup, Tag):
966  if not self.text:
967  found = self.searchTag(markup)
968  # If it's text, make sure the text matches.
969  elif isinstance(markup, NavigableString) or \
970  isinstance(markup, str):
971  if self._matches(markup, self.text):
972  found = markup
973  else:
974  raise Exception("I don't know how to match against a %s" \
975  % markup.__class__)
976  return found
977 
978  def _matches(self, markup, matchAgainst):
979  #print "Matching %s against %s" % (markup, matchAgainst)
980  result = False
981  if matchAgainst is True:
982  result = markup is not None
983  elif callable(matchAgainst):
984  result = matchAgainst(markup)
985  else:
986  #Custom match methods take the tag as an argument, but all
987  #other ways of matching match the tag name as a string.
988  if isinstance(markup, Tag):
989  markup = markup.name
990  if markup and not isinstance(markup, str):
991  markup = unicode(markup)
992  #Now we know that chunk is either a string, or None.
993  if hasattr(matchAgainst, 'match'):
994  # It's a regexp object.
995  result = markup and matchAgainst.search(markup)
996  elif hasattr(matchAgainst, '__iter__'): # list-like
997  result = markup in matchAgainst
998  elif hasattr(matchAgainst, 'items'):
999  result = matchAgainst in markup
1000  elif matchAgainst and isinstance(markup, str):
1001  if isinstance(markup, unicode):
1002  matchAgainst = unicode(matchAgainst)
1003  else:
1004  matchAgainst = str(matchAgainst)
1005 
1006  if not result:
1007  result = matchAgainst == markup
1008  return result
1009 
1011  """A ResultSet is just a list that keeps track of the SoupStrainer
1012  that created it."""
1013  def __init__(self, source):
1014  list.__init__([])
1015  self.source = source
1016 
1017 # Now, some helper functions.
1018 
1019 def buildTagMap(default, *args):
1020  """Turns a list of maps, lists, or scalars into a single map.
1021  Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and
1022  NESTING_RESET_TAGS maps out of lists and partial maps."""
1023  built = {}
1024  for portion in args:
1025  if hasattr(portion, 'items'):
1026  #It's a map. Merge it.
1027  for k,v in portion.items():
1028  built[k] = v
1029  elif hasattr(portion, '__iter__'): # is a list
1030  #It's a list. Map each item to the default.
1031  for k in portion:
1032  built[k] = default
1033  else:
1034  #It's a scalar. Map it to the default.
1035  built[portion] = default
1036  return built
1037 
1038 # Now, the parser classes.
1039 
1040 class BeautifulStoneSoup(Tag, SGMLParser):
1041 
1042  """This class contains the basic parser and search code. It defines
1043  a parser that knows nothing about tag behavior except for the
1044  following:
1045 
1046  You can't close a tag without closing all the tags it encloses.
1047  That is, "<foo><bar></foo>" actually means
1048  "<foo><bar></bar></foo>".
1049 
1050  [Another possible explanation is "<foo><bar /></foo>", but since
1051  this class defines no SELF_CLOSING_TAGS, it will never use that
1052  explanation.]
1053 
1054  This class is useful for parsing XML or made-up markup languages,
1055  or when BeautifulSoup makes an assumption counter to what you were
1056  expecting."""
1057 
1058  SELF_CLOSING_TAGS = {}
1059  NESTABLE_TAGS = {}
1060  RESET_NESTING_TAGS = {}
1061  QUOTE_TAGS = {}
1062  PRESERVE_WHITESPACE_TAGS = []
1063 
1064  MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'),
1065  lambda x: x.group(1) + ' />'),
1066  (re.compile('<!\s+([^<>]*)>'),
1067  lambda x: '<!' + x.group(1) + '>')
1068  ]
1069 
1070  ROOT_TAG_NAME = u'[document]'
1071 
1072  HTML_ENTITIES = "html"
1073  XML_ENTITIES = "xml"
1074  XHTML_ENTITIES = "xhtml"
1075  # TODO: This only exists for backwards-compatibility
1076  ALL_ENTITIES = XHTML_ENTITIES
1077 
1078  # Used when determining whether a text node is all whitespace and
1079  # can be replaced with a single space. A text node that contains
1080  # fancy Unicode spaces (usually non-breaking) should be left
1081  # alone.
1082  STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
1083 
1084  def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None,
1085  markupMassage=True, smartQuotesTo=XML_ENTITIES,
1086  convertEntities=None, selfClosingTags=None, isHTML=False):
1087  """The Soup object is initialized as the 'root tag', and the
1088  provided markup (which can be a string or a file-like object)
1089  is fed into the underlying parser.
1090 
1091  sgmllib will process most bad HTML, and the BeautifulSoup
1092  class has some tricks for dealing with some HTML that kills
1093  sgmllib, but Beautiful Soup can nonetheless choke or lose data
1094  if your data uses self-closing tags or declarations
1095  incorrectly.
1096 
1097  By default, Beautiful Soup uses regexes to sanitize input,
1098  avoiding the vast majority of these problems. If the problems
1099  don't apply to you, pass in False for markupMassage, and
1100  you'll get better performance.
1101 
1102  The default parser massage techniques fix the two most common
1103  instances of invalid HTML that choke sgmllib:
1104 
1105  <br/> (No space between name of closing tag and tag close)
1106  <! --Comment--> (Extraneous whitespace in declaration)
1107 
1108  You can pass in a custom list of (RE object, replace method)
1109  tuples to get Beautiful Soup to scrub your input the way you
1110  want."""
1111 
1112  self.parseOnlyThese = parseOnlyThese
1113  self.fromEncoding = fromEncoding
1114  self.smartQuotesTo = smartQuotesTo
1115  self.convertEntities = convertEntities
1116  # Set the rules for how we'll deal with the entities we
1117  # encounter
1118  if self.convertEntities:
1119  # It doesn't make sense to convert encoded characters to
1120  # entities even while you're converting entities to Unicode.
1121  # Just convert it all to Unicode.
1122  self.smartQuotesTo = None
1123  if convertEntities == self.HTML_ENTITIES:
1124  self.convertXMLEntities = False
1127  elif convertEntities == self.XHTML_ENTITIES:
1128  self.convertXMLEntities = True
1129  self.convertHTMLEntities = True
1130  self.escapeUnrecognizedEntities = False
1131  elif convertEntities == self.XML_ENTITIES:
1132  self.convertXMLEntities = True
1133  self.convertHTMLEntities = False
1134  self.escapeUnrecognizedEntities = False
1135  else:
1136  self.convertXMLEntities = False
1137  self.convertHTMLEntities = False
1138  self.escapeUnrecognizedEntities = False
1139 
1140  self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
1141  SGMLParser.__init__(self)
1142 
1143  if hasattr(markup, 'read'): # It's a file-type object.
1144  markup = markup.read()
1145  self.markup = markup
1146  self.markupMassage = markupMassage
1147  try:
1148  self._feed(isHTML=isHTML)
1149  except StopParsing:
1150  pass
1151  self.markup = None # The markup can now be GCed
1152 
1153  def convert_charref(self, name):
1154  """This method fixes a bug in Python's SGMLParser."""
1155  try:
1156  n = int(name)
1157  except ValueError:
1158  return
1159  if not 0 <= n <= 127 : # ASCII ends at 127, not 255
1160  return
1161  return self.convert_codepoint(n)
1162 
1163  def _feed(self, inDocumentEncoding=None, isHTML=False):
1164  # Convert the document to Unicode.
1165  markup = self.markup
1166  if isinstance(markup, unicode):
1167  if not hasattr(self, 'originalEncoding'):
1168  self.originalEncoding = None
1169  else:
1170  dammit = UnicodeDammit\
1171  (markup, [self.fromEncoding, inDocumentEncoding],
1172  smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
1173  markup = dammit.unicode
1174  self.originalEncoding = dammit.originalEncoding
1175  self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
1176  if markup:
1177  if self.markupMassage:
1178  if not hasattr(self.markupMassage, "__iter__"):
1179  self.markupMassage = self.MARKUP_MASSAGE
1180  for fix, m in self.markupMassage:
1181  markup = fix.sub(m, markup)
1182  # TODO: We get rid of markupMassage so that the
1183  # soup object can be deepcopied later on. Some
1184  # Python installations can't copy regexes. If anyone
1185  # was relying on the existence of markupMassage, this
1186  # might cause problems.
1187  del(self.markupMassage)
1188  self.reset()
1189 
1190  SGMLParser.feed(self, markup)
1191  # Close out any unfinished strings and close all the open tags.
1192  self.endData()
1193  while self.currentTag.name != self.ROOT_TAG_NAME:
1194  self.popTag()
1195 
1196  def __getattr__(self, methodName):
1197  """This method routes method call requests to either the SGMLParser
1198  superclass or the Tag superclass, depending on the method name."""
1199  #print "__getattr__ called on %s.%s" % (self.__class__, methodName)
1200 
1201  if methodName.startswith('start_') or methodName.startswith('end_') \
1202  or methodName.startswith('do_'):
1203  return SGMLParser.__getattr__(self, methodName)
1204  elif not methodName.startswith('__'):
1205  return Tag.__getattr__(self, methodName)
1206  else:
1207  raise AttributeError
1208 
1209  def isSelfClosingTag(self, name):
1210  """Returns true iff the given string is the name of a
1211  self-closing tag according to this parser."""
1212  return name in self.SELF_CLOSING_TAGS \
1213  or name in self.instanceSelfClosingTags
1214 
1215  def reset(self):
1216  Tag.__init__(self, self, self.ROOT_TAG_NAME)
1217  self.hidden = 1
1218  SGMLParser.reset(self)
1219  self.currentData = []
1220  self.currentTag = None
1221  self.tagStack = []
1222  self.quoteStack = []
1223  self.pushTag(self)
1224 
1225  def popTag(self):
1226  tag = self.tagStack.pop()
1227 
1228  #print "Pop", tag.name
1229  if self.tagStack:
1230  self.currentTag = self.tagStack[-1]
1231  return self.currentTag
1232 
1233  def pushTag(self, tag):
1234  #print "Push", tag.name
1235  if self.currentTag:
1236  self.currentTag.contents.append(tag)
1237  self.tagStack.append(tag)
1238  self.currentTag = self.tagStack[-1]
1239 
1240  def endData(self, containerClass=NavigableString):
1241  if self.currentData:
1242  currentData = u''.join(self.currentData)
1243  if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
1244  not set([tag.name for tag in self.tagStack]).intersection(
1245  self.PRESERVE_WHITESPACE_TAGS)):
1246  if '\n' in currentData:
1247  currentData = '\n'
1248  else:
1249  currentData = ' '
1250  self.currentData = []
1251  if self.parseOnlyThese and len(self.tagStack) <= 1 and \
1252  (not self.parseOnlyThese.text or \
1253  not self.parseOnlyThese.search(currentData)):
1254  return
1255  o = containerClass(currentData)
1256  o.setup(self.currentTag, self.previous)
1257  if self.previous:
1258  self.previous.next = o
1259  self.previous = o
1260  self.currentTag.contents.append(o)
1261 
1262 
1263  def _popToTag(self, name, inclusivePop=True):
1264  """Pops the tag stack up to and including the most recent
1265  instance of the given tag. If inclusivePop is false, pops the tag
1266  stack up to but *not* including the most recent instqance of
1267  the given tag."""
1268  #print "Popping to %s" % name
1269  if name == self.ROOT_TAG_NAME:
1270  return
1271 
1272  numPops = 0
1273  mostRecentTag = None
1274  for i in range(len(self.tagStack)-1, 0, -1):
1275  if name == self.tagStack[i].name:
1276  numPops = len(self.tagStack)-i
1277  break
1278  if not inclusivePop:
1279  numPops = numPops - 1
1280 
1281  for i in range(0, numPops):
1282  mostRecentTag = self.popTag()
1283  return mostRecentTag
1284 
1285  def _smartPop(self, name):
1286 
1287  """We need to pop up to the previous tag of this type, unless
1288  one of this tag's nesting reset triggers comes between this
1289  tag and the previous tag of this type, OR unless this tag is a
1290  generic nesting trigger and another generic nesting trigger
1291  comes between this tag and the previous tag of this type.
1292 
1293  Examples:
1294  <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
1295  <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
1296  <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.
1297 
1298  <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
1299  <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
1300  <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
1301  """
1302 
1303  nestingResetTriggers = self.NESTABLE_TAGS.get(name)
1304  isNestable = nestingResetTriggers != None
1305  isResetNesting = name in self.RESET_NESTING_TAGS
1306  popTo = None
1307  inclusive = True
1308  for i in range(len(self.tagStack)-1, 0, -1):
1309  p = self.tagStack[i]
1310  if (not p or p.name == name) and not isNestable:
1311  #Non-nestable tags get popped to the top or to their
1312  #last occurance.
1313  popTo = name
1314  break
1315  if (nestingResetTriggers is not None
1316  and p.name in nestingResetTriggers) \
1317  or (nestingResetTriggers is None and isResetNesting
1318  and p.name in self.RESET_NESTING_TAGS):
1319 
1320  #If we encounter one of the nesting reset triggers
1321  #peculiar to this tag, or we encounter another tag
1322  #that causes nesting to reset, pop up to but not
1323  #including that tag.
1324  popTo = p.name
1325  inclusive = False
1326  break
1327  p = p.parent
1328  if popTo:
1329  self._popToTag(popTo, inclusive)
1330 
1331  def unknown_starttag(self, name, attrs, selfClosing=0):
1332  #print "Start tag %s: %s" % (name, attrs)
1333  if self.quoteStack:
1334  #This is not a real tag.
1335  #print "<%s> is not real!" % name
1336  attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs])
1337  self.handle_data('<%s%s>' % (name, attrs))
1338  return
1339  self.endData()
1340 
1341  if not self.isSelfClosingTag(name) and not selfClosing:
1342  self._smartPop(name)
1343 
1344  if self.parseOnlyThese and len(self.tagStack) <= 1 \
1345  and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
1346  return
1347 
1348  tag = Tag(self, name, attrs, self.currentTag, self.previous)
1349  if self.previous:
1350  self.previous.next = tag
1351  self.previous = tag
1352  self.pushTag(tag)
1353  if selfClosing or self.isSelfClosingTag(name):
1354  self.popTag()
1355  if name in self.QUOTE_TAGS:
1356  #print "Beginning quote (%s)" % name
1357  self.quoteStack.append(name)
1358  self.literal = 1
1359  return tag
1360 
1361  def unknown_endtag(self, name):
1362  #print "End tag %s" % name
1363  if self.quoteStack and self.quoteStack[-1] != name:
1364  #This is not a real end tag.
1365  #print "</%s> is not real!" % name
1366  self.handle_data('</%s>' % name)
1367  return
1368  self.endData()
1369  self._popToTag(name)
1370  if self.quoteStack and self.quoteStack[-1] == name:
1371  self.quoteStack.pop()
1372  self.literal = (len(self.quoteStack) > 0)
1373 
1374  def handle_data(self, data):
1375  self.currentData.append(data)
1376 
1377  def _toStringSubclass(self, text, subclass):
1378  """Adds a certain piece of text to the tree as a NavigableString
1379  subclass."""
1380  self.endData()
1381  self.handle_data(text)
1382  self.endData(subclass)
1383 
1384  def handle_pi(self, text):
1385  """Handle a processing instruction as a ProcessingInstruction
1386  object, possibly one with a %SOUP-ENCODING% slot into which an
1387  encoding will be plugged later."""
1388  if text[:3] == "xml":
1389  text = u"xml version='1.0' encoding='%SOUP-ENCODING%'"
1390  self._toStringSubclass(text, ProcessingInstruction)
1391 
1392  def handle_comment(self, text):
1393  "Handle comments as Comment objects."
1394  self._toStringSubclass(text, Comment)
1395 
1396  def handle_charref(self, ref):
1397  "Handle character references as data."
1398  if self.convertEntities:
1399  data = unichr(int(ref))
1400  else:
1401  data = '&#%s;' % ref
1402  self.handle_data(data)
1403 
1404  def handle_entityref(self, ref):
1405  """Handle entity references as data, possibly converting known
1406  HTML and/or XML entity references to the corresponding Unicode
1407  characters."""
1408  data = None
1409  if self.convertHTMLEntities:
1410  try:
1411  data = unichr(name2codepoint[ref])
1412  except KeyError:
1413  pass
1414 
1415  if not data and self.convertXMLEntities:
1416  data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref)
1417 
1418  if not data and self.convertHTMLEntities and \
1419  not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref):
1420  # TODO: We've got a problem here. We're told this is
1421  # an entity reference, but it's not an XML entity
1422  # reference or an HTML entity reference. Nonetheless,
1423  # the logical thing to do is to pass it through as an
1424  # unrecognized entity reference.
1425  #
1426  # Except: when the input is "&carol;" this function
1427  # will be called with input "carol". When the input is
1428  # "AT&T", this function will be called with input
1429  # "T". We have no way of knowing whether a semicolon
1430  # was present originally, so we don't know whether
1431  # this is an unknown entity or just a misplaced
1432  # ampersand.
1433  #
1434  # The more common case is a misplaced ampersand, so I
1435  # escape the ampersand and omit the trailing semicolon.
1436  data = "&amp;%s" % ref
1437  if not data:
1438  # This case is different from the one above, because we
1439  # haven't already gone through a supposedly comprehensive
1440  # mapping of entities to Unicode characters. We might not
1441  # have gone through any mapping at all. So the chances are
1442  # very high that this is a real entity, and not a
1443  # misplaced ampersand.
1444  data = "&%s;" % ref
1445  self.handle_data(data)
1446 
1447  def handle_decl(self, data):
1448  "Handle DOCTYPEs and the like as Declaration objects."
1449  self._toStringSubclass(data, Declaration)
1450 
1451  def parse_declaration(self, i):
1452  """Treat a bogus SGML declaration as raw data. Treat a CDATA
1453  declaration as a CData object."""
1454  j = None
1455  if self.rawdata[i:i+9] == '<![CDATA[':
1456  k = self.rawdata.find(']]>', i)
1457  if k == -1:
1458  k = len(self.rawdata)
1459  data = self.rawdata[i+9:k]
1460  j = k+3
1461  self._toStringSubclass(data, CData)
1462  else:
1463  try:
1464  j = SGMLParser.parse_declaration(self, i)
1465  except SGMLParseError:
1466  toHandle = self.rawdata[i:]
1467  self.handle_data(toHandle)
1468  j = i + len(toHandle)
1469  return j
1470 
1472 
1473  """This parser knows the following facts about HTML:
1474 
1475  * Some tags have no closing tag and should be interpreted as being
1476  closed as soon as they are encountered.
1477 
1478  * The text inside some tags (ie. 'script') may contain tags which
1479  are not really part of the document and which should be parsed
1480  as text, not tags. If you want to parse the text as tags, you can
1481  always fetch it and parse it explicitly.
1482 
1483  * Tag nesting rules:
1484 
1485  Most tags can't be nested at all. For instance, the occurance of
1486  a <p> tag should implicitly close the previous <p> tag.
1487 
1488  <p>Para1<p>Para2
1489  should be transformed into:
1490  <p>Para1</p><p>Para2
1491 
1492  Some tags can be nested arbitrarily. For instance, the occurance
1493  of a <blockquote> tag should _not_ implicitly close the previous
1494  <blockquote> tag.
1495 
1496  Alice said: <blockquote>Bob said: <blockquote>Blah
1497  should NOT be transformed into:
1498  Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah
1499 
1500  Some tags can be nested, but the nesting is reset by the
1501  interposition of other tags. For instance, a <tr> tag should
1502  implicitly close the previous <tr> tag within the same <table>,
1503  but not close a <tr> tag in another table.
1504 
1505  <table><tr>Blah<tr>Blah
1506  should be transformed into:
1507  <table><tr>Blah</tr><tr>Blah
1508  but,
1509  <tr>Blah<table><tr>Blah
1510  should NOT be transformed into
1511  <tr>Blah<table></tr><tr>Blah
1512 
1513  Differing assumptions about tag nesting rules are a major source
1514  of problems with the BeautifulSoup class. If BeautifulSoup is not
1515  treating as nestable a tag your page author treats as nestable,
1516  try ICantBelieveItsBeautifulSoup, MinimalSoup, or
1517  BeautifulStoneSoup before writing your own subclass."""
1518 
1519  def __init__(self, *args, **kwargs):
1520  if 'smartQuotesTo' not in kwargs:
1521  kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1522  kwargs['isHTML'] = True
1523  BeautifulStoneSoup.__init__(self, *args, **kwargs)
1524 
1525  SELF_CLOSING_TAGS = buildTagMap(None,
1526  ('br' , 'hr', 'input', 'img', 'meta',
1527  'spacer', 'link', 'frame', 'base', 'col'))
1528 
1529  PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
1530 
1531  QUOTE_TAGS = {'script' : None, 'textarea' : None}
1532 
1533  #According to the HTML standard, each of these inline tags can
1534  #contain another tag of the same type. Furthermore, it's common
1535  #to actually use these tags this way.
1536  NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
1537  'center')
1538 
1539  #According to the HTML standard, these block tags can contain
1540  #another tag of the same type. Furthermore, it's common
1541  #to actually use these tags this way.
1542  NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del')
1543 
1544  #Lists can contain other lists, but there are restrictions.
1545  NESTABLE_LIST_TAGS = { 'ol' : [],
1546  'ul' : [],
1547  'li' : ['ul', 'ol'],
1548  'dl' : [],
1549  'dd' : ['dl'],
1550  'dt' : ['dl'] }
1551 
1552  #Tables can contain other tables, but there are restrictions.
1553  NESTABLE_TABLE_TAGS = {'table' : [],
1554  'tr' : ['table', 'tbody', 'tfoot', 'thead'],
1555  'td' : ['tr'],
1556  'th' : ['tr'],
1557  'thead' : ['table'],
1558  'tbody' : ['table'],
1559  'tfoot' : ['table'],
1560  }
1561 
1562  NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre')
1563 
1564  #If one of these tags is encountered, all tags up to the next tag of
1565  #this type are popped.
1566  RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
1567  NON_NESTABLE_BLOCK_TAGS,
1568  NESTABLE_LIST_TAGS,
1569  NESTABLE_TABLE_TAGS)
1570 
1571  NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
1572  NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)
1573 
1574  # Used to detect the charset in a META tag; see start_meta
1575  CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
1576 
1577  def start_meta(self, attrs):
1578  """Beautiful Soup can detect a charset included in a META tag,
1579  try to convert the document to that charset, and re-parse the
1580  document from the beginning."""
1581  httpEquiv = None
1582  contentType = None
1583  contentTypeIndex = None
1584  tagNeedsEncodingSubstitution = False
1585 
1586  for i in range(0, len(attrs)):
1587  key, value = attrs[i]
1588  key = key.lower()
1589  if key == 'http-equiv':
1590  httpEquiv = value
1591  elif key == 'content':
1592  contentType = value
1593  contentTypeIndex = i
1594 
1595  if httpEquiv and contentType: # It's an interesting meta tag.
1596  match = self.CHARSET_RE.search(contentType)
1597  if match:
1598  if (self.declaredHTMLEncoding is not None or
1600  # An HTML encoding was sniffed while converting
1601  # the document to Unicode, or an HTML encoding was
1602  # sniffed during a previous pass through the
1603  # document, or an encoding was specified
1604  # explicitly and it worked. Rewrite the meta tag.
1605  def rewrite(match):
1606  return match.group(1) + "%SOUP-ENCODING%"
1607  newAttr = self.CHARSET_RE.sub(rewrite, contentType)
1608  attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
1609  newAttr)
1610  tagNeedsEncodingSubstitution = True
1611  else:
1612  # This is our first pass through the document.
1613  # Go through it again with the encoding information.
1614  newCharset = match.group(3)
1615  if newCharset and newCharset != self.originalEncoding:
1616  self.declaredHTMLEncoding = newCharset
1617  self._feed(self.declaredHTMLEncoding)
1618  raise StopParsing
1619  pass
1620  tag = self.unknown_starttag("meta", attrs)
1621  if tag and tagNeedsEncodingSubstitution:
1622  tag.containsSubstitutions = True
1623 
1625  pass
1626 
1627 class ICantBelieveItsBeautifulSoup(BeautifulSoup):
1628 
1629  """The BeautifulSoup class is oriented towards skipping over
1630  common HTML errors like unclosed tags. However, sometimes it makes
1631  errors of its own. For instance, consider this fragment:
1632 
1633  <b>Foo<b>Bar</b></b>
1634 
1635  This is perfectly valid (if bizarre) HTML. However, the
1636  BeautifulSoup class will implicitly close the first b tag when it
1637  encounters the second 'b'. It will think the author wrote
1638  "<b>Foo<b>Bar", and didn't close the first 'b' tag, because
1639  there's no real-world reason to bold something that's already
1640  bold. When it encounters '</b></b>' it will close two more 'b'
1641  tags, for a grand total of three tags closed instead of two. This
1642  can throw off the rest of your document structure. The same is
1643  true of a number of other tags, listed below.
1644 
1645  It's much more common for someone to forget to close a 'b' tag
1646  than to actually use nested 'b' tags, and the BeautifulSoup class
1647  handles the common case. This class handles the not-co-common
1648  case: where you can't believe someone wrote what they did, but
1649  it's valid HTML and BeautifulSoup screwed up by assuming it
1650  wouldn't be."""
1651 
1652  I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \
1653  ('em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong',
1654  'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b',
1655  'big')
1656 
1657  I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ('noscript',)
1658 
1659  NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS,
1660  I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS,
1661  I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS)
1662 
1664  """The MinimalSoup class is for parsing HTML that contains
1665  pathologically bad markup. It makes no assumptions about tag
1666  nesting, but it does know which tags are self-closing, that
1667  <script> tags contain Javascript and should not be parsed, that
1668  META tags may contain encoding information, and so on.
1669 
1670  This also makes it better for subclassing than BeautifulStoneSoup
1671  or BeautifulSoup."""
1672 
1673  RESET_NESTING_TAGS = buildTagMap('noscript')
1674  NESTABLE_TAGS = {}
1675 
1677  """This class will push a tag with only a single string child into
1678  the tag's parent as an attribute. The attribute's name is the tag
1679  name, and the value is the string child. An example should give
1680  the flavor of the change:
1681 
1682  <foo><bar>baz</bar></foo>
1683  =>
1684  <foo bar="baz"><bar>baz</bar></foo>
1685 
1686  You can then access fooTag['bar'] instead of fooTag.barTag.string.
1687 
1688  This is, of course, useful for scraping structures that tend to
1689  use subelements instead of attributes, such as SOAP messages. Note
1690  that it modifies its input, so don't print the modified version
1691  out.
1692 
1693  I'm not sure how many people really want to use this class; let me
1694  know if you do. Mainly I like the name."""
1695 
1696  def popTag(self):
1697  if len(self.tagStack) > 1:
1698  tag = self.tagStack[-1]
1699  parent = self.tagStack[-2]
1700  parent._getAttrMap()
1701  if (isinstance(tag, Tag) and len(tag.contents) == 1 and
1702  isinstance(tag.contents[0], NavigableString) and
1703  tag.name not in parent.attrMap):
1704  parent[tag.name] = tag.contents[0]
1705  BeautifulStoneSoup.popTag(self)
1706 
1707 #Enterprise class names! It has come to our attention that some people
1708 #think the names of the Beautiful Soup parser classes are too silly
1709 #and "unprofessional" for use in enterprise screen-scraping. We feel
1710 #your pain! For such-minded folk, the Beautiful Soup Consortium And
1711 #All-Night Kosher Bakery recommends renaming this file to
1712 #"RobustParser.py" (or, in cases of extreme enterprisiness,
1713 #"RobustParserBeanInterface.class") and using the following
1714 #enterprise-friendly class aliases:
1716  pass
1717 class RobustHTMLParser(BeautifulSoup):
1718  pass
1720  pass
1722  pass
1724  pass
1725 
1726 ######################################################
1727 #
1728 # Bonus library: Unicode, Dammit
1729 #
1730 # This class forces XML data into a standard format (usually to UTF-8
1731 # or Unicode). It is heavily based on code from Mark Pilgrim's
1732 # Universal Feed Parser. It does not rewrite the XML or HTML to
1733 # reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi
1734 # (XML) and BeautifulSoup.start_meta (HTML).
1735 
1736 # Autodetects character encodings.
1737 # Download from http://chardet.feedparser.org/
1738 try:
1739  import chardet
1740 # import chardet.constants
1741 # chardet.constants._debug = 1
1742 except ImportError:
1743  chardet = None
1744 
1745 # cjkcodecs and iconv_codec make Python know about more character encodings.
1746 # Both are available from http://cjkpython.i18n.org/
1747 # They're built in if you use Python 2.4.
1748 try:
1749  import cjkcodecs.aliases
1750 except ImportError:
1751  pass
1752 try:
1753  import iconv_codec
1754 except ImportError:
1755  pass
1756 
1758  """A class for detecting the encoding of a *ML document and
1759  converting it to a Unicode string. If the source encoding is
1760  windows-1252, can replace MS smart quotes with their HTML or XML
1761  equivalents."""
1762 
1763  # This dictionary maps commonly seen values for "charset" in HTML
1764  # meta tags to the corresponding Python codec names. It only covers
1765  # values that aren't in Python's aliases and can't be determined
1766  # by the heuristics in find_codec.
1767  CHARSET_ALIASES = { "macintosh" : "mac-roman",
1768  "x-sjis" : "shift-jis" }
1769 
1770  def __init__(self, markup, overrideEncodings=[],
1771  smartQuotesTo='xml', isHTML=False):
1773  self.markup, documentEncoding, sniffedEncoding = \
1774  self._detectEncoding(markup, isHTML)
1775  self.smartQuotesTo = smartQuotesTo
1776  self.triedEncodings = []
1777  if markup == '' or isinstance(markup, unicode):
1778  self.originalEncoding = None
1779  self.unicode = unicode(markup)
1780  return
1781 
1782  u = None
1783  for proposedEncoding in overrideEncodings:
1784  u = self._convertFrom(proposedEncoding)
1785  if u: break
1786  if not u:
1787  for proposedEncoding in (documentEncoding, sniffedEncoding):
1788  u = self._convertFrom(proposedEncoding)
1789  if u: break
1790 
1791  # If no luck and we have auto-detection library, try that:
1792  if not u and chardet and not isinstance(self.markup, unicode):
1793  u = self._convertFrom(chardet.detect(self.markup)['encoding'])
1794 
1795  # As a last resort, try utf-8 and windows-1252:
1796  if not u:
1797  for proposed_encoding in ("utf-8", "windows-1252"):
1798  u = self._convertFrom(proposed_encoding)
1799  if u: break
1800 
1801  self.unicode = u
1802  if not u: self.originalEncoding = None
1803 
1804  def _subMSChar(self, orig):
1805  """Changes a MS smart quote character to an XML or HTML
1806  entity."""
1807  sub = self.MS_CHARS.get(orig)
1808  if isinstance(sub, tuple):
1809  if self.smartQuotesTo == 'xml':
1810  sub = '&#x%s;' % sub[1]
1811  else:
1812  sub = '&%s;' % sub[0]
1813  return sub
1814 
1815  def _convertFrom(self, proposed):
1816  proposed = self.find_codec(proposed)
1817  if not proposed or proposed in self.triedEncodings:
1818  return None
1819  self.triedEncodings.append(proposed)
1820  markup = self.markup
1821 
1822  # Convert smart quotes to HTML if coming from an encoding
1823  # that might have them.
1824  if self.smartQuotesTo and proposed.lower() in("windows-1252",
1825  "iso-8859-1",
1826  "iso-8859-2"):
1827  markup = re.compile("([\x80-\x9f])").sub \
1828  (lambda x: self._subMSChar(x.group(1)),
1829  markup)
1830 
1831  try:
1832  # print "Trying to convert document to %s" % proposed
1833  u = self._toUnicode(markup, proposed)
1834  self.markup = u
1835  self.originalEncoding = proposed
1836  except Exception as e:
1837  # print "That didn't work!"
1838  # print e
1839  return None
1840  #print "Correct encoding: %s" % proposed
1841  return self.markup
1842 
1843  def _toUnicode(self, data, encoding):
1844  '''Given a string and its encoding, decodes the string into Unicode.
1845  %encoding is a string recognized by encodings.aliases'''
1846 
1847  # strip Byte Order Mark (if present)
1848  if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
1849  and (data[2:4] != '\x00\x00'):
1850  encoding = 'utf-16be'
1851  data = data[2:]
1852  elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
1853  and (data[2:4] != '\x00\x00'):
1854  encoding = 'utf-16le'
1855  data = data[2:]
1856  elif data[:3] == '\xef\xbb\xbf':
1857  encoding = 'utf-8'
1858  data = data[3:]
1859  elif data[:4] == '\x00\x00\xfe\xff':
1860  encoding = 'utf-32be'
1861  data = data[4:]
1862  elif data[:4] == '\xff\xfe\x00\x00':
1863  encoding = 'utf-32le'
1864  data = data[4:]
1865  newdata = unicode(data, encoding)
1866  return newdata
1867 
1868  def _detectEncoding(self, xml_data, isHTML=False):
1869  """Given a document, tries to detect its XML encoding."""
1870  xml_encoding = sniffed_xml_encoding = None
1871  try:
1872  if xml_data[:4] == '\x4c\x6f\xa7\x94':
1873  # EBCDIC
1874  xml_data = self._ebcdic_to_ascii(xml_data)
1875  elif xml_data[:4] == '\x00\x3c\x00\x3f':
1876  # UTF-16BE
1877  sniffed_xml_encoding = 'utf-16be'
1878  xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
1879  elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
1880  and (xml_data[2:4] != '\x00\x00'):
1881  # UTF-16BE with BOM
1882  sniffed_xml_encoding = 'utf-16be'
1883  xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
1884  elif xml_data[:4] == '\x3c\x00\x3f\x00':
1885  # UTF-16LE
1886  sniffed_xml_encoding = 'utf-16le'
1887  xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
1888  elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
1889  (xml_data[2:4] != '\x00\x00'):
1890  # UTF-16LE with BOM
1891  sniffed_xml_encoding = 'utf-16le'
1892  xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
1893  elif xml_data[:4] == '\x00\x00\x00\x3c':
1894  # UTF-32BE
1895  sniffed_xml_encoding = 'utf-32be'
1896  xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
1897  elif xml_data[:4] == '\x3c\x00\x00\x00':
1898  # UTF-32LE
1899  sniffed_xml_encoding = 'utf-32le'
1900  xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
1901  elif xml_data[:4] == '\x00\x00\xfe\xff':
1902  # UTF-32BE with BOM
1903  sniffed_xml_encoding = 'utf-32be'
1904  xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
1905  elif xml_data[:4] == '\xff\xfe\x00\x00':
1906  # UTF-32LE with BOM
1907  sniffed_xml_encoding = 'utf-32le'
1908  xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
1909  elif xml_data[:3] == '\xef\xbb\xbf':
1910  # UTF-8 with BOM
1911  sniffed_xml_encoding = 'utf-8'
1912  xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
1913  else:
1914  sniffed_xml_encoding = 'ascii'
1915  pass
1916  except:
1917  xml_encoding_match = None
1918  xml_encoding_match = re.compile(
1919  '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
1920  if not xml_encoding_match and isHTML:
1921  regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I)
1922  xml_encoding_match = regexp.search(xml_data)
1923  if xml_encoding_match is not None:
1924  xml_encoding = xml_encoding_match.groups()[0].lower()
1925  if isHTML:
1926  self.declaredHTMLEncoding = xml_encoding
1927  if sniffed_xml_encoding and \
1928  (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
1929  'iso-10646-ucs-4', 'ucs-4', 'csucs4',
1930  'utf-16', 'utf-32', 'utf_16', 'utf_32',
1931  'utf16', 'u16')):
1932  xml_encoding = sniffed_xml_encoding
1933  return xml_data, xml_encoding, sniffed_xml_encoding
1934 
1935 
1936  def find_codec(self, charset):
1937  return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
1938  or (charset and self._codec(charset.replace("-", ""))) \
1939  or (charset and self._codec(charset.replace("-", "_"))) \
1940  or charset
1941 
1942  def _codec(self, charset):
1943  if not charset: return charset
1944  codec = None
1945  try:
1946  codecs.lookup(charset)
1947  codec = charset
1948  except (LookupError, ValueError):
1949  pass
1950  return codec
1951 
1952  EBCDIC_TO_ASCII_MAP = None
1953  def _ebcdic_to_ascii(self, s):
1954  c = self.__class__
1955  if not c.EBCDIC_TO_ASCII_MAP:
1956  emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
1957  16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
1958  128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
1959  144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
1960  32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
1961  38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
1962  45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
1963  186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
1964  195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
1965  201,202,106,107,108,109,110,111,112,113,114,203,204,205,
1966  206,207,208,209,126,115,116,117,118,119,120,121,122,210,
1967  211,212,213,214,215,216,217,218,219,220,221,222,223,224,
1968  225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
1969  73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
1970  82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
1971  90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
1972  250,251,252,253,254,255)
1973  import string
1974  c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
1975  ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
1976  return s.translate(c.EBCDIC_TO_ASCII_MAP)
1977 
1978  MS_CHARS = { '\x80' : ('euro', '20AC'),
1979  '\x81' : ' ',
1980  '\x82' : ('sbquo', '201A'),
1981  '\x83' : ('fnof', '192'),
1982  '\x84' : ('bdquo', '201E'),
1983  '\x85' : ('hellip', '2026'),
1984  '\x86' : ('dagger', '2020'),
1985  '\x87' : ('Dagger', '2021'),
1986  '\x88' : ('circ', '2C6'),
1987  '\x89' : ('permil', '2030'),
1988  '\x8A' : ('Scaron', '160'),
1989  '\x8B' : ('lsaquo', '2039'),
1990  '\x8C' : ('OElig', '152'),
1991  '\x8D' : '?',
1992  '\x8E' : ('#x17D', '17D'),
1993  '\x8F' : '?',
1994  '\x90' : '?',
1995  '\x91' : ('lsquo', '2018'),
1996  '\x92' : ('rsquo', '2019'),
1997  '\x93' : ('ldquo', '201C'),
1998  '\x94' : ('rdquo', '201D'),
1999  '\x95' : ('bull', '2022'),
2000  '\x96' : ('ndash', '2013'),
2001  '\x97' : ('mdash', '2014'),
2002  '\x98' : ('tilde', '2DC'),
2003  '\x99' : ('trade', '2122'),
2004  '\x9a' : ('scaron', '161'),
2005  '\x9b' : ('rsaquo', '203A'),
2006  '\x9c' : ('oelig', '153'),
2007  '\x9d' : '?',
2008  '\x9e' : ('#x17E', '17E'),
2009  '\x9f' : ('Yuml', ''),}
2010 
2011 #######################################################################
2012 
2013 
2014 #By default, act as an HTML pretty-printer.
2015 if __name__ == '__main__':
2016  import sys
2017  soup = BeautifulSoup(sys.stdin)
2018  print(soup.prettify())
def fetchText(self, text=None, recursive=True, limit=None)
def get(self, key, default=None)
def _convertFrom(self, proposed)
def _findOne(self, method, name, attrs, text, kwargs)
def findNext(self, name=None, attrs={}, text=None, kwargs)
def findPrevious(self, name=None, attrs={}, text=None, kwargs)
def _toStringSubclass(self, text, subclass)
def toEncoding(self, s, encoding=None)
def _matches(self, markup, matchAgainst)
def _popToTag(self, name, inclusivePop=True)
def _codec(self, charset)
def findPreviousSibling(self, name=None, attrs={}, text=None, kwargs)
def __getattr__(self, methodName)
def search(self, markup)
def unknown_starttag(self, name, attrs, selfClosing=0)
def setString(self, string)
def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING, prettyPrint=False, indentLevel=0)
def __eq__(self, other)
def getText(self, separator=u"")
S & print(S &os, JobReport::InputFile const &f)
Definition: JobReport.cc:66
def _toUnicode(self, data, encoding)
def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING)
def __init__(self, args, kwargs)
def childGenerator(self)
def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING)
def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING)
def __init__(self, parser, name, attrs=None, parent=None, previous=None)
def recursiveChildGenerator(self)
def findParents(self, name=None, attrs={}, limit=None, kwargs)
def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING)
def __unicode__(self)
def _getAttrMap(self)
def searchTag(self, markupName=None, markupAttrs={})
def findPreviousSiblings(self, name=None, attrs={}, text=None, limit=None, kwargs)
def has_key(self, key)
def __nonzero__(self)
def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING, prettyPrint=False, indentLevel=0)
def __getattr__(self, tag)
bool decode(bool &, std::string const &)
Definition: types.cc:72
def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING)
def buildTagMap(default, args)
def findParent(self, name=None, attrs={}, kwargs)
T min(T a, T b)
Definition: MathUtil.h:58
def findAllPrevious(self, name=None, attrs={}, text=None, limit=None, kwargs)
def _findAll(self, name, attrs, text, limit, generator, kwargs)
def find_codec(self, charset)
def findNextSiblings(self, name=None, attrs={}, text=None, limit=None, kwargs)
def __call__(self, args, kwargs)
def findNextSibling(self, name=None, attrs={}, text=None, kwargs)
def __getitem__(self, key)
def firstText(self, text=None, recursive=True)
def setup(self, parent=None, previous=None)
def __init__(self, source)
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING)
def findAllNext(self, name=None, attrs={}, text=None, limit=None, kwargs)
def __contains__(self, x)
def encode(args, files)
def _feed(self, inDocumentEncoding=None, isHTML=False)
def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING)
def __init__(self, markup, overrideEncodings=[], smartQuotesTo='xml', isHTML=False)
def findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, kwargs)
def insert(self, position, newChild)
def find(self, name=None, attrs={}, recursive=True, text=None, kwargs)
def replaceWith(self, replaceWith)
def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None, markupMassage=True, smartQuotesTo=XML_ENTITIES, convertEntities=None, selfClosingTags=None, isHTML=False)
def __init__(self, name=None, attrs={}, text=None, kwargs)
def index(self, element)
def _detectEncoding(self, xml_data, isHTML=False)
std::pair< typename Association::data_type::first_type, double > match(Reference key, Association association, bool bestMatchByMaxValue)
Generic matching function.
Definition: Utils.h:10
def __setitem__(self, key, value)
def __ne__(self, other)
#define str(s)
def __delitem__(self, key)
def _match_css_class(str)
def _convertEntities(self, match)
def substituteEncoding(self, str, encoding=None)
How EventSelector::AcceptEvent() decides whether to accept an event for output otherwise it is excluding the probing of A single or multiple positive and the trigger will pass if any such matching triggers are PASS or EXCEPTION[A criterion thatmatches no triggers at all is detected and causes a throw.] A single negative with an expectation of appropriate bit checking in the decision and the trigger will pass if any such matching triggers are FAIL or EXCEPTION A wildcarded negative criterion that matches more than one trigger in the trigger list("!*","!HLTx*"if it matches 2 triggers or more) will accept the event if all the matching triggers are FAIL.It will reject the event if any of the triggers are PASS or EXCEPTION(this matches the behavior of"!*"before the partial wildcard feature was incorporated).Triggers which are in the READY state are completely ignored.(READY should never be returned since the trigger paths have been run
def endData(self, containerClass=NavigableString)