CMS 3D CMS Logo

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Properties Friends Macros Pages
List of all members | Public Member Functions | Public Attributes | Static Public Attributes | Private Member Functions
BeautifulSoup.UnicodeDammit Class Reference

Public Member Functions

def __init__
 
def find_codec
 

Public Attributes

 declaredHTMLEncoding
 
 markup
 
 originalEncoding
 
 smartQuotesTo
 
 triedEncodings
 
 unicode
 

Static Public Attributes

dictionary CHARSET_ALIASES
 
 EBCDIC_TO_ASCII_MAP = None
 
dictionary MS_CHARS
 

Private Member Functions

def _codec
 
def _convertFrom
 
def _detectEncoding
 
def _ebcdic_to_ascii
 
def _subMSChar
 
def _toUnicode
 

Detailed Description

A class for detecting the encoding of a *ML document and
converting it to a Unicode string. If the source encoding is
windows-1252, can replace MS smart quotes with their HTML or XML
equivalents.

Definition at line 1756 of file BeautifulSoup.py.

Constructor & Destructor Documentation

def BeautifulSoup.UnicodeDammit.__init__ (   self,
  markup,
  overrideEncodings = [],
  smartQuotesTo = 'xml',
  isHTML = False 
)

Definition at line 1770 of file BeautifulSoup.py.

1771  smartQuotesTo='xml', isHTML=False):
1772  self.declaredHTMLEncoding = None
1773  self.markup, documentEncoding, sniffedEncoding = \
1774  self._detectEncoding(markup, isHTML)
1775  self.smartQuotesTo = smartQuotesTo
1776  self.triedEncodings = []
1777  if markup == '' or isinstance(markup, unicode):
1778  self.originalEncoding = None
1779  self.unicode = unicode(markup)
1780  return
1781 
1782  u = None
1783  for proposedEncoding in overrideEncodings:
1784  u = self._convertFrom(proposedEncoding)
1785  if u: break
1786  if not u:
1787  for proposedEncoding in (documentEncoding, sniffedEncoding):
1788  u = self._convertFrom(proposedEncoding)
1789  if u: break
1790 
1791  # If no luck and we have auto-detection library, try that:
1792  if not u and chardet and not isinstance(self.markup, unicode):
1793  u = self._convertFrom(chardet.detect(self.markup)['encoding'])
1794 
1795  # As a last resort, try utf-8 and windows-1252:
1796  if not u:
1797  for proposed_encoding in ("utf-8", "windows-1252"):
1798  u = self._convertFrom(proposed_encoding)
1799  if u: break
1800 
1801  self.unicode = u
1802  if not u: self.originalEncoding = None

Member Function Documentation

def BeautifulSoup.UnicodeDammit._codec (   self,
  charset 
)
private

Definition at line 1941 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.find_codec().

1942  def _codec(self, charset):
1943  if not charset: return charset
1944  codec = None
1945  try:
1946  codecs.lookup(charset)
1947  codec = charset
1948  except (LookupError, ValueError):
1949  pass
1950  return codec
def BeautifulSoup.UnicodeDammit._convertFrom (   self,
  proposed 
)
private

Definition at line 1814 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._subMSChar(), BeautifulSoup.UnicodeDammit._toUnicode(), BeautifulSoup.UnicodeDammit.find_codec(), recoMuon.in, BeautifulSoup.BeautifulStoneSoup.markup, BeautifulSoup.UnicodeDammit.markup, BeautifulSoup.BeautifulStoneSoup.smartQuotesTo, BeautifulSoup.UnicodeDammit.smartQuotesTo, and BeautifulSoup.UnicodeDammit.triedEncodings.

1815  def _convertFrom(self, proposed):
1816  proposed = self.find_codec(proposed)
1817  if not proposed or proposed in self.triedEncodings:
1818  return None
1819  self.triedEncodings.append(proposed)
1820  markup = self.markup
1821 
1822  # Convert smart quotes to HTML if coming from an encoding
1823  # that might have them.
1824  if self.smartQuotesTo and proposed.lower() in("windows-1252",
1825  "iso-8859-1",
1826  "iso-8859-2"):
1827  markup = re.compile("([\x80-\x9f])").sub \
1828  (lambda(x): self._subMSChar(x.group(1)),
1829  markup)
1830 
1831  try:
1832  # print "Trying to convert document to %s" % proposed
1833  u = self._toUnicode(markup, proposed)
1834  self.markup = u
1835  self.originalEncoding = proposed
1836  except Exception, e:
1837  # print "That didn't work!"
1838  # print e
1839  return None
1840  #print "Correct encoding: %s" % proposed
1841  return self.markup
def BeautifulSoup.UnicodeDammit._detectEncoding (   self,
  xml_data,
  isHTML = False 
)
private
Given a document, tries to detect its XML encoding.

Definition at line 1867 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._ebcdic_to_ascii(), BeautifulSoup.BeautifulStoneSoup.declaredHTMLEncoding, BeautifulSoup.BeautifulSoup.declaredHTMLEncoding, BeautifulSoup.UnicodeDammit.declaredHTMLEncoding, alcaDQMUpload.encode(), match(), and BeautifulSoup.UnicodeDammit.unicode.

1868  def _detectEncoding(self, xml_data, isHTML=False):
1869  """Given a document, tries to detect its XML encoding."""
1870  xml_encoding = sniffed_xml_encoding = None
1871  try:
1872  if xml_data[:4] == '\x4c\x6f\xa7\x94':
1873  # EBCDIC
1874  xml_data = self._ebcdic_to_ascii(xml_data)
1875  elif xml_data[:4] == '\x00\x3c\x00\x3f':
1876  # UTF-16BE
1877  sniffed_xml_encoding = 'utf-16be'
1878  xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
1879  elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
1880  and (xml_data[2:4] != '\x00\x00'):
1881  # UTF-16BE with BOM
1882  sniffed_xml_encoding = 'utf-16be'
1883  xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
1884  elif xml_data[:4] == '\x3c\x00\x3f\x00':
1885  # UTF-16LE
1886  sniffed_xml_encoding = 'utf-16le'
1887  xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
1888  elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
1889  (xml_data[2:4] != '\x00\x00'):
1890  # UTF-16LE with BOM
1891  sniffed_xml_encoding = 'utf-16le'
1892  xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
1893  elif xml_data[:4] == '\x00\x00\x00\x3c':
1894  # UTF-32BE
1895  sniffed_xml_encoding = 'utf-32be'
1896  xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
1897  elif xml_data[:4] == '\x3c\x00\x00\x00':
1898  # UTF-32LE
1899  sniffed_xml_encoding = 'utf-32le'
1900  xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
1901  elif xml_data[:4] == '\x00\x00\xfe\xff':
1902  # UTF-32BE with BOM
1903  sniffed_xml_encoding = 'utf-32be'
1904  xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
1905  elif xml_data[:4] == '\xff\xfe\x00\x00':
1906  # UTF-32LE with BOM
1907  sniffed_xml_encoding = 'utf-32le'
1908  xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
1909  elif xml_data[:3] == '\xef\xbb\xbf':
1910  # UTF-8 with BOM
1911  sniffed_xml_encoding = 'utf-8'
1912  xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
1913  else:
1914  sniffed_xml_encoding = 'ascii'
1915  pass
1916  except:
1917  xml_encoding_match = None
1918  xml_encoding_match = re.compile(
1919  '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
1920  if not xml_encoding_match and isHTML:
1921  regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I)
1922  xml_encoding_match = regexp.search(xml_data)
1923  if xml_encoding_match is not None:
1924  xml_encoding = xml_encoding_match.groups()[0].lower()
1925  if isHTML:
1926  self.declaredHTMLEncoding = xml_encoding
1927  if sniffed_xml_encoding and \
1928  (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
1929  'iso-10646-ucs-4', 'ucs-4', 'csucs4',
1930  'utf-16', 'utf-32', 'utf_16', 'utf_32',
1931  'utf16', 'u16')):
1932  xml_encoding = sniffed_xml_encoding
1933  return xml_data, xml_encoding, sniffed_xml_encoding
1934 
std::pair< typename Association::data_type::first_type, double > match(Reference key, Association association, bool bestMatchByMaxValue)
Generic matching function.
Definition: Utils.h:10
def BeautifulSoup.UnicodeDammit._ebcdic_to_ascii (   self,
  s 
)
private

Definition at line 1952 of file BeautifulSoup.py.

References __class__< T >.__class__(), pat::__class__.__class__(), and join().

Referenced by BeautifulSoup.UnicodeDammit._detectEncoding().

1953  def _ebcdic_to_ascii(self, s):
1954  c = self.__class__
1955  if not c.EBCDIC_TO_ASCII_MAP:
1956  emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
1957  16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
1958  128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
1959  144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
1960  32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
1961  38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
1962  45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
1963  186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
1964  195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
1965  201,202,106,107,108,109,110,111,112,113,114,203,204,205,
1966  206,207,208,209,126,115,116,117,118,119,120,121,122,210,
1967  211,212,213,214,215,216,217,218,219,220,221,222,223,224,
1968  225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
1969  73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
1970  82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
1971  90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
1972  250,251,252,253,254,255)
1973  import string
1974  c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
1975  ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
1976  return s.translate(c.EBCDIC_TO_ASCII_MAP)
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
def BeautifulSoup.UnicodeDammit._subMSChar (   self,
  orig 
)
private
Changes a MS smart quote character to an XML or HTML
entity.

Definition at line 1803 of file BeautifulSoup.py.

References BeautifulSoup.BeautifulStoneSoup.smartQuotesTo, and BeautifulSoup.UnicodeDammit.smartQuotesTo.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom().

1804  def _subMSChar(self, orig):
1805  """Changes a MS smart quote character to an XML or HTML
1806  entity."""
1807  sub = self.MS_CHARS.get(orig)
1808  if isinstance(sub, tuple):
1809  if self.smartQuotesTo == 'xml':
1810  sub = '&#x%s;' % sub[1]
1811  else:
1812  sub = '&%s;' % sub[0]
1813  return sub
def BeautifulSoup.UnicodeDammit._toUnicode (   self,
  data,
  encoding 
)
private
Given a string and its encoding, decodes the string into Unicode.
%encoding is a string recognized by encodings.aliases

Definition at line 1842 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit.unicode.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom().

1843  def _toUnicode(self, data, encoding):
1844  '''Given a string and its encoding, decodes the string into Unicode.
1845  %encoding is a string recognized by encodings.aliases'''
1846 
1847  # strip Byte Order Mark (if present)
1848  if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
1849  and (data[2:4] != '\x00\x00'):
1850  encoding = 'utf-16be'
1851  data = data[2:]
1852  elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
1853  and (data[2:4] != '\x00\x00'):
1854  encoding = 'utf-16le'
1855  data = data[2:]
1856  elif data[:3] == '\xef\xbb\xbf':
1857  encoding = 'utf-8'
1858  data = data[3:]
1859  elif data[:4] == '\x00\x00\xfe\xff':
1860  encoding = 'utf-32be'
1861  data = data[4:]
1862  elif data[:4] == '\xff\xfe\x00\x00':
1863  encoding = 'utf-32le'
1864  data = data[4:]
1865  newdata = unicode(data, encoding)
1866  return newdata
def BeautifulSoup.UnicodeDammit.find_codec (   self,
  charset 
)

Definition at line 1935 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._codec().

Referenced by BeautifulSoup.UnicodeDammit._convertFrom().

1936  def find_codec(self, charset):
1937  return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
1938  or (charset and self._codec(charset.replace("-", ""))) \
1939  or (charset and self._codec(charset.replace("-", "_"))) \
1940  or charset

Member Data Documentation

dictionary BeautifulSoup.UnicodeDammit.CHARSET_ALIASES
static
Initial value:
1 = { "macintosh" : "mac-roman",
2  "x-sjis" : "shift-jis" }

Definition at line 1766 of file BeautifulSoup.py.

BeautifulSoup.UnicodeDammit.declaredHTMLEncoding

Definition at line 1771 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._detectEncoding().

BeautifulSoup.UnicodeDammit.EBCDIC_TO_ASCII_MAP = None
static

Definition at line 1951 of file BeautifulSoup.py.

BeautifulSoup.UnicodeDammit.markup

Definition at line 1833 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom().

dictionary BeautifulSoup.UnicodeDammit.MS_CHARS
static

Definition at line 1977 of file BeautifulSoup.py.

BeautifulSoup.UnicodeDammit.originalEncoding

Definition at line 1777 of file BeautifulSoup.py.

BeautifulSoup.UnicodeDammit.smartQuotesTo

Definition at line 1774 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom(), and BeautifulSoup.UnicodeDammit._subMSChar().

BeautifulSoup.UnicodeDammit.triedEncodings

Definition at line 1775 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom().

BeautifulSoup.UnicodeDammit.unicode

Definition at line 1778 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._detectEncoding(), and BeautifulSoup.UnicodeDammit._toUnicode().