CMS 3D CMS Logo

List of all members | Public Member Functions | Public Attributes | Static Public Attributes | Private Member Functions
BeautifulSoup.UnicodeDammit Class Reference

Public Member Functions

def __init__ (self, markup, overrideEncodings=[], smartQuotesTo='xml', isHTML=False)
 
def find_codec (self, charset)
 

Public Attributes

 declaredHTMLEncoding
 
 markup
 
 originalEncoding
 
 smartQuotesTo
 
 triedEncodings
 
 unicode
 

Static Public Attributes

 CHARSET_ALIASES
 
 EBCDIC_TO_ASCII_MAP
 
 MS_CHARS
 

Private Member Functions

def _codec (self, charset)
 
def _convertFrom (self, proposed)
 
def _detectEncoding (self, xml_data, isHTML=False)
 
def _ebcdic_to_ascii (self, s)
 
def _subMSChar (self, orig)
 
def _toUnicode (self, data, encoding)
 

Detailed Description

A class for detecting the encoding of a *ML document and
converting it to a Unicode string. If the source encoding is
windows-1252, can replace MS smart quotes with their HTML or XML
equivalents.

Definition at line 1756 of file BeautifulSoup.py.

Constructor & Destructor Documentation

◆ __init__()

def BeautifulSoup.UnicodeDammit.__init__ (   self,
  markup,
  overrideEncodings = [],
  smartQuotesTo = 'xml',
  isHTML = False 
)

Definition at line 1770 of file BeautifulSoup.py.

1770  smartQuotesTo='xml', isHTML=False):
1771  self.declaredHTMLEncoding = None
1772  self.markup, documentEncoding, sniffedEncoding = \
1773  self._detectEncoding(markup, isHTML)
1774  self.smartQuotesTo = smartQuotesTo
1775  self.triedEncodings = []
1776  if markup == '' or isinstance(markup, unicode):
1777  self.originalEncoding = None
1778  self.unicode = unicode(markup)
1779  return
1780 
1781  u = None
1782  for proposedEncoding in overrideEncodings:
1783  u = self._convertFrom(proposedEncoding)
1784  if u: break
1785  if not u:
1786  for proposedEncoding in (documentEncoding, sniffedEncoding):
1787  u = self._convertFrom(proposedEncoding)
1788  if u: break
1789 
1790  # If no luck and we have auto-detection library, try that:
1791  if not u and chardet and not isinstance(self.markup, unicode):
1792  u = self._convertFrom(chardet.detect(self.markup)['encoding'])
1793 
1794  # As a last resort, try utf-8 and windows-1252:
1795  if not u:
1796  for proposed_encoding in ("utf-8", "windows-1252"):
1797  u = self._convertFrom(proposed_encoding)
1798  if u: break
1799 
1800  self.unicode = u
1801  if not u: self.originalEncoding = None
1802 

Member Function Documentation

◆ _codec()

def BeautifulSoup.UnicodeDammit._codec (   self,
  charset 
)
private

Definition at line 1941 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.find_codec().

1941  def _codec(self, charset):
1942  if not charset: return charset
1943  codec = None
1944  try:
1945  codecs.lookup(charset)
1946  codec = charset
1947  except (LookupError, ValueError):
1948  pass
1949  return codec
1950 

◆ _convertFrom()

def BeautifulSoup.UnicodeDammit._convertFrom (   self,
  proposed 
)
private

Definition at line 1814 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._subMSChar(), BeautifulSoup.UnicodeDammit._toUnicode(), mps_setup.append, BeautifulSoup.UnicodeDammit.find_codec(), recoMuon.in, BeautifulSoup.BeautifulStoneSoup.markup, BeautifulSoup.UnicodeDammit.markup, BeautifulSoup.BeautifulStoneSoup.smartQuotesTo, BeautifulSoup.UnicodeDammit.smartQuotesTo, and BeautifulSoup.UnicodeDammit.triedEncodings.

1814  def _convertFrom(self, proposed):
1815  proposed = self.find_codec(proposed)
1816  if not proposed or proposed in self.triedEncodings:
1817  return None
1818  self.triedEncodings.append(proposed)
1819  markup = self.markup
1820 
1821  # Convert smart quotes to HTML if coming from an encoding
1822  # that might have them.
1823  if self.smartQuotesTo and proposed.lower() in("windows-1252",
1824  "iso-8859-1",
1825  "iso-8859-2"):
1826  markup = re.compile("([\x80-\x9f])").sub \
1827  (lambda(x): self._subMSChar(x.group(1)),
1828  markup)
1829 
1830  try:
1831  # print "Trying to convert document to %s" % proposed
1832  u = self._toUnicode(markup, proposed)
1833  self.markup = u
1834  self.originalEncoding = proposed
1835  except Exception, e:
1836  # print "That didn't work!"
1837  # print e
1838  return None
1839  #print "Correct encoding: %s" % proposed
1840  return self.markup
1841 

◆ _detectEncoding()

def BeautifulSoup.UnicodeDammit._detectEncoding (   self,
  xml_data,
  isHTML = False 
)
private
Given a document, tries to detect its XML encoding.

Definition at line 1867 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._ebcdic_to_ascii(), BeautifulSoup.BeautifulStoneSoup.declaredHTMLEncoding, BeautifulSoup.BeautifulSoup.declaredHTMLEncoding, BeautifulSoup.UnicodeDammit.declaredHTMLEncoding, alcaDQMUpload.encode(), match(), and BeautifulSoup.UnicodeDammit.unicode.

1867  def _detectEncoding(self, xml_data, isHTML=False):
1868  """Given a document, tries to detect its XML encoding."""
1869  xml_encoding = sniffed_xml_encoding = None
1870  try:
1871  if xml_data[:4] == '\x4c\x6f\xa7\x94':
1872  # EBCDIC
1873  xml_data = self._ebcdic_to_ascii(xml_data)
1874  elif xml_data[:4] == '\x00\x3c\x00\x3f':
1875  # UTF-16BE
1876  sniffed_xml_encoding = 'utf-16be'
1877  xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
1878  elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
1879  and (xml_data[2:4] != '\x00\x00'):
1880  # UTF-16BE with BOM
1881  sniffed_xml_encoding = 'utf-16be'
1882  xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
1883  elif xml_data[:4] == '\x3c\x00\x3f\x00':
1884  # UTF-16LE
1885  sniffed_xml_encoding = 'utf-16le'
1886  xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
1887  elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
1888  (xml_data[2:4] != '\x00\x00'):
1889  # UTF-16LE with BOM
1890  sniffed_xml_encoding = 'utf-16le'
1891  xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
1892  elif xml_data[:4] == '\x00\x00\x00\x3c':
1893  # UTF-32BE
1894  sniffed_xml_encoding = 'utf-32be'
1895  xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
1896  elif xml_data[:4] == '\x3c\x00\x00\x00':
1897  # UTF-32LE
1898  sniffed_xml_encoding = 'utf-32le'
1899  xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
1900  elif xml_data[:4] == '\x00\x00\xfe\xff':
1901  # UTF-32BE with BOM
1902  sniffed_xml_encoding = 'utf-32be'
1903  xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
1904  elif xml_data[:4] == '\xff\xfe\x00\x00':
1905  # UTF-32LE with BOM
1906  sniffed_xml_encoding = 'utf-32le'
1907  xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
1908  elif xml_data[:3] == '\xef\xbb\xbf':
1909  # UTF-8 with BOM
1910  sniffed_xml_encoding = 'utf-8'
1911  xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
1912  else:
1913  sniffed_xml_encoding = 'ascii'
1914  pass
1915  except:
1916  xml_encoding_match = None
1917  xml_encoding_match = re.compile(
1918  '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
1919  if not xml_encoding_match and isHTML:
1920  regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I)
1921  xml_encoding_match = regexp.search(xml_data)
1922  if xml_encoding_match is not None:
1923  xml_encoding = xml_encoding_match.groups()[0].lower()
1924  if isHTML:
1925  self.declaredHTMLEncoding = xml_encoding
1926  if sniffed_xml_encoding and \
1927  (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
1928  'iso-10646-ucs-4', 'ucs-4', 'csucs4',
1929  'utf-16', 'utf-32', 'utf_16', 'utf_32',
1930  'utf16', 'u16')):
1931  xml_encoding = sniffed_xml_encoding
1932  return xml_data, xml_encoding, sniffed_xml_encoding
1933 
1934 
def encode(args, files)
std::pair< typename Association::data_type::first_type, double > match(Reference key, Association association, bool bestMatchByMaxValue)
Generic matching function.
Definition: Utils.h:10

◆ _ebcdic_to_ascii()

def BeautifulSoup.UnicodeDammit._ebcdic_to_ascii (   self,
  s 
)
private

Definition at line 1952 of file BeautifulSoup.py.

References __class__< T >.__class__(), join(), genParticles_cff.map, and FastTimerService_cff.range.

Referenced by BeautifulSoup.UnicodeDammit._detectEncoding().

1952  def _ebcdic_to_ascii(self, s):
1953  c = self.__class__
1954  if not c.EBCDIC_TO_ASCII_MAP:
1955  emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
1956  16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
1957  128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
1958  144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
1959  32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
1960  38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
1961  45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
1962  186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
1963  195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
1964  201,202,106,107,108,109,110,111,112,113,114,203,204,205,
1965  206,207,208,209,126,115,116,117,118,119,120,121,122,210,
1966  211,212,213,214,215,216,217,218,219,220,221,222,223,224,
1967  225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
1968  73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
1969  82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
1970  90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
1971  250,251,252,253,254,255)
1972  import string
1973  c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
1974  ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
1975  return s.translate(c.EBCDIC_TO_ASCII_MAP)
1976 
static std::string join(char **cmd)
Definition: RemoteFile.cc:19

◆ _subMSChar()

def BeautifulSoup.UnicodeDammit._subMSChar (   self,
  orig 
)
private
Changes a MS smart quote character to an XML or HTML
entity.

Definition at line 1803 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit.MS_CHARS, BeautifulSoup.BeautifulStoneSoup.smartQuotesTo, and BeautifulSoup.UnicodeDammit.smartQuotesTo.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom().

1803  def _subMSChar(self, orig):
1804  """Changes a MS smart quote character to an XML or HTML
1805  entity."""
1806  sub = self.MS_CHARS.get(orig)
1807  if isinstance(sub, tuple):
1808  if self.smartQuotesTo == 'xml':
1809  sub = '&#x%s;' % sub[1]
1810  else:
1811  sub = '&%s;' % sub[0]
1812  return sub
1813 

◆ _toUnicode()

def BeautifulSoup.UnicodeDammit._toUnicode (   self,
  data,
  encoding 
)
private
Given a string and its encoding, decodes the string into Unicode.
%encoding is a string recognized by encodings.aliases

Definition at line 1842 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit.unicode.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom().

1842  def _toUnicode(self, data, encoding):
1843  '''Given a string and its encoding, decodes the string into Unicode.
1844  %encoding is a string recognized by encodings.aliases'''
1845 
1846  # strip Byte Order Mark (if present)
1847  if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
1848  and (data[2:4] != '\x00\x00'):
1849  encoding = 'utf-16be'
1850  data = data[2:]
1851  elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
1852  and (data[2:4] != '\x00\x00'):
1853  encoding = 'utf-16le'
1854  data = data[2:]
1855  elif data[:3] == '\xef\xbb\xbf':
1856  encoding = 'utf-8'
1857  data = data[3:]
1858  elif data[:4] == '\x00\x00\xfe\xff':
1859  encoding = 'utf-32be'
1860  data = data[4:]
1861  elif data[:4] == '\xff\xfe\x00\x00':
1862  encoding = 'utf-32le'
1863  data = data[4:]
1864  newdata = unicode(data, encoding)
1865  return newdata
1866 

◆ find_codec()

def BeautifulSoup.UnicodeDammit.find_codec (   self,
  charset 
)

Definition at line 1935 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._codec(), and BeautifulSoup.UnicodeDammit.CHARSET_ALIASES.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom().

1935  def find_codec(self, charset):
1936  return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
1937  or (charset and self._codec(charset.replace("-", ""))) \
1938  or (charset and self._codec(charset.replace("-", "_"))) \
1939  or charset
1940 

Member Data Documentation

◆ CHARSET_ALIASES

BeautifulSoup.UnicodeDammit.CHARSET_ALIASES
static

Definition at line 1766 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.find_codec().

◆ declaredHTMLEncoding

BeautifulSoup.UnicodeDammit.declaredHTMLEncoding

Definition at line 1771 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._detectEncoding().

◆ EBCDIC_TO_ASCII_MAP

BeautifulSoup.UnicodeDammit.EBCDIC_TO_ASCII_MAP
static

Definition at line 1951 of file BeautifulSoup.py.

◆ markup

BeautifulSoup.UnicodeDammit.markup

Definition at line 1833 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom().

◆ MS_CHARS

BeautifulSoup.UnicodeDammit.MS_CHARS
static

Definition at line 1977 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._subMSChar().

◆ originalEncoding

BeautifulSoup.UnicodeDammit.originalEncoding

Definition at line 1777 of file BeautifulSoup.py.

◆ smartQuotesTo

BeautifulSoup.UnicodeDammit.smartQuotesTo

◆ triedEncodings

BeautifulSoup.UnicodeDammit.triedEncodings

Definition at line 1775 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom().

◆ unicode

BeautifulSoup.UnicodeDammit.unicode