CMS 3D CMS Logo

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Properties Friends Macros Pages
List of all members | Public Member Functions | Public Attributes | Static Public Attributes | Private Member Functions
BeautifulSoup.UnicodeDammit Class Reference

Public Member Functions

def __init__
 
def __init__
 
def find_codec
 
def find_codec
 

Public Attributes

 declaredHTMLEncoding
 
 markup
 
 originalEncoding
 
 smartQuotesTo
 
 triedEncodings
 
 unicode
 

Static Public Attributes

dictionary CHARSET_ALIASES
 
 EBCDIC_TO_ASCII_MAP = None
 
dictionary MS_CHARS
 

Private Member Functions

def _codec
 
def _codec
 
def _convertFrom
 
def _convertFrom
 
def _detectEncoding
 
def _detectEncoding
 
def _ebcdic_to_ascii
 
def _ebcdic_to_ascii
 
def _subMSChar
 
def _subMSChar
 
def _toUnicode
 
def _toUnicode
 

Detailed Description

A class for detecting the encoding of a *ML document and
converting it to a Unicode string. If the source encoding is
windows-1252, can replace MS smart quotes with their HTML or XML
equivalents.

Definition at line 1734 of file BeautifulSoup.py.

Constructor & Destructor Documentation

def BeautifulSoup.UnicodeDammit.__init__ (   self,
  markup,
  overrideEncodings = [],
  smartQuotesTo = 'xml',
  isHTML = False 
)

Definition at line 1748 of file BeautifulSoup.py.

1749  smartQuotesTo='xml', isHTML=False):
1750  self.declaredHTMLEncoding = None
1751  self.markup, documentEncoding, sniffedEncoding = \
1752  self._detectEncoding(markup, isHTML)
1753  self.smartQuotesTo = smartQuotesTo
1754  self.triedEncodings = []
1755  if markup == '' or isinstance(markup, unicode):
1756  self.originalEncoding = None
1757  self.unicode = unicode(markup)
1758  return
1759 
1760  u = None
1761  for proposedEncoding in overrideEncodings:
1762  u = self._convertFrom(proposedEncoding)
1763  if u: break
1764  if not u:
1765  for proposedEncoding in (documentEncoding, sniffedEncoding):
1766  u = self._convertFrom(proposedEncoding)
1767  if u: break
1768 
1769  # If no luck and we have auto-detection library, try that:
1770  if not u and chardet and not isinstance(self.markup, unicode):
1771  u = self._convertFrom(chardet.detect(self.markup)['encoding'])
1772 
1773  # As a last resort, try utf-8 and windows-1252:
1774  if not u:
1775  for proposed_encoding in ("utf-8", "windows-1252"):
1776  u = self._convertFrom(proposed_encoding)
1777  if u: break
1778 
1779  self.unicode = u
1780  if not u: self.originalEncoding = None
def BeautifulSoup.UnicodeDammit.__init__ (   self,
  markup,
  overrideEncodings = [],
  smartQuotesTo = 'xml',
  isHTML = False 
)

Definition at line 1748 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._convertFrom(), BeautifulSoup.UnicodeDammit._detectEncoding(), BeautifulSoup.BeautifulStoneSoup.declaredHTMLEncoding, BeautifulSoup.BeautifulSoup.declaredHTMLEncoding, BeautifulSoup.UnicodeDammit.declaredHTMLEncoding, BeautifulSoup.BeautifulStoneSoup.markup, BeautifulSoup.UnicodeDammit.markup, BeautifulSoup.BeautifulStoneSoup.originalEncoding, BeautifulSoup.BeautifulSoup.originalEncoding, BeautifulSoup.UnicodeDammit.originalEncoding, BeautifulSoup.BeautifulStoneSoup.smartQuotesTo, BeautifulSoup.UnicodeDammit.smartQuotesTo, BeautifulSoup.UnicodeDammit.triedEncodings, and BeautifulSoup.UnicodeDammit.unicode.

1749  smartQuotesTo='xml', isHTML=False):
1750  self.declaredHTMLEncoding = None
1751  self.markup, documentEncoding, sniffedEncoding = \
1752  self._detectEncoding(markup, isHTML)
1753  self.smartQuotesTo = smartQuotesTo
1754  self.triedEncodings = []
1755  if markup == '' or isinstance(markup, unicode):
1756  self.originalEncoding = None
1757  self.unicode = unicode(markup)
1758  return
1759 
1760  u = None
1761  for proposedEncoding in overrideEncodings:
1762  u = self._convertFrom(proposedEncoding)
1763  if u: break
1764  if not u:
1765  for proposedEncoding in (documentEncoding, sniffedEncoding):
1766  u = self._convertFrom(proposedEncoding)
1767  if u: break
1768 
1769  # If no luck and we have auto-detection library, try that:
1770  if not u and chardet and not isinstance(self.markup, unicode):
1771  u = self._convertFrom(chardet.detect(self.markup)['encoding'])
1772 
1773  # As a last resort, try utf-8 and windows-1252:
1774  if not u:
1775  for proposed_encoding in ("utf-8", "windows-1252"):
1776  u = self._convertFrom(proposed_encoding)
1777  if u: break
1778 
1779  self.unicode = u
1780  if not u: self.originalEncoding = None

Member Function Documentation

def BeautifulSoup.UnicodeDammit._codec (   self,
  charset 
)
private

Definition at line 1924 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit._codec(), and BeautifulSoup.UnicodeDammit.find_codec().

1925  def _codec(self, charset):
1926  if not charset: return charset
1927  codec = None
1928  try:
1929  codecs.lookup(charset)
1930  codec = charset
1931  except (LookupError, ValueError):
1932  pass
1933  return codec
def BeautifulSoup.UnicodeDammit._codec (   self,
  charset 
)
private

Definition at line 1924 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._codec().

1925  def _codec(self, charset):
1926  if not charset: return charset
1927  codec = None
1928  try:
1929  codecs.lookup(charset)
1930  codec = charset
1931  except (LookupError, ValueError):
1932  pass
1933  return codec
def BeautifulSoup.UnicodeDammit._convertFrom (   self,
  proposed 
)
private

Definition at line 1795 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._convertFrom(), BeautifulSoup.UnicodeDammit._subMSChar(), BeautifulSoup.UnicodeDammit._toUnicode(), BeautifulSoup.UnicodeDammit.find_codec(), recoMuon.in, BeautifulSoup.BeautifulStoneSoup.markup, BeautifulSoup.UnicodeDammit.markup, BeautifulSoup.BeautifulStoneSoup.originalEncoding, BeautifulSoup.BeautifulSoup.originalEncoding, BeautifulSoup.UnicodeDammit.originalEncoding, BeautifulSoup.BeautifulStoneSoup.smartQuotesTo, BeautifulSoup.UnicodeDammit.smartQuotesTo, and BeautifulSoup.UnicodeDammit.triedEncodings.

1796  def _convertFrom(self, proposed):
1797  proposed = self.find_codec(proposed)
1798  if not proposed or proposed in self.triedEncodings:
1799  return None
1800  self.triedEncodings.append(proposed)
1801  markup = self.markup
1802 
1803  # Convert smart quotes to HTML if coming from an encoding
1804  # that might have them.
1805  if self.smartQuotesTo and proposed.lower() in("windows-1252",
1806  "iso-8859-1",
1807  "iso-8859-2"):
1808  smart_quotes_re = "([\x80-\x9f])"
1809  smart_quotes_compiled = re.compile(smart_quotes_re)
1810  markup = smart_quotes_compiled.sub(self._subMSChar, markup)
1811 
1812  try:
1813  # print "Trying to convert document to %s" % proposed
1814  u = self._toUnicode(markup, proposed)
1815  self.markup = u
1816  self.originalEncoding = proposed
1817  except Exception, e:
1818  # print "That didn't work!"
1819  # print e
1820  return None
1821  #print "Correct encoding: %s" % proposed
1822  return self.markup
def BeautifulSoup.UnicodeDammit._convertFrom (   self,
  proposed 
)
private

Definition at line 1795 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._subMSChar(), BeautifulSoup.UnicodeDammit._toUnicode(), BeautifulSoup.UnicodeDammit.find_codec(), recoMuon.in, BeautifulSoup.BeautifulStoneSoup.markup, BeautifulSoup.UnicodeDammit.markup, BeautifulSoup.BeautifulStoneSoup.smartQuotesTo, BeautifulSoup.UnicodeDammit.smartQuotesTo, and BeautifulSoup.UnicodeDammit.triedEncodings.

Referenced by BeautifulSoup.UnicodeDammit.__init__(), and BeautifulSoup.UnicodeDammit._convertFrom().

1796  def _convertFrom(self, proposed):
1797  proposed = self.find_codec(proposed)
1798  if not proposed or proposed in self.triedEncodings:
1799  return None
1800  self.triedEncodings.append(proposed)
1801  markup = self.markup
1802 
1803  # Convert smart quotes to HTML if coming from an encoding
1804  # that might have them.
1805  if self.smartQuotesTo and proposed.lower() in("windows-1252",
1806  "iso-8859-1",
1807  "iso-8859-2"):
1808  smart_quotes_re = "([\x80-\x9f])"
1809  smart_quotes_compiled = re.compile(smart_quotes_re)
1810  markup = smart_quotes_compiled.sub(self._subMSChar, markup)
1811 
1812  try:
1813  # print "Trying to convert document to %s" % proposed
1814  u = self._toUnicode(markup, proposed)
1815  self.markup = u
1816  self.originalEncoding = proposed
1817  except Exception, e:
1818  # print "That didn't work!"
1819  # print e
1820  return None
1821  #print "Correct encoding: %s" % proposed
1822  return self.markup
def BeautifulSoup.UnicodeDammit._detectEncoding (   self,
  xml_data,
  isHTML = False 
)
private
Given a document, tries to detect its XML encoding.

Definition at line 1848 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._ebcdic_to_ascii(), BeautifulSoup.BeautifulStoneSoup.declaredHTMLEncoding, BeautifulSoup.BeautifulSoup.declaredHTMLEncoding, BeautifulSoup.UnicodeDammit.declaredHTMLEncoding, edm.decode(), alcaDQMUpload.encode(), match(), and BeautifulSoup.UnicodeDammit.unicode.

Referenced by BeautifulSoup.UnicodeDammit.__init__(), and BeautifulSoup.UnicodeDammit._detectEncoding().

1849  def _detectEncoding(self, xml_data, isHTML=False):
1850  """Given a document, tries to detect its XML encoding."""
1851  xml_encoding = sniffed_xml_encoding = None
1852  try:
1853  if xml_data[:4] == '\x4c\x6f\xa7\x94':
1854  # EBCDIC
1855  xml_data = self._ebcdic_to_ascii(xml_data)
1856  elif xml_data[:4] == '\x00\x3c\x00\x3f':
1857  # UTF-16BE
1858  sniffed_xml_encoding = 'utf-16be'
1859  xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
1860  elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
1861  and (xml_data[2:4] != '\x00\x00'):
1862  # UTF-16BE with BOM
1863  sniffed_xml_encoding = 'utf-16be'
1864  xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
1865  elif xml_data[:4] == '\x3c\x00\x3f\x00':
1866  # UTF-16LE
1867  sniffed_xml_encoding = 'utf-16le'
1868  xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
1869  elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
1870  (xml_data[2:4] != '\x00\x00'):
1871  # UTF-16LE with BOM
1872  sniffed_xml_encoding = 'utf-16le'
1873  xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
1874  elif xml_data[:4] == '\x00\x00\x00\x3c':
1875  # UTF-32BE
1876  sniffed_xml_encoding = 'utf-32be'
1877  xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
1878  elif xml_data[:4] == '\x3c\x00\x00\x00':
1879  # UTF-32LE
1880  sniffed_xml_encoding = 'utf-32le'
1881  xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
1882  elif xml_data[:4] == '\x00\x00\xfe\xff':
1883  # UTF-32BE with BOM
1884  sniffed_xml_encoding = 'utf-32be'
1885  xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
1886  elif xml_data[:4] == '\xff\xfe\x00\x00':
1887  # UTF-32LE with BOM
1888  sniffed_xml_encoding = 'utf-32le'
1889  xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
1890  elif xml_data[:3] == '\xef\xbb\xbf':
1891  # UTF-8 with BOM
1892  sniffed_xml_encoding = 'utf-8'
1893  xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
1894  else:
1895  sniffed_xml_encoding = 'ascii'
1896  pass
1897  except:
1898  xml_encoding_match = None
1899  xml_encoding_re = '^<\?.*encoding=[\'"](.*?)[\'"].*\?>'.encode()
1900  xml_encoding_match = re.compile(xml_encoding_re).match(xml_data)
1901  if not xml_encoding_match and isHTML:
1902  meta_re = '<\s*meta[^>]+charset=([^>]*?)[;\'">]'.encode()
1903  regexp = re.compile(meta_re, re.I)
1904  xml_encoding_match = regexp.search(xml_data)
1905  if xml_encoding_match is not None:
1906  xml_encoding = xml_encoding_match.groups()[0].decode(
1907  'ascii').lower()
1908  if isHTML:
1909  self.declaredHTMLEncoding = xml_encoding
1910  if sniffed_xml_encoding and \
1911  (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
1912  'iso-10646-ucs-4', 'ucs-4', 'csucs4',
1913  'utf-16', 'utf-32', 'utf_16', 'utf_32',
1914  'utf16', 'u16')):
1915  xml_encoding = sniffed_xml_encoding
1916  return xml_data, xml_encoding, sniffed_xml_encoding
1917 
bool decode(bool &, std::string const &)
Definition: types.cc:62
std::pair< typename Association::data_type::first_type, double > match(Reference key, Association association, bool bestMatchByMaxValue)
Generic matching function.
Definition: Utils.h:6
def BeautifulSoup.UnicodeDammit._detectEncoding (   self,
  xml_data,
  isHTML = False 
)
private
Given a document, tries to detect its XML encoding.

Definition at line 1848 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._detectEncoding(), BeautifulSoup.UnicodeDammit._ebcdic_to_ascii(), BeautifulSoup.BeautifulStoneSoup.declaredHTMLEncoding, BeautifulSoup.BeautifulSoup.declaredHTMLEncoding, BeautifulSoup.UnicodeDammit.declaredHTMLEncoding, edm.decode(), alcaDQMUpload.encode(), match(), and BeautifulSoup.UnicodeDammit.unicode.

1849  def _detectEncoding(self, xml_data, isHTML=False):
1850  """Given a document, tries to detect its XML encoding."""
1851  xml_encoding = sniffed_xml_encoding = None
1852  try:
1853  if xml_data[:4] == '\x4c\x6f\xa7\x94':
1854  # EBCDIC
1855  xml_data = self._ebcdic_to_ascii(xml_data)
1856  elif xml_data[:4] == '\x00\x3c\x00\x3f':
1857  # UTF-16BE
1858  sniffed_xml_encoding = 'utf-16be'
1859  xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
1860  elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
1861  and (xml_data[2:4] != '\x00\x00'):
1862  # UTF-16BE with BOM
1863  sniffed_xml_encoding = 'utf-16be'
1864  xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
1865  elif xml_data[:4] == '\x3c\x00\x3f\x00':
1866  # UTF-16LE
1867  sniffed_xml_encoding = 'utf-16le'
1868  xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
1869  elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
1870  (xml_data[2:4] != '\x00\x00'):
1871  # UTF-16LE with BOM
1872  sniffed_xml_encoding = 'utf-16le'
1873  xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
1874  elif xml_data[:4] == '\x00\x00\x00\x3c':
1875  # UTF-32BE
1876  sniffed_xml_encoding = 'utf-32be'
1877  xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
1878  elif xml_data[:4] == '\x3c\x00\x00\x00':
1879  # UTF-32LE
1880  sniffed_xml_encoding = 'utf-32le'
1881  xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
1882  elif xml_data[:4] == '\x00\x00\xfe\xff':
1883  # UTF-32BE with BOM
1884  sniffed_xml_encoding = 'utf-32be'
1885  xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
1886  elif xml_data[:4] == '\xff\xfe\x00\x00':
1887  # UTF-32LE with BOM
1888  sniffed_xml_encoding = 'utf-32le'
1889  xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
1890  elif xml_data[:3] == '\xef\xbb\xbf':
1891  # UTF-8 with BOM
1892  sniffed_xml_encoding = 'utf-8'
1893  xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
1894  else:
1895  sniffed_xml_encoding = 'ascii'
1896  pass
1897  except:
1898  xml_encoding_match = None
1899  xml_encoding_re = '^<\?.*encoding=[\'"](.*?)[\'"].*\?>'.encode()
1900  xml_encoding_match = re.compile(xml_encoding_re).match(xml_data)
1901  if not xml_encoding_match and isHTML:
1902  meta_re = '<\s*meta[^>]+charset=([^>]*?)[;\'">]'.encode()
1903  regexp = re.compile(meta_re, re.I)
1904  xml_encoding_match = regexp.search(xml_data)
1905  if xml_encoding_match is not None:
1906  xml_encoding = xml_encoding_match.groups()[0].decode(
1907  'ascii').lower()
1908  if isHTML:
1909  self.declaredHTMLEncoding = xml_encoding
1910  if sniffed_xml_encoding and \
1911  (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
1912  'iso-10646-ucs-4', 'ucs-4', 'csucs4',
1913  'utf-16', 'utf-32', 'utf_16', 'utf_32',
1914  'utf16', 'u16')):
1915  xml_encoding = sniffed_xml_encoding
1916  return xml_data, xml_encoding, sniffed_xml_encoding
1917 
bool decode(bool &, std::string const &)
Definition: types.cc:62
std::pair< typename Association::data_type::first_type, double > match(Reference key, Association association, bool bestMatchByMaxValue)
Generic matching function.
Definition: Utils.h:6
def BeautifulSoup.UnicodeDammit._ebcdic_to_ascii (   self,
  s 
)
private

Definition at line 1935 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._ebcdic_to_ascii(), join(), and Association.map.

1936  def _ebcdic_to_ascii(self, s):
1937  c = self.__class__
1938  if not c.EBCDIC_TO_ASCII_MAP:
1939  emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
1940  16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
1941  128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
1942  144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
1943  32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
1944  38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
1945  45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
1946  186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
1947  195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
1948  201,202,106,107,108,109,110,111,112,113,114,203,204,205,
1949  206,207,208,209,126,115,116,117,118,119,120,121,122,210,
1950  211,212,213,214,215,216,217,218,219,220,221,222,223,224,
1951  225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
1952  73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
1953  82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
1954  90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
1955  250,251,252,253,254,255)
1956  import string
1957  c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
1958  ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
1959  return s.translate(c.EBCDIC_TO_ASCII_MAP)
dictionary map
Definition: Association.py:205
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
def BeautifulSoup.UnicodeDammit._ebcdic_to_ascii (   self,
  s 
)
private

Definition at line 1935 of file BeautifulSoup.py.

References join(), and Association.map.

Referenced by BeautifulSoup.UnicodeDammit._detectEncoding(), and BeautifulSoup.UnicodeDammit._ebcdic_to_ascii().

1936  def _ebcdic_to_ascii(self, s):
1937  c = self.__class__
1938  if not c.EBCDIC_TO_ASCII_MAP:
1939  emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
1940  16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
1941  128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
1942  144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
1943  32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
1944  38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
1945  45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
1946  186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
1947  195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
1948  201,202,106,107,108,109,110,111,112,113,114,203,204,205,
1949  206,207,208,209,126,115,116,117,118,119,120,121,122,210,
1950  211,212,213,214,215,216,217,218,219,220,221,222,223,224,
1951  225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
1952  73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
1953  82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
1954  90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
1955  250,251,252,253,254,255)
1956  import string
1957  c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
1958  ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
1959  return s.translate(c.EBCDIC_TO_ASCII_MAP)
dictionary map
Definition: Association.py:205
static std::string join(char **cmd)
Definition: RemoteFile.cc:18
def BeautifulSoup.UnicodeDammit._subMSChar (   self,
  match 
)
private
Changes a MS smart quote character to an XML or HTML
entity.

Definition at line 1781 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._subMSChar(), alcaDQMUpload.encode(), BeautifulSoup.BeautifulStoneSoup.smartQuotesTo, and BeautifulSoup.UnicodeDammit.smartQuotesTo.

1782  def _subMSChar(self, match):
1783  """Changes a MS smart quote character to an XML or HTML
1784  entity."""
1785  orig = match.group(1)
1786  sub = self.MS_CHARS.get(orig)
1787  if type(sub) == types.TupleType:
1788  if self.smartQuotesTo == 'xml':
1789  sub = '&#x'.encode() + sub[1].encode() + ';'.encode()
1790  else:
1791  sub = '&'.encode() + sub[0].encode() + ';'.encode()
1792  else:
1793  sub = sub.encode()
1794  return sub
def BeautifulSoup.UnicodeDammit._subMSChar (   self,
  match 
)
private
Changes a MS smart quote character to an XML or HTML
entity.

Definition at line 1781 of file BeautifulSoup.py.

References alcaDQMUpload.encode(), BeautifulSoup.BeautifulStoneSoup.smartQuotesTo, and BeautifulSoup.UnicodeDammit.smartQuotesTo.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom(), and BeautifulSoup.UnicodeDammit._subMSChar().

1782  def _subMSChar(self, match):
1783  """Changes a MS smart quote character to an XML or HTML
1784  entity."""
1785  orig = match.group(1)
1786  sub = self.MS_CHARS.get(orig)
1787  if type(sub) == types.TupleType:
1788  if self.smartQuotesTo == 'xml':
1789  sub = '&#x'.encode() + sub[1].encode() + ';'.encode()
1790  else:
1791  sub = '&'.encode() + sub[0].encode() + ';'.encode()
1792  else:
1793  sub = sub.encode()
1794  return sub
def BeautifulSoup.UnicodeDammit._toUnicode (   self,
  data,
  encoding 
)
private
Given a string and its encoding, decodes the string into Unicode.
%encoding is a string recognized by encodings.aliases

Definition at line 1823 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._toUnicode(), and BeautifulSoup.UnicodeDammit.unicode.

1824  def _toUnicode(self, data, encoding):
1825  '''Given a string and its encoding, decodes the string into Unicode.
1826  %encoding is a string recognized by encodings.aliases'''
1827 
1828  # strip Byte Order Mark (if present)
1829  if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
1830  and (data[2:4] != '\x00\x00'):
1831  encoding = 'utf-16be'
1832  data = data[2:]
1833  elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
1834  and (data[2:4] != '\x00\x00'):
1835  encoding = 'utf-16le'
1836  data = data[2:]
1837  elif data[:3] == '\xef\xbb\xbf':
1838  encoding = 'utf-8'
1839  data = data[3:]
1840  elif data[:4] == '\x00\x00\xfe\xff':
1841  encoding = 'utf-32be'
1842  data = data[4:]
1843  elif data[:4] == '\xff\xfe\x00\x00':
1844  encoding = 'utf-32le'
1845  data = data[4:]
1846  newdata = unicode(data, encoding)
1847  return newdata
def BeautifulSoup.UnicodeDammit._toUnicode (   self,
  data,
  encoding 
)
private
Given a string and its encoding, decodes the string into Unicode.
%encoding is a string recognized by encodings.aliases

Definition at line 1823 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit.unicode.

Referenced by BeautifulSoup.UnicodeDammit._convertFrom(), and BeautifulSoup.UnicodeDammit._toUnicode().

1824  def _toUnicode(self, data, encoding):
1825  '''Given a string and its encoding, decodes the string into Unicode.
1826  %encoding is a string recognized by encodings.aliases'''
1827 
1828  # strip Byte Order Mark (if present)
1829  if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
1830  and (data[2:4] != '\x00\x00'):
1831  encoding = 'utf-16be'
1832  data = data[2:]
1833  elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
1834  and (data[2:4] != '\x00\x00'):
1835  encoding = 'utf-16le'
1836  data = data[2:]
1837  elif data[:3] == '\xef\xbb\xbf':
1838  encoding = 'utf-8'
1839  data = data[3:]
1840  elif data[:4] == '\x00\x00\xfe\xff':
1841  encoding = 'utf-32be'
1842  data = data[4:]
1843  elif data[:4] == '\xff\xfe\x00\x00':
1844  encoding = 'utf-32le'
1845  data = data[4:]
1846  newdata = unicode(data, encoding)
1847  return newdata
def BeautifulSoup.UnicodeDammit.find_codec (   self,
  charset 
)

Definition at line 1918 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._codec().

Referenced by BeautifulSoup.UnicodeDammit._convertFrom(), and BeautifulSoup.UnicodeDammit.find_codec().

1919  def find_codec(self, charset):
1920  return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
1921  or (charset and self._codec(charset.replace("-", ""))) \
1922  or (charset and self._codec(charset.replace("-", "_"))) \
1923  or charset
def BeautifulSoup.UnicodeDammit.find_codec (   self,
  charset 
)

Definition at line 1918 of file BeautifulSoup.py.

References BeautifulSoup.UnicodeDammit._codec(), and BeautifulSoup.UnicodeDammit.find_codec().

1919  def find_codec(self, charset):
1920  return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
1921  or (charset and self._codec(charset.replace("-", ""))) \
1922  or (charset and self._codec(charset.replace("-", "_"))) \
1923  or charset

Member Data Documentation

dictionary BeautifulSoup.UnicodeDammit.CHARSET_ALIASES
static
Initial value:
1 = { "macintosh" : "mac-roman",
2  "x-sjis" : "shift-jis" }

Definition at line 1744 of file BeautifulSoup.py.

BeautifulSoup.UnicodeDammit.declaredHTMLEncoding

Definition at line 1749 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.__init__(), and BeautifulSoup.UnicodeDammit._detectEncoding().

BeautifulSoup.UnicodeDammit.EBCDIC_TO_ASCII_MAP = None
static

Definition at line 1934 of file BeautifulSoup.py.

BeautifulSoup.UnicodeDammit.markup

Definition at line 1814 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.__init__(), and BeautifulSoup.UnicodeDammit._convertFrom().

dictionary BeautifulSoup.UnicodeDammit.MS_CHARS
static

Definition at line 1960 of file BeautifulSoup.py.

BeautifulSoup.UnicodeDammit.originalEncoding

Definition at line 1755 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.__init__(), and BeautifulSoup.UnicodeDammit._convertFrom().

BeautifulSoup.UnicodeDammit.smartQuotesTo

Definition at line 1752 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.__init__(), BeautifulSoup.UnicodeDammit._convertFrom(), and BeautifulSoup.UnicodeDammit._subMSChar().

BeautifulSoup.UnicodeDammit.triedEncodings

Definition at line 1753 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.__init__(), and BeautifulSoup.UnicodeDammit._convertFrom().

BeautifulSoup.UnicodeDammit.unicode

Definition at line 1756 of file BeautifulSoup.py.

Referenced by BeautifulSoup.UnicodeDammit.__init__(), BeautifulSoup.UnicodeDammit._detectEncoding(), and BeautifulSoup.UnicodeDammit._toUnicode().