Unicode is a system to represent characters from all the world's different languages. When Python parses an XML
document, all data is stored in memory as unicode.
You'll get to all that in a minute, but first, some background.
Historical note. Before unicode, there were separate character encoding systems for each language, each using the
same numbers (0−255) to represent that language's characters. Some languages (like Russian) have multiple
conflicting standards about how to represent the same characters; other languages (like Japanese) have so many
characters that they require multiple−byte character sets. Exchanging documents between systems was difficult
because there was no way for a computer to tell for certain which character encoding scheme the document author had
used; the computer only saw numbers, and the numbers could mean different things. Then think about trying to store
these documents in the same place (like in the same database table); you would need to store the character encoding
alongside each piece of text, and make sure to pass it around whenever you passed the text around. Then think about
multilingual documents, with characters from multiple languages in the same document. (They typically used escape
codes to switch modes; poof, you're in Russian koi8−r mode, so character 241 means this; poof, now you're in Mac
Greek mode, so character 241 means something else. And so on.) These are the problems which unicode was designed
to solve.
To solve these problems, unicode represents each character as a 2−byte number, from 0 to 65535.[5] Each 2−byte
number represents a unique character used in at least one of the world's languages. (Characters that are used in
multiple languages have the same numeric code.) There is exactly 1 number per character, and exactly 1 character per
number. Unicode data is never ambiguous.
Of course, there is still the matter of all these legacy encoding systems. 7−bit ASCII, for instance, which stores
English characters as numbers ranging from 0 to 127. (65 is capital "A", 97 is lowercase "a", and so forth.) English
has a very simple alphabet, so it can be completely expressed in 7−bit ASCII. Western European languages like
French, Spanish, and German all use an encoding system called ISO−8859−1 (also called "latin−1"), which uses the
7−bit ASCII characters for the numbers 0 through 127, but then extends into the 128−255 range for characters like
n−with−a−tilde−over−it (241), and u−with−two−dots−over−it (252). And unicode uses the same characters as 7−bit
ASCII for 0 through 127, and the same characters as ISO−8859−1 for 128 through 255, and then extends from there
into characters for other languages with the remaining numbers, 256 through 65535.
When dealing with unicode data, you may at some point need to convert the data back into one of these other legacy
encoding systems. For instance, to integrate with some other computer system which expects its data in a specific
1−byte encoding scheme, or to print it to a non−unicode−aware terminal or printer. Or to store it in an XML document
which explicitly specifies the encoding scheme.
>>> import sys
>>> sys.getdefaultencoding()
'iso−8859−1'
If you are going to be storing non−ASCII strings within your Python code, you'll need to specify the encoding of each
individual .py file by putting an encoding declaration at the top of each file. This declaration defines the .py file to
be UTF−8:
#!/usr/bin/env python
# −*− coding: UTF−8 −*−
document, all data is stored in memory as unicode.
You'll get to all that in a minute, but first, some background.
Historical note. Before unicode, there were separate character encoding systems for each language, each using the
same numbers (0−255) to represent that language's characters. Some languages (like Russian) have multiple
conflicting standards about how to represent the same characters; other languages (like Japanese) have so many
characters that they require multiple−byte character sets. Exchanging documents between systems was difficult
because there was no way for a computer to tell for certain which character encoding scheme the document author had
used; the computer only saw numbers, and the numbers could mean different things. Then think about trying to store
these documents in the same place (like in the same database table); you would need to store the character encoding
alongside each piece of text, and make sure to pass it around whenever you passed the text around. Then think about
multilingual documents, with characters from multiple languages in the same document. (They typically used escape
codes to switch modes; poof, you're in Russian koi8−r mode, so character 241 means this; poof, now you're in Mac
Greek mode, so character 241 means something else. And so on.) These are the problems which unicode was designed
to solve.
To solve these problems, unicode represents each character as a 2−byte number, from 0 to 65535.[5] Each 2−byte
number represents a unique character used in at least one of the world's languages. (Characters that are used in
multiple languages have the same numeric code.) There is exactly 1 number per character, and exactly 1 character per
number. Unicode data is never ambiguous.
Of course, there is still the matter of all these legacy encoding systems. 7−bit ASCII, for instance, which stores
English characters as numbers ranging from 0 to 127. (65 is capital "A", 97 is lowercase "a", and so forth.) English
has a very simple alphabet, so it can be completely expressed in 7−bit ASCII. Western European languages like
French, Spanish, and German all use an encoding system called ISO−8859−1 (also called "latin−1"), which uses the
7−bit ASCII characters for the numbers 0 through 127, but then extends into the 128−255 range for characters like
n−with−a−tilde−over−it (241), and u−with−two−dots−over−it (252). And unicode uses the same characters as 7−bit
ASCII for 0 through 127, and the same characters as ISO−8859−1 for 128 through 255, and then extends from there
into characters for other languages with the remaining numbers, 256 through 65535.
When dealing with unicode data, you may at some point need to convert the data back into one of these other legacy
encoding systems. For instance, to integrate with some other computer system which expects its data in a specific
1−byte encoding scheme, or to print it to a non−unicode−aware terminal or printer. Or to store it in an XML document
which explicitly specifies the encoding scheme.
>>> import sys
>>> sys.getdefaultencoding()
'iso−8859−1'
If you are going to be storing non−ASCII strings within your Python code, you'll need to specify the encoding of each
individual .py file by putting an encoding declaration at the top of each file. This declaration defines the .py file to
be UTF−8:
#!/usr/bin/env python
# −*− coding: UTF−8 −*−
Comments
Post a Comment
https://gengwg.blogspot.com/