|
Instructional Module W22f
|
|
||
| |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
What is a character? |
Since computers were originally designed for numeric calculation, they don't (technically speaking) store "characters" or "letters". They only store numbers. In order to deal with letters and other characters, the pioneers of computer science set up tables giving each necessary letter a numeric code.
At first, there were many of these tables, or character sets because of this significant fact: It makes very little difference which number is used to represent a character.
Early codes were very simple. One example is the Baudot code, which used only 31 codes (5 bits), but was able to represent 55 distinct characters. As time went on, more character sets came into use. This turned out to be a real hassle, because of this equally significant fact: It makes very little difference which number is used to
represent a character, When different computers use different codes for characters, they can't "talk" to one another without translation.
This causes endless trouble, and was the reason for the development of ASCII. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Seven-bit code
|
ASCIIThe first character set to be used widely on many different types of computers was ASCII, the American Standard Code for Information Interchange. It was developed between 1963 and 1968 by Robert Bemer and computer scientists from many computer organizations, and has been the mainstay of information interchange for four decades. Designed during a time when computer memory (RAM) was very expensive, it used only the low 7 of the 8 bits in a standard byte. This was adequate for the English-speaking people who developed the computers and the code, and allowed the leftover bit to be used (occasionally) for other purposes.
ASCII characters are divided into two main regions, each within a range of numbers:
Here's a table of the visible characters:
Although there are characters that look like accent marks ` ~ ^ these are not designed to be placed over letters. As a result, only English, Dutch, Latin, Indonesian/Malay, and a handful of other languages can be written correctly using ASCII. ANSIThere is a variant known as ANSI, in which one or two of the less-used characters are different. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Eight-bit codes | EBCDICThe second most widely used character set (after ASCII) was IBM's proprietary "Extended Binary-Coded Decimal Interchange Code". It was developed for use on the 1964 System 360 computers, and used all 8 characters in each byte. EBCDIC was used in all IBM mainframe computers for three decades, and on mainframes of several other companies as well. You may want to follow this link to a table of EBCDIC characters. Like ASCII, the basic code table didn't have provisions for any non-English languages. But there were many numbers that weren't assigned to any characters - these are the blank squares in the table illustration. IBM's approach to international languages was to provide slightly different character sets to customers in different language-regions by adding any necessary characters where the basic table was blank, and changing existing character assignments as needed. This resulted in many incompatible versions of EBCDIC, which made it difficult to exchange data between computers in different language-regions. For that reason, EBCDIC has never been used on the Internet, and is no longer being incorporated in new operating systems for computers of any size. Extended ASCIIBy 1970, the vast majority of computers used 8-bit bytes, so ASCII character data always had one bit unused. Using that one extra bit would make it possible to double the number of characters in the set. So most computer systems used the extra bit - the only difficulty being that for decades there was no standard way to use the extra characters. Each operating system had its own set of extended ASCII characters.
About 8 and 16 bit codesAn 8-bit byte can be used to hold up to 256 distinct characters. Problem is, there are languages with many more than 256 characters. Languages like Chinese and Japanese use thousands of ideographs - symbols that represent entire words, concepts, or names.
Theoretically, a 16-bit sequence, which can represent up to 65,535 unique characters, might be enough for all the world's languages. However, this point has not been accepted by all interested Asians themselves, so the jury is still out on the best way to provide a coding system for all languages. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ISO Codes | ISO 646: 7-bit Code StandardIn order to make computers more friendly to other languages without using more than 7 bits, the International Organization for Standardization (ISO) approved a use of ASCII which allowed flexibility. ISO 646 specifies that 10 of the US ASCII characters can be used for nationalization: # $ @ [ \ ] { | } ^ ` ~ This has been used in many language-regions; for example, Italy replaces the characters above with this set, registered as ISO 646-15: £ $ § ° ç é à ò è ¬ ù ì The need for these 7-bit codes is largely past, as practically all computers now are equipped to handle at least 8-bit codes. All operating systems released in the last few years for personal computers and up, have also been designed to handle 16-bit codes. ISO 2022 Procedural StandardAttempting to bring some order to the chaotic use of the extended character area, the International Organization for Standardization approved a standard, ISO 2022, simply to organize how to use the extend character space in . It was based on these principles:
ISO 8859: 8-bit code StandardFollowing most of the principles of ISO 2022, a number of 8-bit codes were suggested for the alphabetic writing systems. These were brought together in the ISO 8859 series.
There are 10+ defined character sets based on ISO 8859. The items in this list are linked to tables in Roman Czyborra's Alphabet Soup page:
ISO 10646: Universal Character Set and UnicodeIn the late 1980s, two groups separately begin planning a coding system to encompass all the writing systems of the world. Those groups were:
By the early 1990s, these two groups agreed to cooperate, though they kept their separate identities. Their first task was to agree on the principles for creating a Universal Character Set (UCS). These are the principles they arrived at:
UTF-8ISO 10646 and Unicode are systems for assigning characters to numeric values. In order to actually store these on a computer in a practical way, Ken Thompson and Rob Pike of the AT&T Unix team developed the UTF-8 system - a practical way of representing the Universal Character Set. UTF-8 works by allowing all characters to be represented in the smallest number of bytes necessary to represent their code number.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Encoding Types
|
Characters are encoded in many different ways, as the previous section shows. In order to give some measure of flexibility and independence to characters in (X)(HT)ML, there are several ways to encode them in a file:
Common codes are entered simply by typing them in. Any character on your keyboard is automatically encoded as soon as the computer receives it. Named and numeric entity codes are used to enter characters that either:
For example, when you want an angle bracket < or > to be displayed in a browser, the named or numeric entity must be used, since < in an XML/HTML file marks the beginning of a markup tag. Character entities are delimited by & at the beginning and ; at the end. In the following section, we'll discuss these codes. |
||||||||||||||||||||||||
|
Named Entities
|
There are scores of named entities; here's a list of the most commonly used ones:
Named entities exist for many types of characters:
Names are case-sensitive: they are all lower-case, except the accented capital letters; only their first letter is capitalized. If a browser doesn't recognize a character name, it shows the name like this: &xyz; Link to a table of named character entities. |
||||||||||||||||||||||||
|
Numeric Codes
|
Software doesn't recoganize names for all character codes, but it does recognize their code-numbers. (Of coourse, numbers are harder for most people to remember, so you may often have to look up the code for a character.) The numbers represent the position of the character in the character set used by the document. After the &, a number sign # is followed by the number itself. The number is either decimal (base 10) or hexadecimal (base 16) if you put the letter "x" in front of it. Here are some examples:
For more information:
|
||||||||||||||||||||||||
|
Making It All Work
|
OK, that's a lot of facts. Here's how you put it all together:
|
||||||||||||||||||||||||
| |
|
|---|---|
| References | Character Sets
|
|
Audience
|
This module is for people who know the basics of XML/HTML code (see module W22c) and are ready to learn about character entities. |
| Objectives | |
| Module w22f: Odd Characters in HTML |
This document is part of a modular instruction
series in Computer Instruction. For more information, see the overview
or the list of modules in this series, W: World Wide
Web. This document has been used in the following classes: INP
150.
|
| History |
Original: 3 October 2003, by Laurence J. Krieg
Last modification: Monday, 18-Sep-2006 16:04:27 EDT |
| Copyright |
Copyright © 2003, Laurence
J. Krieg, Washtenaw Community College
Instructors: You may point to this file in your Web-based materials; however, its location may change without notice. Students: You are welcome to make a copy for your personal use. All other uses: Please contact the author, Laurence J. Krieg, for permission: krieg@ieee.org. |