Instructional Module W22f

Odd Characters in HTML


to Top Overview

We call this the "World Wide Web", but around here we sometimes seem to think of it as the "world English web". How can the Web be truly world-wide if it's not accessible to the billions of people who don't understand English? Clearly, the Web needs to handle more than just English - at least in writing.

The Internet was developed primarily by native speakers of English, and for the first decades of its life, used mainly the characters of English. But English is a language that uses relatively few characters. We have:

  • 26 uppercase letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
  • 26 lowercase letters: abcdefghijklmnopqrstuvwxyz
  • 10 digits: 0 1 2 3 4 5 6 7 8 9
  • A handful of punctuation marks: ! @ # & * ( ) : ; " ' , . ?

Other languages of Europe and America use more characters, including:

  • Extra punctuation marks: ¿ ¡ °
  • Different types of quotation marks: « »
  • Various currency symbols:  £ ¥ € ƒ
  • Letters with various marks over them Â É ã ö ñ, through them Ø ø Ð , joined-together letters Æ œ ß, and a few no longer used in English: Þ þ Ð ð Æ æ

Languages in many parts of the world use alphabets with entirely different characters, such as:

  • Russian, Ukranian, Bulgarian, and central Asian Cyrilic
  • Greek
  • Arabic, Farsi, and Urdu
  • Hebrew
  • Coptic
  • Cree and Eskimo syllabary
  • Cherokee syllabary
  • Devanagari-based syllabaries of south Asia
  • Thai and Lao
  • Japanese kana writing
  • Korean hungul writing
  • ...and others not so well known

Some languages don't use alphabets at all. They use thousands of ideographs - characters that represent ideas. The best known of these are:

  • Chinese
  • Japanese kanji

Since the Internet was originally designed to handle only the characters of English, it has taken quite a lot of work to expand the World Wide Web to handle the characters of all languages. This module introduces the systems used on the Internet as a whole and in XML/HTML to represent both the familiar letters of English and the many "odd" characters of world languages.



to Top Character Sets
Character Representation
to Top

What is a character?
A character is any written symbol used to communicate meaning.

Link to Top

Since computers were originally designed for numeric calculation, they don't (technically speaking) store "characters" or "letters". They only store numbers.

In order to deal with letters and other characters, the pioneers of computer science set up tables giving each necessary letter a numeric code.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

At first, there were many of these tables, or character sets because of this significant fact:

It makes very little difference which number is used to represent a character.

Code used on hypothetical computer X
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Code used on hypothetical computer Y
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Early codes were very simple. One example is the Baudot code, which used only 31 codes (5 bits), but was able to represent 55 distinct characters. As time went on, more character sets came into use. This turned out to be a real hassle, because of this equally significant fact:

It makes very little difference which number is used to represent a character,
as long as everyone uses the same number for the same character.

When different computers use different codes for characters, they can't "talk" to one another without translation.

Computer X says:
In Computer X's code:
Sent from X to Y
In Computer Y's code:
Computer Y hears:
HELLO
8-5-12-12-15
8-5-12-12-15
SVOOL

This causes endless trouble, and was the reason for the development of ASCII.

Seven-bit code

to Top
Link to Top

ASCII

The first character set to be used widely on many different types of computers was ASCII, the American Standard Code for Information Interchange. It was developed between 1963 and 1968 by Robert Bemer and computer scientists from many computer organizations, and has been the mainstay of information interchange for four decades. Designed during a time when computer memory (RAM) was very expensive, it used only the low 7 of the 8 bits in a standard byte. This was adequate for the English-speaking people who developed the computers and the code, and allowed the leftover bit to be used (occasionally) for other purposes.

7-bit ASCII and 8-bit standard bytes illustrated

ASCII characters are divided into two main regions, each within a range of numbers:

  • Numbers 0 - 31: Non-visible Control characters. These are used to send simple control signals to devices. The most commonly used are:
    • 9: horizontal (normal) tab
    • 10: line feed
    • 13: carriage return. On Unix/Linux and Apple systems, this is used at the end of every line of text.
      13 and 10 are used together on DOS, Windows, and other systems to mark the ends of lines.
  • Numbers 32 - 127: Visible characters
    • 32 - 63: Space, punctuation marks, and digits 0 - 9
    • 64 - 95: Capital letters A - Z (plus a few punctuation marks)
    • 96 - 127: Lower-case letters a - z (plus a few punctuation marks)

Here's a table of the visible characters:

US-ASCII table, courtesy of Roman Czyborra
Table of 7-bit US ASCII from Roman Czyborra's Alphabet Soup page

Although there are characters that look like accent marks ` ~ ^ these are not designed to be placed over letters. As a result, only English, Dutch, Latin, Indonesian/Malay, and a handful of other languages can be written correctly using ASCII.

ANSI

There is a variant known as ANSI, in which one or two of the less-used characters are different.

Eight-bit codes Link to Top

EBCDIC

The second most widely used character set (after ASCII) was IBM's proprietary "Extended Binary-Coded Decimal Interchange Code". It was developed for use on the 1964 System 360 computers, and used all 8 characters in each byte. EBCDIC was used in all IBM mainframe computers for three decades, and on mainframes of several other companies as well.

You may want to follow this link to a table of EBCDIC characters.

Like ASCII, the basic code table didn't have provisions for any non-English languages. But there were many numbers that weren't assigned to any characters - these are the blank squares in the table illustration.

IBM's approach to international languages was to provide slightly different character sets to customers in different language-regions by adding any necessary characters where the basic table was blank, and changing existing character assignments as needed. This resulted in many incompatible versions of EBCDIC, which made it difficult to exchange data between computers in different language-regions. For that reason, EBCDIC has never been used on the Internet, and is no longer being incorporated in new operating systems for computers of any size.

Extended ASCII

By 1970, the vast majority of computers used 8-bit bytes, so ASCII character data always had one bit unused. Using that one extra bit would make it possible to double the number of characters in the set.

So most computer systems used the extra bit - the only difficulty being that for decades there was no standard way to use the extra characters. Each operating system had its own set of extended ASCII characters.

About 8 and 16 bit codes

An 8-bit byte can be used to hold up to 256 distinct characters. Problem is, there are languages with many more than 256 characters. Languages like Chinese and Japanese use thousands of ideographs - symbols that represent entire words, concepts, or names.

  • 2,000 - 3,000 characters are needed for very basic communication;
  • 7,000 - 8,000 are needed for daily news and business;
  • 40,000 - 50,000 are needed to represent family and place names, and to print or store the ancient and classical literature of the languages.

Theoretically, a 16-bit sequence, which can represent up to 65,535 unique characters, might be enough for all the world's languages. However, this point has not been accepted by all interested Asians themselves, so the jury is still out on the best way to provide a coding system for all languages.

ISO Codes Link to Top

ISO 646: 7-bit Code Standard

In order to make computers more friendly to other languages without using more than 7 bits, the International Organization for Standardization (ISO) approved a use of ASCII which allowed flexibility. ISO 646 specifies that 10 of the US ASCII characters can be used for nationalization:

#   $   @   [   \   ]   {   |   }   ^   `   ~

This has been used in many language-regions; for example, Italy replaces the characters above with this set, registered as ISO 646-15:

£   $   §   °   ç   é   à   ò   è   ¬   ù    ì

The need for these 7-bit codes is largely past, as practically all computers now are equipped to handle at least 8-bit codes. All operating systems released in the last few years for personal computers and up, have also been designed to handle 16-bit codes.

ISO 2022 Procedural Standard

Attempting to bring some order to the chaotic use of the extended character area, the International Organization for Standardization approved a standard, ISO 2022, simply to organize how to use the extend character space in . It was based on these principles:

  • The lower half - the original part used in 7-bit codes - would remain the same as what was described in ISO 646.
  • The upper half consists of two parts:
    • The first 32 characters are control codes
    • The rest are graphic codes
  • Those who wished to create character sets using the upper half would register with ISO's International Register of Coded Character Sets.
  • An escape sequence could be used to distinguish which character coding set was in use. Escape sequences in this case are the ESC character (code 27) followed by one or two other characters.
  • In addition to 8-bit codes, ISO 2022 provides for 16-bit extensions. These are necessary for languages that use ideographs.

ISO 8859: 8-bit code Standard

Following most of the principles of ISO 2022, a number of 8-bit codes were suggested for the alphabetic writing systems. These were brought together in the ISO 8859 series.

  • The first 128 characters, as usual, are the same as US ASCII.
  • The second 128 characters contains characters useful to the languages of a region

ISO-8859-1, from Roman Czyborra
Table of ISO 8859-1 from Roman Czyborra's Alphabet Soup page

There are 10+ defined character sets based on ISO 8859. The items in this list are linked to tables in Roman Czyborra's Alphabet Soup page:

  1. Latin1 (West European)
  2. Latin2 (East European)
  3. Latin3 (South European)
  4. Latin4 (North European)
  5. Cyrillic
  6. Arabic
  7. Greek
  8. Hebrew
  9. Latin5 (Turkish)
  10. Latin6 (Nordic)

ISO 10646: Universal Character Set and Unicode

In the late 1980s, two groups separately begin planning a coding system to encompass all the writing systems of the world. Those groups were:

  • Unicode Consortium: a group of software developing companies, mainly American.
  • ISO 10646: technical committees of ISO, mainly European and Asian.

By the early 1990s, these two groups agreed to cooperate, though they kept their separate identities. Their first task was to agree on the principles for creating a Universal Character Set (UCS). These are the principles they arrived at:

  1. The total system encompases 31 bits, enabling over two billion (2,147,483,648) possible character codes.
  2. The first 127 characters are identical to the ANSI and ISO 646 character set.
  3. The first 255 characters are identical to ISO 8859-1 "Latin 1" set
  4. Other character sets are assigned codes above 255 in groups of 256. There are over eight million groups of 256 available, and many are still unassigned.

UTF-8

ISO 10646 and Unicode are systems for assigning characters to numeric values. In order to actually store these on a computer in a practical way, Ken Thompson and Rob Pike of the AT&T Unix team developed the UTF-8 system - a practical way of representing the Universal Character Set.

UTF-8 works by allowing all characters to be represented in the smallest number of bytes necessary to represent their code number.

  • The original 128 ASCII characters can still be represented in one 8-bit byte;
  • ISO 8859-1 and most of the world's alphabetic and syllabic writing systems can be represented in two bytes;
  • The more complex ideographic systems can be represented in four or five bytes;
  • The least common Unicode character sets may require as many six bytes.
 
to Top Character Entities
Encoding Types

to Top
Link to Top

Characters are encoded in many different ways, as the previous section shows. In order to give some measure of flexibility and independence to characters in (X)(HT)ML, there are several ways to encode them in a file:

  1. The common 7- or 8-bit code, usually ISO 8859-1, though this can be varied by changing the character set declaration in the prolog or the head of the file.
  2. A named entity-code
  3. A numeric entity-code

Common codes are entered simply by typing them in. Any character on your keyboard is automatically encoded as soon as the computer receives it.

Named and numeric entity codes are used to enter characters that either:

  • represent delimiters in XML/HTML; delimiters are characters that mark the beginning and end of tags, quoted values, or character entities themselves;
    or
  • are not easily entered with your keyboard or text editor.

For example, when you want an angle bracket < or > to be displayed in a browser, the named or numeric entity must be used, since < in an XML/HTML file marks the beginning of a markup tag.

Character entities are delimited by & at the beginning and ; at the end. In the following section, we'll discuss these codes.

Named Entities

to Top
Link to Top

There are scores of named entities; here's a list of the most commonly used ones:

Symbol Name Meaning
<
&lt; less than
>
&gt; greater than
&
&amp; ampersand ("and" symbol)
&nbsp; non-breaking space
"
&quot; quotation mark
©
&copy; copyright
®
&reg; registered

Named entities exist for many types of characters:

  • accented characters of many languages; note that they are case-sensitive: if the name begins with a capital letter, it represents the capital letter symbol:
    • &Eacute; is É (capital E with acute accent mark)
    • &eacute; is é (lower-case e with acute accent mark)
  • currency symbols such as £, ¥, €
  • special punctuation marks like ¡ ¿ « » §
  • fractions ½, ¼, ¾
  • Math symbols like ± ¬ ÷ °

Names are case-sensitive: they are all lower-case, except the accented capital letters; only their first letter is capitalized. If a browser doesn't recognize a character name, it shows the name like this: &xyz;

Link to a table of named character entities.

Numeric Codes

to Top
Link to Top

Software doesn't recoganize names for all character codes, but it does recognize their code-numbers. (Of coourse, numbers are harder for most people to remember, so you may often have to look up the code for a character.)

The numbers represent the position of the character in the character set used by the document. After the &, a number sign # is followed by the number itself. The number is either decimal (base 10) or hexadecimal (base 16) if you put the letter "x" in front of it. Here are some examples:

  • &#229; (in decimal) represents å the letter "a" with a small circle above it (used, for example, in Norwegian).
  • &#xE5; (in hexadecimal) represents å the same character.
  • &#Xe5; (in hexadecimal) represents å the same character as well.
  • &#1048; (in decimal) represents И the Cyrillic capital letter "I".
  • &#x6C34; (in hexadecimal) represents Unicode 6c34 the Chinese character for water.

For more information:

Making It All Work

to Top
Link to Top

OK, that's a lot of facts. Here's how you put it all together:

  1. Choose a character set.
    • By default, in the US and most of Western Europe ISO 8859-1 is used.
    • UTF-8 is the most versatile and practical for representing multiple languages, but older browsers may not be able to handle it.
  2. Choose an editor that knows how to save in that character set.
    • Notepad in Windows 2000, XP, and later can handle ASCII, Unicode, and UTF-8
    • Recent HTML and programming editors allow you to choose character sets
  3. Choose a font-family that has the characters you need. Remember that fonts around the world can display the US ASCII characters, and most can handle ISO 8859-1, but beyond that users will need a versatile font to be able to see characters in multiple languages.
    • Arial Unicode MS is the font that is most widely available and has greatest variety of characters.
    • Lucida Sans Unicode is distributed fairly widely and has a large character set.
    • For Asian languages, it is often necessary to use a special font.

to Top About This Document
References

Character Sets

XML/HTML Character Entities

 

 

Audience

to Top
Link to Top

This module is for people who know the basics of XML/HTML code (see module W22c) and are ready to learn about character entities.

Objectives

On successful completion of this module, you will be able to:

  1. Define character encoding
  2. Identify character entities and incorporate in your document
Link to Top
Module w22f: Odd Characters in HTML
This document is part of a modular instruction series in Computer Instruction. For more information, see the overview or the list of modules in this series, W: World Wide Web. This document has been used in the following classes: INP 150.
History
Original: 3 October 2003, by Laurence J. Krieg
Last modification: Monday, 18-Sep-2006 16:04:27 EDT
Copyright
Copyright © 2003, Laurence J. Krieg, Washtenaw Community College
Instructors: You may point to this file in your Web-based materials; however, its location may change without notice.
Students: You are welcome to make a copy for your personal use.
All other uses: Please contact the author, Laurence J. Krieg, for permission: krieg@ieee.org.

Link to Top