Unicode - Misc Notes

BuddhaSasana
Vietnamese Buddhist Page, with Unicode Times font


Additional notes on Unicode-based documents


[I]. Set up

1) Outlook-98 in Win-NT

Tools -- Options -- Mail Format

* Message Format: HTML
* Stationary & Fonts: Character Set - Universal Alphabet (UTF-8) - Set as Default

2) For  IE 6.x

View -- Font / Encoding -- Universal Alphabet (UTF-8)
or Right-click the mouse, then: Language -- Universal Alphabet (UTF-8)

3) For Netscape:

* View -- Encoding (or Character Set) -- Unicode (UTF-8)
* Edit -- Preferences -- Appearance-Fonts -- Use document-specified fonts

[II]. Printers:

1) HP Laser printers: may need adjustment, as following:

1.a) Models HP-III, HP-4M, HP-5Si

File -- Print -- Properties -- Advanced -- Documents Options --
Print Text as Graphics: ON

1.b) Model HP-5M

File -- Print -- Properties -- Advanced -- Options --
Graphic Mode: HP-GL/2
Laser III compatible: ENABLED

1.c) Model HP-8000, HP-4MP

File -- Print -- Properties -- Finishing -- Details --
Font Settings: Send True Type as Bitmaps.

1.d) Other models: follow one of the above procedures.

2) HP Inkjet printers:

- HP Inkjet 2500C: cannot print Unicode page, both from browser and from Wotd-97.
- HP Inkjet 721C: can print Unicode in Word-97
- HP 970 Deskjet: can print in Word-97 and in Netscape 4.x (but not IE 5.x)

3) Other printers:

- CANON Bubblejet BJC: can print Unicode in Win-98/Word-2000
- PANASONIC Laser printer KX series: can print Unicode with both browsers
- RICOH Aficio 270: can print Unicode only in Word
- EPSON Color Stylus series can print Unicode documents either from browsers or from Word.

[III]. Resources:

1) Alan Wood's Unicode Resources: https://www.alanwood.net/unicode/
2) Unicode for Vietnamese: https://www.vovisoft.com/vovisoft/UnicodeChoVN.htm  
3) Unicode consortium: https://www.unicode.org/
4) See also links and information on Viet Unicode: https://vietunicode.sourceforge.net/

[IV]. Fonts:

1) Basic fonts come with Office-2000, Windows-98 SE, Windows-Me, Windows-2000, Windows XP. For older versions, check these fonts:

- Core fonts: Arial, Courier New, Times New Roman, version 2.76 or later. If not, then download them and install.
- Not all WGL-4 fonts supplied by Microsoft contain VN characters.

2) A larger set: Arial-Unicode MS by Microsoft and CN-Times by Chan-Nguyen,  includes Chinese-Japanse-Korean characters (15 Mb, zipped), for Viet-Han texts.

3) VU-Times by Ho Phuoc Hung for Viet-Pali texts.

[V]. Software and Hardware

Folowing is a list of common software and hardware I use for our web site.

Keyboard programs:

1) VPS-Keys 4.3 (freeware): https://www.hcgvn.net/software/
2) WinVNKey, 4.0 (freeware): https://sourceforge.net/projects/winvnkey
3) UniKey, 3.55 (freeware): https://sourceforge.net/projects/unikey

Document and graphics preparation:

1) MS Word-2000, -XP
2) MS Image Composer 1.5
3) Corel Draw and Corel PhotoPaint, versions 9 & 11

Document conversion programs:

1) Convert2anything (freeware), by Cafe68T https://cafe68t.multimania.com/content/unicode/download.html
2) VoviSoft (freeware), https://www.vovisoft.com/vovisoft/UnicodeChoVN.htm  
3) VPSKeys 4.3 (freeware), https://www.hcgvn.net/software/
4) UniKey 3.55 (freeware), https://sourceforge.net/projects/unikey
5)
WinVNKey, 4.0 (freeware): https://sourceforge.net/projects/winvnkey

Web page set up:

1) MS Frontpage-2000, -XP (commercial)
2) Arachnophilia 4.0 (freeware): https://www.arachnoid.com/arachnophilia/

Operating systems:

1) Windows 2000
2) Windows XP

Browser: IE 6.x

System hardware:

1) PC Pentium-IV 1.6 GHz, 512 Mb RAM with Win-XP
2) PC Pentium Celeron 1.6 GHz, 256 Mb RAM, with Win-XP
3) PC Pentium Xeon 2.8 GHz, 2Gb RAM, with Win 2000

Printers:

1) Epson Stylus series (color inkjet)
2) HP Laser 5L
3) Many networked HP Laser printers (4x, 5x) and Inkjet printers.

[VI]. Mac machines:

I have no experience with Mac machines and Mac-OS. You might like to consult Alan Wood's website at: https://www.alanwood.net/unicode


[VII] UTF-8 

UTF-8 (UTF: Unicode Transformation Format) has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values.

This section is only an illustration of how you can encode a Unicode character in UTF-8.  

1) Take the Unicode value of the character to find out how many bytes you need.  Unicode values are given in hexadecimal & decimal numbers:

Hex Range Dec Range  
 0000-007F 0 - 127 1 byte
 0080-07FF 128 - 2,047 2 bytes
 0800-FFFF 2,048 - 65,535 3 bytes
10000-1FFFFF 65,536 - 2,097,151 4 bytes
200000 - 3FFFFFF 2,097,152 - 67,108,863 5 bytes
4000000 - 7FFFFFFF 67,108,864 - 2,147,483,648 (*) 6 bytes

(*) Maximum 2,147,483,648 (2**31) characters could be created.

2) Convert the hex code to binary form and fill in the empty bits:

1 byte 0xxxxxxx
2 bytes 110xxxxx 10xxxxxx
3 bytes 1110xxxx 10xxxxxx 10xxxxxx
4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 bytes 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 bytes 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Example:

The Unicode value of 'tea' (Han) is 8336 (dec: 33,590), so you need 3 bytes.  The binary form of hexadecimal 8336 is:

10000011 00110110

Fill the empty slots of the three-byte template with the binary value of 'tea' and you will get:

Fill the empty slots of the three-byte template with the binary value of 'tea' and you will get:

11101000 10001100 10110110

Thus you have converted 0x8336 to 3 bytes: 0xE8 0x8C 0xB6. 


[VIII] UTF-16 Conversion

UTF-16 definition

Each character is assigned a number, which Unicode calls the Unicode scalar value. In the UTF-16 encoding, characters are represented using either one or two unsigned 16-bit integers, the rules for how characters are encoded in UTF-16 are:

- Characters with values less than 0x10000 are represented as a single 16-bit integer with a value equal to that of the character number.

- Characters with values between 0x10000 and 0x10FFFF are represented by a 16-bit integer with a value between 0xD800 and 0xDBFF (within the so-called high-half zone or high surrogate area) followed by a 16-bit integer with a value between 0xDC00 and 0xDFFF (within the so-called low-half zone or low surrogate area).

- Characters with values greater than 0x10FFFF cannot be encoded in UTF-16.

Note: Values between 0xD800 and 0xDFFF are specifically reserved for use with UTF-16, and don't have any characters assigned to them.

Encoding UTF-16

Encoding of a single character from an ISO 10646 character value to UTF-16 proceeds as follows. Let U be the character number, no greater than 0x10FFFF.

1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate.

2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF, U' must be less than or equal to 0xFFFFF. That is, U' can be represented in 20 bits.

3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 0xDC00, respectively. These integers each have 10 bits free to encode the character value, for a total of 20 bits.

4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. Terminate.

Graphically, steps 2 through 4 look like:

U' = yyyyyyyyyyxxxxxxxxxx (binary, 20 bits)
W1 = 110110yyyyyyyyyy
W2 = 110111xxxxxxxxxx

-ooOoo-


[Back to the Main Page]


Last updated: 16-04-2004

Web master: [email protected]