4.4.4 Double-Byte Character Handling

Asian character sets contain large numbers of ideographic characters that represent an entire or partial word or concept. They may also contain interspersed phonetic characters. They may therefore consist of tens of thousands of characters. Because one 8-bit byte can hold only 256 unique codes, these languages require at least two bytes to represent each character, in order to accommodate the full range.

Most double-byte characters occupy two full character screen positions (each byte corresponds to one screen position). Such data may be entered into and displayed from USAGE DISPLAY data items. Most COBOL applications can therefore accept and store double-byte data without modification.

Problems can arise when double-byte data is displayed on the screen. For example, during an ACCEPT, one byte of a double-byte character may be deleted or overwritten. When a window is displayed, the edge of the window might cover one byte of a double-byte character. In these circumstances, the pairing of bytes can change, and the resulting codes may represent entirely different characters. On most machines this confuses the operating system's display driver. To overcome these potential problems, the runtime must follow two rules:

1. Always display both bytes of a double-byte character together (never display only part of a double-byte character).

2. Always overwrite, or change the attributes of, both bytes of a double-byte character together (never overwrite, or change the attributes of, only part of a double-byte character).

These rules must be obeyed when an ACCEPT handles cursor movement, cursor placement, text selection, delete, backspace, and character overtyping.

The rules must also be followed when the edges of windows are displayed, to avoid covering parts of double-byte characters. To implement these rules, the runtime needs to know which of several double-byte character encoding schemes is being used.

The CODE-SYSTEM runtime configuration variable tells the runtime if double-byte character data is being accepted or displayed, and which code system (that is, which standard for encoding Japanese and other Asian character sets) is being used. Each code system has a range of values that it allows within each byte of a two-byte character, so identifying the code system allows the runtime to recognize character boundaries when it is processing double-byte data for ACCEPT and DISPLAY statements.

Setting CODE-SYSTEM to the proper value allows your COBOL applications to handle input and display of Asian character data without source program changes. The syntax is:

CODE-SYSTEM  setting

The table below shows the possible settings of the CODE-SYSTEM variable, the code system to which each setting refers, and some examples of operating systems to which the particular code system applies:

Setting Code System Op. System Examples

BIG5 Big Five (Taiwan) Chinese DOS, Windows DBC Acucorp Generic other double-byte
Double-byte Coding machines
Scheme
EUC Extended UNIX Most UNIX machines
GB Code of Chinese Chinese DOS, Windows
Graphic Character Set
(People's Republic of
China) KSC Korean Character Korean DOS
Standard
SJC Shift JIS Code DOS/V, Windows
(Japanese Industrial some UNIX machines
Standard)

The default "0" means ASCII or EBCDIC single-byte characters.

The following table shows the decimal values that the respective code systems allow for each byte of the two-byte character:

Code System Setting 1st byte 2nd byte

BIG 161 - 254 64 - 126
(second format) 161 - 254 161 - 254

DBC 128 - 255 128 - 255

EUC 142 161 - 223
(second format) 161 - 254 161 - 254

GB and KSC 161 - 254 161 - 254

SJC 129 - 159 64 - 252 (not 127)
(second format) 224 - 239 64 - 252 (not 127)

Note that the first and second byte values are co-dependent; that is, both values must fall within the respective ranges shown in the table. If either value is not within its allowable range, then each byte will be treated as a single character.

Example (for most UNIX machines):

CODE-SYSTEM   EUC