OnWorks Linux and Windows Online WorkStations

Logo

Free Hosting Online for WorkStations

< Previous | Contents | Next >

POSIX Character Classes

The traditional character ranges are an easily understood and effective way to handle the problem of quickly specifying sets of characters. Unfortunately, they don’t always work. While we have not encountered any problems with our use of grep so far, we might run into problems using other programs.

Back in Chapter 4, we looked at how wildcards are used to perform pathname expansion. In that discussion, we said that character ranges could be used in a manner almost identi- cal to the way they are used in regular expressions, but here’s the problem:



[me@linuxbox ~]$ ls /usr/sbin/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*

/usr/sbin/MAKEFLOPPIES

/usr/sbin/NetworkManagerDispatcher

/usr/sbin/NetworkManager

[me@linuxbox ~]$ ls /usr/sbin/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*

/usr/sbin/MAKEFLOPPIES

/usr/sbin/NetworkManagerDispatcher

/usr/sbin/NetworkManager


(Depending on the Linux distribution, we will get a different list of files, possibly an empty list. This example is from Ubuntu). This command produces the expected resulta list of only the files whose names begin with an uppercase letter, but:


[me@linuxbox ~]$ ls /usr/sbin/[A-Z]*

/usr/sbin/biosdecode

/usr/sbin/chat

/usr/sbin/chgpasswd

/usr/sbin/chpasswd

/usr/sbin/chroot

/usr/sbin/cleanup-info

/usr/sbin/complain

/usr/sbin/console-kit-daemon

[me@linuxbox ~]$ ls /usr/sbin/[A-Z]*

/usr/sbin/biosdecode

/usr/sbin/chat

/usr/sbin/chgpasswd

/usr/sbin/chpasswd

/usr/sbin/chroot

/usr/sbin/cleanup-info

/usr/sbin/complain

/usr/sbin/console-kit-daemon


with this command we get an entirely different result (only a partial listing of the results is shown). Why is that? It’s a long story, but here’s the short version:

Back when Unix was first developed, it only knew about ASCII characters, and this fea- ture reflects that fact. In ASCII, the first 32 characters (numbers 0-31) are control codes (things like tabs, backspaces, and carriage returns). The next 32 (32-63) contain printable characters, including most punctuation characters and the numerals zero through nine. The next 32 (numbers 64-95) contain the uppercase letters and a few more punctuation symbols. The final 31 (numbers 96-127) contain the lowercase letters and yet more punc- tuation symbols. Based on this arrangement, systems using ASCII used a collation order that looked like this:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz This differs from proper dictionary order, which is like this: aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

As the popularity of Unix spread beyond the United States, there grew a need to support characters not found in U.S. English. The ASCII table was expanded to use a full eight bits, adding characters numbers 128-255, which accommodated many more languages. To support this ability, the POSIX standards introduced a concept called a locale, which could be adjusted to select the character set needed for a particular location. We can see the language setting of our system using this command:



[me@linuxbox ~]$ echo $LANG

en_US.UTF-8

[me@linuxbox ~]$ echo $LANG

en_US.UTF-8


With this setting, POSIX compliant applications will use a dictionary collation order rather than ASCII order. This explains the behavior of the commands above. A character range of [A-Z] when interpreted in dictionary order includes all of the alphabetic char- acters except the lowercase “a”, hence our results.

To partially work around this problem, the POSIX standard includes a number of charac- ter classes which provide useful ranges of characters. They are described in the table be-


low:

Table 19-2: POSIX Character Classes


Character Class Description

Character Class Description

[:alnum:] The alphanumeric characters. In ASCII, equivalent to:

[A-Za-z0-9]


image

[:word:] The same as [:alnum:], with the addition of the underscore (_) character.


image

[:alpha:] The alphabetic characters. In ASCII, equivalent to:

[A-Za-z]


image

[:blank:] Includes the space and tab characters.


image

[:cntrl:] The ASCII control codes. Includes the ASCII characters 0 through 31 and 127.


image

[:digit:] The numerals zero through nine.


image

[:graph:] The visible characters. In ASCII, it includes characters 33

through 126.


image

[:lower:] The lowercase letters.


image

[:punct:] The punctuation characters. In ASCII, equivalent to:

[-!"#$%&'()*+,./:;<=>?@[\\\]_`{|}~]


image

[:print:] The printable characters. All the characters in [:graph:]

plus the space character.


image

[:space:] The whitespace characters including space, tab, carriage

return, newline, vertical tab, and form feed. In ASCII, equivalent to:

[ \t\r\n\v\f]


image

[:upper:] The uppercase characters.


image

[:xdigit:] Characters used to express hexadecimal numbers. In ASCII, equivalent to:

[0-9A-Fa-f]


image


Even with the character classes, there is still no convenient way to express partial ranges, such as [A-M].

Using character classes, we can repeat our directory listing and see an improved result:


[me@linuxbox ~]$ ls /usr/sbin/[[:upper:]]*

/usr/sbin/MAKEFLOPPIES

/usr/sbin/NetworkManagerDispatcher

/usr/sbin/NetworkManager

[me@linuxbox ~]$ ls /usr/sbin/[[:upper:]]*

/usr/sbin/MAKEFLOPPIES

/usr/sbin/NetworkManagerDispatcher

/usr/sbin/NetworkManager


image

Remember, however, that this is not an example of a regular expression, rather it is the shell performing pathname expansion. We show it here because POSIX character classes can be used for both.


Reverting To Traditional Collation Order

You can opt to have your system use the traditional (ASCII) collation order by changing the value of the LANG environment variable. As we saw above, the LANG variable contains the name of the language and character set used in your locale. This value was originally determined when you selected an installation language as your Linux was installed.

To see the locale settings, use the locale command:

[me@linuxbox ~]$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

To change the locale to use the traditional Unix behaviors, set the LANG variable to POSIX:

[me@linuxbox ~]$ export LANG=POSIX

Note that this change converts the system to use U.S. English (more specifically, ASCII) for its character set, so be sure if this is really what you want.


image

You can make this change permanent by adding this line to you your .bashrc

file:

export LANG=POSIX


Top OS Cloud Computing at OnWorks: