UTF-eight and Unicode FAQ For Unix/Linux > 자유게시판 | 모란역 바른플란트 교정치과

UTF-eight and Unicode FAQ For Unix/Linux

페이지 정보

작성자 Sonja
댓글 0건 조회 279회 작성일 23-11-24 11:11

본문

The official title and spelling of this encoding is UTF-8, the place UTF stands for UCS Transformation Format. Please don't write UTF-8 in any documentation textual content in other methods (comparable to utf8 or UTF_8), unless of course you seek advice from a variable title and never the encoding itself.

An necessary word for developers of UTF-eight decoding routines: For safety causes, a UTF-eight decoder must not settle for UTF-8 sequences that are longer than essential to encode a character. For example, the character U+000A (line feed) have to be accepted from a UTF-8 stream solely in the form 0x0A, however not in any of the next five potential overlong kinds: 0xC0 0x8A 0xE0 0x80 0x8A 0xF0 0x80 0x80 0x8A 0xF8 0x80 0x80 0x80 0x8A 0xFC 0x80 0x80 0x80 0x80 0x8A Any overlong UTF-eight sequence could be abused to bypass UTF-eight substring tests that look only for the shortest potential encoding. All overlong UTF-eight sequences start with one in all the following byte patterns: 1100000x (10xxxxxx) 11100000 100xxxxx (10xxxxxx) 11110000 1000xxxx (10xxxxxx 10xxxxxx) 11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx) 11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx) Also note that the code positions U+D800 to U+DFFF (UTF-sixteen surrogates) as well as U+FFFE and U+FFFF must not happen in regular UTF-8 or UCS-4 data. UTF-eight decoders ought to deal with them like malformed or overlong sequences for safety reasons. Markus Kuhn’s UTF-eight decoder stress check file comprises a scientific collection of malformed and overlong UTF-eight sequences and will allow you to to verify the robustness of your decoder. Who invented UTF-8?

The encoding known today as UTF-eight was invented by Ken Thompson. It was born during the evening hours of 1992-09-02 in a brand new Jersey diner, the place he designed it within the presence of Rob Pike on a placemat (see Rob Pike’s UTF-8 historical past). It changed an earlier try and design a FSS/UTF (file system secure UCS transformation format) that was circulated in an X/Open working document in August 1992 by Gary Miller (IBM), Greger Leijonhufvud and John Entenmann (SMI) as a replacement for the division-heavy UTF-1 encoding from the first version of ISO 10646-1. By the top of the first week of September 1992, Pike and Thompson had turned AT&T Bell Lab’s Plan 9 into the world’s first operating system to make use of UTF-8. They reported about their experience at the USENIX Winter 1993 Technical Conference, San Diego, January 25-29, 1993, Proceedings, pp. 43-50. FSS/UTF was briefly also known as UTF-2 and later renamed into UTF-8, and pushed by way of the requirements process by the X/Open Joint Internationalization Group XOJIG. Where do I find nice UTF-8 instance information?

Just a few interesting UTF-8 example recordsdata for tests and demonstrations are: UTF-8 Sampler web web page by the Kermit project Markus Kuhn’s instance plain-text information, together with among others the traditional demo, decoder take a look at, TeX repertoire, WGL4 repertoire, euro check pages, and Robert Brady’s IPA lyrics. Unicode Transcriptions Generator for Indic Unicode take a look at recordsdataWhat totally different encodings are there?

Both the UCS and Unicode standards are initially large tables that assign to every character an integer quantity. If you use the term "UCS", "ISO 10646", or "Unicode", this simply refers to a mapping between characters and integers. This doesn't but specify tips on how to retailer these integers as a sequence of bytes in reminiscence. ISO 10646-1 defines the UCS-2 and UCS-four encodings. These are sequences of two bytes and four bytes per character, respectively. ISO 10646 was from the beginning designed as a 31-bit character set (with possible code positions ranging from U-00000000 to U-7FFFFFFF), nevertheless it took until 2001 for the primary characters to be assigned beyond the essential Multilingual Plane (BMP), that's beyond the first 216 character positions (see ISO 10646-2 and Unicode 3.1). UCS-four can symbolize all UCS and Unicode characters, UCS-2 can characterize only these from the BMP (U+0000 to U+FFFF). "Unicode" initially implied that the encoding was UCS-2 and it initially didn’t make any provisions for characters outside the BMP (U+0000 to U+FFFF). When it turned clear that more than 64k characters can be wanted for certain particular functions (historic alphabets and ideographs, mathematical and musical typesetting, and many others.), Unicode was turned into a form of 21-bit character set with possible code points in the vary U-00000000 to U-0010FFFF. The 2×1024 surrogate characters (U+D800 to U+DFFF) were introduced into the BMP to permit 1024×1024 non-BMP characters to be represented as a sequence of two 16-bit surrogate characters. This way UTF-16 was born, which represents the prolonged "21-bit" Unicode in a way backwards compatible with UCS-2. The time period UTF-32 was introduced in Unicode to explain a 4-byte encoding of the prolonged "21-bit" Unicode. UTF-32 is the very same factor as UCS-4, except that by definition UTF-32 isn't used to signify characters above U-0010FFFF, whereas UCS-4 can cowl all 231 code positions as much as U-7FFFFFFF. The ISO 10646 working group has agreed to modify their normal to exclude code positions beyond U-0010FFFF, so as to turn the brand new UCS-four and UTF-32 into practically the same factor. Along with all that, UTF-eight was launched to provide an ASCII backwards compatible multi-byte encoding. The definitions of UTF-eight in UCS and Unicode differed originally slightly, because in UCS, as much as 6-byte lengthy UTF-8 sequences were doable to signify characters up to U-7FFFFFFF, whereas in Unicode only as much as 4-byte long UTF-eight sequences are outlined to signify characters up to U-0010FFFF. (The difference was in essence the same as between UCS-4 and UTF-32.) No endianess is implied by the encoding names UCS-2, UCS-4, UTF-16, and UTF-32, though ISO 10646-1 says that Bigendian needs to be preferred until otherwise agreed. It has change into customary to append the letters "BE" (Bigendian, high-byte first) and "LE" (Littleendian, low-byte first) to the encoding names to be able to explicitly specify a byte order. In order to permit the automated detection of the byte order, it has develop into customary on some platforms (notably Win32) to start every Unicode file with the character U+FEFF (ZERO WIDTH NO-BREAK Space), also known because the Byte-Order Mark (BOM). Its byte-swapped equivalent U+FFFE is just not a legitimate Unicode character, due to this fact it helps to unambiguously distinguish the Bigendian and Littleendian variants of UTF-16 and UTF-32. A full featured character encoding converter will have to offer the following thirteen encoding variants of Unicode and UCS: UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE Where no byte order is explicitly specified, use the byte order of the CPU on which the conversion takes place and in an enter stream swap the byte order every time U+FFFE is encountered. The difference between outputting UCS-4 versus UTF-32 and UTF-16 versus UCS-2 lies in dealing with out-of-vary characters. The fallback mechanism for non-representable characters must be activated in UTF-32 (for characters > U-0010FFFF) or UCS-2 (for characters > U+FFFF) even the place UCS-4 or UTF-sixteen respectively would provide a representation. Really just of historic interest are UTF-1, UTF-7, SCSU and a dozen different much less extensively publicised UCS encoding proposals with various properties, none of which ever loved any vital use. Their use ought to be prevented. A very good encoding converter may also offer options for adding or eradicating the BOM: - Unconditionally prefix the output text with U+FEFF. - Prefix the output text with U+FEFF unless it's already there. - Remove the first character whether it is U+FEFF.It has additionally been advised to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) as a signature to mark the start of a UTF-8 file. This observe ought to undoubtedly not be used on POSIX methods for several reasons: - On POSIX systems, the locale (and not a magic file-kind code) defines the encoding of plain text information. Mixing the 2 ideas would add numerous complexity and break existing functionality. - Adding a UTF-8 signature firstly of a file would interfere with many established conventions such as the kernel in search of "#!" at first of a plaintext executable to locate the suitable interpreter. - Handling BOMs properly would add undesirable complexity even to easy programs like cat or grep that combine contents of several files into one.Along with the encoding alternatives, Unicode also specifies numerous Normalization Forms, which provide reasonable subsets of Unicode, particularly to take away encoding ambiguities brought on by the presence of precomposed and compatibility characters: Normalization Form D (NFD): Split up (decompose) precomposed characters into combining sequences where potential, e.g. use U+0041 U+0308 (LATIN CAPITAL LETTER A, COMBINING DIAERESIS) as a substitute of U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS). Also avoid deprecated characters, e.g. use U+0041 U+030A (LATIN CAPITAL LETTER A, COMBINING RING ABOVE) instead of U+212B (ANGSTROM Sign). Normalization Form C (NFC): Use precomposed characters as an alternative of combining sequences where possible, e.g. use U+00C4 ("Latin capital letter A with diaeresis") as an alternative of U+0041 U+0308 ("Latin capital letter A", "combining diaeresis"). Also keep away from deprecated characters, e.g. use U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE) instead of U+212B (ANGSTROM Sign).NFC is the popular form for Linux and WWW. Normalization Form KD (NFKD): Like NFD, however keep away from as well as the usage of compatibility characters, e.g. use "fi" as an alternative of U+FB01 (LATIN SMALL LIGATURE FI). Normalization Form KC (NFKC): Like NFC, however avoid in addition the use of compatibility characters, e.g. use "fi" instead of U+FB01 (LATIN SMALL LIGATURE FI).A full-featured character encoding converter must also provide conversion between normalization forms. Care must be used with mapping to NFKD or NFKC, as semantic information might be misplaced (as an example U+00B2 (SUPERSCRIPT TWO) maps to 2) and further mark-up data would possibly have to be added to preserve it (e.g., 2 in HTML). What programming languages support Unicode?

More moderen programming languages that were developed after round 1993 already have particular data types for Unicode/ISO 10646-1 characters. That is the case with Ada95, Java, TCL, Perl, Python, C# and others. ISO C 90 specifies mechanisms to handle multi-byte encoding and broad characters. These services had been improved with Amendment 1 to ISO C ninety in 1994 and even additional enhancements were made within the ISO C ninety nine normal. These amenities had been designed originally with varied East-Asian encodings in mind. They are on one facet slightly extra refined than what can be essential to handle UCS (handling of "shift sequences"), but in addition lack support for more advanced facets of UCS (combining characters, and so forth.). UTF-8 is an instance of what the ISO C standard calls multi-byte encoding. The kind wchar_t, which in trendy environments is often a signed 32-bit integer, can be utilized to hold Unicode characters. (Since wchar_t has ended up being a 16-bit kind on some platforms and a 32-bit type on others, further types char16_t and char32_t have been proposed in ISO TR 19769 for future revisions of the C language, to provide application programmers extra management over the illustration of such broad strings.) Unfortunately, wchar_t was already broadly used for numerous Asian 16-bit encodings all through the nineties. Therefore, the ISO C 99 customary was bound by backwards compatibility. It couldn't be modified to require wchar_t to be used with UCS, like Java and Ada95 managed to do. However, the C compiler can no less than sign to an utility that wchar_t is guaranteed to hold UCS values in all locales. To do so, it defines the macro __STDC_ISO_10646__ to be an integer constant of the kind yyyymmL. The yr and month seek advice from the version of ISO/IEC 10646 and its amendments that have been implemented. For instance, __STDC_ISO_10646__ == 200009L if the implementation covers ISO/IEC 10646-1:2000. How ought to Unicode be used below Linux?

Before UTF-eight emerged, Linux customers all around the world had to use various completely different language-specific extensions of ASCII. Hottest were ISO 8859-1 and ISO 8859-2 in Europe, ISO 8859-7 in Greece, KOI-8 / ISO 8859-5 / CP1251 in Russia, EUC and Shift-JIS in Japan, BIG5 in Taiwan, and many others. This made the change of recordsdata troublesome and application software program had to worry about various small variations between these encodings. Support for these encodings was usually incomplete, untested, and unsatisfactory, as a result of the appliance developers rarely used all these encodings themselves. Because of these difficulties, main Linux distributors and utility builders at the moment are phasing out these older legacy encodings in favour of UTF-8. UTF-eight support has improved dramatically over the last few years and many individuals now use UTF-eight every day in - textual content recordsdata (supply code, HTML files, email messages, and many others.) - file names - normal enter and standard output, pipes - atmosphere variables - cut and paste choice buffers - telnet, modem, and serial port connections to terminal emulatorsand in some other places where byte sequences used to be interpreted in ASCII. In UTF-eight mode, terminal emulators such as xterm or the Linux console driver transform each keystroke into the corresponding UTF-8 sequence and send it to the stdin of the foreground course of. Similarly, any output of a course of on stdout is distributed to the terminal emulator, where it is processed with a UTF-8 decoder after which displayed using a 16-bit font. Full Unicode functionality with all bells and whistles (e.g. excessive-quality typesetting of the Arabic and Indic scripts) can solely be anticipated from subtle multi-lingual phrase-processing packages. What Linux helps immediately on a broad base is far simpler and mainly aimed at changing the old 8- and 16-bit character sets. Linux terminal emulators and command line tools usually only support a Level 1 implementation of ISO 10646-1 (no combining characters), and only scripts equivalent to Latin, Greek, Cyrillic, Armenian, Georgian, CJK, and plenty of scientific symbols are supported that want no additional processing assist. At this level, UCS support could be very comparable to ISO 8859 help and the only important difference is that we've got now hundreds of various characters obtainable, that characters could be represented by multibyte sequences, and that ideographic Chinese/Japanese/Korean characters require two terminal character positions (double-width). Level 2 support within the form of combining characters for selected scripts (specifically Thai) and Hangul Jamo is in elements also accessible (i.e., some fonts, terminal emulators and editors support it via easy overstringing), but precomposed characters needs to be most popular over combining character sequences where available. More formally, the preferred approach of encoding textual content in Unicode under Linux must be Normalization Form C as defined in Unicode Technical Report #15. One influential non-POSIX Pc operating system vendor (whom we shall leave unnamed right here) steered that all Unicode files should begin with the character ZERO WIDTH NOBREAK Space (U+FEFF), which is in this role additionally referred to because the "signature" or "byte-order mark (BOM)", as a way to establish the encoding and byte-order used in a file. Linux/Unix does not use any BOMs and signatures. They would break far too many current ASCII syntax conventions (akin to scripts starting with #!). On POSIX techniques, the chosen locale identifies already the encoding expected in all input and output recordsdata of a process. (It has additionally been advised to name UTF-8 recordsdata without a signature "UTF-8N" recordsdata, however this non-commonplace time period is normally not used in the POSIX world.) Before you change to UTF-8 below Linux, replace your installation to a latest distribution with up-to-date UTF-eight help. This is specific the case if you use an installation older than SuSE 9.1 or Red Hat 8.0. Before these, UTF-eight assist was not yet mature enough to be recommendable for each day use. Red Hat Linux 8.0 (September 2002) was the first distribution to take the leap of switching to UTF-eight as the default encoding for many locales. The one exceptions have been Chinese/Japanese/Korean locales, for which there were at the time still too many specialized instruments out there that didn't but support UTF-8. This first mass deployment of UTF-8 below Linux precipitated most remaining points to be ironed out fairly shortly during 2003. SuSE Linux then switched its default locales to UTF-eight as properly, as of version 9.1 (May 2004). It was followed by Ubuntu Linux, the primary Debian-derivative that switched to UTF-8 because the system-vast default encoding. With the migration of the three hottest Linux distributions, UTF-eight related bugs have now been fixed in practically all properly-maintained Linux tools. Other distributions might be expected to comply with quickly. How do I have to change my software program?

If you are a developer, there are several approaches so as to add UTF-eight assist. We will break up them into two classes, which I'll call delicate and hard conversion. In mushy conversion, information is stored in its UTF-eight form in every single place and solely only a few software program adjustments are needed. In arduous conversion, any UTF-eight information that the program reads might be converted into broad-character arrays and will probably be handled as such in all places inside the appliance. Strings will only be transformed back to UTF-8 at output time. Internally, a character remains a set-measurement memory object. We can even distinguish arduous-wired and locale-dependent approaches for supporting UTF-8, relying on how a lot the string processing depends on the standard library. C provides quite a lot of string processing functions designed to handle arbitrary locale-specific multibyte encodings. An application programmer who relies completely on these can remain unaware of the particular particulars of the UTF-8 encoding. Chances are then that by merely changing the locale setting, xxx 2020xxx 2021 a number of other multi-byte encodings (corresponding to EUC) will automatically be supported as well. The opposite means a programmer can go is to hardcode information about UTF-8 into the appliance. This may occasionally lead in some situations to significant performance improvements. It could also be the best strategy for applications that will only be used with ASCII and UTF-8. Even the place support for every multi-byte encoding supported by libc is desired, it may properly be value including extra code optimized for UTF-8. Thanks to UTF-8’s self-synchronizing options, it may be processed very effectively. The locale-dependent libc string functions might be two orders of magnitude slower than equal hardwired UTF-eight procedures. A bad educating instance was GNU grep 2.5.1, which relied solely on locale-dependent libc features comparable to mbrlen() for its generic multi-byte encoding assist. This made it about 100× slower in multibyte mode than in single-byte mode! Other applications with hardwired support for UTF-8 common expressions (e.g., Perl 5.8) don't suffer this dramatic slowdown. Most purposes can do very tremendous with simply smooth conversion. This is what makes the introduction of UTF-8 on Unix possible at all. To name two trivial examples, applications akin to cat and echo do not need to be modified in any respect. They'll stay completely ignorant as to whether or not their input and output is ISO 8859-2 or UTF-8, because they handle just byte streams with out processing them. They only recognize ASCII characters and management codes similar to '
' which don't change in any way under UTF-8. Therefore the UTF-8 encoding and decoding is done for these applications utterly within the terminal emulator. A small modification will likely be crucial for any program that determines the variety of characters in a string by counting the bytes. With UTF-8, as with other multi-byte encodings, where the size of a text string is of concern, programmers have to differentiate clearly between 1. the number of bytes, 2. the number of characters, 3. the display width (e.g., the variety of cursor position cells in a VT100 terminal emulator)of a string. C’s strlen(s) function at all times counts the number of bytes. This is the quantity relevant, for example, for memory administration (willpower of string buffer sizes). Where the output of strlen is used for such purposes, no change will be mandatory. The variety of characters may be counted in C in a portable way utilizing mbstowcs(NULL,s,0). This works for UTF-8 like for another supported encoding, as long as the appropriate locale has been selected. A tough-wired technique to depend the number of characters in a UTF-8 string is to depend all bytes except those within the range 0x80 - 0xBF, because these are just continuation bytes and never characters of their own. However, the need to count characters arises surprisingly not often in functions. In functions written for ASCII or ISO 8859, a way more common use of strlen is to foretell the number of columns that the cursor of the terminal will advance if a string is printed. With UTF-8, neither a byte nor a character rely will predict the display width, as a result of ideographic characters (Chinese, Japanese, Korean) will occupy two column positions, whereas control and combining characters occupy none. To determine the width of a string on the terminal display screen, it's essential to decode the UTF-eight sequence and then use the wcwidth operate to check the show width of every character, or wcswidth to measure the complete string. As an illustration, the ls program needed to be modified, as a result of without knowing the column widths of filenames, it cannot format the desk format by which it presents directories to the consumer. Similarly, all applications that assume in some way that the output is offered in a fixed-width font and format it accordingly need to learn to count columns in UTF-eight text. Editor capabilities similar to deleting a single character should be slightly modified to delete all bytes that may belong to 1 character. Affected had been for example editors (vi, emacs, readline, and so on.) in addition to packages that use the ncurses library. Any Unix-model kernel can do tremendous with smooth conversion and desires only very minor modifications to completely assist UTF-8. Most kernel features that handle strings (e.g. file names, atmosphere variables, and many others.) should not affected at all by the encoding. Modifications had been obligatory in Linux the following locations: - The console show and keyboard driver (another VT100 emulator) have to encode and decode UTF-8 and will assist at the very least some subset of the Unicode character set. This had already been available in Linux as early as kernel 1.2 (ship ESC %G to the console to activate UTF-8 mode). - External file system drivers comparable to VFAT and WinNT have to convert file title character encodings. UTF-eight is one of the obtainable conversion choices, and the mount command has to tell the kernel driver that user processes shall see UTF-8 file names. Since VFAT and WinNT use already Unicode anyway, UTF-eight is the only out there encoding that guarantees a lossless conversion here. - The tty driver of any POSIX system supports a "cooked" mode, through which some primitive line editing functionality is available. In order to allow the character-erase operate (which is activated whenever you press backspace) to work correctly with UTF-8, someone wants to inform it not depend continuation bytes in the vary 0x80-0xBF as characters, however to delete them as a part of a UTF-8 multi-byte sequence. Because the kernel is ignorant of the libc locale mechanics, another mechanism is needed to inform the tty driver about UTF-8 being used. Linux kernel versions 2.6 or newer assist a bit IUTF8 in the c_iflag member variable of struct termios. If it is ready, the "cooked" mode line editor will treat UTF-eight multi-byte sequences correctly. This mode might be set from the command shell with "stty iutf8". Xterm and associates should set this bit routinely when known as in a UTF-8 locale.C help for Unicode and UTF-8

Starting with GNU glibc 2.2, the sort wchar_t is formally supposed for use just for 32-bit ISO 10646 values, impartial of the currently used locale. That is signalled to functions by the definition of the __STDC_ISO_10646__ macro as required by ISO C99. The ISO C multi-byte conversion capabilities (mbsrtowcs(), wcsrtombs(), and so on.) are absolutely implemented in glibc 2.2 or higher and can be used to convert between wchar_t and any locale-dependent multibyte encoding, including UTF-8, ISO 8859-1, and so forth. For instance, you'll be able to write #embrace #embody int predominant() if (!setlocale(LC_CTYPE, "")) fprintf(stderr, "Can't set the required locale! " "Check LANG, LC_CTYPE, LC_ALL.
"); return 1; printf("%ls
", L"Schöne Grüße"); return 0; Call this program with the locale setting LANG=de_DE and the output can be in ISO 8859-1. Call it with LANG=de_DE.UTF-eight and the output will likely be in UTF-8. The %ls format specifier in printf calls wcsrtombs so as to transform the huge character argument string into the locale-dependent multi-byte encoding. A lot of C’s string capabilities are locale-impartial and they just have a look at zero-terminated byte sequences: strcpy strncpy strcat strncat strcmp strncmp strdup strchr strrchr strcspn strspn strpbrk strstr strtok A few of these (e.g. strcpy) can equally be used for single-byte (ISO 8859-1) and multi-byte (UTF-8) encoded character units, as they need no notion of how many byte long a character is, while others (e.g., strchr) rely upon one character being encoded in a single char worth and are of much less use for UTF-8 (strchr nonetheless works tremendous should you simply search for an ASCII character in a UTF-8 string). Other C features are locale dependent and work in UTF-8 locales just as effectively: strcoll strxfrm How should the UTF-8 mode be activated?

If your utility is soft converted and doesn't use the usual locale-dependent C multibyte routines (mbsrtowcs(), wcsrtombs(), and so forth.) to convert every little thing into wchar_t for processing, then it might need to seek out out not directly, whether or not it is supposed to assume that the textual content information it handles is in some 8-bit encoding (like ISO 8859-1, the place 1 byte = 1 character) or UTF-8. Once everybody makes use of solely UTF-8, you'll be able to simply make it the default, but until then both the classical 8-bit units and UTF-eight should still need to be supported. The first wave of applications with UTF-eight support used an entire lot of various command line switches to activate their respective UTF-eight modes, for instance the famous xterm -u8. That turned out to be a really unhealthy idea. Having to remember a special command line option or different configuration mechanism for each application could be very tedious, which is why command line choices usually are not the correct way of activating a UTF-8 mode. The correct solution to activate UTF-8 is the POSIX locale mechanism. A locale is a configuration setting that contains information about culture-specific conventions of software behaviour, including the character encoding, the date/time notation, alphabetic sorting rules, the measurement system and customary office paper measurement, and so forth. The names of locales normally encompass ISO 639-1 language and ISO 3166-1 alpha-2 nation codes, sometimes with additional encoding names or other qualifiers. You may get a listing of all locales installed in your system (usually in /usr/lib/locale/) with the command locale -a. Set the surroundings variable LANG to the identify of your most popular locale. When a C program executes the setlocale(LC_CTYPE, "") function, the library will take a look at the atmosphere variables LC_ALL, LC_CTYPE, and LANG in that order, and the primary one of those that has a price will determine which locale knowledge is loaded for the LC_CTYPE category (which controls the multibyte conversion capabilities). The locale data is break up up into separate categories. For instance, LC_CTYPE defines the character encoding and LC_COLLATE defines the string sorting order. The LANG atmosphere variable is used to set the default locale for all categories, but the LC_* variables can be used to override individual classes. Don't worry an excessive amount of about the nation identifiers within the locales. Locales reminiscent of en_GB (English in Great Britain) and en_AU (English in Australia) differ often only within the LC_Monetary category (name of foreign money, rules for printing financial quantities), which practically no Linux application ever makes use of. LC_CTYPE=en_GB and LC_CTYPE=en_AU have precisely the same effect. Effect of locale on sorting order: If you happen to had not set a locale beforehand, it's possible you'll shortly discover that setting one (e.g., LANG=en_US.UTF-8 or LANG=en_GB.UTF-8), additionally adjustments the sorting order used by some instruments: the "ls" command now types filenames with uppercase and lowercase first character subsequent to one another (like in a dictionary), and file globbing now not uses the ASCII order both (e.g. "echo [a-z]*" additionally lists filenames starting uppercase). To get the outdated ASCII sorting order back that you're used to, simply set as well as additionally LC_COLLATE=POSIX (or equivalently LC_COLLATE=C), and you will rapidly really feel at dwelling once more. You'll be able to question the title of the character encoding in your present locale with the command locale charmap. This should say UTF-eight if you efficiently picked a UTF-8 locale within the LC_CTYPE class. The command locale -m offers a list with the names of all installed character encodings. If you utilize exclusively C library multibyte functions to do all the conversion between the external character encoding and the wchar_t encoding that you use internally, then the C library will take care of using the appropriate encoding according to LC_CTYPE for you and your program does not even need to know explicitly what the current multibyte encoding is. However, in case you choose not to do every little thing using the libc multi-byte features (e.g., because you assume this could require too many adjustments in your software program or isn't efficient sufficient), then your application has to search out out for itself when to activate the UTF-eight mode. To do that, on any X/Open compliant techniques, where is offered, you can use a line comparable to utf8_mode = (strcmp(nl_langinfo(CODESET), "UTF-8") == 0); in an effort to detect whether the current locale makes use of the UTF-eight encoding. You could have of course so as to add a setlocale(LC_CTYPE, "") at the beginning of your application to set the locale in keeping with the surroundings variables first. The usual perform name nl_langinfo(CODESET) can be what locale charmap calls to search out the title of the encoding specified by the present locale for you. It is offered on pretty much every fashionable Unix now. FreeBSD added nl_langinfo(CODESET) help with version 4.6 (2002-06). If you happen to want an autoconf take a look at for the availability of nl_langinfo(CODESET), right here is the one Bruno Haible steered: ======================== m4/codeset.m4 ================================ #serial AM1 dnl From Bruno Haible. AC_DEFUN([AM_LANGINFO_CODESET], [ AC_CACHE_Check([for nl_langinfo and CODESET], am_cv_langinfo_codeset, [AC_Try_Link([#embody ], [char* cs = nl_langinfo(CODESET);], am_cv_langinfo_codeset=yes, am_cv_langinfo_codeset=no) ]) if take a look at $am_cv_langinfo_codeset = yes; then AC_Define(HAVE_LANGINFO_CODESET, 1, [Define when you have and nl_langinfo(CODESET).]) fi ]) ======================================================================= [You possibly can additionally try to query the locale atmosphere variables yourself with out utilizing setlocale(). Within the sequence LC_ALL, LC_CTYPE, LANG, search for the first of those setting variables that has a price. Make the UTF-8 mode the default (still overridable by command line switches) when this worth incorporates the substring UTF-8, as this signifies reasonably reliably that the C library has been requested to use a UTF-eight locale. An instance code fragment that does this is char *s; int utf8_mode = 0; if (((s = getenv("LC_ALL")) && *s) || ((s = getenv("LC_CTYPE")) && *s) || ((s = getenv("LANG")) && *s)) if (strstr(s, "UTF-8")) utf8_mode = 1; This depends in fact on all UTF-eight locales having the name of the encoding of their name, which isn't all the time the case, due to this fact the nl_langinfo() query is clearly the higher methodology. If you're really involved that calling nl_langinfo() may not be portable enough, there can also be Markus Kuhn’s portable public area nl_langinfo(CODESET) emulator for systems that do not need the actual thing (and another one from Bruno Haible), and you should utilize the norm_charmap() perform to standardize the output of the nl_langinfo(CODESET) on totally different platforms.] How do I get a UTF-8 version of xterm?

The xterm version that comes with XFree86 4.0 or higher (maintained by Thomas Dickey) includes UTF-8 help. To activate it, begin xterm in a UTF-8 locale and use a font with iso10646-1 encoding, as an example with LC_CTYPE=en_GB.UTF-eight xterm \ -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1' after which cat some instance file, comparable to UTF-8-demo.txt in the newly started xterm and get pleasure from what you see. If you are not utilizing XFree86 4.0 or newer, then you'll be able to alternatively download the latest xterm development version individually and compile it yourself with "./configure --enable-extensive-chars ; make" or alternatively with "xmkmf; make Makefiles; make; make install; make set up.man". Should you shouldn't have UTF-eight locale assist out there, use command line choice -u8 if you invoke xterm to switch input and output to UTF-8. How a lot of Unicode does xterm help?

Xterm in XFree86 4.0.1 solely supported Level 1 (no combining characters) of ISO 10646-1 with fixed character width and left-to-proper writing path. In different phrases, the terminal semantics have been mainly the same as for ISO 8859-1, except that it will probably now decode UTF-eight and can access 16-bit characters. With XFree86 4.0.3, two necessary functions have been added: - computerized switching to a double-width font for CJK ideographs - simple overstriking combining charactersIf the selected regular font is X × Y pixels giant, then xterm will attempt to load in addition a 2X × Y pixels large font (same XLFD, aside from a doubled value of the average_WIDTH property). It can use this font to characterize all Unicode characters that have been assigned the East Asian Wide (W) or East Asian FullWidth (F) property in Unicode Technical Report #11. The following fonts coming with XFree86 4.x are suitable for show of Japanese and Korean Unicode text with terminal emulators and editors: 6x13 -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1 6x13B -Misc-Fixed-Bold-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1 6x13O -Misc-Fixed-Medium-O-SemiCondensed--13-120-75-75-C-60-ISO10646-1 12x13ja -Misc-Fixed-Medium-R-Normal-ja-13-120-75-75-C-120-ISO10646-1 9x18 -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1 9x18B -Misc-Fixed-Bold-R-Normal--18-120-100-100-C-90-ISO10646-1 18x18ja -Misc-Fixed-Medium-R-Normal-ja-18-120-100-100-C-180-ISO10646-1 18x18ko -Misc-Fixed-Medium-R-Normal-ko-18-120-100-100-C-180-ISO10646-1 Some easy help for nonspacing or enclosing combining characters (i.e., those with basic class code Mn or Me in the Unicode database) is now also available, which is carried out by just overstriking (logical OR-ing) a base-character glyph with up to 2 combining-character glyphs. This produces acceptable results for accents beneath the bottom line and accents on top of small characters. It also works nicely, for example, for Thai and Korean Hangul Conjoining Jamo fonts that have been specifically designed to be used with overstriking. However, the results might not be absolutely passable for combining accents on high of tall characters in some fonts, particularly with the fonts of the "fixed" family. Therefore precomposed characters will proceed to be preferable where obtainable. The fonts below that come with XFree86 4.x are appropriate for display of Latin and many others. combining characters (additional head-area). Other fonts will only look nice with combining accents on small x-excessive characters. 6x12 -Misc-Fixed-Medium-R-Semicondensed--12-110-75-75-C-60-ISO10646-1 9x18 -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1 9x18B -Misc-Fixed-Bold-R-Normal--18-120-100-100-C-90-ISO10646-1 The next fonts coming with XFree86 4.x are suitable for display of Thai combining characters: 6x13 -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1 9x15 -Misc-Fixed-Medium-R-Normal--15-140-75-75-C-90-ISO10646-1 9x15B -Misc-Fixed-Bold-R-Normal--15-140-75-75-C-90-ISO10646-1 10x20 -Misc-Fixed-Medium-R-Normal--20-200-75-75-C-100-ISO10646-1 9x18 -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1 The fonts 18x18ko, 18x18Bko, 16x16Bko, and 16x16ko are appropriate for displaying Hangul Jamo (using the identical easy overstriking character mechanism used for Thai). A word for programmers of textual content mode applications: With help for CJK ideographs and combining characters, the output of xterm behaves a little bit more like with a proportional font, because a Latin/Greek/Cyrillic/and so on. character requires one column place, a CJK ideograph two, and a combining character zero. The Open Group’s Single UNIX Specification specifies the two C features wcwidth() and wcswidth() that allow an software to test what number of column positions a personality will occupy: #embrace int wcwidth(wchar_t wc); int wcswidth(const wchar_t *pwcs, size_t n); Markus Kuhn’s free wcwidth() implementation may be used by purposes on platforms where the C library does not but provide an appropriate perform. Xterm will for the foreseeable future in all probability not assist the next functionality, which you would possibly count on from a extra sophisticated full Unicode rendering engine: - bidirectional output of Hebrew and Arabic characters - substitution of Arabic presentation types - substitution of Indic/Syriac ligatures - arbitrary stacks of combining charactersHebrew and Arabic customers will therefore have to use application applications that reverse and left-pad Hebrew and Arabic strings earlier than sending them to the terminal. In other words, the bidirectional processing needs to be achieved by the appliance and not by xterm. The state of affairs for Hebrew and Arabic improves over ISO 8859 not less than within the form of the availability of precomposed glyphs and presentation varieties. It is removed from clear in the mean time, whether bidirectional assist ought to actually go into xterm and the way exactly this should work. Both ISO 6429 = ECMA-forty eight and the Unicode bidi algorithm present different starting points. See additionally ECMA Technical Report TR/53. Should you plan to help bidirectional textual content output in your application, have a have a look at either Dov Grobgeld’s FriBidi or Mark Leisher’s Pretty Good Bidi Algorithm, two free implementations of the Unicode bidi algorithm. Xterm currently doesn't help the Arabic, Syriac, or Indic text formatting algorithms, though Robert Brady has published some experimental patches towards bidi help. It is still unclear whether or not it's possible or preferable to do that in a VT100 emulator at all. Applications can apply the Arabic and Hangul formatting algorithms themselves simply, as a result of xterm allows them to output the necessary presentation forms. For Hangul, Unicode incorporates the presentation forms wanted for contemporary (post-1933) Korean orthography. For Indic scripts, the X font mechanism at the moment doesn't even assist the encoding of the necessary ligature variants, so there may be little xterm could supply anyway. Applications requiring Indic or Syriac output ought to better use a correct Unicode X11 rendering library resembling Pango as a substitute of a VT100 emulator like xterm. Where do I discover ISO 10646-1 X11 fonts?

Quite a number of Unicode fonts have change into obtainable for X11 over the past few months, and the record is rising quickly: - Markus Kuhn together with plenty of different volunteers has prolonged the old -misc-fastened-*-iso8859-1 fonts that come with X11 in the direction of a repertoire that covers all European characters (Latin, Greek, Cyrillic, intl. phonetic alphabet, mathematical and technical symbols, in some fonts even Armenian, Georgian, Katakana, Thai, and extra). For extra information see the Unicode fonts and instruments for X11 web page. These fonts at the moment are additionally distributed with XFree86 4.0.1 or higher. - Markus has also ready ISO 10646-1 variations of all of the Adobe and B&H BDF fonts within the X11R6.4 distribution. These fonts already contained the full PostScript font repertoire (round 30 extra characters, principally those used also by CP1252 MS-Windows, e.g. good quotes, dashes, and many others.), which had been nevertheless not accessible underneath the ISO 8859-1 encoding. They at the moment are all accessible in the ISO 10646-1 model, together with many extra precomposed characters protecting ISO 8859-1,2,3,4,9,10,13,14,15. These fonts are now also distributed with XFree86 4.1 or larger. - XFree86 4.Zero comes with an integrated TrueType font engine that can make accessible any Apple/Microsoft font to your X application in the ISO 10646-1 encoding. - Some future XFree86 release may additionally remove most old BDF fonts from the distribution and exchange them with ISO 10646-1 encoded versions. The X server can be prolonged with an automated encoding converter that creates other font encodings corresponding to ISO 8859-* from the ISO 10646-1 font file on-the-fly when such a font is requested by old 8-bit software program. Modern software ought to ideally use the ISO 10646-1 font encoding directly. ClearlyU (cu12) is a 12 level, one hundred dpi proportional ISO 10646-1 BDF font for X11 with over 3700 characters by Mark Leisher (instance photographs). - The Electronic Font Open Laboratory in Japan can also be working on a family of Unicode bitmap fonts. - Dmitry Yu. Bolkhovityanov created a Unicode VGA font in BDF for use by text mode IBM Pc emulators and many others. - Roman Czyborra’s GNU Unicode font undertaking works on collecting a complete and free 8×16/16×16 pixel Unicode font. It at the moment covers over 34000 characters. etl-unicode is an ISO 10646-1 BDF font prepared by Primoz Peterlin. Primoz Peterlin has also began the freefont project, which extends to raised UCS coverage among the 35 core PostScript outline fonts that URW++ donated to the ghostscript challenge, with the assistance of pfaedit. - George Williams has created a Type1 Unicode font household, which can also be accessible in BDF. He additionally developed the PfaEdit PostScript and bitmap font editor. EversonMono is a shareware monospaced font with over 3000 European glyphs, also accessible from the DKUUG server. - Birger Langkjer has ready a Unicode VGA Console Font for Linux. - Alan Wood has a listing of Microsoft fonts that help varied Unicode ranges. CODE2000 is a Unicode font by James Kass.Unicode X11 font names end with -ISO10646-1. This is now the officially registered worth for the X Logical Font Descriptor (XLFD) fields CHARSET_REGISTRY and CHARSET_ENCODING for all Unicode and ISO 10646-1 16-bit fonts. The *-ISO10646-1 fonts include some unspecified subset of your complete Unicode character set, and customers must guantee that no matter font they select covers the subset of characters needed by them. The *-ISO10646-1 fonts usually additionally specify a DEFAULT_CHAR value that factors to a special non-Unicode glyph for representing any character that isn't obtainable within the font (usually a dashed field, the scale of an H, positioned at 0x00). This ensures that users at the very least see clearly that there is an unsupported character. The smaller fixed-width fonts similar to 6x13 etc. for xterm will never be capable of cover all of Unicode, because many scripts reminiscent of Kanji can only be represented in significantly bigger pixel sizes than these extensively utilized by European customers. Typical Unicode fonts for European utilization will include solely subsets of between a thousand and 3000 characters, such as the CEN MES-3 repertoire. You might discover that within the *-ISO10646-1 fonts the shapes of the ASCII citation marks has barely modified to bring them consistent with the standards and follow on other platforms. What are the issues associated to UTF-8 terminal emulators?

VT100 terminal emulators settle for ISO 2022 (=ECMA-35) ESC sequences in order to change between different character units. UTF-8 is in the sense of ISO 2022 an "other coding system" (see section 15.4 of ECMA 35). UTF-eight is outdoors the ISO 2022 SS2/SS3/G0/G1/G2/G3 world, so in the event you switch from ISO 2022 to UTF-8, all SS2/SS3/G0/G1/G2/G3 states grow to be meaningless till you go away UTF-eight and change again to ISO 2022. UTF-8 is a stateless encoding, i.e. a self-terminating quick byte sequence determines completely which character is meant, impartial of any switching state. G0 and G1 in ISO 10646-1 are these of ISO 8859-1, and G2/G3 do not exist in ISO 10646, because each character has a fixed position and no switching takes place. With UTF-8, it is not doable that your terminal stays switched to unusual graphics-character mode after you unintentionally dumped a binary file to it. This makes a terminal in UTF-8 mode way more sturdy than with ISO 2022 and it is therefore useful to have a way of locking a terminal into UTF-8 mode such that it cannot unintentionally go back to the ISO 2022 world. The ISO 2022 normal specifies a spread of ESC % sequences for leaving the ISO 2022 world (designation of other coding system, DOCS), and plenty of such sequences have been registered for UTF-8 in section 2.Eight of the ISO 2375 International Register of Coded Character Sets: ESC %G activates UTF-eight with an unspecified implementation degree from ISO 2022 in a manner that permits to return to ISO 2022 again. ESC %@ goes back from UTF-eight to ISO 2022 in case UTF-8 had been entered by way of ESC %G. ESC %/G switches to UTF-8 Level 1 with no return. ESC %/H switches to UTF-8 Level 2 with no return. ESC %/I switches to UTF-eight Level three with no return.While a terminal emulator is in UTF-eight mode, any ISO 2022 escape sequences comparable to for switching G2/G3 and so on. are ignored. The only ISO 2022 sequence on which a terminal emulator might act in UTF-eight mode is ESC %@ for returning from UTF-8 back to the ISO 2022 scheme. UTF-eight still permits you to use C1 management characters resembling CSI, though UTF-eight also makes use of bytes within the vary 0x80-0x9F. It is necessary to know that a terminal emulator in UTF-eight mode should apply the UTF-8 decoder to the incoming byte stream before deciphering any management characters. C1 characters are UTF-eight decoded similar to another character above U+007F. Many textual content-mode purposes obtainable at present count on to talk to the terminal utilizing a legacy encoding or to use ISO 2022 sequences for switching terminal fonts. So as to use such purposes within a UTF-eight terminal emulator, it is feasible to make use of a conversion layer that will translate between ISO 2022 and UTF-eight on the fly. Examples for such utilities are Juliusz Chroboczek’s luit and pluto. If all you want is ISO 8859 support in a UTF-8 terminal, you too can use display screen (version 4.Zero or newer) by Michael Schröder and Jürgen Weigert. As implementation of ISO 2022 is a fancy and error-prone process, better keep away from implementing ISO 2022 yourself. Implement solely UTF-8 and level customers who want ISO 2022 at luit (or display screen). What UTF-8 enabled purposes are available?

Warning: As of mid-2003, this section is changing into more and more incomplete. UTF-eight support is now a pretty normal function for most well-maintained packages. This list will quickly need to be transformed into an inventory of the most popular programs that still have issues with UTF-8. Terminal emulation and communication

xterm as shipped with XFree86 4.0 or increased works accurately in UTF-8 locales if you employ an *-iso10646-1 font. Just strive it with for instance LC_CTYPE=en_GB.UTF-8 xterm -fn '-Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1'. C-Kermit has supported UTF-8 because the switch, terminal, and file character set since version 7.0. mlterm is a multi-lingual terminal emulator that supports UTF-eight amongst many different encodings, combining characters, XIM. Edmund Grimley Evans prolonged the BOGL Linux framebuffer graphics library with UCS font assist and constructed a simple UTF-eight console terminal emulator known as bterm with it. Uterm purports to be a UTF-8 terminal emulator for the Linux framebuffer console. Pluto, Juliusz Chroboczek’s paranormal Unicode converter, can guess which encoding is being utilized in a terminal session, and converts it on-the-fly to UTF-8. (Wonderful for studying IRC channels with combined ISO 8859 and UTF-eight messages!)Editing and word processing

Vim (the popular clone of the traditional vi editor) helps UTF-eight with large characters and up to 2 combining characters starting from model 6.0. Emacs has quite good primary UTF-8 support starting from model 21.3. Emacs 23 modified the inner encoding to UTF-8. Yudit is Gaspar Sinai’s free X11 Unicode editor. MinEd by Thomas Wolff is a really nice UTF-8 succesful text editor, forward of the competitors with features reminiscent of not only help of double-width and combining characters, but also bidirectional scripts, keyboard mappings for a wide range of scripts, script-dependent highlighting, etc. JOE is a popular WordStar-like editor that supports UTF-8 as of version 3.0. Cooledit presents UTF-8 and UCS help beginning with model 3.15.0. QEmacs is a small editor for use on UTF-8 terminals. much less is a popular plain-text file viewer that had UTF-8 assist since model 348. (Version 358 had a bug related to the handling of UTF-eight characters and backspace underlining/boldification as used by nroff/man, for which a patch is obtainable, model 381 nonetheless has issues with UTF-eight characters within the search-mode input line.) - GNU bash and readline provide single-line editors they usually introduced support for multi-byte character encodings, akin to UTF-8, with variations bash 2.05b and readline 4.3. gucharmap and UMap are instruments to pick and paste any Unicode character into your software. LaTeX has supported UTF-8 in its base package deal since March 2004 (still experimental). You may simply write \usepackage[utf8]inputenc and then encode no less than a few of TeX’s normal character repertoire in UTF-8 in your LaTeX sources. (Before that, UTF-8 was already out there in the form of Dominique Unruh’s package deal, which coated far more characters and was slightly useful resource hungry.) XeTeX is a reengineered version of TeX that reads and understands (UTF-8 encoded) Unicode text. Abiword.Programming

Perl provides useable Unicode and UTF-8 assist starting with model 5.8.1. Strings are now tagged in reminiscence as both byte strings or character strings, and the latter are stored internally as UTF-eight but appear to the programmer simply as sequences of UCS characters. There may be now additionally comprehensive support for encoding conversion and normalization included. Read "man perluniintro" for particulars. Python bought Unicode help added in version 1.6. Tcl/Tk began utilizing Unicode as its base character set with model 8.1. ISO10646-1 fonts are supported in Tk from version 8.3.3 or newer. CLISP can work with all multi-byte encodings (including UTF-8) and with the functions char-width and string-width there may be an API comparable to wcwidth() and wcswidth() available.Mail and Internet

- The Mutt e-mail consumer has worked since model 1.3.24 in UTF-8 locales. When compiled and linked with ncursesw (ncurses built with wide-character assist), Mutt 1.3.x works decently in UTF-eight locales underneath UTF-eight terminal emulators such as xterm. Exmh is a GUI frontend for the MH or nmh mail system and partially supports Unicode beginning with version 2.1.1 if Tcl/Tk 8.3.3 or newer is used. To allow displaying UTF-8 e-mail, be certain you could have the *-iso10646-1 fonts installed and add to .Xdefaults the line "exmh.mimeUCharsets: utf-8". Much of the Exmh-inner MIME charset-set mechanics nevertheless nonetheless dates from the times earlier than Tcl 8.1, subsequently ignores Tcl/Tk’s more recent Unicode help, and will now be simplified and improved significantly. Specifically, writing or replying to UTF-8 mail is still broken. - Most fashionable web browsers reminiscent of Mozilla Firefox have fairly respectable UTF-8 assist as we speak. - The popular Pine email client lacks UTF-eight support and is no longer maintained. Switch to its successor Alpine, a complete reimplementation by the same authors, which has glorious UTF-eight help.Printing

Cedilla is Juliusz Chroboczek’s finest-effort Unicode to PostScript textual content printer. - Markus Kuhn’s hpp is a quite simple plain textual content formatter for HP PCL printers that helps the repertoire of characters lined by the usual PCL fixed-width fonts in all the character encodings for which your C library has a locale mapping. Markus Kuhn’s utf2ps is an early quick-and-dirty proof-of-concept UTF-8 formatter for PostScript, that was only written to show which character repertoire can simply be printed using only the standard PostScript fonts and was by no means meant to be truly used. - Some submit-2004 HP printers have UTF-eight PCL firmware help (extra). The related PCL5 commands appear to be "␛&t1008P" (encoding methodology: UTF-8) and "␛(18N" (Unicode code web page). Recent PCL printers from other manufacturers (e.g., Kyocera) additionally promote UTF-8 assist (for SAP compatibility). - The Common UNIX Printing System comes with a texttops instrument that converts plaintext UTF-8 to PostScript. txtbdf2ps by Serge Winitzki is a Perl script to print UTF-eight plaintext to PostScript utilizing BDF pixel fonts.Misc

- The PostgreSQL DBMS had support for UTF-8 since model 7.1, each as the frontend encoding, and because the backend storage encoding. Data conversion between frontend and backend encodings is carried out mechanically. FIGlet is a instrument to output banner text in large letters utilizing monospaced characters as block graphics components and added UTF-8 support in version 2.2. Charlint is a personality normalization instrument for the W3C character mannequin. - The first out there UTF-8 tools for Unix got here out of the Plan 9 undertaking, Bell Lab’s Unix successor and the world’s first working system utilizing UTF-8. Plan 9’s Sam editor and 9term terminal emulator have additionally been ported to Unix. Wily began out as a Unix implementation of the Plan 9 Acme editor and is a mouse-oriented, textual content-based mostly working atmosphere for programmers. More lately the Plan 9 from User Space (aka plan9port) bundle has emerged, a port of many Plan 9 packages from their native Plan 9 surroundings to Unix-like working techniques. - The Gnumeric spreadsheet is fully Unicode based mostly from version 1.1. The Heirloom Toolchest is a set of standard Unix utilities derived from authentic Unix materials launched as open source by Caldera with help for multibyte character sets, especially UTF-8. convmv is a device to transform the filenames in entire listing timber from a legacy encoding to UTF-8.What patches to enhance UTF-eight assist can be found?

Many of those already have been included in the respective predominant distribution. - The Advanced Utility Development subgroup of the OpenI18N (formerly Li18nux) venture have ready various internationalization patches for instruments such as cut, fold, glibc, be part of, sed, uniq, xterm, and so on. that may improve UTF-eight help. - A set of UTF-8 patches for numerous tools as well as a UTF-8 assist standing record is in Bruno Haible’s Unicode-HOWTO. - Bruno Haible has also prepared various patches for stty, the Linux kernel tty, and many others. - The multilingualization patch (w3m-m17n) for the textual content-mode web browser w3m allows you to view documents in all the widespread encodings on a UTF-eight terminal like xterm (also change possibility "Use alternate expression with ASCII for entity" to OFF after urgent "o"). Another multilingual version (w3mmee) is out there as effectively (haven't tried that yet).Are there free libraries for dealing with Unicode obtainable?

- Ulrich Drepper’s GNU C library glibc has featured since version 2.2 full multi-byte locale support for UTF-8, an ISO ISO 14651 sorting order algorithm, and it could possibly recode into many other encodings. All current Linux distributions come with glibc 2.2 or newer, so that you positively ought to improve now if you are still using an earlier Linux C library. - The International Components for Unicode (ICU) (previously IBM Classes for Unicode) have turn into what might be essentially the most highly effective cross-platform commonplace library for more superior Unicode character processing capabilities. - X.Net’s xIUA is a package deal designed to retrofit existing code for ICU assist by providing locale administration in order that users do not need to change inside calling interfaces to move locale parameters. It uses extra familiar APIs, for instance to collate you employ xiua_strcoll, and is thread safe. Mark Leisher’s UCData Unicode character property and bidi library as well as his wchar_t assist test code. - Bruno Haible’s libiconv character-set conversion library offers an iconv() implementation, to be used on programs which should not have one, or whose implementation can not convert from/to Unicode.It also comprises the libcharset character-encoding question library that allows functions to find out in a extremely portable way the character encoding of the present locale, avoiding the portability issues of utilizing nl_langinfo(CODESET) immediately. Bruno Haible’s libutf8 provides varied features for dealing with UTF-eight strings, especially for platforms that don't yet provide proper UTF-8 locales. Tom Tromey’s libunicode library is a part of the Gnome Desktop mission, but may be built independently of Gnome. It incorporates various character class and conversion functions. (CVS) FriBidi is Dov Grobgeld’s free implementation of the Unicode bidi algorithm. Markus Kuhn’s free wcwidth() implementation will be used by purposes on platforms the place the C library does not yet provide an equivalent function to find, what number of column positions a character or string will occupy on a UTF-8 terminal emulator display screen. - Markus Kuhn’s transtab is a transliteration table for purposes that should make a finest-effort conversion from Unicode to ASCII or some 8-bit character set. It incorporates a comprehensive listing of substitution strings for Unicode characters, comparable to the fallback notations that folks use commonly in e-mail and on typewriters to symbolize unavailable characters. The desk comes in ISO/IEC TR 14652 format, to permit easy inclusion into POSIX locale definition recordsdata.What is the status of Unicode assist for various X widget libraries?

- The Pango - Unicode and Complex Text Processing mission added full-featured Unicode help to GTK+. Qt supported the usage of *-ISO10646-1 fonts since version 2.0. - A UTF-8 extension for the Fast Light Tool Kit was ready by Jean-Marc Lienher, based on his Xutf8 Unicode show library.What packages with UTF-eight assist are at present under improvement?

- Native Unicode assist is deliberate for Emacs 23. If you're enthusiastic about contributing/testing, please be part of the emacs-devel @gnu.org mailing checklist. - The Linux Console Project works on a complete revision of the VT100 emulator constructed into the Linux kernel, which will enhance the simplistic UTF-eight support already there.How does UTF-8 assist work underneath Solaris?

Starting with Solaris 2.8, UTF-8 is no less than partially supported. To make use of it, just set one of many UTF-8 locales, for instance by typing setenv LANG en_US.UTF-8 in a C shell. Now the dtterm terminal emulator can be utilized to input and output UTF-8 textual content and the mp print filter will print UTF-8 files on PostScript printers. The en_US.UTF-8 locale is for the time being supported by Motif and CDE desktop functions and libraries, however not by OpenWindows, XView, and OPENLOOK DeskSet functions and libraries. For more info, read Sun’s Overview of en_US.UTF-eight Locale Support net page. Can I take advantage of UTF-eight on the web?

Yes. There are two methods by which a HTTP server can indicate to a shopper that a document is encoded in UTF-8: - Make sure that the HTTP header of a document contains the road Content-Type: textual content/html; charset=utf-8 if the file is HTML, or the road Content-Type: text/plain; charset=utf-eight if the file is plain textual content. How this can be achieved depends in your net server. If you utilize Apache and you have a subdirecory through which all *.html or *.txt files are encoded in UTF-8, then create there a file .htaccess and add to it the 2 lines AddType textual content/html;charset=UTF-eight html AddType textual content/plain;charset=UTF-8 txt A webmaster can modify /and so forth/httpd/mime.varieties to make the identical change for all subdirectories simultaneously. - If you can not affect the HTTP headers that the online server prefixes to your paperwork routinely, then add in a HTML doc underneath HEAD the aspect which usually has the identical effect. This obviously works only for HTML information, not for plain text. It additionally pronounces the encoding of the file to the parser solely after the parser has already started to learn the file, so it's clearly the much less elegant strategy.The currently most generally used browsers help UTF-8 nicely sufficient to usually advocate UTF-eight for use on internet pages. The previous Netscape four browser used an annoyingly large single font for displaying any UTF-8 document. Best upgrade to Mozilla, Netscape 6 or some other recent browser (Netscape four is generally very buggy and never maintained any more). There can be the question of how non-ASCII characters entered into HTML types are encoded in the next HTTP GET or Post request that transfers the sector contents to a CGI script on the server. Unfortunately, each standardization and implementation are nonetheless an enormous mess right here, as discussed in the Form submission and i18n tutorial by Alan Flavell. We are able to only hope that a apply of doing all this in UTF-eight will emerge finally. See additionally the discussion about Mozilla bug 18643. How are PostScript glyph names related to UCS codes?

See Adobe’s Unicode and Glyph Names information. Are there any effectively-outlined UCS subsets?

With over 40000 characters, the design of a font that covers every single Unicode character is an unlimited mission, not just concerning the variety of glyphs that should be created, but also in terms of the calligraphic expertise required to do an adequate job for every script. Because of this, there are hardly any fonts that attempt to cover "all of Unicode". While a few tasks