source: rtems/cpukit/libmisc/utf8proc/README @ 46b7f921

4.115
Last change on this file since 46b7f921 was 46b7f921, checked in by Ralf Kirchner <ralf.kirchner@…>, on 02/26/13 at 11:00:34

libmisc: Add utf8proc-v1.1.5

utf8proc is a small library for processing UTF-8 encoded Unicode strings.
Some features are Unicode normalization, stripping of default ignorable characters, case folding and detection of grapheme cluster boundaries.
For the time beeing utf8proc is intended to be used for normalizing and folding UTF-8 strings
for comparison purposes when adding UTF-8 support to the FAT file system.

  • Property mode set to 100644
File size: 3.9 KB
Line 
1
2Please read the LICENSE file, which is shipping with this software.
3
4
5*** QUICK START ***
6
7For compilation of the C library call "make c-library", for compilation of
8the ruby library call "make ruby-library" and for compilation of the
9PostgreSQL extension call "make pgsql-library".
10
11For ruby you can also create a gem-file by calling "make ruby-gem".
12
13"make all" can be used to build everything, but both ruby and PostgreSQL
14installations are required in this case.
15
16
17*** GENERAL INFORMATION ***
18
19The C library is found in this directory after successful compilation and
20is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
21the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
22subdirectory "ruby/". If you chose to create a gem-file it is placed in the
23"ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
24and resides in the "pgsql/" directory.
25
26Both the ruby library and the PostgreSQL extension are built as stand-alone
27libraries and are therefore not dependent the dynamic version of the
28C library files, but this behaviour might change in future releases.
29
30The Unicode version being supported is 5.0.0.
31Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
32      version 5.0.0 had not been available at the time of implementation.
33
34For Unicode normalizations, the following options have to be used:
35Normalization Form C:  STABLE, COMPOSE
36Normalization Form D:  STABLE, DECOMPOSE
37Normalization Form KC: STABLE, COMPOSE, COMPAT
38Normalization Form KD: STABLE, DECOMPOSE, COMPAT
39
40
41*** C LIBRARY ***
42
43The documentation for the C library is found in the utf8proc.h header file.
44"utf8proc_map" is most likely function you will be using for mapping UTF-8
45strings, unless you want to allocate memory yourself.
46
47
48*** RUBY API ***
49
50The ruby library adds the methods "utf8map" and "utf8map!" to the String
51class, and the method "utf8" to the Integer class.
52
53The String#utf8map method does the same as the "utf8proc_map" C function.
54Options for the mapping procedure are passed as symbols, i.e:
55"Hello".utf8map(:casefold) => "hello"
56
57The descriptions of all options are found in the C header file
58"utf8proc.h". Please notice that the according symbols in ruby are all
59lowercase.
60
61String#utf8map! is the destructive function in the meaning that the string
62is replaced by the result.
63
64There are shortcuts for the 4 normalization forms specified by Unicode:
65String#utf8nfd,  String#utf8nfd!,
66String#utf8nfc,  String#utf8nfc!,
67String#utf8nfkd, String#utf8nfkd!,
68String#utf8nfkc, String#utf8nfkc!
69
70The method Integer#utf8 returns a UTF-8 string, which is containing the
71unicode char given by the code point.
720x000A.utf8 => "\n"
730x2028.utf8 => "\342\200\250"
74
75
76*** POSTGRESQL API ***
77
78For PostgreSQL there are two SQL functions supplied named "unifold" and
79"unistrip". These functions function can be used to prepare index fields in
80order to be folded in a way where string-comparisons make more sense, e.g.
81where "bathtub" == "bath<soft hyphen>tub"
82or "Hello World" == "hello world".
83
84CREATE TABLE people (
85  id    serial8 primary key,
86  name  text,
87  CHECK (unifold(name) NOTNULL)
88);
89CREATE INDEX name_idx ON people (unifold(name));
90SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
91
92The function "unistrip" removes character marks like accents or diaeresis,
93while "unifold" keeps then.
94
95NOTICE: The outputs of the function can change between releases, as
96        utf8proc does not follow a versioning stability policy. You have to
97        rebuild your database indicies, if you upgrade to a newer version
98        of utf8proc.
99
100
101*** TODO ***
102
103- detect stable code points and process segments independently in order to
104  save memory
105- do a quick check before normalizing strings to optimize speed
106- support stream processing
107
108
109*** CONTACT ***
110
111If you find any bugs or experience difficulties in compiling this software,
112please contact us:
113
114Project page: http://www.public-software-group.org/utf8proc
115
116
Note: See TracBrowser for help on using the repository browser.