1 | |
---|
2 | Please read the LICENSE file, which is shipping with this software. |
---|
3 | |
---|
4 | |
---|
5 | *** QUICK START *** |
---|
6 | |
---|
7 | For compilation of the C library call "make c-library", for compilation of |
---|
8 | the ruby library call "make ruby-library" and for compilation of the |
---|
9 | PostgreSQL extension call "make pgsql-library". |
---|
10 | |
---|
11 | For ruby you can also create a gem-file by calling "make ruby-gem". |
---|
12 | |
---|
13 | "make all" can be used to build everything, but both ruby and PostgreSQL |
---|
14 | installations are required in this case. |
---|
15 | |
---|
16 | |
---|
17 | *** GENERAL INFORMATION *** |
---|
18 | |
---|
19 | The C library is found in this directory after successful compilation and |
---|
20 | is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of |
---|
21 | the files "utf8proc.rb" and "utf8proc_native.so", which are found in the |
---|
22 | subdirectory "ruby/". If you chose to create a gem-file it is placed in the |
---|
23 | "ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so" |
---|
24 | and resides in the "pgsql/" directory. |
---|
25 | |
---|
26 | Both the ruby library and the PostgreSQL extension are built as stand-alone |
---|
27 | libraries and are therefore not dependent the dynamic version of the |
---|
28 | C library files, but this behaviour might change in future releases. |
---|
29 | |
---|
30 | The Unicode version being supported is 5.0.0. |
---|
31 | Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as |
---|
32 | version 5.0.0 had not been available at the time of implementation. |
---|
33 | |
---|
34 | For Unicode normalizations, the following options have to be used: |
---|
35 | Normalization Form C: STABLE, COMPOSE |
---|
36 | Normalization Form D: STABLE, DECOMPOSE |
---|
37 | Normalization Form KC: STABLE, COMPOSE, COMPAT |
---|
38 | Normalization Form KD: STABLE, DECOMPOSE, COMPAT |
---|
39 | |
---|
40 | |
---|
41 | *** C LIBRARY *** |
---|
42 | |
---|
43 | The documentation for the C library is found in the utf8proc.h header file. |
---|
44 | "utf8proc_map" is most likely function you will be using for mapping UTF-8 |
---|
45 | strings, unless you want to allocate memory yourself. |
---|
46 | |
---|
47 | |
---|
48 | *** RUBY API *** |
---|
49 | |
---|
50 | The ruby library adds the methods "utf8map" and "utf8map!" to the String |
---|
51 | class, and the method "utf8" to the Integer class. |
---|
52 | |
---|
53 | The String#utf8map method does the same as the "utf8proc_map" C function. |
---|
54 | Options for the mapping procedure are passed as symbols, i.e: |
---|
55 | "Hello".utf8map(:casefold) => "hello" |
---|
56 | |
---|
57 | The descriptions of all options are found in the C header file |
---|
58 | "utf8proc.h". Please notice that the according symbols in ruby are all |
---|
59 | lowercase. |
---|
60 | |
---|
61 | String#utf8map! is the destructive function in the meaning that the string |
---|
62 | is replaced by the result. |
---|
63 | |
---|
64 | There are shortcuts for the 4 normalization forms specified by Unicode: |
---|
65 | String#utf8nfd, String#utf8nfd!, |
---|
66 | String#utf8nfc, String#utf8nfc!, |
---|
67 | String#utf8nfkd, String#utf8nfkd!, |
---|
68 | String#utf8nfkc, String#utf8nfkc! |
---|
69 | |
---|
70 | The method Integer#utf8 returns a UTF-8 string, which is containing the |
---|
71 | unicode char given by the code point. |
---|
72 | 0x000A.utf8 => "\n" |
---|
73 | 0x2028.utf8 => "\342\200\250" |
---|
74 | |
---|
75 | |
---|
76 | *** POSTGRESQL API *** |
---|
77 | |
---|
78 | For PostgreSQL there are two SQL functions supplied named "unifold" and |
---|
79 | "unistrip". These functions function can be used to prepare index fields in |
---|
80 | order to be folded in a way where string-comparisons make more sense, e.g. |
---|
81 | where "bathtub" == "bath<soft hyphen>tub" |
---|
82 | or "Hello World" == "hello world". |
---|
83 | |
---|
84 | CREATE TABLE people ( |
---|
85 | id serial8 primary key, |
---|
86 | name text, |
---|
87 | CHECK (unifold(name) NOTNULL) |
---|
88 | ); |
---|
89 | CREATE INDEX name_idx ON people (unifold(name)); |
---|
90 | SELECT * FROM people WHERE unifold(name) = unifold('John Doe'); |
---|
91 | |
---|
92 | The function "unistrip" removes character marks like accents or diaeresis, |
---|
93 | while "unifold" keeps then. |
---|
94 | |
---|
95 | NOTICE: The outputs of the function can change between releases, as |
---|
96 | utf8proc does not follow a versioning stability policy. You have to |
---|
97 | rebuild your database indicies, if you upgrade to a newer version |
---|
98 | of utf8proc. |
---|
99 | |
---|
100 | |
---|
101 | *** TODO *** |
---|
102 | |
---|
103 | - detect stable code points and process segments independently in order to |
---|
104 | save memory |
---|
105 | - do a quick check before normalizing strings to optimize speed |
---|
106 | - support stream processing |
---|
107 | |
---|
108 | |
---|
109 | *** CONTACT *** |
---|
110 | |
---|
111 | If you find any bugs or experience difficulties in compiling this software, |
---|
112 | please contact us: |
---|
113 | |
---|
114 | Project page: http://www.public-software-group.org/utf8proc |
---|
115 | |
---|
116 | |
---|