Context Navigation

README @ 46b7f921

4.115

Last change on this file since 46b7f921 was 46b7f921, checked in by Ralf Kirchner <ralf.kirchner@…>, on 02/26/13 at 11:00:34

libmisc: Add utf8proc-v1.1.5

utf8proc is a small library for processing UTF-8 encoded Unicode strings.
Some features are Unicode normalization, stripping of default ignorable characters, case folding and detection of grapheme cluster boundaries.
For the time beeing utf8proc is intended to be used for normalizing and folding UTF-8 strings
for comparison purposes when adding UTF-8 support to the FAT file system.

Property mode set to 100644

File size: 3.9 KB

Line
1
2	Please read the LICENSE file, which is shipping with this software.
3
4
5	* QUICK START *
6
7	For compilation of the C library call "make c-library", for compilation of
8	the ruby library call "make ruby-library" and for compilation of the
9	PostgreSQL extension call "make pgsql-library".
10
11	For ruby you can also create a gem-file by calling "make ruby-gem".
12
13	"make all" can be used to build everything, but both ruby and PostgreSQL
14	installations are required in this case.
15
16
17	* GENERAL INFORMATION *
18
19	The C library is found in this directory after successful compilation and
20	is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
21	the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
22	subdirectory "ruby/". If you chose to create a gem-file it is placed in the
23	"ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
24	and resides in the "pgsql/" directory.
25
26	Both the ruby library and the PostgreSQL extension are built as stand-alone
27	libraries and are therefore not dependent the dynamic version of the
28	C library files, but this behaviour might change in future releases.
29
30	The Unicode version being supported is 5.0.0.
31	Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
32	version 5.0.0 had not been available at the time of implementation.
33
34	For Unicode normalizations, the following options have to be used:
35	Normalization Form C: STABLE, COMPOSE
36	Normalization Form D: STABLE, DECOMPOSE
37	Normalization Form KC: STABLE, COMPOSE, COMPAT
38	Normalization Form KD: STABLE, DECOMPOSE, COMPAT
39
40
41	* C LIBRARY *
42
43	The documentation for the C library is found in the utf8proc.h header file.
44	"utf8proc_map" is most likely function you will be using for mapping UTF-8
45	strings, unless you want to allocate memory yourself.
46
47
48	* RUBY API *
49
50	The ruby library adds the methods "utf8map" and "utf8map!" to the String
51	class, and the method "utf8" to the Integer class.
52
53	The String#utf8map method does the same as the "utf8proc_map" C function.
54	Options for the mapping procedure are passed as symbols, i.e:
55	"Hello".utf8map(:casefold) => "hello"
56
57	The descriptions of all options are found in the C header file
58	"utf8proc.h". Please notice that the according symbols in ruby are all
59	lowercase.
60
61	String#utf8map! is the destructive function in the meaning that the string
62	is replaced by the result.
63
64	There are shortcuts for the 4 normalization forms specified by Unicode:
65	String#utf8nfd, String#utf8nfd!,
66	String#utf8nfc, String#utf8nfc!,
67	String#utf8nfkd, String#utf8nfkd!,
68	String#utf8nfkc, String#utf8nfkc!
69
70	The method Integer#utf8 returns a UTF-8 string, which is containing the
71	unicode char given by the code point.
72	0x000A.utf8 => "\n"
73	0x2028.utf8 => "\342\200\250"
74
75
76	* POSTGRESQL API *
77
78	For PostgreSQL there are two SQL functions supplied named "unifold" and
79	"unistrip". These functions function can be used to prepare index fields in
80	order to be folded in a way where string-comparisons make more sense, e.g.
81	where "bathtub" == "bath<soft hyphen>tub"
82	or "Hello World" == "hello world".
83
84	CREATE TABLE people (
85	id serial8 primary key,
86	name text,
87	CHECK (unifold(name) NOTNULL)
88	);
89	CREATE INDEX name_idx ON people (unifold(name));
90	SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
91
92	The function "unistrip" removes character marks like accents or diaeresis,
93	while "unifold" keeps then.
94
95	NOTICE: The outputs of the function can change between releases, as
96	utf8proc does not follow a versioning stability policy. You have to
97	rebuild your database indicies, if you upgrade to a newer version
98	of utf8proc.
99
100
101	* TODO *
102
103	- detect stable code points and process segments independently in order to
104	save memory
105	- do a quick check before normalizing strings to optimize speed
106	- support stream processing
107
108
109	* CONTACT *
110
111	If you find any bugs or experience difficulties in compiling this software,
112	please contact us:
113
114	Project page: http://www.public-software-group.org/utf8proc
115
116

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: rtems/cpukit/libmisc/utf8proc/README @ 46b7f921

Download in other formats: