Re: Random string from selected Unicode character set (test data) [message #177386 is a reply to message #177359] |
Fri, 23 March 2012 05:36 |
gordonb.9boif
Messages: 1 Registered: March 2012
Karma:
|
Junior Member |
|
|
> I am implementing the script from generatedata.com
>
> But I would like for it to also display Unicode chars, so I can test
> other languages.
>
> I have looked everywhere, but can't seem to find a PHP function that
> lets me do something like
>
> $outstring .= $make_unicode_random('katakana');
>
> thereby selecting things from the code points U+30A0 .. U+30FF
First, get a function that takes a codepoint and generates the UTF
bytes for that codepoint. See below. I chose to put the data in
urlencode format (e.g. "%E3%82%A0" for U+30A0) and then urldecode()
it to a raw byte sequence. It may not be super-efficient but it
works. Note that the codepoint probably needs to be a hex number
(e.g. begins with "0x") although decimal works also. UTF-8 encoding
is essentially a lot of bit shifting and masking.
This function is great as a base for generating code charts (that
is, *ALL* of the characters in a certain range, in a table labelled
with the codepoint, to test out your fonts).
Second, make some tables so that, given a language, you can determine
some code point range(s) which contain characters from that language.
The unicode.org tables of blocks of characters and what scripts
they belong to may be useful here.
Third, make a function that picks some random code points from a
particular language range and outputs them.
Be sure to declare the character set of your web page (if that's
where the output is going) as UTF-8 in the HTTP headers:
header('Content-type: text/html; charset=UTF-8');
> The strings could be defined more broadly like "japanese" etc for
> language separation. Not important.
>
> Point is to get a string of 1..n chars of a certain language group, be
> it Greek, European accents, Japanese, Mandarin, Hangul etc
<?php
function codepoint_to_utf8($code)
{
if ($code <= 0x7f) {
$s = sprintf("%%%02X", $code);
} else if ($code <= 0x7ff) {
$s = sprintf("%%%02X%%%02X",
((($code >> 6) & 0x1f)) | 0xc0,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code <= 0xffff) {
$s = sprintf("%%%02X%%%02X%%%02X",
((($code >> 12) & 0x0f)) | 0xe0,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code <= 0x1fffff) {
$s = sprintf("%%%02X%%%02X%%%02X%%%02X",
((($code >> 18) & 0x07)) | 0xf0,
((($code >> 12) & 0x3f)) | 0x80,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code < 0x3ffffff) {
/* actually, this is beyond legal Unicode */
$s = sprintf("%%%02X%%%02X%%%02X%%%02X",
((($code >> 24) & 0x03)) | 0xf8,
((($code >> 18) & 0x3f)) | 0x80,
((($code >> 12) & 0x3f)) | 0x80,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code < 0x7fffffff) {
/* actually, this is beyond legal Unicode */
$s = sprintf("%%%02X%%%02X%%%02X%%%02X%%%02X",
((($code >> 30) & 0x01)) | 0xfc,
((($code >> 24) & 0x3f)) | 0x80,
((($code >> 18) & 0x3f)) | 0x80,
((($code >> 12) & 0x3f)) | 0x80,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
}
return urldecode($s);
}
?>
UTF-8 Character set (U+0100 - U+01FF)
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
10_ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď
11_ Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ
12_ Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į
13_ İ ı IJ ij Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ
14_ ŀ Ł ł Ń ń Ņ ņ Ň ň ʼn Ŋ ŋ Ō ō Ŏ ŏ
15_ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş
16_ Š š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů
17_ Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž ſ
18_ ƀ Ɓ Ƃ ƃ Ƅ ƅ Ɔ Ƈ ƈ Ɖ Ɗ Ƌ ƌ ƍ Ǝ Ə
19_ Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ɵ
1A_ Ơ ơ Ƣ ƣ Ƥ ƥ Ʀ Ƨ ƨ Ʃ ƪ ƫ Ƭ ƭ Ʈ Ư
1B_ ư Ʊ Ʋ Ƴ ƴ Ƶ ƶ Ʒ Ƹ ƹ ƺ ƻ Ƽ ƽ ƾ ƿ
1C_ ǀ ǁ ǂ ǃ DŽ Dž dž LJ Lj lj NJ Nj nj Ǎ ǎ Ǐ
1D_ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ
1E_ Ǡ ǡ Ǣ ǣ Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ ǭ Ǯ ǯ
1F_ ǰ DZ Dz dz Ǵ ǵ Ƕ Ƿ Ǹ ǹ Ǻ ǻ Ǽ ǽ Ǿ ǿ
|
|
|