FUDforum: comp.lang.php » Random string from selected Unicode character set (test data)

Home » Imported messages » comp.lang.php » Random string from selected Unicode character set (test data)

Show: Today's Messages :: Polls :: Message Navigator

Re: Random string from selected Unicode character set (test data) [message #177386 is a reply to message #177359]

Fri, 23 March 2012 05:36

gordonb.9boif
Messages: 1
Registered: March 2012

Karma:

Junior Member

> I am implementing the script from generatedata.com
>
> But I would like for it to also display Unicode chars, so I can test
> other languages.
>
> I have looked everywhere, but can't seem to find a PHP function that
> lets me do something like
>
> $outstring .= $make_unicode_random('katakana');
>
> thereby selecting things from the code points U+30A0 .. U+30FF

First, get a function that takes a codepoint and generates the UTF
bytes for that codepoint. See below. I chose to put the data in
urlencode format (e.g. "%E3%82%A0" for U+30A0) and then urldecode()
it to a raw byte sequence. It may not be super-efficient but it
works. Note that the codepoint probably needs to be a hex number
(e.g. begins with "0x") although decimal works also. UTF-8 encoding
is essentially a lot of bit shifting and masking.

This function is great as a base for generating code charts (that
is, *ALL* of the characters in a certain range, in a table labelled
with the codepoint, to test out your fonts).

Second, make some tables so that, given a language, you can determine
some code point range(s) which contain characters from that language.
The unicode.org tables of blocks of characters and what scripts
they belong to may be useful here.

Third, make a function that picks some random code points from a
particular language range and outputs them.

Be sure to declare the character set of your web page (if that's
where the output is going) as UTF-8 in the HTTP headers:

header('Content-type: text/html; charset=UTF-8');

> The strings could be defined more broadly like "japanese" etc for
> language separation. Not important.
>
> Point is to get a string of 1..n chars of a certain language group, be
> it Greek, European accents, Japanese, Mandarin, Hangul etc

<?php

function codepoint_to_utf8($code)
{
if ($code <= 0x7f) {
$s = sprintf("%%%02X", $code);
} else if ($code <= 0x7ff) {
$s = sprintf("%%%02X%%%02X",
((($code >> 6) & 0x1f)) | 0xc0,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code <= 0xffff) {
$s = sprintf("%%%02X%%%02X%%%02X",
((($code >> 12) & 0x0f)) | 0xe0,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code <= 0x1fffff) {
$s = sprintf("%%%02X%%%02X%%%02X%%%02X",
((($code >> 18) & 0x07)) | 0xf0,
((($code >> 12) & 0x3f)) | 0x80,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code < 0x3ffffff) {
/* actually, this is beyond legal Unicode */
$s = sprintf("%%%02X%%%02X%%%02X%%%02X",
((($code >> 24) & 0x03)) | 0xf8,
((($code >> 18) & 0x3f)) | 0x80,
((($code >> 12) & 0x3f)) | 0x80,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code < 0x7fffffff) {
/* actually, this is beyond legal Unicode */
$s = sprintf("%%%02X%%%02X%%%02X%%%02X%%%02X",
((($code >> 30) & 0x01)) | 0xfc,
((($code >> 24) & 0x3f)) | 0x80,
((($code >> 18) & 0x3f)) | 0x80,
((($code >> 12) & 0x3f)) | 0x80,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
}
return urldecode($s);
}
?>

UTF-8 Character set (U+0100 - U+01FF)
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F

10_ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď
11_ Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ
12_ Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į
13_ İ ı Ĳ ĳ Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ
14_ ŀ Ł ł Ń ń Ņ ņ Ň ň ŉ Ŋ ŋ Ō ō Ŏ ŏ
15_ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş
16_ Š š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů
17_ Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž ſ
18_ ƀ Ɓ Ƃ ƃ Ƅ ƅ Ɔ Ƈ ƈ Ɖ Ɗ Ƌ ƌ ƍ Ǝ Ə
19_ Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ɵ
1A_ Ơ ơ Ƣ ƣ Ƥ ƥ Ʀ Ƨ ƨ Ʃ ƪ ƫ Ƭ ƭ Ʈ Ư
1B_ ư Ʊ Ʋ Ƴ ƴ Ƶ ƶ Ʒ Ƹ ƹ ƺ ƻ Ƽ ƽ ƾ ƿ
1C_ ǀ ǁ ǂ ǃ Ǆ ǅ ǆ Ǉ ǈ ǉ Ǌ ǋ ǌ Ǎ ǎ Ǐ
1D_ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ
1E_ Ǡ ǡ Ǣ ǣ Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ ǭ Ǯ ǯ
1F_ ǰ Ǳ ǲ ǳ Ǵ ǵ Ƕ Ƿ Ǹ ǹ Ǻ ǻ Ǽ ǽ Ǿ ǿ

Report message to a moderator

[Message index]

		Random string from selected Unicode character set (test data) By: Horst Lemminger on Sat, 17 March 2012 19:50
		Re: Random string from selected Unicode character set (test data) By: The Natural Philosoph on Sun, 18 March 2012 00:30
		Re: Random string from selected Unicode character set (test data) By: gordonb.9boif on Fri, 23 March 2012 05:36

Previous Topic:	Stats comp.lang.php (last 7 days)
Next Topic:	Re: openssl_pkcs7_sign with key file

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Thu May 01 05:18:44 GMT 2025

Total time taken to generate the page: 0.04873 seconds