Random string from selected Unicode character set (test data) [message #177359] |
Sat, 17 March 2012 19:50 |
Horst Lemminger
Messages: 1 Registered: March 2012
Karma: 0
|
Junior Member |
|
|
I am implementing the script from generatedata.com
But I would like for it to also display Unicode chars, so I can test
other languages.
I have looked everywhere, but can't seem to find a PHP function that
lets me do something like
$outstring .= $make_unicode_random('katakana');
thereby selecting things from the code points U+30A0 .. U+30FF
etc
The strings could be defined more broadly like "japanese" etc for
language separation. Not important.
Point is to get a string of 1..n chars of a certain language group, be
it Greek, European accents, Japanese, Mandarin, Hangul etc
I have found a project "babel" which is a .NET application. Not sure
if it's open sourced.
Anyone have some pointers to this project ?
|
|
|
Re: Random string from selected Unicode character set (test data) [message #177361 is a reply to message #177359] |
Sun, 18 March 2012 00:30 |
The Natural Philosoph
Messages: 993 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
Horst Lemminger wrote:
> I am implementing the script from generatedata.com
>
> But I would like for it to also display Unicode chars, so I can test
> other languages.
>
> I have looked everywhere, but can't seem to find a PHP function that
> lets me do something like
>
> $outstring .= $make_unicode_random('katakana');
>
> thereby selecting things from the code points U+30A0 .. U+30FF
>
> etc
>
I would simply make an indexed array of all the characters you want to
select from, and then randmonly index into that.
I've done similar in the past by writing a program to write the source
code of a large lookup table.
Sometimes the 'table' approach is just simpler (and much faster) than an
algorithmic approach unless you are REALLY strapped for code/static
memory..
Which is unlikely to be the case with a typical LAMP type installation.
> The strings could be defined more broadly like "japanese" etc for
> language separation. Not important.
>
use separate tables for each character set..
> Point is to get a string of 1..n chars of a certain language group, be
> it Greek, European accents, Japanese, Mandarin, Hangul etc
>
> I have found a project "babel" which is a .NET application. Not sure
> if it's open sourced.
>
> Anyone have some pointers to this project ?
>
Not me
--
To people who know nothing, anything is possible.
To people who know too much, it is a sad fact
that they know how little is really possible -
and how hard it is to achieve it.
|
|
|
Re: Random string from selected Unicode character set (test data) [message #177386 is a reply to message #177359] |
Fri, 23 March 2012 05:36 |
gordonb.9boif
Messages: 1 Registered: March 2012
Karma: 0
|
Junior Member |
|
|
> I am implementing the script from generatedata.com
>
> But I would like for it to also display Unicode chars, so I can test
> other languages.
>
> I have looked everywhere, but can't seem to find a PHP function that
> lets me do something like
>
> $outstring .= $make_unicode_random('katakana');
>
> thereby selecting things from the code points U+30A0 .. U+30FF
First, get a function that takes a codepoint and generates the UTF
bytes for that codepoint. See below. I chose to put the data in
urlencode format (e.g. "%E3%82%A0" for U+30A0) and then urldecode()
it to a raw byte sequence. It may not be super-efficient but it
works. Note that the codepoint probably needs to be a hex number
(e.g. begins with "0x") although decimal works also. UTF-8 encoding
is essentially a lot of bit shifting and masking.
This function is great as a base for generating code charts (that
is, *ALL* of the characters in a certain range, in a table labelled
with the codepoint, to test out your fonts).
Second, make some tables so that, given a language, you can determine
some code point range(s) which contain characters from that language.
The unicode.org tables of blocks of characters and what scripts
they belong to may be useful here.
Third, make a function that picks some random code points from a
particular language range and outputs them.
Be sure to declare the character set of your web page (if that's
where the output is going) as UTF-8 in the HTTP headers:
header('Content-type: text/html; charset=UTF-8');
> The strings could be defined more broadly like "japanese" etc for
> language separation. Not important.
>
> Point is to get a string of 1..n chars of a certain language group, be
> it Greek, European accents, Japanese, Mandarin, Hangul etc
<?php
function codepoint_to_utf8($code)
{
if ($code <= 0x7f) {
$s = sprintf("%%%02X", $code);
} else if ($code <= 0x7ff) {
$s = sprintf("%%%02X%%%02X",
((($code >> 6) & 0x1f)) | 0xc0,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code <= 0xffff) {
$s = sprintf("%%%02X%%%02X%%%02X",
((($code >> 12) & 0x0f)) | 0xe0,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code <= 0x1fffff) {
$s = sprintf("%%%02X%%%02X%%%02X%%%02X",
((($code >> 18) & 0x07)) | 0xf0,
((($code >> 12) & 0x3f)) | 0x80,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code < 0x3ffffff) {
/* actually, this is beyond legal Unicode */
$s = sprintf("%%%02X%%%02X%%%02X%%%02X",
((($code >> 24) & 0x03)) | 0xf8,
((($code >> 18) & 0x3f)) | 0x80,
((($code >> 12) & 0x3f)) | 0x80,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
} else if ($code < 0x7fffffff) {
/* actually, this is beyond legal Unicode */
$s = sprintf("%%%02X%%%02X%%%02X%%%02X%%%02X",
((($code >> 30) & 0x01)) | 0xfc,
((($code >> 24) & 0x3f)) | 0x80,
((($code >> 18) & 0x3f)) | 0x80,
((($code >> 12) & 0x3f)) | 0x80,
((($code >> 6) & 0x3f)) | 0x80,
((($code >> 0) & 0x3f)) | 0x80
);
}
return urldecode($s);
}
?>
UTF-8 Character set (U+0100 - U+01FF)
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
10_ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď
11_ Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ
12_ Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į
13_ İ ı IJ ij Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ
14_ ŀ Ł ł Ń ń Ņ ņ Ň ň ʼn Ŋ ŋ Ō ō Ŏ ŏ
15_ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş
16_ Š š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů
17_ Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž ſ
18_ ƀ Ɓ Ƃ ƃ Ƅ ƅ Ɔ Ƈ ƈ Ɖ Ɗ Ƌ ƌ ƍ Ǝ Ə
19_ Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ɵ
1A_ Ơ ơ Ƣ ƣ Ƥ ƥ Ʀ Ƨ ƨ Ʃ ƪ ƫ Ƭ ƭ Ʈ Ư
1B_ ư Ʊ Ʋ Ƴ ƴ Ƶ ƶ Ʒ Ƹ ƹ ƺ ƻ Ƽ ƽ ƾ ƿ
1C_ ǀ ǁ ǂ ǃ DŽ Dž dž LJ Lj lj NJ Nj nj Ǎ ǎ Ǐ
1D_ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ
1E_ Ǡ ǡ Ǣ ǣ Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ ǭ Ǯ ǯ
1F_ ǰ DZ Dz dz Ǵ ǵ Ƕ Ƿ Ǹ ǹ Ǻ ǻ Ǽ ǽ Ǿ ǿ
|
|
|