Ask a Question related to Coldfusion - Advanced Techniques, Design and Development.
-
n.n. #1
Unicode characters above FFFF
How can I generate/display on a page/ Unicode characters based on unicode
scalar values?
I know ColdFusion function chr() works very good for characters up to 65535
(U+FFFF), but how can I render unicode characters that have a scalar value
greater than FFFF for example U+20000 ?
Unicode website says [u+20000] must be represented by byte sequence [F0 90 80
80]. I tried applying that to BinaryEncode and CharsetEncode functions, but
can't make it display the correct ideograph.
Please help!
n.n. Guest
-
cfqueryparam unicode characters
I am storing Unicode characters in sqlserver. If I use : SELECT * table values WHERE a=N"#val1#" it works fine. However if I use: ... -
Unicode characters and ado.
I have an Access 2002 database which stores unicode characters. I am using adodb.recordset object to display these fields on a web page. If I were... -
entering unicode characters
I would really appreciate some help with this. I am using mozilla (1.2.1) with XFree86 (3.3.6). How can I enter a unicode character into mozilla? -
Unicode characters support: no in SQL - yes in Access.
What datatype is the data stored in SQL Server? To utilize unicode, you myst use nchar, nvarchar, or ntext... not char, varchar, or text.... -
Passing unicode characters in forms
I have as issue I am finding hard to research. I use a stored proecdure in SQL 2000 to provide search capability for our database of news stories... -
PaulH #2
Re: Unicode characters above FFFF
in general the int/long codepoint works w/chr (in mx). i can for instance do
something like the code below. what gets rendered depends on the font (boxes vs
chars). what gets garbaged depends (i think) on the version of unicode in the
JDK cf is using.
what subrange are you after? do you know what version of unicode it's in? is
it the private use stuff?
<cfoutput>
<cfloop index="i" from="194586" to="194686">
<font face="Arial Unicode MS">#chr(i)#</font>
</cfloop>
</cfoutput>
PaulH Guest
-
n.n. #3
Re: Unicode characters above FFFF
I am trying to get to display all Chinese characters. Ranges THAT ARE ABOVE
U+FFFF:
U+2F800-U+2FA1F (CJK compatibility ideographs Supplement);
U+20000-U+2A6DF (CJK Unified ideographs Extension B)
The version of Unicode is 4.1
Also, I noticed that using a number above 65536 as argument to chr(), though
does not throw an error and even displays characters, actually displays a
character from inside the 0-65535 range. So if n = 131073, chr(n) will show the
same as chr(1), it appears the logic in this case is : chr(n)=chr(n-65535*2).
In other words if argument to chr() is greater than 65535, the function
deducts 65535 from the argument untill it falls into range 0-65535. this can be
verified by calling asc(chr(i)), in your example.
n.n. Guest
-
PaulH #4
Re: Unicode characters above FFFF
sorry i'm not seeing that. the snippet below shows different chars not the same
ones. but i see your point, the docs for asc and chr both point to ucs-2
(65535).
are you saying that those chars are new to unicode version 4.1? i'll double
check but sounds strange.
i guess my next question is there any reason why you're doing it this way?
seems rather clumsy. for i18n app text (chinese or other languages) we always
use resource bundles created using ibm's rb manger. can you explain your logic?
<font face="Arial Unicode MS">
<cfoutput>
<cfloop index="i" from="194586" to="194686">
<cfset n=i-65535*2>
#chr(i)# ::: #chr(n)#<br>
</cfloop>
</cfoutput>
</font>
PaulH Guest
-
PaulH #5
Re: Unicode characters above FFFF
i've been poking around a bit more. see if the output from this code makes any
sense to you.
<cfset ub=createObject("java","java.lang.Character$Unicod eBlock")>
<cfset charObj=createObject("java","java.lang.Character") >
<table border="1" cellspacing="2" cellpadding="2">
<tr align="center" bgcolor="#E8EFCB">
<td bgcolor="#E8EFCB"><b>codepoint</b></td>
<td bgcolor="#E8EFCB"><b>java char</b></td>
<td bgcolor="#E8EFCB"><b>cf char</b></td>
<td bgcolor="#E8EFCB">unicode block</td>
</tr>
<cfoutput>
<cfloop index="i" from="65535" to="300000" step="1000">
<cfset thisChar=charObj.init(javacast("int",i))>
<cfset cj=thisChar.toString()>
<cfset cf=chr(javacast("int",i))>
<tr><td>#i# </td><td>#cj# </td><td>#cf# </td><td>#ub.of(thisChar
)# </td></tr>
</cfloop>
</cfoutput>
</table>
PaulH Guest
-
n.n. #6
Re: Unicode characters above FFFF
Paul,
Thanks for your help. I am trying to build a dictionary where you can search
character by number of strokes and radical etc, based on the unicode database.
This database ( ) references character by scalar values only.
Attached code shows the main ranges in question, and the limitation of chr()
function, or rather its compliance with a Unicode standard, unfortunately not
4.1.
One thing I discovered is that Arial Unicode MS font does not include some of
the CJK ranges, specifically Extension B, and some parts of other extensions.
In MSWord help it is said to be Unicode 2.0 compliant, and some CJK ranges were
not added till version 3.0.
I also dumped the java.lang.Character$UnicodeBlock object and I see that CJK
Extension B is not there either.
Unicode website has very nice adobe acrobat charts of all characters, but
contents cannot be copied or extracted. The font Unicode used to create these
charts contains up to 100 000 characters, is available from a chinese company
and costs a small fortune.
I guess the real limitation is not even the chr function, but the available
fonts. So I will have to go with characters available in Arial or use images
instead of chr().
<cfoutput>
<table border="2">
<tr>
<th colspan="6">CJK Unified idegraphs, Range: 4E00-9FBF</th>
</tr>
<tr>
<th>Hex</th>
<th>Dec</th>
<th>Dec Chr</th>
<th>Dec + 65536</th>
<th>Dec + 65536*2</th>
<th>Dec + 65536*3</th>
</tr>
<cfloop from="#inputBaseN('5400',16)#" to="#inputbaseN('5fbe',16)#"
step="501" index="a">
<tr <cfif a mod 2 eq 0> bgcolor="##CCFFCC"</cfif>>
<td>#formatBaseN(a,16)#</td>
<td>#a#</td>
<td>chr(#a#)::#chr(a)#</td>
<td>chr(#val(a+65536)#)::#chr(val(a+65536))#</td>
<td>chr(#val(a+65536*2)#)::#chr(val(a+65536*2))# </td>
<td>chr(#val(a+65536*3)#)::#chr(val(a+65536*3))# </td>
</tr>
</cfloop>
<tr>
<th colspan="6">CJK Unified idegraphs Extension A, Range: 3400-4DBF</th>
</tr>
<tr>
<th>Hex</th>
<th>Dec</th>
<th>Dec Chr</th>
<th>Dec + 65536</th>
<th>Dec + 65536*2</th>
<th>Dec + 65536*3</th>
</tr>
<cfloop from="#inputBaseN('3400',16)#" to="#inputbaseN('3bd4',16)#"
step="501" index="a">
<tr <cfif a mod 2 eq 0> bgcolor="##CCFFCC"</cfif>>
<td>#formatBaseN(a,16)#</td>
<td>#a#</td>
<td>chr(#a#)::#chr(a)#</td>
<td>chr(#val(a+65536)#)::#chr(val(a+65536))#</td>
<td>chr(#val(a+65536*2)#)::#chr(val(a+65536*2))# </td>
<td>chr(#val(a+65536*3)#)::#chr(val(a+65536*3))# </td>
</tr>
</cfloop>
<tr>
<th colspan="6">CJK Unified idegraphs Extension B, Range: 20000
- 2A6DF</th>
</tr>
<tr>
<th>Hex</th>
<th>Dec</th>
<th>Dec Chr</th>
<th>MOD 65536</th>
<th></th>
<th></th>
</tr>
<cfloop from="#inputBaseN('24E48',16)#" to="#inputbaseN('24f47',16)#"
step="51" index="a">
<tr <cfif a mod 2 eq 0> bgcolor="##CCFFCC"</cfif>>
<td>#formatBaseN(a,16)#</td>
<td>#a#</td>
<td>chr(#a#)::#chr(a)#</td>
<td>chr(#val(a MOD 65536)#)::#chr(val(a MOD 65536))#</td>
<td></td>
<td></td>
</tr>
</cfloop>
<tr>
<th colspan="6">CJK Compatibility Idegraphs Supplement, Range:
2F800 - 2FA1F</th>
</tr>
<tr>
<th>Hex</th>
<th>Dec</th>
<th>Dec Chr</th>
<th>MOD 65536</th>
<th></th>
<th></th>
</tr>
<cfloop from="#inputBaseN('2f932',16)#" to="#inputbaseN('2FA1F',16)#"
step="51" index="a">
<tr <cfif a mod 2 eq 0> bgcolor="##CCFFCC"</cfif>>
<td>#formatBaseN(a,16)#</td>
<td>#a#</td>
<td>chr(#a#)::#chr(a)#</td>
<td>chr(#val(a MOD 65536)#)::#chr(val(a MOD 65536))#</td>
<td></td>
<td></td>
</tr>
</cfloop>
</table>
</cfoutput>
n.n. Guest
-
PaulH #7
Re: Unicode characters above FFFF
sounds like an interesting app. the db contains strokes, etc, metadata? i
wonder if icu4j has anything that might help (i use it a lot for i18n work).
let me dig around some more.
unfortunately even JDK 1.5 (which i think mx 7 doesn't quite support yet) is
only at unicode 4.0. are you sure those chars are at 4.1?
PaulH Guest
-
PaulH #8
Re: Unicode characters above FFFF
might as well have a look at this: [url]http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/lang/UCharacter.html[/url]
PaulH Guest
-
n.n. #9
Re: Unicode characters above FFFF
Yes, UCharacter is a cool thing! It has all the metadata for these characters.
I am going to play with it.
Though now I realize that the real issue is to get to display the characters
that are not included in Arial Unicode and SimSun.
I can find out how they are pronounced and their byte sequence, and meaning
etc, but without the font by "Beijing Zhong Yi (Zheng Code)", it does not make
sense - cause it appears to be the only font that can render those.
([url]http://www.unicode.org/charts/fonts.html[/url])
([url]http://www.china-e.com.cn/en/zyfont/zyfont.htm[/url])
Yes, Unicode 4.1 is the latest version, but does not differs siginificantly
from 4.0.
Most or all of CJK Extension B was added in 3.0, i think.
n.n. Guest
-
PaulH #10
Re: Unicode characters above FFFF
i'm not 100% sure its the fonts. i'm getting ? in some subranges rather than
any rendering problems (boxes, mojibake, etc) which usually means garbaged data
(which might be unicode version differences) . i'm not too familiar w/that font
except maybe having read it was quite expensive.
maybe you should put this question to the
[url]http://www.unicode.org/consortium/distlist.html?[/url] lots of smart people there
(well smarter than me anyways).
PaulH Guest
-
n.n. #11
Re: Unicode characters above FFFF
You are right, it is not just the fonts.
What I meant is that even if there is a way in CFusion to generate a 4-byte
sequence that will be correctly interpreted by browser as a unicode character,
say U+2000A, we can't be sure we did it, cause there's no font that has that
character - and it won't be rendered anyway.
But of course the primary issue, does CF allow to generate correct byte
sequences?
Trying to generate byte sequence directly, I was not very successful:
I create a binary object with BinaryDecode function and dump it; it looks
somewhat different from what I'd expect: For example for U+5400 the UTF-8 byte
sequence is E5 90 80. binaryDecode('E5 90 80','hex') dump looks like "840"
Then, using CharsetEncode(binaryobject, "utf-8") on that binary object does not
produce a good output - even though using chr(21504) renders correct
character, i.e. font has it.
n.n. Guest
-
PaulH #12
Re: Unicode characters above FFFF
i'm still getting hung up on the 4 byte sequence. can't you just shove the
codepoint at the browser? maybe as an NCR (hex)? that just leaves the font as
the issue (which somebody else can throw money at ;-).
PaulH Guest
-
n.n. #13
Re: Unicode characters above FFFF
Right, it's a tough one. I think that CharsetEncode and BinaryDecode functions
are either not fully documented or not fully implemented; we are fumbling in
the dark here.
Good idea, though, about throwing hex at the browser - I will try to generate
a file as sequence of bytes and see what I can get there.
n.n. Guest
-
n.n. #14
Re: Unicode characters above FFFF
Paul,
Thank you for all your help.
I guess I will go ahead with what is renderable now and let Unicode worry about the fonts.
n.n. Guest
-
PaulH #15
Re: Unicode characters above FFFF
if you're not in a big rush, i asked somebody at mm to have a look into this. might offer some better ideas (they usually do).
PaulH Guest
-
-
hokugawa #17
Re: Unicode characters above FFFF
I think U+20000 form representation is UTF-32 encoding. Not UTF-16. Probably,
it should use UTF-16's surrogate pair instead... but, I can't succeed yet.
BTW, which OS are you using? I couldn't figure out what OS can handle 'CJK
Unified Ideographs Extension B' area.
Did you try to display these characters by using plain html? like following
code.
Thanks,
-- Hiroshi
U+20000 : 𠀀 <br>
D840 DC00 : �� <br>
hokugawa Guest
-
n.n. #18
Re: Unicode characters above FFFF
Hiroshi,
I am using Windows XP.
I agree, non of the windows-installed fonts contain unicode characters above
U+FFFF.
Though I would not say that this OS does not support CJK Unified Ideographs
Extension B. I can see them in adobe acrobat document generated by Unicode.org
( [url]http://www.unicode.org/charts/PDF/U20000.pdf[/url] ), thus my OS supports it, right?
Yes, i tried plain html code; but hard to tell how good it is for Extension B,
since Extension B characters are not included in Arial Unicode MS.
I do not quite understand your point about UTF-32.
Unicode website assures that U+20000 (et al.) can be rendered in utf-8
encoding as good as in utf-16 and utf-32.
Thank you.
n.n. Guest
-
PaulH #19
Re: Unicode characters above FFFF
it's neither here no there that adobe can display those chars, it's all custom encoding & embedded fonts. which actually might be telling us something??
PaulH Guest
-
hokugawa #20
Re: Unicode characters above FFFF
> Though I would not say that this OS does not support CJK Unified Ideographs
Extension B. I can see them in adobe acrobat document generated by Unicode.org
( CJK Unified Ideographs Extension B in PDF format ), thus my OS supports it,
right?
It doesn't mean your OS support CJK Ext B. Acrobat can display any vector
data, also you can make any font character glyph. (It called Gaiji)
But, yes, Windows can handle CJK Ext B area characters. The font 'Simsun
(Founder Extended)' (sursong.ttf) contains CJK Ext B area data. So, if you
have this font, you can display CJK Ext B characters in your PC.
I mean you can't use \u20000 in Java. You should use \ud840\udc00 instead.> I do not quite understand your point about UTF-32.
Anyway, if you have the font (probably you can google it), you can try
attached code.
Thanks,
-- Hiroshi
<html>
<head>
<title></title>
</head>
<body>
<span style="font-family: SimSun (Founder Extended)">
U+20000 : 𠀀 <br>
D840 DC00 : �� <br>
#chr(inputbasen("d840", 16))##chr(inputbasen("dc00", 16))# :
<cfoutput>#chr(inputbasen("d840", 16))##chr(inputbasen("dc00",
16))#</cfoutput><br>
</span>
<p>
<a
href="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=20000">[url]http://ww[/url]
w.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=20000</a>
</body>
</html>
hokugawa Guest



Reply With Quote

