Unicode characters above FFFF

Ask a Question related to Coldfusion - Advanced Techniques, Design and Development.

  1. #1

    Default Unicode characters above FFFF

    How can I generate/display on a page/ Unicode characters based on unicode
    scalar values?

    I know ColdFusion function chr() works very good for characters up to 65535
    (U+FFFF), but how can I render unicode characters that have a scalar value
    greater than FFFF for example U+20000 ?

    Unicode website says [u+20000] must be represented by byte sequence [F0 90 80
    80]. I tried applying that to BinaryEncode and CharsetEncode functions, but
    can't make it display the correct ideograph.

    Please help!


    n.n. Guest

  2. Similar Questions and Discussions

    1. cfqueryparam unicode characters
      I am storing Unicode characters in sqlserver. If I use : SELECT * table values WHERE a=N"#val1#" it works fine. However if I use: ...
    2. Unicode characters and ado.
      I have an Access 2002 database which stores unicode characters. I am using adodb.recordset object to display these fields on a web page. If I were...
    3. entering unicode characters
      I would really appreciate some help with this. I am using mozilla (1.2.1) with XFree86 (3.3.6). How can I enter a unicode character into mozilla?
    4. Unicode characters support: no in SQL - yes in Access.
      What datatype is the data stored in SQL Server? To utilize unicode, you myst use nchar, nvarchar, or ntext... not char, varchar, or text....
    5. Passing unicode characters in forms
      I have as issue I am finding hard to research. I use a stored proecdure in SQL 2000 to provide search capability for our database of news stories...
  3. #2

    Default Re: Unicode characters above FFFF

    in general the int/long codepoint works w/chr (in mx). i can for instance do
    something like the code below. what gets rendered depends on the font (boxes vs
    chars). what gets garbaged depends (i think) on the version of unicode in the
    JDK cf is using.

    what subrange are you after? do you know what version of unicode it's in? is
    it the private use stuff?




    <cfoutput>
    <cfloop index="i" from="194586" to="194686">
    <font face="Arial Unicode MS">#chr(i)#</font>
    </cfloop>
    </cfoutput>

    PaulH Guest

  4. #3

    Default Re: Unicode characters above FFFF

    I am trying to get to display all Chinese characters. Ranges THAT ARE ABOVE
    U+FFFF:
    U+2F800-U+2FA1F (CJK compatibility ideographs Supplement);
    U+20000-U+2A6DF (CJK Unified ideographs Extension B)
    The version of Unicode is 4.1

    Also, I noticed that using a number above 65536 as argument to chr(), though
    does not throw an error and even displays characters, actually displays a
    character from inside the 0-65535 range. So if n = 131073, chr(n) will show the
    same as chr(1), it appears the logic in this case is : chr(n)=chr(n-65535*2).
    In other words if argument to chr() is greater than 65535, the function
    deducts 65535 from the argument untill it falls into range 0-65535. this can be
    verified by calling asc(chr(i)), in your example.

    n.n. Guest

  5. #4

    Default Re: Unicode characters above FFFF

    sorry i'm not seeing that. the snippet below shows different chars not the same
    ones. but i see your point, the docs for asc and chr both point to ucs-2
    (65535).

    are you saying that those chars are new to unicode version 4.1? i'll double
    check but sounds strange.

    i guess my next question is there any reason why you're doing it this way?
    seems rather clumsy. for i18n app text (chinese or other languages) we always
    use resource bundles created using ibm's rb manger. can you explain your logic?

    <font face="Arial Unicode MS">
    <cfoutput>
    <cfloop index="i" from="194586" to="194686">
    <cfset n=i-65535*2>
    #chr(i)# ::: #chr(n)#<br>
    </cfloop>
    </cfoutput>
    </font>

    PaulH Guest

  6. #5

    Default Re: Unicode characters above FFFF

    i've been poking around a bit more. see if the output from this code makes any
    sense to you.

    <cfset ub=createObject("java","java.lang.Character$Unicod eBlock")>
    <cfset charObj=createObject("java","java.lang.Character") >
    <table border="1" cellspacing="2" cellpadding="2">
    <tr align="center" bgcolor="#E8EFCB">
    <td bgcolor="#E8EFCB"><b>codepoint</b></td>
    <td bgcolor="#E8EFCB"><b>java char</b></td>
    <td bgcolor="#E8EFCB"><b>cf char</b></td>
    <td bgcolor="#E8EFCB">unicode block</td>
    </tr>
    <cfoutput>
    <cfloop index="i" from="65535" to="300000" step="1000">
    <cfset thisChar=charObj.init(javacast("int",i))>
    <cfset cj=thisChar.toString()>
    <cfset cf=chr(javacast("int",i))>

    <tr><td>#i#&nbsp;</td><td>#cj#&nbsp;</td><td>#cf#&nbsp;</td><td>#ub.of(thisChar
    )#&nbsp;</td></tr>
    </cfloop>
    </cfoutput>
    </table>

    PaulH Guest

  7. #6

    Default Re: Unicode characters above FFFF

    Paul,
    Thanks for your help. I am trying to build a dictionary where you can search
    character by number of strokes and radical etc, based on the unicode database.
    This database ( ) references character by scalar values only.

    Attached code shows the main ranges in question, and the limitation of chr()
    function, or rather its compliance with a Unicode standard, unfortunately not
    4.1.

    One thing I discovered is that Arial Unicode MS font does not include some of
    the CJK ranges, specifically Extension B, and some parts of other extensions.
    In MSWord help it is said to be Unicode 2.0 compliant, and some CJK ranges were
    not added till version 3.0.

    I also dumped the java.lang.Character$UnicodeBlock object and I see that CJK
    Extension B is not there either.

    Unicode website has very nice adobe acrobat charts of all characters, but
    contents cannot be copied or extracted. The font Unicode used to create these
    charts contains up to 100 000 characters, is available from a chinese company
    and costs a small fortune.

    I guess the real limitation is not even the chr function, but the available
    fonts. So I will have to go with characters available in Arial or use images
    instead of chr().





    <cfoutput>
    <table border="2">
    <tr>
    <th colspan="6">CJK Unified idegraphs, Range: 4E00-9FBF</th>
    </tr>
    <tr>
    <th>Hex</th>
    <th>Dec</th>
    <th>Dec Chr</th>
    <th>Dec + 65536</th>
    <th>Dec + 65536*2</th>
    <th>Dec + 65536*3</th>
    </tr>
    <cfloop from="#inputBaseN('5400',16)#" to="#inputbaseN('5fbe',16)#"
    step="501" index="a">
    <tr <cfif a mod 2 eq 0> bgcolor="##CCFFCC"</cfif>>
    <td>#formatBaseN(a,16)#</td>
    <td>#a#</td>
    <td>chr(#a#)::#chr(a)#</td>
    <td>chr(#val(a+65536)#)::#chr(val(a+65536))#</td>
    <td>chr(#val(a+65536*2)#)::#chr(val(a+65536*2))# </td>
    <td>chr(#val(a+65536*3)#)::#chr(val(a+65536*3))# </td>
    </tr>
    </cfloop>
    <tr>
    <th colspan="6">CJK Unified idegraphs Extension A, Range: 3400-4DBF</th>
    </tr>
    <tr>
    <th>Hex</th>
    <th>Dec</th>
    <th>Dec Chr</th>
    <th>Dec + 65536</th>
    <th>Dec + 65536*2</th>
    <th>Dec + 65536*3</th>
    </tr>
    <cfloop from="#inputBaseN('3400',16)#" to="#inputbaseN('3bd4',16)#"
    step="501" index="a">
    <tr <cfif a mod 2 eq 0> bgcolor="##CCFFCC"</cfif>>
    <td>#formatBaseN(a,16)#</td>
    <td>#a#</td>
    <td>chr(#a#)::#chr(a)#</td>
    <td>chr(#val(a+65536)#)::#chr(val(a+65536))#</td>
    <td>chr(#val(a+65536*2)#)::#chr(val(a+65536*2))# </td>
    <td>chr(#val(a+65536*3)#)::#chr(val(a+65536*3))# </td>
    </tr>
    </cfloop>
    <tr>
    <th colspan="6">CJK Unified idegraphs Extension B, Range: 20000
    - 2A6DF</th>
    </tr>
    <tr>
    <th>Hex</th>
    <th>Dec</th>
    <th>Dec Chr</th>
    <th>MOD 65536</th>
    <th></th>
    <th></th>
    </tr>
    <cfloop from="#inputBaseN('24E48',16)#" to="#inputbaseN('24f47',16)#"
    step="51" index="a">
    <tr <cfif a mod 2 eq 0> bgcolor="##CCFFCC"</cfif>>
    <td>#formatBaseN(a,16)#</td>
    <td>#a#</td>
    <td>chr(#a#)::#chr(a)#</td>
    <td>chr(#val(a MOD 65536)#)::#chr(val(a MOD 65536))#</td>
    <td></td>
    <td></td>
    </tr>
    </cfloop>
    <tr>
    <th colspan="6">CJK Compatibility Idegraphs Supplement, Range:
    2F800 - 2FA1F</th>
    </tr>
    <tr>
    <th>Hex</th>
    <th>Dec</th>
    <th>Dec Chr</th>
    <th>MOD 65536</th>
    <th></th>
    <th></th>
    </tr>
    <cfloop from="#inputBaseN('2f932',16)#" to="#inputbaseN('2FA1F',16)#"
    step="51" index="a">
    <tr <cfif a mod 2 eq 0> bgcolor="##CCFFCC"</cfif>>
    <td>#formatBaseN(a,16)#</td>
    <td>#a#</td>
    <td>chr(#a#)::#chr(a)#</td>
    <td>chr(#val(a MOD 65536)#)::#chr(val(a MOD 65536))#</td>
    <td></td>
    <td></td>
    </tr>
    </cfloop>
    </table>
    </cfoutput>

    n.n. Guest

  8. #7

    Default Re: Unicode characters above FFFF

    sounds like an interesting app. the db contains strokes, etc, metadata? i
    wonder if icu4j has anything that might help (i use it a lot for i18n work).
    let me dig around some more.

    unfortunately even JDK 1.5 (which i think mx 7 doesn't quite support yet) is
    only at unicode 4.0. are you sure those chars are at 4.1?

    PaulH Guest

  9. #8

    Default Re: Unicode characters above FFFF

    might as well have a look at this: [url]http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/lang/UCharacter.html[/url]
    PaulH Guest

  10. #9

    Default Re: Unicode characters above FFFF

    Yes, UCharacter is a cool thing! It has all the metadata for these characters.
    I am going to play with it.

    Though now I realize that the real issue is to get to display the characters
    that are not included in Arial Unicode and SimSun.

    I can find out how they are pronounced and their byte sequence, and meaning
    etc, but without the font by "Beijing Zhong Yi (Zheng Code)", it does not make
    sense - cause it appears to be the only font that can render those.
    ([url]http://www.unicode.org/charts/fonts.html[/url])
    ([url]http://www.china-e.com.cn/en/zyfont/zyfont.htm[/url])

    Yes, Unicode 4.1 is the latest version, but does not differs siginificantly
    from 4.0.
    Most or all of CJK Extension B was added in 3.0, i think.


    n.n. Guest

  11. #10

    Default Re: Unicode characters above FFFF

    i'm not 100% sure its the fonts. i'm getting ? in some subranges rather than
    any rendering problems (boxes, mojibake, etc) which usually means garbaged data
    (which might be unicode version differences) . i'm not too familiar w/that font
    except maybe having read it was quite expensive.

    maybe you should put this question to the
    [url]http://www.unicode.org/consortium/distlist.html?[/url] lots of smart people there
    (well smarter than me anyways).


    PaulH Guest

  12. #11

    Default Re: Unicode characters above FFFF

    You are right, it is not just the fonts.
    What I meant is that even if there is a way in CFusion to generate a 4-byte
    sequence that will be correctly interpreted by browser as a unicode character,
    say U+2000A, we can't be sure we did it, cause there's no font that has that
    character - and it won't be rendered anyway.
    But of course the primary issue, does CF allow to generate correct byte
    sequences?
    Trying to generate byte sequence directly, I was not very successful:
    I create a binary object with BinaryDecode function and dump it; it looks
    somewhat different from what I'd expect: For example for U+5400 the UTF-8 byte
    sequence is E5 90 80. binaryDecode('E5 90 80','hex') dump looks like "840"
    Then, using CharsetEncode(binaryobject, "utf-8") on that binary object does not
    produce a good output - even though using chr(21504) renders correct
    character, i.e. font has it.


    n.n. Guest

  13. #12

    Default Re: Unicode characters above FFFF

    i'm still getting hung up on the 4 byte sequence. can't you just shove the
    codepoint at the browser? maybe as an NCR (hex)? that just leaves the font as
    the issue (which somebody else can throw money at ;-).


    PaulH Guest

  14. #13

    Default Re: Unicode characters above FFFF

    Right, it's a tough one. I think that CharsetEncode and BinaryDecode functions
    are either not fully documented or not fully implemented; we are fumbling in
    the dark here.
    Good idea, though, about throwing hex at the browser - I will try to generate
    a file as sequence of bytes and see what I can get there.

    n.n. Guest

  15. #14

    Default Re: Unicode characters above FFFF

    Paul,
    Thank you for all your help.
    I guess I will go ahead with what is renderable now and let Unicode worry about the fonts.

    n.n. Guest

  16. #15

    Default Re: Unicode characters above FFFF

    if you're not in a big rush, i asked somebody at mm to have a look into this. might offer some better ideas (they usually do).
    PaulH Guest

  17. #16

    Default Re: Unicode characters above FFFF

    Great!
    I am in no rush.
    n.n. Guest

  18. #17

    Default Re: Unicode characters above FFFF

    I think U+20000 form representation is UTF-32 encoding. Not UTF-16. Probably,
    it should use UTF-16's surrogate pair instead... but, I can't succeed yet.

    BTW, which OS are you using? I couldn't figure out what OS can handle 'CJK
    Unified Ideographs Extension B' area.
    Did you try to display these characters by using plain html? like following
    code.

    Thanks,
    -- Hiroshi



    U+20000 : &#x20000; <br>
    D840 DC00 : &#xd845;&#xdeb4; <br>

    hokugawa Guest

  19. #18

    Default Re: Unicode characters above FFFF

    Hiroshi,
    I am using Windows XP.
    I agree, non of the windows-installed fonts contain unicode characters above
    U+FFFF.

    Though I would not say that this OS does not support CJK Unified Ideographs
    Extension B. I can see them in adobe acrobat document generated by Unicode.org
    ( [url]http://www.unicode.org/charts/PDF/U20000.pdf[/url] ), thus my OS supports it, right?

    Yes, i tried plain html code; but hard to tell how good it is for Extension B,
    since Extension B characters are not included in Arial Unicode MS.

    I do not quite understand your point about UTF-32.
    Unicode website assures that U+20000 (et al.) can be rendered in utf-8
    encoding as good as in utf-16 and utf-32.

    Thank you.


    n.n. Guest

  20. #19

    Default Re: Unicode characters above FFFF

    it's neither here no there that adobe can display those chars, it's all custom encoding & embedded fonts. which actually might be telling us something??
    PaulH Guest

  21. #20

    Default Re: Unicode characters above FFFF

    > Though I would not say that this OS does not support CJK Unified Ideographs
    Extension B. I can see them in adobe acrobat document generated by Unicode.org
    ( CJK Unified Ideographs Extension B in PDF format ), thus my OS supports it,
    right?

    It doesn't mean your OS support CJK Ext B. Acrobat can display any vector
    data, also you can make any font character glyph. (It called Gaiji)

    But, yes, Windows can handle CJK Ext B area characters. The font 'Simsun
    (Founder Extended)' (sursong.ttf) contains CJK Ext B area data. So, if you
    have this font, you can display CJK Ext B characters in your PC.
    > I do not quite understand your point about UTF-32.
    I mean you can't use \u20000 in Java. You should use \ud840\udc00 instead.

    Anyway, if you have the font (probably you can google it), you can try
    attached code.

    Thanks,
    -- Hiroshi

    <html>
    <head>
    <title></title>
    </head>
    <body>
    <span style="font-family: SimSun (Founder Extended)">
    U+20000 : &#x20000; <br>
    D840 DC00 : &#xd840;&#xdc00; <br>
    #chr(inputbasen("d840", 16))##chr(inputbasen("dc00", 16))# :
    <cfoutput>#chr(inputbasen("d840", 16))##chr(inputbasen("dc00",
    16))#</cfoutput><br>
    </span>
    <p>
    <a
    href="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=20000">[url]http://ww[/url]
    w.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=20000</a>
    </body>
    </html>

    hokugawa Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139