CGI.pm: encoding problems

Ask a Question related to PERL Modules, Design and Development.

  1. #1

    Default CGI.pm: encoding problems

    I have a problem with inputing utf-8 via a text window using CGI.pm. This
    problem concerns UTF8 so apologies for posting something with Chinese
    characters in it.

    The following code is a minimal working example of the problem with a lot of
    extraneous material removed. It needs to be run under a web server to see
    the problem. When the text is submitted using the form, the default text of
    Chinese characters (they are the numbers from one to four) are munged into
    some gibberish stuff, and the test of the input, which checks whether the
    input is valid Chinese numerals, fails:

    Input text:

    一二三四

    Output of program:

    Input 一二三四 was not a valid number

    Thank you very much for any assistance, suggestions or advice about this
    problem.
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Begin script (to end of message)
    #!/usr/bin/perl
    use warnings;
    use strict;
    use CGI;
    use utf8;
    binmode (STDOUT, ":utf8");
    my $query = CGI->new();
    $query->charset('UTF-8');
    print $query->header();
    my $kanji;
    if ($query->param('kanji')) {
    my $inputnumber = $query->param('kanji');
    if ($inputnumber =~ /^([一二三四五*七八九十]+)$/) {
    $kanji = $1;
    } else {
    print "<p>Input $inputnumber was not a valid number</p>";
    $kanji = "";
    }
    } else {
    $kanji = "一二三四";
    }
    print $query->start_form(-method => 'POST',-action => $query->url());
    print $query->textarea(-name => 'kanji',
    -default => $kanji);
    print $query->submit();
    print $query->endform();
    print "<table><tr>\n<th>Value</th><td>",
    $kanji, "</td></tr>\n", "</table>\n</form>\n<p>\n";
    print $query->end_html();

    Ben Bullock Guest

  2. Similar Questions and Discussions

    1. problems encoding hebrew text from access 97
      Hello, I've moved a web site that I've built from a server that uses CF 5 to a server that uses CF MX. As seen in the page -...
    2. Encoding Problems
      Hi, I'm using the MSXML4 XMLHTTP object to send soap messages. Yesterday all worked fine, and I think the installation og either XP SP2 or .net...
    3. encoding problems with MX
      i dont know if there is a way to do it or a patch to fix it or how many times this question has been asked here but i will ask it. how can i use...
    4. FileMaker and encoding problems
      Hi, My friend is using filemaker to store his workes. Using their working days & hours, personal info. He has an encoding problem. Can anyone...
    5. Apache character encoding problems [ FIXED; thanks ]
      Thanks to the guys who answered in private (but do answer to the list next time!) I had to comment out this line: AddDefaultCharset on in...
  3. #2

    Default Re: CGI.pm: encoding problems

    Ben Bullock schreef:
    > use warnings;
    > use strict;
    > use CGI;
    > use utf8;
    > binmode (STDOUT, ":utf8");
    Try to replace those 5 lines with these (reordered) 4:

    use strict;
    use warnings;
    use encoding 'utf8' ;
    use CGI;

    This would also set the PerlIO layer of STDIN to ':utf8'.

    See perldoc encoding.

    --
    Affijn, Ruud

    "Gewoon is een tijger."


    Dr.Ruud Guest

  4. #3

    Default Re: CGI.pm: encoding problems

    Ben Bullock wrote:
    > I have a problem with inputing utf-8 via a text window using CGI.pm.
    > This problem concerns UTF8 so apologies for posting something with
    > Chinese characters in it.
    >
    > The following code is a minimal working example of the problem with a
    > lot of extraneous material removed. It needs to be run under a web
    > server to see the problem. When the text is submitted using the form,
    > the default text of Chinese characters (they are the numbers from one to
    > four) are munged into some gibberish stuff, and the test of the input,
    > which checks whether the input is valid Chinese numerals, fails:
    >
    > Input text:
    >
    > 一二三四
    >
    > Output of program:
    >
    > Input 一二三四 was not a valid number
    >
    > Thank you very much for any assistance, suggestions or advice about this
    > problem.
    >
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Begin script (to end of message)
    >
    > #!/usr/bin/perl
    > use warnings;
    > use strict;
    > use CGI;
    > use utf8;
    > binmode (STDOUT, ":utf8");
    > my $query = CGI->new();
    > $query->charset('UTF-8');
    > print $query->header();
    > my $kanji;
    > if ($query->param('kanji')) {
    > my $inputnumber = $query->param('kanji');
    > if ($inputnumber =~ /^([一二三四五*七八九十]+)$/) {
    > $kanji = $1;
    > } else {
    > print "<p>Input $inputnumber was not a valid number</p>";
    > $kanji = "";
    > }
    > } else {
    > $kanji = "一二三四";
    > }
    > print $query->start_form(-method => 'POST',-action => $query->url());
    > print $query->textarea(-name => 'kanji',
    > -default => $kanji);
    > print $query->submit();
    > print $query->endform();
    > print "<table><tr>\n<th>Value</th><td>",
    > $kanji, "</td></tr>\n", "</table>\n</form>\n<p>\n";
    > print $query->end_html();
    >
    I made a few changes to your program. I don't know exactly what the
    problem is, but I hope that this sheds some light on it:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use CGI;
    use utf8;
    use Encode (); # changed
    binmode (STDOUT, ":utf8");
    my $query = CGI->new();
    $query->charset('UTF-8');
    print $query->header('-cache-control' => 'no-cache'); # changed

    my $kanji;
    if ($query->param('kanji')) {
    my $inputnumber = $query->param('kanji');

    print <<EOF;
    <p> Interesting decodings of
    &quot;$inputnumber&quot; <br>
    UTF-8: @{[ Encode::decode('utf8', $inputnumber) ]} <br>
    </p>
    <hr>

    EOF

    # Add this to decode the number:
    $inputnumber = Encode::decode('utf8', $inputnumber);

    if ($inputnumber =~ /^([一二三四五*七八九十]+)$/) {
    $kanji = $1;
    } else {
    print "<p>Input $inputnumber was not a valid number</p>";
    $kanji = "";
    }
    } else {
    $kanji = "一二三四";
    }

    print <<EOF;
    <p> The value if \$kanji is: $kanji
    </p>

    EOF

    print $query->start_form(
    -method => 'POST',
    -action => $query->url()
    );
    print $query->textarea(-name => 'kanji',
    -default => $kanji);

    print <<EOF;
    <textarea name=alternate>
    DATA = $kanji
    </textarea>
    EOF

    print $query->submit();
    print $query->endform();
    print "<table><tr>\n<th>Value</th><td>",
    $kanji, "</td></tr>\n", "</table>\n</form>\n<p>\n";
    print $query->end_html();
    Mumia W. Guest

  5. #4

    Default Re: CGI.pm: encoding problems

    Dr.Ruud wrote:
    > Ben Bullock schreef:
    >
    >> use warnings;
    >> use strict;
    >> use CGI;
    >> use utf8;
    >> binmode (STDOUT, ":utf8");
    >
    > Try to replace those 5 lines with these (reordered) 4:
    >
    > use strict;
    > use warnings;
    > use encoding 'utf8' ;
    > use CGI;
    >
    > This would also set the PerlIO layer of STDIN to ':utf8'.
    >
    > See perldoc encoding.
    >
    I still get the problem when running Ben's program. The problem is that
    using the CGI module to initialize the textarea works the first time and
    not the second; however, bypassing CGI.pm and writing the textarea
    directly using print seems to work consistently.

    The bug might be logic related, but it's more likely CGI.pm-related.

    There is a "hint" that the CGI.pm on my Sarge system is not UTF-8 ready.
    This appears at the top of every page of output:
    <?xml version="1.0" encoding="iso-8859-1"?>

    This happens even when the HTTP header says utf8.

    Mumia W. Guest

  6. #5

    Default Re: CGI.pm: encoding problems

    Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud I was
    able to get this working, but I also noticed a couple of interesting
    phenomena in debugging this program. As Mumia W. says the text in the box is
    done incorrectly. Also, if I use my own "<input" box the input is mangled,
    and if I use the "straight" function calls of CGI.pm rather than the
    object-oriented ones, things stop working again, so it does look rather like
    there is something wrong inside CGI.pm. If anyone is interested, let me know
    and I'll post example code.

    Thanks again.

    Ben Bullock Guest

  7. #6

    Default Re: CGI.pm: encoding problems

    Ben Bullock wrote:
    > Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud I
    > was able to get this working, but I also noticed a couple of interesting
    > phenomena in debugging this program. As Mumia W. says the text in the
    > box is done incorrectly. Also, if I use my own "<input" box the input is
    > mangled, and if I use the "straight" function calls of CGI.pm rather
    > than the object-oriented ones, things stop working again, so it does
    > look rather like there is something wrong inside CGI.pm. If anyone is
    > interested, let me know and I'll post example code.
    >
    > Thanks again.
    >
    How were you able to get it working? Re-ordering the prologue and using
    utf8 didn't work for me.

    Mumia W. Guest

  8. #7

    Default Re: CGI.pm: encoding problems

    Ben Bullock wrote:
    > Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud I
    > was able to get this working, but I also noticed a couple of interesting
    > phenomena in debugging this program. As Mumia W. says the text in the
    > box is done incorrectly. Also, if I use my own "<input" box the input is
    > mangled, and if I use the "straight" function calls of CGI.pm rather
    > than the object-oriented ones, things stop working again, so it does
    > look rather like there is something wrong inside CGI.pm. If anyone is
    > interested, let me know and I'll post example code.
    >
    > Thanks again.
    >
    It's not a bug; it's a feature ;)

    For whatever reason, on my system, CGI.pm always interprets the STDIN
    data in raw mode, regardless of the script encoding, so form elements
    have to be explicitly decoded.

    And CGI.pm has a nifty feature that allows the programmer to
    automatically create forms with the same values that were in the posted
    data.

    These two behaviors combine to create the problems you had. The
    workarounds are to explicitly decode the form elements and to delete the
    old form element before creating another one with the same name.

    This program should demonstrate the issue and workarounds:

    #!/usr/bin/perl
    # kanji-2.cgi
    use strict;
    use warnings;
    use encoding 'utf8';
    use CGI ();
    use CGI::Carp 'fatalsToBrowser';

    $\ = "\n";

    # Invoke this script without a query string to
    # get the default (broken) behavior.
    #
    # Invoke this script with a query string of 'recode'
    # to get the 'kanji' form element recoded into
    # utf8. Example:
    #
    # [url]http://server.com/kanji-2.cgi?recode[/url]
    #
    # Or, if you want the old textarea data deleted
    # upon successive invocations of the form, add
    # a query string of 'delete' like so:
    #
    # [url]http://server.com/kanji-2.cgi?delete[/url]
    my $RECODE_QUERY = 0;
    my $DELETE_QUERY = 0;
    $RECODE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/recode/;
    $DELETE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/delete/;

    my $kanji;
    my $text;
    my $query = new CGI;

    print $query->header(
    -type => 'text/html',
    -charset => 'utf8',
    );

    print $query->start_html(
    -title => 'Kanji Test',
    -head => CGI::meta ({-http_equiv => 'Content-Type',
    -content => 'text/html; charset=utf8' ,
    }),
    ),
    $query->h1('Kanji Test');

    print <<EOF;
    <p> Let's see if it's possible to send
    and receive kanji numeric characters.
    </p>
    EOF

    if (! defined $query->param('kanji')) {

    $kanji = "一二三四";

    } else {

    $kanji = $query->param('kanji');
    $kanji = Encode::decode('utf8', $kanji);
    my $old_kanji = $query->param('kanji');

    if ($RECODE_QUERY) {
    $query->param('kanji', $kanji);
    }

    if ($DELETE_QUERY) {
    $query->delete('kanji');
    }

    ($text = <<EOF) =~ s/^\s*//mg;
    <pre> The data received was:
    ORIGINAL: $old_kanji
    DECODED: $kanji
    </pre>
    EOF


    print $text;
    }

    my $qs = '' eq $ENV{QUERY_STRING} ? '' :
    "?$ENV{QUERY_STRING}" ;

    print $query->start_form(
    -method => 'POST',
    -action => $query->url() . $qs );

    print $query->textarea(
    -name => 'kanji',
    -default => $kanji,
    );

    print $query->submit();

    print $query->end_form();


    print $query->end_html;

    Mumia W. Guest

  9. #8

    Default Re: CGI.pm: encoding problems

    Mumia W. wrote:
    > Ben Bullock wrote:
    >
    >> Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud
    >> I was able to get this working, but I also noticed a couple of
    >> interesting phenomena in debugging this program. As Mumia W. says the
    >> text in the box is done incorrectly. Also, if I use my own "<input"
    >> box the input is mangled, and if I use the "straight" function calls
    >> of CGI.pm rather than the object-oriented ones, things stop working
    >> again, so it does look rather like there is something wrong inside
    >> CGI.pm. If anyone is interested, let me know and I'll post example code.
    >>
    >> Thanks again.
    >>
    >
    > It's not a bug; it's a feature ;)
    >
    > For whatever reason, on my system, CGI.pm always interprets the STDIN
    > data in raw mode, regardless of the script encoding, so form elements
    > have to be explicitly decoded.
    >
    > And CGI.pm has a nifty feature that allows the programmer to
    > automatically create forms with the same values that were in the posted
    > data.
    >
    > These two behaviors combine to create the problems you had. The
    > workarounds are to explicitly decode the form elements and to delete the
    > old form element before creating another one with the same name.
    >
    > This program should demonstrate the issue and workarounds:
    Interesting. I found that the following program blew up on the
    Encode::decode, but that $kanji_orig appeared to display correctly.
    Also, the 'kanji' element displayed correctly even if I did not specify
    a query string. Do we have a version problem? I'm

    Perl 5.8.6
    CGI.pm 3.20
    OS: Darwin 7.9.0 (a.k.a. Mac OS X)
    Server: Apache 1.3.33
    Browser: Firefox 1.5.0.4 (though I doubt this has anything to do with it).
    >
    #!/usr/local/bin/perl
    > # kanji-2.cgi
    > use strict;
    > use warnings;
    > use encoding 'utf8';
    > use CGI ();
    > use CGI::Carp 'fatalsToBrowser';
    >
    > $\ = "\n";
    >
    > # Invoke this script without a query string to
    > # get the default (broken) behavior.
    > #
    > # Invoke this script with a query string of 'recode'
    > # to get the 'kanji' form element recoded into
    > # utf8. Example:
    > #
    > # [url]http://server.com/kanji-2.cgi?recode[/url]
    > #
    > # Or, if you want the old textarea data deleted
    > # upon successive invocations of the form, add
    > # a query string of 'delete' like so:
    > #
    > # [url]http://server.com/kanji-2.cgi?delete[/url]
    > my $RECODE_QUERY = 0;
    > my $DELETE_QUERY = 0;
    > $RECODE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/recode/;
    > $DELETE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/delete/;
    >
    > my $kanji;
    > my $text;
    > my $query = new CGI;
    >
    > print $query->header(
    > -type => 'text/html',
    > -charset => 'utf8',
    > );
    >
    # I found I got redundant meta headers with the original
    # script, so:
    > print $query->start_html(
    > -title => 'Kanji Test',
    ## -head => CGI::meta ({-http_equiv => 'Content-Type',
    ## -content => 'text/html; charset=utf8' ,
    ## }),
    > ),
    > $query->h1('Kanji Test');
    >
    > print <<EOF;
    > <p> Let's see if it's possible to send
    > and receive kanji numeric characters.
    > </p>
    > EOF
    >
    > if (! defined $query->param('kanji')) {
    >
    > $kanji = "一二三四";
    >
    > } else {
    >
    > $kanji = $query->param('kanji');
    eval {$kanji = Encode::decode('utf8', $kanji)};
    $@ and $kanji = $@;
    > my $old_kanji = $query->param('kanji');
    >
    > if ($RECODE_QUERY) {
    > $query->param('kanji', $kanji);
    > }
    >
    > if ($DELETE_QUERY) {
    > $query->delete('kanji');
    > }
    >
    > ($text = <<EOF) =~ s/^\s*//mg;
    > <pre> The data received was:
    > ORIGINAL: $old_kanji
    > DECODED: $kanji
    > </pre>
    > EOF
    >
    >
    > print $text;
    > }
    >
    > my $qs = '' eq $ENV{QUERY_STRING} ? '' :
    > "?$ENV{QUERY_STRING}" ;
    >
    > print $query->start_form(
    > -method => 'POST',
    > -action => $query->url() . $qs );
    >
    > print $query->textarea(
    > -name => 'kanji',
    > -default => $kanji,
    > );
    >
    > print $query->submit();
    >
    > print $query->end_form();
    >
    >
    > print $query->end_html;
    >
    Tom Wyant
    harryfmudd [AT] comcast [DOT] net Guest

  10. #9

    Default Re: CGI.pm: encoding problems

    harryfmudd [AT] comcast [DOT] net wrote:
    > Mumia W. wrote:
    >> [...]
    >> This program should demonstrate the issue and workarounds:
    >
    > Interesting. I found that the following program blew up on the
    > Encode::decode, but that $kanji_orig appeared to display correctly.
    > Also, the 'kanji' element displayed correctly even if I did not specify
    > a query string. Do we have a version problem? [...]
    Quite likely. I have perl 5.8.4 and CGI.pm 3.04 (old). That's probably
    why Dr. Ruud's advice of moving the "use" statements around didn't work
    for me.

    So it seems that re-decoding the data is a bad idea with newer versions
    of the module. As you were everybody.

    Mumia W. Guest

  11. #10

    Default Re: CGI.pm: encoding problems

    If anyone cares, the original program is on the web as follows:

    [url]http://www.sljfaq.org/cgi/numbers.cgi[/url]
    [url]http://www.sljfaq.org/cgi/kanjinumbers.cgi[/url]

    The bottom one was the one with the problems.

    Ordering the statements correctly solved the problem with the encoding, but
    some problems remained.

    Thanks for the help.

    Ben Bullock Guest

  12. #11

    Default Re: CGI.pm: encoding problems

    Ben Bullock wrote:
    > If anyone cares, the original program is on the web as follows:
    > [...]
    > [url]http://www.sljfaq.org/cgi/kanjinumbers.cgi[/url]
    >
    > [...]
    I'm not having any problems with it. Am I supposed to?


    Mumia W. Guest

  13. #12

    Default Re: CGI.pm: encoding problems

    "Mumia W." <mumia.w.18.spam+nospam.usenet@earthlink.net> wrote in message
    news:Pr%jg.13048$921.9261@newsread4.news.pas.earth link.net...
    > Ben Bullock wrote:
    >> If anyone cares, the original program is on the web as follows:
    >> [...]
    >> [url]http://www.sljfaq.org/cgi/kanjinumbers.cgi[/url]
    >>
    >> [...]
    >
    > I'm not having any problems with it. Am I supposed to?
    No, not really. But one interesting problem occurs if you type in numbers
    like this:

    一ニ三四五xyz

    then the xyz is preserved after you convert. If you go the other way round,

    12345xyz

    then the xyz disappears. The code is exactly the same going either way, so
    you tell me why that should be.

    Ben Bullock Guest

  14. #13

    Default Re: CGI.pm: encoding problems

    Ben Bullock wrote:
    > "Mumia W." <mumia.w.18.spam+nospam.usenet@earthlink.net> wrote in
    > message news:Pr%jg.13048$921.9261@newsread4.news.pas.earth link.net...
    >> Ben Bullock wrote:
    >>> If anyone cares, the original program is on the web as follows:
    >>> [...]
    >>> [url]http://www.sljfaq.org/cgi/kanjinumbers.cgi[/url]
    >>>
    >>> [...]
    >>
    >> I'm not having any problems with it. Am I supposed to?
    >
    > No, not really. But one interesting problem occurs if you type in
    > numbers like this:
    >
    > 一ニ三四五xyz
    >
    > then the xyz is preserved after you convert. If you go the other way round,
    >
    > 12345xyz
    >
    > then the xyz disappears. The code is exactly the same going either way,
    > so you tell me why that should be.
    I don't know, but perhaps you can create your own character class that
    matches only numbers from the various languages you're using.


    Mumia W. Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139