Professional Web Applications Themes

Reading contents of an excel file from a test file - PERL Modules

Hi I am writing a script to read various file types (doc, xls, pdf, html etc.) and search for certain keywords. Without caring for the file formats, I used 'findstr' system call from perl with the keyword for each file and directed the output to a text file. The results were not as bad as I had expected :) In the txt file created, I have the list of all the files and the original text from the file where the string to be searched occurs but this file has some unicode characters which prevents it to be read and ...

  1. #1

    Default Reading contents of an excel file from a test file

    Hi

    I am writing a script to read various file types (doc, xls, pdf, html
    etc.) and search for certain keywords. Without caring for the file
    formats, I used 'findstr' system call from perl with the keyword for
    each file and directed the output to a text file. The results were not
    as bad as I had expected :)
    In the txt file created, I have the list of all the files and the
    original text from the file where the string to be searched occurs but
    this file has some unicode characters which prevents it to be read and
    processed properly. :(
    I basically get a lot of "" in the result test file which makes perl
    act wierd.

    Can u please suggest a way to read a text file which has unicode
    characters??
    I do NOT want to create seperate prs for the different file types
    (things like PExcel) as it will increase the complexity and will
    need a lot of effort.

    Cheeeeers!!

    KRN!!?!

    Mick Guest

  2. #2

    Default Re: Reading contents of an excel file from a test file

    sorry guys.. but the unicode characters as they appear as rectangles
    in my text file (all appear as the same), are not printed when posting
    a message on this forum!!

    On May 15, 11:49 am, Mick <com> wrote: 


    Mick Guest

  3. #3

    Default Re: Reading contents of an excel file from a test file

    Top-posting corrected, Please don't top-post.

    Mick wrote: [/ref]

    I'm pretty sure that current versions of Perl are happy to process Unicode
    perldoc perlunicode

    If I wanted to ignore characters that are outside the ASCII printable
    set then I'd investigate Perl's 'tr'. `perldoc perlop` suggests
    tr/a-zA-Z/ /cs; # change non-alphas to single space
     [/ref]

    I suspect there's no guarantee that arbitrary file types will store your
    keywords in a recognisable form. A file might store "KEYWORD" as
    "KExxYWxxOxRD" for example. I'd guess this is particularly
    likely in PDF, especially if it is kerning text. Some might use UTF8
    encoding others might use UTF16 or some non-unicode encoding. Some might
    compress or encode the text so it no longer appears in ASCII.
     

    You are using Google Groups and it seems to think your character set is
    Latin1 not Unicode. Your posting has this header:
    Content-Type: text/plain; cht="iso-8859-1"

    Possibly you are viewing your "text file" in an application that is not
    Unicode aware or is not using a font that has glyphs for the particular
    Unicode characters in the file.
    Ian Guest

Similar Threads

  1. Replies: 0
    Last Post: July 25th, 01:55 AM
  2. Replies: 0
    Last Post: April 26th, 08:19 AM
  3. Open file, make changes, save file, close, re-open, file contents not changed
    By brock@bergdesign.com in forum Adobe Illustrator Macintosh
    Replies: 7
    Last Post: March 7th, 07:29 PM
  4. Newbie - Reading File Contents
    By Sylvie Stone in forum PHP Development
    Replies: 1
    Last Post: July 29th, 01:20 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139