Convert pdf page to text (set of textboxes)

Ask a Question related to Adobe Acrobat SDK, Design and Development.

  1. #1

    Default Convert pdf page to text (set of textboxes)

    Hi!

    My problem is that:
    I have a client (a media observer corp).
    They have the media contents as pdf files.
    When they are need the text of some pdf file, they are copypaste it from Reader.
    But it is very hard with many pdf, and sometimes the text is splitted to columns, and the information is encoded in iso-8859-2... So it is horrible sometimes.

    They need some utility, but I don't know, it is possible to write, or not.
    The utility will working like Recognita (but it will not OCR): it is recognize the blocks of texts, number them, and the user can change this order, and then the program copy the good ordered text to the clipboard, or a file.

    For this I need to:
    1.) Get all text from the PDF page.
    2.) Get all information from text blocks.

    I tried all of the commercial and freeware tool in the net, but they are not working good. They are get only the text, or have problem with reading the pdf. Adobe Acrobat also not good for this, it's export function not containing good filter for me.
    The pdftohtml freeware thing do same thing I needed: it is makes divs with abs. positions, and divs containing the texts.
    This is what I need: the text boxes with coordinates and the text.
    Example:
    page1
    textbox1{10,10,200,27,"Thisisatext/xfc/xfa"};
    textbox2{10,30,200,27,"2"};
    page2
    textbox1{00,30,200,27,"Thisisatext/xfc/xfa"};
    ....

    I have many question, but one is over at all of them:
    Can I get these textboxes from a pdf page/file?
    1.) Can the scripting/COM objects provide same informations?
    2.) What product I need to install to get these informations?
    3.) How to I do it? (averysimplexample).

    Thanks for your help:
    dd
    durumdara@adobeforums.com Guest

  2. Similar Questions and Discussions

    1. Master page has text added automaticalllywhen I flow text to a new page that is based on it
      Have the Adobe software engineers addressed this problem yet. I have a 200-page document that I've been fighting with for weeks. It is a document I...
    2. saving text values of dynamically created textboxes
      I have many textboxes that are created dynamically as child controls of my custom control. I know that I must recreate them after each postback...
    3. convert text to text field
      I'm using Director 8.5.1 on Windows XP I'm trying to make pages of an old .dir printable, but the text wasn't created in text fields. Print o...
    4. help: form textboxes on one page will not show up a variables on second
      I have a redhat 9.0 box running PHP on apache I know php works but this form will not index.html ---------------
    5. rasterize/path text/convert text to art
      I need to be able to convert text to art or path my text for HTML newsletters I build in another program. Anyone? I have been creating text boxes...
  3. #2

    Default Re: Convert pdf page to text (set of textboxes)

    It's just software - therefore it is certainly possible to write...

    However, what you are trying to accomplish is complex and difficult as PDF isn't a file format that (necessarily!) lends itself to this type of operation. Some PDFs WILL contain the necessary information (called Tagging), but not will not :(. In those cases, you will need to write complex heuristics to determine the "blocks".

    Leonard
    Leonard_Rosenthol@adobeforums.com Guest

  4. #3

    Default Re: Convert pdf page to text (set of textboxes)

    Hi!

    Interesting that pdftohtml can do same thing: it is convert the the pdf texts to abs. pos. divs, but it lost the some intern. characters.

    My idea about pdf that it is use same tech. like wmf/emf.
    The simple way to write text block is to define it as {x,y,w,h,text}.
    Like WinAPI TextRect().

    But I don't know it is working in Adobe? Is Adobe store the texts in same way?

    When I got these informations, I can do anything with them.
    But when coordinates missing, the text is only a flow text...

    Do you knows something about text informations?

    Thanks for it:
    dd
    durumdara@adobeforums.com Guest

  5. #4

    Default Re: Convert pdf page to text (set of textboxes)

    You should read the PDF Reference, especially the chapter on text if
    you want to be able to understand what text extraction is conceptually
    possible, and what complexities can arise.

    To oversimplify... Each text run is identified by font, matrix,
    spacing parameters and text; the text is subject to interpretation by
    the encoding defined for the font. The matrix defines the scaling
    (and perhaps skewing) of the font, and its origin). In many cases,
    the PDF file does not include spaces, and the text runs may not be in
    reading order.

    Aandi Inston
    Aandi_Inston@adobeforums.com Guest

  6. #5

    Default Re: Convert pdf page to text (set of textboxes)



    To oversimplify...




    Yes, I do it.
    But I don't know that: is text drawed to an area (clip!) or the size of the text (font, style, etc) is determine the full width/height?

    I must oversimplify this, because the pdf/ps parsing is very complex thing, I don't want to rewrite the Adobe Acrobat Viewer, only I want to render the pdf to pages/textboxes format.

    The simplified working method is that:
    I want to make hooks in the "printing" of pdf.
    When some text area drawed virtually, it callback my routine with x,y,w,h and with text.
    Then I determine what to do and I handle everything. Only I need the bounds, and the text.

    Is Adobe provides me an interface to do this?
    Like MS WMF Play where I can catch every operation.

    Thanks for your help:
    dd
    durumdara@adobeforums.com Guest

  7. #6

    Default Re: Convert pdf page to text (set of textboxes)

    >But I don't know that: is text drawed to an area (clip!) or the size of the text (font, style, etc) is determine the full width/height?

    The input is: matrix, font, spacing parameters, text. The text is laid
    out from the starting point (determined by the matrix). For most fonts
    it continues left to right for the width determined by the font
    itself, the matrix, and the spacing.
    >
    >I must oversimplify this, because the pdf/ps parsing is very complex thing, I don't want to rewrite the Adobe Acrobat Viewer, only I want to render the pdf to pages/textboxes format.
    The Acrobat SDK offers various APIs for getting text, either "raw"
    (PDFEdit for plug-ins) or partly processed (e.g. using fuzzy logic to
    divide into "words").
    >Is Adobe provides me an interface to do this?
    >Like MS WMF Play where I can catch every operation.
    There's nothing to intercept drawing, but PDFEdit provides an
    abstraction of the page contents including all text.

    Aandi Inston
    Aandi_Inston@adobeforums.com Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139