Ask a Question related to Adobe Acrobat SDK, Design and Development.
-
durumdara@adobeforums.com #1
Convert pdf page to text (set of textboxes)
Hi!
My problem is that:
I have a client (a media observer corp).
They have the media contents as pdf files.
When they are need the text of some pdf file, they are copypaste it from Reader.
But it is very hard with many pdf, and sometimes the text is splitted to columns, and the information is encoded in iso-8859-2... So it is horrible sometimes.
They need some utility, but I don't know, it is possible to write, or not.
The utility will working like Recognita (but it will not OCR): it is recognize the blocks of texts, number them, and the user can change this order, and then the program copy the good ordered text to the clipboard, or a file.
For this I need to:
1.) Get all text from the PDF page.
2.) Get all information from text blocks.
I tried all of the commercial and freeware tool in the net, but they are not working good. They are get only the text, or have problem with reading the pdf. Adobe Acrobat also not good for this, it's export function not containing good filter for me.
The pdftohtml freeware thing do same thing I needed: it is makes divs with abs. positions, and divs containing the texts.
This is what I need: the text boxes with coordinates and the text.
Example:
page1
textbox1{10,10,200,27,"Thisisatext/xfc/xfa"};
textbox2{10,30,200,27,"2"};
page2
textbox1{00,30,200,27,"Thisisatext/xfc/xfa"};
....
I have many question, but one is over at all of them:
Can I get these textboxes from a pdf page/file?
1.) Can the scripting/COM objects provide same informations?
2.) What product I need to install to get these informations?
3.) How to I do it? (averysimplexample).
Thanks for your help:
dd
durumdara@adobeforums.com Guest
-
Master page has text added automaticalllywhen I flow text to a new page that is based on it
Have the Adobe software engineers addressed this problem yet. I have a 200-page document that I've been fighting with for weeks. It is a document I... -
saving text values of dynamically created textboxes
I have many textboxes that are created dynamically as child controls of my custom control. I know that I must recreate them after each postback... -
convert text to text field
I'm using Director 8.5.1 on Windows XP I'm trying to make pages of an old .dir printable, but the text wasn't created in text fields. Print o... -
help: form textboxes on one page will not show up a variables on second
I have a redhat 9.0 box running PHP on apache I know php works but this form will not index.html --------------- -
rasterize/path text/convert text to art
I need to be able to convert text to art or path my text for HTML newsletters I build in another program. Anyone? I have been creating text boxes... -
Leonard_Rosenthol@adobeforums.com #2
Re: Convert pdf page to text (set of textboxes)
It's just software - therefore it is certainly possible to write...
However, what you are trying to accomplish is complex and difficult as PDF isn't a file format that (necessarily!) lends itself to this type of operation. Some PDFs WILL contain the necessary information (called Tagging), but not will not :(. In those cases, you will need to write complex heuristics to determine the "blocks".
Leonard
Leonard_Rosenthol@adobeforums.com Guest
-
durumdara@adobeforums.com #3
Re: Convert pdf page to text (set of textboxes)
Hi!
Interesting that pdftohtml can do same thing: it is convert the the pdf texts to abs. pos. divs, but it lost the some intern. characters.
My idea about pdf that it is use same tech. like wmf/emf.
The simple way to write text block is to define it as {x,y,w,h,text}.
Like WinAPI TextRect().
But I don't know it is working in Adobe? Is Adobe store the texts in same way?
When I got these informations, I can do anything with them.
But when coordinates missing, the text is only a flow text...
Do you knows something about text informations?
Thanks for it:
dd
durumdara@adobeforums.com Guest
-
Aandi_Inston@adobeforums.com #4
Re: Convert pdf page to text (set of textboxes)
You should read the PDF Reference, especially the chapter on text if
you want to be able to understand what text extraction is conceptually
possible, and what complexities can arise.
To oversimplify... Each text run is identified by font, matrix,
spacing parameters and text; the text is subject to interpretation by
the encoding defined for the font. The matrix defines the scaling
(and perhaps skewing) of the font, and its origin). In many cases,
the PDF file does not include spaces, and the text runs may not be in
reading order.
Aandi Inston
Aandi_Inston@adobeforums.com Guest
-
durumdara@adobeforums.com #5
Re: Convert pdf page to text (set of textboxes)
To oversimplify...
Yes, I do it.
But I don't know that: is text drawed to an area (clip!) or the size of the text (font, style, etc) is determine the full width/height?
I must oversimplify this, because the pdf/ps parsing is very complex thing, I don't want to rewrite the Adobe Acrobat Viewer, only I want to render the pdf to pages/textboxes format.
The simplified working method is that:
I want to make hooks in the "printing" of pdf.
When some text area drawed virtually, it callback my routine with x,y,w,h and with text.
Then I determine what to do and I handle everything. Only I need the bounds, and the text.
Is Adobe provides me an interface to do this?
Like MS WMF Play where I can catch every operation.
Thanks for your help:
dd
durumdara@adobeforums.com Guest
-
Aandi_Inston@adobeforums.com #6
Re: Convert pdf page to text (set of textboxes)
>But I don't know that: is text drawed to an area (clip!) or the size of the text (font, style, etc) is determine the full width/height?
The input is: matrix, font, spacing parameters, text. The text is laid
out from the starting point (determined by the matrix). For most fonts
it continues left to right for the width determined by the font
itself, the matrix, and the spacing.The Acrobat SDK offers various APIs for getting text, either "raw">
>I must oversimplify this, because the pdf/ps parsing is very complex thing, I don't want to rewrite the Adobe Acrobat Viewer, only I want to render the pdf to pages/textboxes format.
(PDFEdit for plug-ins) or partly processed (e.g. using fuzzy logic to
divide into "words").
There's nothing to intercept drawing, but PDFEdit provides an>Is Adobe provides me an interface to do this?
>Like MS WMF Play where I can catch every operation.
abstraction of the page contents including all text.
Aandi Inston
Aandi_Inston@adobeforums.com Guest



Reply With Quote

