Identifying image-only vs. image+text PDF files

Ask a Question related to Adobe Acrobat SDK, Design and Development.

  1. #1

    Default Identifying image-only vs. image+text PDF files

    I have a large collection of PDFs in a large directory/subdirectory structure, we are talking 366K PDFs.

    I need to find some way of identifying which are image only from all others.

    I've tried various things, I have both Windows & Linux at my disposal. I was trying to parse the PDFs source and search for "Font" Or "Annot" which I believe would identify something other than image-only types of PDFs. Baring that I'm hoping someone knows of a better way or if there is a tool out there already.

    Any help would be greatly appreciated. Thanks.
    Zamdrist Guest

  2. Similar Questions and Discussions

    1. Any way to programatically convert Image PDF files to OCR searchable PDF files?
      Hi, I have Adobe Acrobat Reader 7.0 Standard and was wondering if via using Visual Basic there was any way to programatically convert Image PDF...
    2. Preserving links between Illustrator files and image files during system upgrade
      Client is a design firm with thousands of Illustrator files spread across about five non-RAID external SCSI drives, most with partitions, attached to...
    3. Open Image in 'Kodak Image Edit Control' with web browser.
      hi, 1.I want to show a image file of type '.tif' in the browser window; for that I'm writting as ASP code page. 2.This '.tif' type image can be...
    4. Resizing high res image smaller results in blurred image
      Hi there, I have a high res logo in PSD format (around 1500px x 1500px) but when I resize it to around 300px x 300px the resulting image is not...
    5. Can I take a small (320x240), blurry image, and make it a clear, large image?
      Just wondering if there is an easy way to do this? I'm sure it won't be perfect cause photoshop can only work with what's there, but maybe it can...
  3. #2

    Default Re: Identifying image-only vs. image+text PDF files

    Does anyone have any thoughts on this? The need is urgent and would great appreciate any feedback. Thank you.
    Zamdrist Guest

  4. #3

    Default Re: Identifying image-only vs. image+text PDF files

    There are a number of 3rd party solutions that offer this functionality for both Windows and Linux. A web search should turn them up.

    If you want to use Acrobat, you would need to write a custom plugin OR you could try to use either IAC or JavaScript to do text extraction (though that's not 100% reliable since you could have a map w/o text).

    Leonard
    Leonard_Rosenthol@adobeforums.com Guest

  5. #4

    Default Re: Identifying image-only vs. image+text PDF files

    Thank you Leonard for your reply. Despite my reasonably good Googling skills, I've not been able to find anything in the sorts of a plug-in or third party tool.

    Ironically I keep finding my own posted question, for which no one seems to have an answer for.

    Keep in mind I need a tool which will search and return ALL PDFs in a large directory/sub-directory structure and their path names.

    Not looking for a tool that will just let me know if the open PDF is image only or image+text, which is quite easy to ascertain.

    Thank you.
    Zamdrist Guest

  6. #5

    Default Re: Identifying image-only vs. image+text PDF files

    I don't now of a company with an "off the shelf" directory walker, etc. but a variety of companies have the necessary component to do the checking on a single PDF (which you could then connect up to the folder walker).

    Check with companies such as Apago ([url]http://www.apago.com[/url]), Traction Software ([url]http://www.traction-software.co.uk/[/url]) and Glyph and Cog ([url]http://www.glyphandcog.com[/url]).

    Leonard
    Leonard_Rosenthol@adobeforums.com Guest

  7. #6

    Default Re: Identifying image-only vs. image+text PDF files

    You could have a look at the adobe ifilter plugin for windows. I allows
    to index pdfs. The windows platform sdk contains example binaries how to
    use this windows api. It is very easy to examine pages of pdf files
    using the samples in the sdk. I guess a pdf page containing only images
    gives back no text.

    Something like


    filtdump testfile.pdf


    should extract text after you installed the ifilter plugin. filtdump is
    part of the platform sdk.

    [url]http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611[/url]

    [url]http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/html/ixrefint_9sfm.asp[/url]



    Zamdrist schrieb:
    > Thank you Leonard for your reply. Despite my reasonably good Googling skills, I've not been able to find anything in the sorts of a plug-in or third party tool.
    >
    > Ironically I keep finding my own posted question, for which no one seems to have an answer for.
    >
    > Keep in mind I need a tool which will search and return ALL PDFs in a large directory/sub-directory structure and their path names.
    >
    > Not looking for a tool that will just let me know if the open PDF is image only or image+text, which is quite easy to ascertain.
    >
    > Thank you.
    Carsten Hammer Guest

  8. #7

    Default Re: Identifying image-only vs. image+text PDF files

    You could have a look at the adobe ifilter plugin for windows. I allows to index pdfs. The windows platform sdk contains example binaries how to use this windows api. It is very easy to examine pages of pdf files using the samples in the sdk. I guess a pdf page containing only images gives back no text.

    Something like

    filtdump testfile.pdf

    should extract text after you installed the ifilter plugin. filtdump is part of the platform sdk.

    <http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611>

    <http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/html/ixrefint_9sfm.asp>
    Carsten_Hammer@adobeforums.com Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139