PDWordFinder - Setting characters used to break up words

Ask a Question related to Adobe Acrobat SDK, Design and Development.

  1. #1

    Default PDWordFinder - Setting characters used to break up words

    I am using PDWordFinderEnumWords with a PDWordfinder and a callback method to iterate through all the words in my PDF document. I notice that whenever a slash, hyphen, or backslash are encountered, the callback function is called again. What this means is that, for example, the word
    "multi-functional"
    will be treated as though "multi-" and "functional" are two separate words.

    I want the behaviour to be different in that I only want the space character to be used as a word separator.

    How can I achieve this functionality? I have tried looking at settings in the PDWordFinder, but have not found anything useful.

    Thanks for your help!
    Eliott_Hayut@adobeforums.com Guest

  2. Similar Questions and Discussions

    1. Bookmark: Engish words appear as Chinese? characters
      DAk please how I can stop bookmarked text from English words appearing as chinese? characters? I want the bookmark to be English words. I should say...
    2. Break paragraph into list of words / keyword detection
      I'm writing an journaling type of application, and I was wondering if someone might be able to help me re-write one of the modules so it's more...
    3. break paragraph of text into individual words / keyword detection
      I'm writing an journaling type of application, and I was wondering if someone might be able to help me re-write one of the modules so it's more...
    4. DIME attachments break double-byte characters using WSE 1.0 SP1
      Using the .NET framework and WSE 1.0 SP1, I have built a web service which receives data from a web service client and saves it to disk in a text...
    5. Match the first 3 characters of 2 words?
      What is the easiest way to test the first 3 characters of two words for a match. IE: "dasf" test "dasg" to return positive. rod.
  3. #2

    Default Re: PDWordFinder - Setting characters used to break up words

    There is a version of WordFinder (WordFinderCreateEx?) that lets you specify the word break tables.
    Leonard_Rosenthol@adobeforums.com Guest

  4. #3

    Default Re: PDWordFinder - Setting characters used to break up words

    I create my word finder, and set up the charTypeTbl like so:
    const ASUns16 myCharTypeTbl[] = {32, 9, 13, 3, W_WORD_BREAK};
    m_wfConfig->charTypeTbl = myCharTypeTbl;
    m_wfConfig->charTypeTblSize = sizeof(ASUns16)*6;

    However, I get an error when creating the word finder using
    PDDocCreateWordFinderEx( m_pdfDoc, WF_LATEST_VERSION, l_fIsUnicode, m_wfConfig );
    Eliott_Hayut@adobeforums.com Guest

  5. #4

    Default Re: PDWordFinder - Setting characters used to break up words



    However, I get an error when creating the word finder




    And what error is that?
    PDL@adobeforums.com Guest

  6. #5

    Default Re: PDWordFinder - Setting characters used to break up words

    An application crash (the worst kind of error!)
    Eliott_Hayut@adobeforums.com Guest

  7. #6

    Default Re: PDWordFinder - Setting characters used to break up words

    Is this is a DURING/HANDLER block?
    PDL@adobeforums.com Guest

  8. #7

    Default Re: PDWordFinder - Setting characters used to break up words

    Aandi, I have tried creating a character type table using the documentation's example as well:
    const ASUns16 myCharTypeTbl[] ={0x0082, 0x0082, W_CNTL+W_WORD_BREAK, 0x00b2, 0x00b3, W_DIGIT}
    and I got the same results (i.e. a crash).

    PDL, if I put it in a DURING/HANDLER block, the application doesn't crash, but the word finder is not created since an error is caught
    Eliott_Hayut@adobeforums.com Guest

  9. #8

    Default Re: PDWordFinder - Setting characters used to break up words



    if I put it in a DURING/HANDLER block, the application doesn't crash,
    but the word finder is not created since an error is caught




    Yes ... and what is that error?
    PDL@adobeforums.com Guest

  10. #9

    Default Re: PDWordFinder - Setting characters used to break up words

    The error is "Bad Parameter."
    Eliott_Hayut@adobeforums.com Guest

  11. #10

    Default Re: PDWordFinder - Setting characters used to break up words

    Sure!

    /*GLOBAL VARIABLE*/
    const ASUns16 myCharTypeTbl[] = {32, 9, 13, 3, W_WORD_BREAK};

    //------------------------------------------------------------------------------
    // Init
    //------------------------------------------------------------------------------
    void gveDoc::Init(gveDocType in_docType )
    {
    //Irrelevant code....

    // Set up m_WordFinder creation options record
    m_wfConfig = static_cast<PDWordFinderConfig>(ASmalloc(sizeof(PD WordFinderConfigRec)));

    memset(m_wfConfig, 0, sizeof(PDWordFinderConfigRec));

    m_wfConfig->recSize = sizeof(PDWordFinderConfigRec);
    m_wfConfig->ignoreCharGaps = false;
    m_wfConfig->ignoreLineGaps = false;
    m_wfConfig->noAnnots = true;
    m_wfConfig->noEncodingGuess = true; // leave non-Roman single-byte font alone

    // Std Roman treatment for custom encoding; overrides the noEncodingGuess option
    m_wfConfig->unknownToStdEnc = false;

    m_wfConfig->disableTaggedPDF = false; // legacy mode m_WordFinder creation
    m_wfConfig->noXYSort = false;
    m_wfConfig->preserveSpaces = false;
    m_wfConfig->noLigatureExp = false;
    m_wfConfig->noHyphenDetection = false;
    m_wfConfig->trustNBSpace = false;
    m_wfConfig->noExtCharOffset = false; // text extraction efficiency
    m_wfConfig->noStyleInfo = false; // text extraction efficiency
    m_wfConfig->decomposeTbl = NULL; // Unicode character replacement
    m_wfConfig->decomposeTblSize = 0;

    m_wfConfig->charTypeTbl = NULL; // Custom char type table
    m_wfConfig->charTypeTblSize = 0;
    //m_wfConfig->charTypeTbl = myCharTypeTbl; // Custom char type table
    //m_wfConfig->charTypeTblSize = sizeof(ASUns16)*5;
    }

    GVE_RESULT gveDoc::ExtractText()
    {
    PDWordFinder pdm_WordFinder = NULL;
    gveList* l_lstWordTables = new gveList;

    gveBool l_fIsUnicode = false;
    #ifdef _UNICODE
    l_fIsUnicode = true;
    #endif

    DURING
    pdm_WordFinder = PDDocCreateWordFinderEx( m_pdfDoc, WF_LATEST_VERSION, l_fIsUnicode, m_wfConfig );
    HANDLER
    char buf[256];
    ASGetErrorString(ERRORCODE, buf, sizeof(buf));
    int x = 4;
    END_HANDLER

    // More code...

    }
    Eliott_Hayut@adobeforums.com Guest

  12. #11

    Default Re: PDWordFinder - Setting characters used to break up words

    > const ASUns16 myCharTypeTbl[] = {32, 9, 13, 3, W_WORD_BREAK};
    This does not seem to match the documentation of what a character type
    table should look like. Not even the length is right.



    Aandi Inston
    Aandi_Inston@adobeforums.com Guest

  13. #12

    Default Re: PDWordFinder - Setting characters used to break up words

    Ok, can you post a complete fragment that generates a Bad Parameter
    error with the character type table from the example...

    Aandi Inston
    Aandi_Inston@adobeforums.com Guest

  14. #13

    Default Re: PDWordFinder - Setting characters used to break up words

    I apologize, the code I posted was a little unclear.

    If I comment out:
    m_wfConfig->charTypeTbl = NULL; // Custom char type table
    m_wfConfig->charTypeTblSize = 0;

    And uncomment:
    //m_wfConfig->charTypeTbl = myCharTypeTbl; // Custom char type table
    //m_wfConfig->charTypeTblSize = sizeof(ASUns16)*5;

    I get a bad parameter error. If I leave the code as I pasted in my previous post, the code executes without a problem.
    Eliott_Hayut@adobeforums.com Guest

  15. #14

    Default Re: PDWordFinder - Setting characters used to break up words

    Ok, and does that exact code give Bad Parameter for you?

    Aandi Inston
    Aandi_Inston@adobeforums.com Guest

  16. #15

    Default Re: PDWordFinder - Setting characters used to break up words

    I arrived at this value just by playing around. If I replace this value with the one used as an example in the documentation:

    const ASUns16 myCharTypeTbl[] ={0x0082, 0x0082, W_CNTL+W_WORD_BREAK, 0x00b2, 0x00b3, W_DIGIT}

    I get the exact same error.
    Eliott_Hayut@adobeforums.com Guest

  17. #16

    Default Re: PDWordFinder - Setting characters used to break up words

    >I arrived at this value just by playing around.

    How about by reading the documentation??

    Aandi Inston
    Aandi_Inston@adobeforums.com Guest

  18. #17

    Default Re: PDWordFinder - Setting characters used to break up words

    As I said in my previous post, even by using the character type table sample provided in the documentation, I get the same "Bad Parameter" error. I would assume that the example that Adobe provides in their documentation should function correctly, so I am inclined to believe that the character type table definition is not the problem.

    Do you know what else may be causing the error?

    Thanks :)
    Eliott_Hayut@adobeforums.com Guest

  19. #18

    Default Re: PDWordFinder - Setting characters used to break up words

    Well, you've posted an example that you say works, and you say that by
    changing a line you can get a parameter error, which I would expect
    given that there is clearly a parameter error.

    Could I suggest you post an example which
    * produces the exception
    * uses a correct character table definition such as the one in the
    documentation?

    Aandi Inston
    Aandi_Inston@adobeforums.com Guest

  20. #19

    Default Re: PDWordFinder - Setting characters used to break up words

    /*GLOBAL VARIABLE*/
    const ASUns16 myCharTypeTbl[] ={0x0082, 0x0082, W_CNTL+W_WORD_BREAK, 0x00b2, 0x00b3, W_DIGIT}

    //------------------------------------------------------------------------------
    // Init
    //------------------------------------------------------------------------------
    void gveDoc::Init(gveDocType in_docType )
    {
    //Irrelevant code....

    // Set up m_WordFinder creation options record
    m_wfConfig = static_cast<PDWordFinderConfig>(ASmalloc(sizeof(PD WordFinderConfigRec)));

    memset(m_wfConfig, 0, sizeof(PDWordFinderConfigRec));

    m_wfConfig->recSize = sizeof(PDWordFinderConfigRec);
    m_wfConfig->ignoreCharGaps = false;
    m_wfConfig->ignoreLineGaps = false;
    m_wfConfig->noAnnots = true;
    m_wfConfig->noEncodingGuess = true; // leave non-Roman single-byte font alone

    // Std Roman treatment for custom encoding; overrides the noEncodingGuess option
    m_wfConfig->unknownToStdEnc = false;

    m_wfConfig->disableTaggedPDF = false; // legacy mode m_WordFinder creation
    m_wfConfig->noXYSort = false;
    m_wfConfig->preserveSpaces = false;
    m_wfConfig->noLigatureExp = false;
    m_wfConfig->noHyphenDetection = false;
    m_wfConfig->trustNBSpace = false;
    m_wfConfig->noExtCharOffset = false; // text extraction efficiency
    m_wfConfig->noStyleInfo = false; // text extraction efficiency
    m_wfConfig->decomposeTbl = NULL; // Unicode character replacement
    m_wfConfig->decomposeTblSize = 0;
    m_wfConfig->charTypeTbl = myCharTypeTbl; // Custom char type table
    m_wfConfig->charTypeTblSize = sizeof(ASUns16)*7;
    }

    GVE_RESULT gveDoc::ExtractText()
    {
    PDWordFinder pdm_WordFinder = NULL;
    gveList* l_lstWordTables = new gveList;

    gveBool l_fIsUnicode = false;
    #ifdef _UNICODE
    l_fIsUnicode = true;
    #endif

    DURING
    /****THIS LINE THROWS AN EXCEPTION "Bad Parameter."****/
    pdm_WordFinder = PDDocCreateWordFinderEx( m_pdfDoc, WF_LATEST_VERSION, l_fIsUnicode, m_wfConfig );
    HANDLER
    char buf[256];
    ASGetErrorString(ERRORCODE, buf, sizeof(buf));
    int x = 4;
    END_HANDLER

    // More code...

    }

    The above code throws an exception - I just tested it. Note that I am 100% positive that m_pdfDoc (from the line of code that throws the exception) is valid.

    Thanks again for your help, and I'm sorry if my previous posts were a little scattered.
    Eliott_Hayut@adobeforums.com Guest

  21. #20

    Default Re: PDWordFinder - Setting characters used to break up words

    Aandi, you've saved me hours of trying to debug something which in the end was just a silly mistake on my part.

    Thanks for your time!
    Eliott_Hayut@adobeforums.com Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139