Ask a Question related to Adobe Acrobat SDK, Design and Development.
-
Eliott_Hayut@adobeforums.com #1
PDWordFinder - Setting characters used to break up words
I am using PDWordFinderEnumWords with a PDWordfinder and a callback method to iterate through all the words in my PDF document. I notice that whenever a slash, hyphen, or backslash are encountered, the callback function is called again. What this means is that, for example, the word
"multi-functional"
will be treated as though "multi-" and "functional" are two separate words.
I want the behaviour to be different in that I only want the space character to be used as a word separator.
How can I achieve this functionality? I have tried looking at settings in the PDWordFinder, but have not found anything useful.
Thanks for your help!
Eliott_Hayut@adobeforums.com Guest
-
Bookmark: Engish words appear as Chinese? characters
DAk please how I can stop bookmarked text from English words appearing as chinese? characters? I want the bookmark to be English words. I should say... -
Break paragraph into list of words / keyword detection
I'm writing an journaling type of application, and I was wondering if someone might be able to help me re-write one of the modules so it's more... -
break paragraph of text into individual words / keyword detection
I'm writing an journaling type of application, and I was wondering if someone might be able to help me re-write one of the modules so it's more... -
DIME attachments break double-byte characters using WSE 1.0 SP1
Using the .NET framework and WSE 1.0 SP1, I have built a web service which receives data from a web service client and saves it to disk in a text... -
Match the first 3 characters of 2 words?
What is the easiest way to test the first 3 characters of two words for a match. IE: "dasf" test "dasg" to return positive. rod. -
Leonard_Rosenthol@adobeforums.com #2
Re: PDWordFinder - Setting characters used to break up words
There is a version of WordFinder (WordFinderCreateEx?) that lets you specify the word break tables.
Leonard_Rosenthol@adobeforums.com Guest
-
Eliott_Hayut@adobeforums.com #3
Re: PDWordFinder - Setting characters used to break up words
I create my word finder, and set up the charTypeTbl like so:
const ASUns16 myCharTypeTbl[] = {32, 9, 13, 3, W_WORD_BREAK};
m_wfConfig->charTypeTbl = myCharTypeTbl;
m_wfConfig->charTypeTblSize = sizeof(ASUns16)*6;
However, I get an error when creating the word finder using
PDDocCreateWordFinderEx( m_pdfDoc, WF_LATEST_VERSION, l_fIsUnicode, m_wfConfig );
Eliott_Hayut@adobeforums.com Guest
-
PDL@adobeforums.com #4
Re: PDWordFinder - Setting characters used to break up words
However, I get an error when creating the word finder
And what error is that?
PDL@adobeforums.com Guest
-
Eliott_Hayut@adobeforums.com #5
Re: PDWordFinder - Setting characters used to break up words
An application crash (the worst kind of error!)
Eliott_Hayut@adobeforums.com Guest
-
PDL@adobeforums.com #6
Re: PDWordFinder - Setting characters used to break up words
Is this is a DURING/HANDLER block?
PDL@adobeforums.com Guest
-
Eliott_Hayut@adobeforums.com #7
Re: PDWordFinder - Setting characters used to break up words
Aandi, I have tried creating a character type table using the documentation's example as well:
const ASUns16 myCharTypeTbl[] ={0x0082, 0x0082, W_CNTL+W_WORD_BREAK, 0x00b2, 0x00b3, W_DIGIT}
and I got the same results (i.e. a crash).
PDL, if I put it in a DURING/HANDLER block, the application doesn't crash, but the word finder is not created since an error is caught
Eliott_Hayut@adobeforums.com Guest
-
PDL@adobeforums.com #8
Re: PDWordFinder - Setting characters used to break up words
if I put it in a DURING/HANDLER block, the application doesn't crash,
but the word finder is not created since an error is caught
Yes ... and what is that error?
PDL@adobeforums.com Guest
-
Eliott_Hayut@adobeforums.com #9
Re: PDWordFinder - Setting characters used to break up words
The error is "Bad Parameter."
Eliott_Hayut@adobeforums.com Guest
-
Eliott_Hayut@adobeforums.com #10
Re: PDWordFinder - Setting characters used to break up words
Sure!
/*GLOBAL VARIABLE*/
const ASUns16 myCharTypeTbl[] = {32, 9, 13, 3, W_WORD_BREAK};
//------------------------------------------------------------------------------
// Init
//------------------------------------------------------------------------------
void gveDoc::Init(gveDocType in_docType )
{
//Irrelevant code....
// Set up m_WordFinder creation options record
m_wfConfig = static_cast<PDWordFinderConfig>(ASmalloc(sizeof(PD WordFinderConfigRec)));
memset(m_wfConfig, 0, sizeof(PDWordFinderConfigRec));
m_wfConfig->recSize = sizeof(PDWordFinderConfigRec);
m_wfConfig->ignoreCharGaps = false;
m_wfConfig->ignoreLineGaps = false;
m_wfConfig->noAnnots = true;
m_wfConfig->noEncodingGuess = true; // leave non-Roman single-byte font alone
// Std Roman treatment for custom encoding; overrides the noEncodingGuess option
m_wfConfig->unknownToStdEnc = false;
m_wfConfig->disableTaggedPDF = false; // legacy mode m_WordFinder creation
m_wfConfig->noXYSort = false;
m_wfConfig->preserveSpaces = false;
m_wfConfig->noLigatureExp = false;
m_wfConfig->noHyphenDetection = false;
m_wfConfig->trustNBSpace = false;
m_wfConfig->noExtCharOffset = false; // text extraction efficiency
m_wfConfig->noStyleInfo = false; // text extraction efficiency
m_wfConfig->decomposeTbl = NULL; // Unicode character replacement
m_wfConfig->decomposeTblSize = 0;
m_wfConfig->charTypeTbl = NULL; // Custom char type table
m_wfConfig->charTypeTblSize = 0;
//m_wfConfig->charTypeTbl = myCharTypeTbl; // Custom char type table
//m_wfConfig->charTypeTblSize = sizeof(ASUns16)*5;
}
GVE_RESULT gveDoc::ExtractText()
{
PDWordFinder pdm_WordFinder = NULL;
gveList* l_lstWordTables = new gveList;
gveBool l_fIsUnicode = false;
#ifdef _UNICODE
l_fIsUnicode = true;
#endif
DURING
pdm_WordFinder = PDDocCreateWordFinderEx( m_pdfDoc, WF_LATEST_VERSION, l_fIsUnicode, m_wfConfig );
HANDLER
char buf[256];
ASGetErrorString(ERRORCODE, buf, sizeof(buf));
int x = 4;
END_HANDLER
// More code...
}
Eliott_Hayut@adobeforums.com Guest
-
Aandi_Inston@adobeforums.com #11
Re: PDWordFinder - Setting characters used to break up words
This does not seem to match the documentation of what a character type> const ASUns16 myCharTypeTbl[] = {32, 9, 13, 3, W_WORD_BREAK};
table should look like. Not even the length is right.
Aandi Inston
Aandi_Inston@adobeforums.com Guest
-
Aandi_Inston@adobeforums.com #12
Re: PDWordFinder - Setting characters used to break up words
Ok, can you post a complete fragment that generates a Bad Parameter
error with the character type table from the example...
Aandi Inston
Aandi_Inston@adobeforums.com Guest
-
Eliott_Hayut@adobeforums.com #13
Re: PDWordFinder - Setting characters used to break up words
I apologize, the code I posted was a little unclear.
If I comment out:
m_wfConfig->charTypeTbl = NULL; // Custom char type table
m_wfConfig->charTypeTblSize = 0;
And uncomment:
//m_wfConfig->charTypeTbl = myCharTypeTbl; // Custom char type table
//m_wfConfig->charTypeTblSize = sizeof(ASUns16)*5;
I get a bad parameter error. If I leave the code as I pasted in my previous post, the code executes without a problem.
Eliott_Hayut@adobeforums.com Guest
-
Aandi_Inston@adobeforums.com #14
Re: PDWordFinder - Setting characters used to break up words
Ok, and does that exact code give Bad Parameter for you?
Aandi Inston
Aandi_Inston@adobeforums.com Guest
-
Eliott_Hayut@adobeforums.com #15
Re: PDWordFinder - Setting characters used to break up words
I arrived at this value just by playing around. If I replace this value with the one used as an example in the documentation:
const ASUns16 myCharTypeTbl[] ={0x0082, 0x0082, W_CNTL+W_WORD_BREAK, 0x00b2, 0x00b3, W_DIGIT}
I get the exact same error.
Eliott_Hayut@adobeforums.com Guest
-
Aandi_Inston@adobeforums.com #16
Re: PDWordFinder - Setting characters used to break up words
>I arrived at this value just by playing around.
How about by reading the documentation??
Aandi Inston
Aandi_Inston@adobeforums.com Guest
-
Eliott_Hayut@adobeforums.com #17
Re: PDWordFinder - Setting characters used to break up words
As I said in my previous post, even by using the character type table sample provided in the documentation, I get the same "Bad Parameter" error. I would assume that the example that Adobe provides in their documentation should function correctly, so I am inclined to believe that the character type table definition is not the problem.
Do you know what else may be causing the error?
Thanks :)
Eliott_Hayut@adobeforums.com Guest
-
Aandi_Inston@adobeforums.com #18
Re: PDWordFinder - Setting characters used to break up words
Well, you've posted an example that you say works, and you say that by
changing a line you can get a parameter error, which I would expect
given that there is clearly a parameter error.
Could I suggest you post an example which
* produces the exception
* uses a correct character table definition such as the one in the
documentation?
Aandi Inston
Aandi_Inston@adobeforums.com Guest
-
Eliott_Hayut@adobeforums.com #19
Re: PDWordFinder - Setting characters used to break up words
/*GLOBAL VARIABLE*/
const ASUns16 myCharTypeTbl[] ={0x0082, 0x0082, W_CNTL+W_WORD_BREAK, 0x00b2, 0x00b3, W_DIGIT}
//------------------------------------------------------------------------------
// Init
//------------------------------------------------------------------------------
void gveDoc::Init(gveDocType in_docType )
{
//Irrelevant code....
// Set up m_WordFinder creation options record
m_wfConfig = static_cast<PDWordFinderConfig>(ASmalloc(sizeof(PD WordFinderConfigRec)));
memset(m_wfConfig, 0, sizeof(PDWordFinderConfigRec));
m_wfConfig->recSize = sizeof(PDWordFinderConfigRec);
m_wfConfig->ignoreCharGaps = false;
m_wfConfig->ignoreLineGaps = false;
m_wfConfig->noAnnots = true;
m_wfConfig->noEncodingGuess = true; // leave non-Roman single-byte font alone
// Std Roman treatment for custom encoding; overrides the noEncodingGuess option
m_wfConfig->unknownToStdEnc = false;
m_wfConfig->disableTaggedPDF = false; // legacy mode m_WordFinder creation
m_wfConfig->noXYSort = false;
m_wfConfig->preserveSpaces = false;
m_wfConfig->noLigatureExp = false;
m_wfConfig->noHyphenDetection = false;
m_wfConfig->trustNBSpace = false;
m_wfConfig->noExtCharOffset = false; // text extraction efficiency
m_wfConfig->noStyleInfo = false; // text extraction efficiency
m_wfConfig->decomposeTbl = NULL; // Unicode character replacement
m_wfConfig->decomposeTblSize = 0;
m_wfConfig->charTypeTbl = myCharTypeTbl; // Custom char type table
m_wfConfig->charTypeTblSize = sizeof(ASUns16)*7;
}
GVE_RESULT gveDoc::ExtractText()
{
PDWordFinder pdm_WordFinder = NULL;
gveList* l_lstWordTables = new gveList;
gveBool l_fIsUnicode = false;
#ifdef _UNICODE
l_fIsUnicode = true;
#endif
DURING
/****THIS LINE THROWS AN EXCEPTION "Bad Parameter."****/
pdm_WordFinder = PDDocCreateWordFinderEx( m_pdfDoc, WF_LATEST_VERSION, l_fIsUnicode, m_wfConfig );
HANDLER
char buf[256];
ASGetErrorString(ERRORCODE, buf, sizeof(buf));
int x = 4;
END_HANDLER
// More code...
}
The above code throws an exception - I just tested it. Note that I am 100% positive that m_pdfDoc (from the line of code that throws the exception) is valid.
Thanks again for your help, and I'm sorry if my previous posts were a little scattered.
Eliott_Hayut@adobeforums.com Guest
-
Eliott_Hayut@adobeforums.com #20
Re: PDWordFinder - Setting characters used to break up words
Aandi, you've saved me hours of trying to debug something which in the end was just a silly mistake on my part.
Thanks for your time!
Eliott_Hayut@adobeforums.com Guest



Reply With Quote

