Hello,
I'm new to the community. My apologies if this would best be asked elsewhere.
I'm a second-year CS student and am currently doing some studies in document analysis using the (somewhat terrible) vector document comparison method.
The method requires trimming down words into their "base" words. In many cases this is as simple as removing the "s" from plural words or removing the "ing" from state-of-being verbs, etc. But in many cases (e.g. cactus), it's not so simple.
Does anybody know of where I could acquire a more-or-less comprehensive "word -> base-word" map? Writing one from scratch is way beyond the scope of the project, and I'm not even exactly sure what to Google for (although I definitely looked for a while before posting here).
(This project will be done in Java for what it's worth, but I'm fluent in regexen enough to take most any format.)
Any advice would be great! Thanks in advance!
I'm new to the community. My apologies if this would best be asked elsewhere.
I'm a second-year CS student and am currently doing some studies in document analysis using the (somewhat terrible) vector document comparison method.
The method requires trimming down words into their "base" words. In many cases this is as simple as removing the "s" from plural words or removing the "ing" from state-of-being verbs, etc. But in many cases (e.g. cactus), it's not so simple.
Does anybody know of where I could acquire a more-or-less comprehensive "word -> base-word" map? Writing one from scratch is way beyond the scope of the project, and I'm not even exactly sure what to Google for (although I definitely looked for a while before posting here).
(This project will be done in Java for what it's worth, but I'm fluent in regexen enough to take most any format.)
Any advice would be great! Thanks in advance!
