+ 18

Working with natural languages

I would like to analyze large amounts of text for the contained vocabulary. Therefore, I'd like a tool that recognizes all sorts of shapes of words and connects them back to the basic word, so that they are only counted once. For example the words "counting", "count", "counted", "counts" would all be recognized as "count" and... counted only once. Is there some framework with the appropriate databases that can do that sort of thing, preferably an easy-to-use one?

python languages vocabulary

25th Dec 2021, 7:31 PM

HonFu

10 Answers

+ 13

https://code.sololearn.com/cUNN85EmXzRN/?ref=app

26th Dec 2021, 2:49 AM

Vitaly Sokol

+ 9

So you have a text and want to extract word stems out of it (its sentences)? Did you try nltk (Python )? It should enable you to do something like that for English at least...

25th Dec 2021, 8:22 PM

Lisa

+ 7

Simon Sauter the ability to use Snowball for different languages

26th Dec 2021, 3:18 AM

Vitaly Sokol

+ 5

I've never used it myself, but this looks like it might do what you're looking for: https://machinelearningknowledge.ai/learn-lemmatization-in-ntlk-with-examples/

25th Dec 2021, 8:38 PM

Simon Sauter

+ 4

Vitaly Sokol, wow, thank you, the example shows it clearly!

26th Dec 2021, 8:56 AM

HonFu

+ 4

Arif Dastager That's exactly the code posted above, isn't it?

27th Dec 2021, 12:00 PM

Lisa

+ 2

Hm, cool, that does look like the general thing I need... Would be great if it worked for other languages, foremost German and Japanese. Thanks, I'll check that out!

25th Dec 2021, 8:43 PM

HonFu

+ 1

Vitaly Sokol is there a reason why you used stemming instead of lemmatization?

26th Dec 2021, 3:15 AM

Simon Sauter

+ 1

That's real nice 👍

27th Dec 2021, 2:49 PM

₿ig Ray