+ 1

how to Process the text in the document

tokenization of text document I know the file with the file, but I do not answer it in pycharm

1st Jun 2018, 7:40 AM

reza

3 Answers

+ 5

You want to simply break the text into words, or need a more complex analysis? In the first case you can use the split() function to break a string into a list of strings. You can also get rid of punctuation with replace(). If you need some better data science tools, Python probably has them in some module. Maybe this helps: http://www.nltk.org

1st Jun 2018, 7:52 AM

Pedro Demingos

+ 5

Exactly what Pedro said. Nltk's tokenizer is really good and the lib itself can get you going through the whole process -- plus if you want to do a semantic analysis, you can employ word2vec, which goes smoothly with nltk corpus.

1st Jun 2018, 8:06 AM

Kuba Siekierzyński

+ 2

Please explain, your question is a bit vague, else I hope this helps. with open('text_file.txt') as f: file_contents = f.readlines() # This should print out the contents of the file named 'text_file.txt' print (file_contents)

1st Jun 2018, 7:49 AM

Mpho Mphego