Fri 20 June 2014

Filed under python

Tags text python

This script will take any particular webpage (in this case, the wikipedia page for Machine Learning) and do a quick a dirty scrape then count of the words on the page. It is not particularly sophisticated and can be further customized and improved for whatever your purpose may be.

import urllib
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from collections import Counter

## read contents from webpage
f = urllib.urlopen('http://en.wikipedia.org/wiki/Machine_learning')
contents = f.read()

# create BS object
soup = BeautifulSoup(contents)

# clean text: lower case, remove trailing commas, remove words less than 2 characters long 
mytext = soup.get_text()
mytext = mytext.lower()
mytext = mytext.replace(",", " ")
mytext = ' '.join(word for word in mytext.split() if len(word)>2)

# remove stopwords
filtered_words = [w for w in mytext.split() if not w in stopwords.words('english')]

# return counts using counter object
mycounts = Counter(filtered_words)
print mycounts.most_common(10)
Comment

Matt O'Brien (dot) Me © Matt O'Brien Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More