Monday, August 11, 2014

Corpus din Google News

Utilizare pachet: http://cran.r-project.org/web/packages/tm.plugin.webmining/
Package ‘tm.plugin.webmining’
Utilizare google news:

GoogleNewsSource(query, params = list(hl = "en", q = query, ie = "utf-8", num
= 100, output = "rss"), ...)
exemplu practic:
corpus <- Corpus(GoogleNewsSource("Microsoft"))
Scenariu de test:
> library(tm)
> library(tm.plugin.webmining)
> googlenews <- WebCorpus(GoogleNewsSource("Stiri"))
> googlenews
<<WebCorpus (documents: 100, metadata (corpus/indexed): 3/0)>>
>corpus.update(googlenews,)
> inspect(googlenews)
VCorpus(VectorSource(googlenews))
 dtm <- DocumentTermMatrix(googlenews)
findFreqTerms(dtm, 5)
 inspect(removeSparseTerms(dtm, 0.4))
writeCorpus(googlenews, path = "C:\R", filenames = NULL)