A Comparison of Event Models for Naive Bayes Text Classi cation

A Comparison of Event Models for Naive Bayes Text Classi cation

(Jurnal Teknik Informatika)

ABSTRAK

Recent approaches to text classi cation have used two di erent rst-order probabilistic models for classi cation,

both of which make the naive Bayes assumption. Some use a multi-variate Bernoulli model, that is, a

Bayesian Network with no dependencies between words and binary word features (e.g. Larkey and Croft 1996; Koller and Sahami 1997). Others use a multinomial model, that is, a uni-gram language model with integer word counts (e.g. Lewis and Gale 1994; Mitchell 1997). This paper aims to clarify the confusion by describing the di erences and details of these two models, and by empirically comparing their classi cation performance on ve text corpora. We nd that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes|providing on

average a 27% reduction in error over the multi-variate Bernoulli model at any vocabulary size.

Introduction

Simple Bayesian classi ers have been gaining popularity lately, and have been found to perform surprisingly well (Friedman 1997; Friedman et al. 1997; Sahami 1996; Langley et al. 1992). These probabilistic approaches make strong assumptions about how the data is generated, and posit a probabilistic model that embodies these assumptions; then they use a collection of labeled training examples to estimate the parameters of the generative model. Classi cation on new examples is performed with Bayes' rule by selecting the class that is most likely to have generated the example.

The naive Bayes classi er is the simplest of these models, in that it assumes that all attributes of the examples are independent of each other given the context of the class. This is the so-called \naive Bayes assumption." While this assumption is clearly false in most real-world tasks, naive Bayes often performs classi cation very well. This paradox is explained by the fact that classi cation estimation is only a function of the sign (in binary cases) of the function estimation; the function approximation can still be poor while classi cation accuracy remains high (Friedman 1997; Domingos and Pazzani 1997).

Untuk lebih lengkapnya anda bisa mengdownload jurnal nya di link berikut :

A Comparison of Event Models for Naive Bayes Text Classi cation

Kata Kunci : Jurnal International, Jurnal Teknik Informatika, Jurnal Skripsi, Jurnal, Contoh Jurnal, Skripsi Teknik Informatika,Contoh Skripsi, Skripsi.

jurnal jurnal international

Skripsi Teknik Informatika

A Comparison of Event Models for Naive Bayes Text Classi cation