Bayesian Classifier

Bayes Basics

Understand how the Bayesian classifier works and when to use it.

Bayesian Classification Basics

The Bayesian classifier uses Bayes’ theorem to calculate the probability that a piece of text belongs to each category. It’s simple, fast, and surprisingly effective for many text classification tasks.

How It Works

Naive Bayes classification works in three steps:

Training: Count word frequencies for each category
Classification: Calculate probability of each category given the words
Decision: Return the category with highest probability

The Math (Simplified)

For a document with words w1, w2, w3, the probability of category C is:

P(C | w1, w2, w3) ∝ P(C) × P(w1|C) × P(w2|C) × P(w3|C)

Where:

P(C) is the prior probability of category C
P(w|C) is the probability of seeing word w in category C

The “naive” assumption is that words are independent of each other, which isn’t true but works well in practice.

Creating a Classifier

require 'classifier'

# Create with any number of categories
classifier = Classifier::Bayes.new 'Tech', 'Sports', 'Politics'

Training

Train the classifier by providing examples for each category:

# Keyword arguments (recommended)
classifier.train(tech: 'New JavaScript framework released')
classifier.train(sports: 'Team wins championship game')
classifier.train(politics: 'Senate passes new legislation')

# Batch training with arrays
classifier.train(
  tech: ['Apple announces new MacBook', 'Python 4.0 features announced'],
  sports: ['Soccer player signs new contract', 'Team wins finals']
)

# Legacy APIs (still work)
classifier.train :Tech, 'Example text'
classifier.train_tech 'Example text'

Training Tips

More data is better: Accuracy improves significantly with more training examples
Balance categories: Try to provide similar amounts of data for each category
Use representative examples: Train with text similar to what you’ll classify

Classification

# Get the best category
result = classifier.classify 'The new iPhone has amazing features'
# => "Tech"

# Get scores for all categories
scores = classifier.classifications 'Congress debates tax reform'
# => {"Tech" => -15.2, "Sports" => -18.4, "Politics" => -8.1}

Understanding Scores

The classifier returns log probabilities:

Scores are always negative
Higher (less negative) = more likely
Differences matter more than absolute values

To convert to relative probabilities:

scores = classifier.classifications(text)

# Normalize to get percentages
max_score = scores.values.max
normalized = scores.transform_values { |s| Math.exp(s - max_score) }
total = normalized.values.sum
percentages = normalized.transform_values { |v| (v / total * 100).round(1) }

When to Use Bayes

Good for:

Spam detection
Sentiment analysis (positive/negative)
Topic categorization
Language detection
Any task with clear category boundaries

Not ideal for:

Finding related documents (use LSI instead)
Semantic similarity
When word order matters significantly

Configuration Options

# Enable automatic stemming (on by default)
classifier = Classifier::Bayes.new :a, :b, enable_stemmer: true

# Use custom language for stemming
classifier = Classifier::Bayes.new :a, :b, language: 'fr'

# Disable threshold (classify everything, even low confidence)
classifier = Classifier::Bayes.new :a, :b, enable_threshold: false

Streaming & Batch Training

For large datasets, use batch training to reduce lock contention and track progress:

classifier = Classifier::Bayes.new 'Spam', 'Ham'

# Batch training with progress callback
classifier.train_batch(:spam, spam_documents, batch_size: 1000) do |progress|
  puts "#{progress.percent}% complete (#{progress.rate.round} docs/sec)"
end

# Train multiple categories at once
classifier.train_batch(
  spam: spam_documents,
  ham: ham_documents,
  batch_size: 500
)

For files too large to load into memory, stream line-by-line:

File.open('spam_corpus.txt', 'r') do |file|
  classifier.train_from_stream(:spam, file, batch_size: 1000) do |progress|
    puts "Processed #{progress.completed} lines"
  end
end

See the Streaming Training Tutorial for checkpoints and resumable training.

Example: Sentiment Analyzer

sentiment = Classifier::Bayes.new 'Positive', 'Negative'

# Train with examples
sentiment.train(positive: "I love this product!")
sentiment.train(positive: "Excellent service, highly recommend")
sentiment.train(positive: "Best purchase I've ever made")

sentiment.train(negative: "Terrible experience, avoid")
sentiment.train(negative: "Waste of money")
sentiment.train(negative: "Disappointing and frustrating")

# Or batch train
sentiment.train(
  positive: ["Amazing quality!", "Highly recommended"],
  negative: ["Total disappointment", "Don't waste your money"]
)

# Classify new reviews
sentiment.classify "This is amazing!"
# => "Positive"

sentiment.classify "Complete garbage, don't buy"
# => "Negative"

Next Steps

Streaming Training - Train on large datasets with progress tracking
Persistence - Save and load trained classifiers
Real-time Pipeline - Build a production-ready classification pipeline