Inspiration

It can be quite cumbersome to proofread long texts for minor inconsistencies which a spell checker won't find. To search for possible typos in sequence may take a lot of time and some inconsistencies may still be forgotten. The project aim is to provide a word counting that lists word occurrences in files. It can either search for several selected words in one go, or just gather every word in a text in a summarizing list that tells you how many times a word occurs in the text.

What it does

Counts word occurrences in text files and results are saved in alphabetical order in a text file. The results are summarized for each source text file.

Effects

  • Typos and inconsistencies in the used terms in large files can be easilier found.
  • Such as inconsistency regarding hyphen usage,
  • reference signs for figures such as in patent applications or manuals,
  • or using similar terms, such as both "disc" and "disk", for the same object.

Some characters, especially commas, periods and parentheses after a word, are removed before a word is registered. This smoothes the vocabulary somewhat, but by keeping the punctuation inside of a word if any (such as in "2.0"), typos such as missing spaces after a comma are also registered after a closing bracket. URLs or other compositions with slashes are not split up so that low-level domains or word endings (such as the "s" in "file(s)") are not counted as separate words.

How I built it

with Python

Challenges I ran into

Accomplishments that I'm proud of

What I learned

What's next for WordChecker

  • Some syntax highlighting for similar words
  • Other programming languages

Requirements

Written in Python 3 (works with e.g. version 3.8.1), these modules are imported: sys, re, glob, os

Works best if the text files are saved as UTF-8 (with or without BOM), which can be saved for example with Windows Notepad or Notepad++. The 3-byte BOM which can occur at the beginning of an UTF-8 file, is skipped so that the first word of the file is registered as a "normal word".

Built With

Share this project:

Updates