Data Compression Coursework

Project grade: 76% (1st class)

Repository Structure

.
├── README.md                    # This file
├── src/
│   ├── compression_utils.py     # Main compression/decompression functions
│   ├── dictionary_struc.py      # Dictionary data structures (list and trie)
│   ├── nb_analysis.py           # Benchmark analysis and plotting
│   └── literature_results.csv   # Reference results from literature
└── test_data/
    ├── artificl/                # Small artificial test files
    ├── calgary/                 # Calgary corpus benchmark files
    ├── large/                   # Large test files
    └── misc/                    # Miscellaneous test files

Instructions on how to run code

How to use

Download the canterbury corpus into the root in a folder called 'canterbury'.

To compress and decompress data, import from compression_utils.py. Trie and simple array data structures are stored in dictionary_struc.py (imported from within compression_utils.py).

You can choose between two dictionary structures: struct='list' for simple array of strings, or struct='trie' for trie.

Literature results used in the report are saved in literature_results.csv. Sourced from https://corpus.canterbury.ac.nz/details/cantrbry/

Note: The modules need to be run from the src/ directory due to relative imports.

Example usage

from compression_utils import compress, decompress
import json

MAX_DICT_SIZE = 4096 #set max dictionary size (must match for compress/decompress)
SAVE_JSON = True #optionally save/load compressed data to/from JSON

#example usage (inputs and outputs in bytes!)
input_data = b"sir sid eastman easily"
compressed = compress(input_data, max_dict_size=MAX_DICT_SIZE)

#optionally save/load compressed data to/from JSON
compressed_filename = 'compressed.json'
if SAVE_JSON:
    with open(compressed_filename, 'w') as f:
        json.dump(compressed, f)

    with open(compressed_filename, 'r') as f:
        compressed = json.load(f)

decompressed = decompress(compressed, max_dict_size=MAX_DICT_SIZE)

#verify output matches input
assert decompressed == input_data

Running benchmarks

To generate Canterbury benchmark results and plots from the report, run nb_analysis.py:

python nb_analysis.py

This will:

Run example compressions with various test inputs
Generate data for Canterbury benchmark
Save results to results.csv
Generate swarm plots comparing with literature results

Code Assessment

Criterion	Score
Correctness	20/20
Robustness	4/5
Efficiency	3/5
Code quality and comments	4/5
Total	31/35

Code Feedback

The code achieves the minimum correctness requirement by compressing and decompressing all the files in the Canterbury benchmark without any loss of information.

The code is robust. Appreciate the explicit discussion of the KwKwK edge case, even though the relevant testing was somewhat limited.

The employed data structures are fit for purpose. In many cases it seems that in the management of the dictionary some safer but sub-optimal implementation choices were made.

Clean code structure. Well-written and informative README file, the code is well-commented. Some room for improvement in runtime user-messaging.

Report Assessment

Criterion	Score
Introduction	3/5
The LZW algorithm	7/10
Implementation and testing	10/15
Results and analysis	11/15
Conclusions	3/5
References	7/10
Writing and presentation quality	4/5
Total	45/65

Report Feedback

Introduction: Clear introduction to the area of data compression and clear summary of the rest of the paper. A high-level description of LZW motivating the work would have been appreciated.

The LZW algorithm: Clear description of LZW, informative discussion of its properties and its variants, well supported by references to the literature.

Implementation and testing: A clear discussion of implementation issue. The implementation choices are reasonable but needed a little more context through a comparative discussion of more complex approaches to dictionary management.

Results and analysis: Appropriate analysis of the results, based on extensive testing, but a little more discussion on their dependence on implementation decisions would have been appreciated. Suitable comparison with previous methods, nicely visualised with a swarm plot.

Conclusions: Well-written but brief summary of the main findings. Appropriate conclusions are drawn but should had been qualified as dependent on several implementation choices. Appropriate recommendations for future investigations.

References: Well-chosen set of references, well-integrated into the main text to support your observations and analysis.

Writing and presentation quality: Generally, a very well-written paper, the material is well-structured and well-balanced across section. Appropriate use of figures, tables and pseudocode.

genAI Usage Statement

genAI was used to help write the plotting code for figures in the report (marked in comment). It wasn't used for anything else - the report and the rest of the code was written solely by myself.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
test_data		test_data
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Compression Coursework

Repository Structure

Instructions on how to run code

How to use

Example usage

Running benchmarks

Code Assessment

Code Feedback

Report Assessment

Report Feedback

genAI Usage Statement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Compression Coursework

Repository Structure

Instructions on how to run code

How to use

Example usage

Running benchmarks

Code Assessment

Code Feedback

Report Assessment

Report Feedback

genAI Usage Statement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages