Skip to content

Theosdoor/Data-Compression-Coursework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Compression Coursework

Project grade: 76% (1st class)

Repository Structure

.
├── README.md                    # This file
├── src/
│   ├── compression_utils.py     # Main compression/decompression functions
│   ├── dictionary_struc.py      # Dictionary data structures (list and trie)
│   ├── nb_analysis.py           # Benchmark analysis and plotting
│   └── literature_results.csv   # Reference results from literature
└── test_data/
    ├── artificl/                # Small artificial test files
    ├── calgary/                 # Calgary corpus benchmark files
    ├── large/                   # Large test files
    └── misc/                    # Miscellaneous test files

Instructions on how to run code

How to use

Download the canterbury corpus into the root in a folder called 'canterbury'.

To compress and decompress data, import from compression_utils.py. Trie and simple array data structures are stored in dictionary_struc.py (imported from within compression_utils.py).

You can choose between two dictionary structures: struct='list' for simple array of strings, or struct='trie' for trie.

Literature results used in the report are saved in literature_results.csv. Sourced from https://corpus.canterbury.ac.nz/details/cantrbry/

Note: The modules need to be run from the src/ directory due to relative imports.

Example usage

from compression_utils import compress, decompress
import json

MAX_DICT_SIZE = 4096 #set max dictionary size (must match for compress/decompress)
SAVE_JSON = True #optionally save/load compressed data to/from JSON

#example usage (inputs and outputs in bytes!)
input_data = b"sir sid eastman easily"
compressed = compress(input_data, max_dict_size=MAX_DICT_SIZE)

#optionally save/load compressed data to/from JSON
compressed_filename = 'compressed.json'
if SAVE_JSON:
    with open(compressed_filename, 'w') as f:
        json.dump(compressed, f)

    with open(compressed_filename, 'r') as f:
        compressed = json.load(f)

decompressed = decompress(compressed, max_dict_size=MAX_DICT_SIZE)

#verify output matches input
assert decompressed == input_data

Running benchmarks

To generate Canterbury benchmark results and plots from the report, run nb_analysis.py:

python nb_analysis.py

This will:

  1. Run example compressions with various test inputs
  2. Generate data for Canterbury benchmark
  3. Save results to results.csv
  4. Generate swarm plots comparing with literature results

Code Assessment

Criterion Score
Correctness 20/20
Robustness 4/5
Efficiency 3/5
Code quality and comments 4/5
Total 31/35

Code Feedback

The code achieves the minimum correctness requirement by compressing and decompressing all the files in the Canterbury benchmark without any loss of information.

The code is robust. Appreciate the explicit discussion of the KwKwK edge case, even though the relevant testing was somewhat limited.

The employed data structures are fit for purpose. In many cases it seems that in the management of the dictionary some safer but sub-optimal implementation choices were made.

Clean code structure. Well-written and informative README file, the code is well-commented. Some room for improvement in runtime user-messaging.

Report Assessment

Criterion Score
Introduction 3/5
The LZW algorithm 7/10
Implementation and testing 10/15
Results and analysis 11/15
Conclusions 3/5
References 7/10
Writing and presentation quality 4/5
Total 45/65

Report Feedback

Introduction: Clear introduction to the area of data compression and clear summary of the rest of the paper. A high-level description of LZW motivating the work would have been appreciated.

The LZW algorithm: Clear description of LZW, informative discussion of its properties and its variants, well supported by references to the literature.

Implementation and testing: A clear discussion of implementation issue. The implementation choices are reasonable but needed a little more context through a comparative discussion of more complex approaches to dictionary management.

Results and analysis: Appropriate analysis of the results, based on extensive testing, but a little more discussion on their dependence on implementation decisions would have been appreciated. Suitable comparison with previous methods, nicely visualised with a swarm plot.

Conclusions: Well-written but brief summary of the main findings. Appropriate conclusions are drawn but should had been qualified as dependent on several implementation choices. Appropriate recommendations for future investigations.

References: Well-chosen set of references, well-integrated into the main text to support your observations and analysis.

Writing and presentation quality: Generally, a very well-written paper, the material is well-structured and well-balanced across section. Appropriate use of figures, tables and pseudocode.

genAI Usage Statement

genAI was used to help write the plotting code for figures in the report (marked in comment). It wasn't used for anything else - the report and the rest of the code was written solely by myself.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages