Project grade: 76% (1st class)
.
├── README.md # This file
├── src/
│ ├── compression_utils.py # Main compression/decompression functions
│ ├── dictionary_struc.py # Dictionary data structures (list and trie)
│ ├── nb_analysis.py # Benchmark analysis and plotting
│ └── literature_results.csv # Reference results from literature
└── test_data/
├── artificl/ # Small artificial test files
├── calgary/ # Calgary corpus benchmark files
├── large/ # Large test files
└── misc/ # Miscellaneous test files
Download the canterbury corpus into the root in a folder called 'canterbury'.
To compress and decompress data, import from compression_utils.py.
Trie and simple array data structures are stored in dictionary_struc.py (imported from within compression_utils.py).
You can choose between two dictionary structures: struct='list' for simple array of strings, or struct='trie' for trie.
Literature results used in the report are saved in literature_results.csv. Sourced from https://corpus.canterbury.ac.nz/details/cantrbry/
Note: The modules need to be run from the src/ directory due to relative imports.
from compression_utils import compress, decompress
import json
MAX_DICT_SIZE = 4096 #set max dictionary size (must match for compress/decompress)
SAVE_JSON = True #optionally save/load compressed data to/from JSON
#example usage (inputs and outputs in bytes!)
input_data = b"sir sid eastman easily"
compressed = compress(input_data, max_dict_size=MAX_DICT_SIZE)
#optionally save/load compressed data to/from JSON
compressed_filename = 'compressed.json'
if SAVE_JSON:
with open(compressed_filename, 'w') as f:
json.dump(compressed, f)
with open(compressed_filename, 'r') as f:
compressed = json.load(f)
decompressed = decompress(compressed, max_dict_size=MAX_DICT_SIZE)
#verify output matches input
assert decompressed == input_dataTo generate Canterbury benchmark results and plots from the report, run nb_analysis.py:
python nb_analysis.pyThis will:
- Run example compressions with various test inputs
- Generate data for Canterbury benchmark
- Save results to
results.csv - Generate swarm plots comparing with literature results
| Criterion | Score |
|---|---|
| Correctness | 20/20 |
| Robustness | 4/5 |
| Efficiency | 3/5 |
| Code quality and comments | 4/5 |
| Total | 31/35 |
The code achieves the minimum correctness requirement by compressing and decompressing all the files in the Canterbury benchmark without any loss of information.
The code is robust. Appreciate the explicit discussion of the KwKwK edge case, even though the relevant testing was somewhat limited.
The employed data structures are fit for purpose. In many cases it seems that in the management of the dictionary some safer but sub-optimal implementation choices were made.
Clean code structure. Well-written and informative README file, the code is well-commented. Some room for improvement in runtime user-messaging.
| Criterion | Score |
|---|---|
| Introduction | 3/5 |
| The LZW algorithm | 7/10 |
| Implementation and testing | 10/15 |
| Results and analysis | 11/15 |
| Conclusions | 3/5 |
| References | 7/10 |
| Writing and presentation quality | 4/5 |
| Total | 45/65 |
Introduction: Clear introduction to the area of data compression and clear summary of the rest of the paper. A high-level description of LZW motivating the work would have been appreciated.
The LZW algorithm: Clear description of LZW, informative discussion of its properties and its variants, well supported by references to the literature.
Implementation and testing: A clear discussion of implementation issue. The implementation choices are reasonable but needed a little more context through a comparative discussion of more complex approaches to dictionary management.
Results and analysis: Appropriate analysis of the results, based on extensive testing, but a little more discussion on their dependence on implementation decisions would have been appreciated. Suitable comparison with previous methods, nicely visualised with a swarm plot.
Conclusions: Well-written but brief summary of the main findings. Appropriate conclusions are drawn but should had been qualified as dependent on several implementation choices. Appropriate recommendations for future investigations.
References: Well-chosen set of references, well-integrated into the main text to support your observations and analysis.
Writing and presentation quality: Generally, a very well-written paper, the material is well-structured and well-balanced across section. Appropriate use of figures, tables and pseudocode.
genAI was used to help write the plotting code for figures in the report (marked in comment). It wasn't used for anything else - the report and the rest of the code was written solely by myself.