Skip to content

pdb protein file compression (work in progress, still only does huffman coding)

Notifications You must be signed in to change notification settings

xnought/protein-zip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pzip .pz

PDB file compression with no external dependencies only built-in python libraries used.

Trying to compress a pdb file format down as low as possible. Currently can make 2x smaller. See A.pdb versus the zipped protein A.pz.

Right now, just doing Huffman coding on a character level. Note that PDB's have very repetitive words like ATOM and the three letter coded residues as well as tons of spaces in between. Can likely treat these as characters in the future to get down to 10x smaller.

gzip can do 5x smaller on A.pdb, let's see if I can beat that!

Terminal Usage

You can directly call the python code with

python3 pzip.py <mode> <input_filename> <output_filename>

where mode can be zip (from .pdb to .pz) or unzip (from .pz to .pdb). Alternatively, you can use separate pzip and punzip executables as shown below:

.pdb compressing to .pz

./pzip ./data/A.pdb ./data/A.pz

.pz uncompressing to .pdb

./punzip ./data/A.pz ./data/A.pdb

Python Usage

from pzip import pzip, punzip

# compresses A.pdb into A.pz
pzip("./data/A.pdb", "./data/A.pz")

# decompresses the A.pz back into A.pdb (renamed)
punzip("./data/A.pz", "./data/A.pdb")

File Format

See A.pz or click/toggle the details to see an example

toggle screenshot of A.pz from vscode Screenshot 2024-01-01 at 7 59 08 PM

These are fields are separated by new lines

  1. Length of bits of the encoded bitstring (not of the padded bytes) {int}
  2. The starting location of the body data bytes {int}
  3. The huffman tree representation in chars {str}
  4. The actual encoded data in bytes {bytes}

Note

The fourth field, the actual data, is in chunks of bytes, but the encoding is probably less than a multiple of 8 since I encoded a bit string. So use the first field to offset the padded 0s to get the actual start. Example in the code.

Visual Tree

For example transforming files like A.pdb into huffman trees for compression

Max Depth=12

Node("2,167,121")
├── Leaf(' ', "887,763")
└── Node("1,279,358")
    ├── Node("523,681")
    │   ├── Node("255,963")
    │   │   ├── Leaf('1', "127,074")
    │   │   └── Node("128,889")
    │   │       ├── Leaf('6', "64,746")
    │   │       └── Node("64,143")
    │   │           ├── Leaf('C', "34,267")
    │   │           └── Leaf('T', "29,876")
    │   └── Node("267,718")
    │       ├── Leaf('.', "133,754")
    │       └── Node("133,964")
    │           ├── Leaf('5', "69,215")
    │           └── Leaf('A', "64,749")
    └── Node("755,677")
        ├── Node("312,572")
        │   ├── Node("146,541")
        │   │   ├── Leaf('4', "71,186")
        │   │   └── Node("75,355")
        │   │       ├── Leaf('O', "37,836")
        │   │       └── Node("37,519")
        │   │           ├── Node("17,619")
        │   │           │   ├── Leaf('G', "8,583")
        │   │           │   └── Leaf('S', "9,036")
        │   │           └── Node("19,900")
        │   │               ├── Leaf('E', "10,295")
        │   │               └── Node("9,605")
        │   │                   ├── Leaf('U', "4,521")
        │   │                   └── Node("5,084")
        │   │                       ├── Leaf('I', "2,907")
        │   │                       └── Node("2,177")
        │   │                           ├── Leaf('V', "1,462")
        │   │                           └── Leaf('Z', "715")
        │   └── Node("166,031")
        │       ├── Leaf('3', "76,945")
        │       └── Node("89,086")
        │           ├── Leaf('-', "38,823")
        │           └── Node("50,263")
        │               ├── Leaf('\n', "26,754")
        │               └── Node("23,509")
        │                   ├── Leaf('N', "11,958")
        │                   └── Node("11,551")
        │                       ├── Leaf('R', "6,146")
        │                       └── Leaf('Y', "5,405")
        └── Node("443,105")
            ├── Node("207,597")
            │   ├── Leaf('0', "111,676")
            │   └── Leaf('2', "95,921")
            └── Node("235,508")
                ├── Node("113,409")
                │   ├── Leaf('9', "57,091")
                │   └── Node("56,318")
                │       ├── Leaf('M', "27,120")
                │       └── Node("29,198")
                │           ├── Leaf('L', "14,156")
                │           └── Node("15,042")
                │               ├── Node("6,594")
                │               │   ├── Leaf('B', "3,201")
                │               │   └── Leaf('D', "3,393")
                │               └── Node("8,448")
                │                   ├── Leaf('H', "4,195")
                │                   └── Leaf('P', "4,253")
                └── Node("122,099")
                    ├── Leaf('7', "62,000")
                    └── Leaf('8', "60,099")

About

pdb protein file compression (work in progress, still only does huffman coding)

Resources

Stars

Watchers

Forks