Compression and WinRar software:How it’s made?


Compression software can compress data to smaller size from it’s original size.The well know WinRar software is of such type.Compression software comes in handy when you need to save some space in your hard drive.To make such software we need an algorithm.The better is your algorithm the better your software can compress which means saving larger amount of space.WinRar does a good job in compressing text data.I have google on what algorithm is used in WinRar but it seems the algorithm creator did not make it public.So,I don’t know what algorithm is used however,I will make use of two algorithms in building a simple console compression software-

i)The first one is a simple 7 bits compression method and
ii)The second is LZW(Lempel-Ziv-Welch) algorithm

Winrar may produce better compression than these two algorithms but knowing them will give you an insight on the origin of compression software.If you want to make a better compression software than WinRar or any other existing software you will have to come up with a better algorithm and ‘how the concept of compressing data was born‘ may prove useful in the process.Let’s get on with the first algorithm!



7 bits compression method

The smallest size a character can occupy in our computer is 8 bits.However,not all character occupy 8 bits.The characters that you see in your keyboard occupy up to maximum of 7 bits,the last bits is free.8 bits is only used for some other characters not found in the keyboard.Since a text document consist only of English Alphabets ,number ,punctuation,some special symbols,etc. which are all found in the keyboard,so we need up to maximum of 7 bits to store them.So what if we allocate only 7 bits for each character instead of 8 bits we can save some space.For instance,if a text document have 500 characters which is 500 bytes *8=4000 bits size but,if we allocate only 7 bits size for each character the size becomes 500 byte *7=3500 bits,so the size decreases by 500(4000-3500) bits or we would have saved 62.5 bytes of memory.And if the text document size is larger we would have saved more space.The concept is simple here “allocate a fixed size of 7 bits instead of 8 bits” for a character.

To implement 7 bits compression method our algorithm will consist of two parts: i)Encoding and ii)Decoding.

 





Encoding

 
In encoding process we will allocate 7 bits for a character.We cannot use bit field to allocate 7 bits for a character and write out the 7 bits data to the file because any data must have a minimum size of 8 bits whether it makes sense or not.Encoding process involve taking 7 bits from the character and storing it in a sequence of 8 bits unsigned character.To understand this process look at the chart below.
 
Chart 1
 
var1 , var2 , var3 , var4 … are the 8 bits characters to be encoded.And encoded1 , encoded2 , encoded3 , encoded4 , encoded5 … are the sequence of 8 bits unsigned character.If we look at the chart ,the 7 bits of var1 is stored in encoded1 however,encoded1 has still one free bit left so the first bit of var2 is stored in it.The remaining 6 bits is store in the next 8 bit sequence which is encoded2.The last two free bits of encoded2 is use to store the first two bits of var3. Again,the remaining 5 bits of var3 is store in the next 8 bit sequence of encoded3.The process goes on but,after 8 characters have been encoded the process of encoding the next character is same as the first character.So,the process repeats itself after every 8 characters.The sequence encoded1 , encoded2 , encoded3 , encoded4 , encoded5… are the sequence of data for the encoded file,so save it in a file.We will use this file later to decode the data .From the chart it is clear that for every 8 characters we obtained seven, 8 bits encoded character so,compression is achieved.
 
The source code of the program implementing the above algorithm can be Download here(7_bits_encoding_method.zip) .The .zip file contain an encoded compressed file of William Shakespeare poem “Shall I compare thee to a summer’s day?“.You can compare the difference in their size.
 


Decoding

 
In decoding the reverse of encoding is performed.From the encoded 8 bits sequence 7 bits are assigned to unique unsigned character and it is save in the file.This file is the decoded file and the content is the same with the original file.Decoding process can be understood easily by looking at the chart below.
 
Chart 2
 
Consider encoded1 , encoded2 , encoded3 , encoded4 , encoded5… as the encoded sequence from the encoded file and decoded1 , decoded2 , decoded3 , decoded4 , decoded5… as the decoded unsigned characters obtained after assigning a fix size of 7 bit which is extracted from the encoded sequence.Looking at the chart the first seven bits of encoded1 is assigned to decoded1 and the last bit is assigned to the next character , decoded2. However,decoded2 still needs 6 more bits so 6 bits is taken from the next encoded sequence,encoded2.The two bits of encoded2 is assigned to decoded3 and the missing five bits is taken from encoded3 which is the next sequence.The process goes on until the last character has been decoded.Decoding the encoded8 character is same as decoding encoded1 so the decoding pattern repeats after every 7 characters.
 
You can download the source code of the program implementing the decoding algorithm here(7bit_decoding_method.zip) .You can test the decoding program by decoding the encoded file included in the zip file 7_bits_encoding_method.zip.
 


 


Stop wasting time,earn money($$$) from your website-Join Now!

LZW(Lempel-Ziv-Welch) Algorithm


 
It is a dictionary base compression method.The algorithm uses a fix size bits assigned to the encoded sequence.This fix size bits is stored in the output file which is used for decoding.The first 256(0-255) bits is reserve for the 256 characters.The next encoded sequence is given 256 and the coming encoded sequence is given 257 and so on.

*Note:
Dictionary:A dictionary is nothing but a temporary file created to store an entries/string which is assigned certain value .These values assigned to the entries is used later to encoded the data read from the file. It can be of any type .txt or .bin or any other type.

Sequence:A sequence means a data or a character taken from the file to be encoded.

I will be using a fix size of 10 bits to encode the sequence in the example.

 

Encoding the sequence

 
Read the first sequence say x from the file and consider this as the current sequence. Read the next sequence say y,from the file and combine the current and next sequence.This gives the string “xy“. Assign a fix size bits encoded value(256) to “xy” and add the string and 256 to the dictionary.And add a 10 bits size int value of the current sequence (‘x‘) to the output file.This file is the encoded file which will be used for decoding.The next sequence ‘y‘ will be considered as the current sequence and read the next sequence say ‘z‘ from the file.Combine the current and next sequence to give the string “yz” and assign a fix size bits 257 and add the string and the value to the dictionary.To the output file add 10 bits int value of y. The next sequence z is considered as the current sequence and read the next sequence from the file.Suppose if a string “ab” exist in the dictionary then the next sequence say ‘c‘ is read from the file and added to the string “ab“. In such case the integer value assigned to the string “ab” in the dictionary is added to the output file and the sequence ‘c‘ is taken as the current sequence.And if “abc” also exist in the dictionary the next sequence is read from the file and added to the string “abc“. The new string and a new encoded value is added to the dictionary while,the encoded value of “abc” is added to the output file.

Let’s try to implement the algorithm on the string “ghkjlncvghkjnmgghkjlnghkjlnmr“.
 

Current
sequence
Next
sequence
Dictionary entryOutput
code
Nullg
ghgh — 256103
hkhk — 257104
kjkj — 258107
jljl — 259106
lnln — 260108
ncnc — 261110
cvcv — 26299
vgvg — 263118
ghkghk —264256
kjnkjn — 265258
nmnm — 266110
mgmg — 267109
gggg — 268103
ghkjghkj — 269264
jlnjln — 270259
ngng — 271110
ghkjlghkjl – 272269
lnmlnm — 273260
mrmr — 274109
r~~114

 
The string size is 29*8 bits=232 bits.The compressed data size is 20*10 bits=200 bits.The size difference is 32 bits or 4 bytes.For files having thousand characters or more the algorithm would produce a noticeable difference in the size of the encoded file and the original file.The dictionary file can be deleted after the compression because the dictionary can be built again while decoding.

The source code of the encoding algorithm is given here(LZW_encoding.zip). The dictionary file and the encoded file of the text file containing the string “ghkjlncvghkjnmgghkjlnghkjlnmr” is also included.The program is made simple by not using class and C++ core features.To understand the program you also need to know how to allocate 10 bits for a data using 8 bits unsigned character.If you are not interested in the given source code then try to write your own program! in that way you will also learn more.



Decoding the sequence

 
To decode the sequence we need to know the initial dictionary entry,the remaining dictionary can be created from the next sequence obtain from the encoded file.Read the first 10 bits from the sequence and convert it to unsigned char value say x. This character is the first character of the decoded file.Read the next 10 bits from the file and convert it to unsigned character type say y. This is the second character of the decoded file.Combine the first and second character which gives the string “xy” and send this string to the dictionary file and assign it’s value as 256.By now we have obtained two characters of the decoded file and one dictionary entry.Read the next 10 bits, consider it’s character value as z. Combine the previous character and the current character and assign it’s value as 257.The string “yz” and 257 is the new entry in the dictionary and ‘z‘ is the decoded character.This process continue for all the sequence read from the file.

CASE 1-When the current 10 bits sequence have a size greater than 255:Suppose while decoding you come across a sequence of 10 bits having integer value greater than 255.In such case look up the corresponding string in the dictionary and send the string as the output value.For instance,if the value was say 256 than the corresponding string is “xy“. Now “xy” is the output value and if the previous output value was say ‘p‘ then the dictionary entry for this case is “px” (only the first character of the output string is taken for the dictionary entry) and the assigned value is the incremented value of the previous entry.

CASE 2-When the current 10 bits sequence have a size greater than 255 and the previous output value is a string:In this case the process is same as the case 1(above) but the dictionary entry is the combine string of the precious output string and the first character of the current output.

There is a case in which the decoding algorithm fails,we will discuss that later in case 3.The table below shows the decoding steps of the file which was encoded earlier.
 

Input bitschar
value
dictionary entryoutput data
0001100111‬g (103)g
0001101000‬ h (104)gh — 256h
‭0001101011‬k (107)hk — 257k
‭0001101010‬j (106)kj — 258j
‭0001101100‬l (108)jl — 259l
‭0001101110‬n (110)ln — 260n
‭0001100011‬c (99)nc — 261c
‭0001110110‬v (118)cv — 262v
0100000000‬gh (256)vg — 263gh
0100000010‬kj (258)ghk – 264kj
‭0001101110‬n (110)kjn – 265n
‭0001101101‬m (109)nm — 266m
‭0001100111‬g (103)mg — 267g
0100001000ghk (268)gg — 268ghk
0011111010jl (259)ghkj – 269jl
‭0001101110‬n (110)jln — 270n
0100001101ghkj (269)ng — 271ghkj
0100000100ln (260)ghkjl – 272ln
‭0001101101‬m (109)lnm – 273m
‭0001110010‬r (114)mr — 274r

 
If you compare the encoded and decoded dictionary the entries are the same.So,making sure that your encoding and decoding program render the same dictionary entry is the first step in writing a correct program .
 
CASE 3-When the current fix 10 bits value is greater than 255 and the dictionary entry for the value has not been made.:Suppose while decoding you come across a sequence whose value is greater than 255 but the dictionary entry corresponding to that value has not been made.In such case you will take the previous decoded string and add the first character of the string to itself.The new string is the current decoded output and a dictionary entry for the current sequence.For instance,let’s encode and decode the string “abcabcabcabcabcabcabc”

Current
sequence
Next
sequence
Dictionary entryOutput
NULLa
abab — 25697
bcbc — 25798
caca — 25899
abcabc — 259256
cabcab — 260258
bcabca -261257
abcaabca – 262259
abcababcab – 263262
bcabbcab – 264261
bc~~257

 
There is no problem with encoding the string.Now,let’s decode the string.
 

Input bitsCharDictionary entrydecoded
output
‭01100001‬a (97)a
‭01100010‬b (98)ab — 256b
‭01100011‬c (99)bc — 257c
‭0100000000‬ab (256)ca — 258ab
‭0100000010‬ca (258) abc — 259ca
‭0100000001‬bc (257)cab — 260bc
‭0100000011‬abc (259)bca — 261abc
‭*0100000110‬*abca (262)* abca — 262abca
‭0100000101‬bca (261)abcab – 263bca
‭0100000001‬bc (257)bcabc – 264bc

If you look at the 8th column marked with ‘*‘ the sequence has a value of 262,but the dictionary does not contain any entry with the value 262.So, the previous decoded string “abc” is taken and it’s first character ‘a‘ is added to it.The new string “abca” is the decoded string of input bits value 262 and the dictionary entry for the current sequence.
 
The source code of the decoding algorithm is given here(LZW_decode_algorithm_implementation.zip).