# Compression and WinRar software:How it’s made?

Compression software can compress data to smaller size from it’s original size.The well know WinRar software is of such type.Compression software comes in handy when you need to save some space in your hard drive.To make such software we need an algorithm.The better is your algorithm the better your software can compress which means saving larger amount of space.WinRar does a good job in compressing text data.I have google on what algorithm is used in WinRar but it seems the algorithm creator did not make it public.So,I don’t know what algorithm is used however,I will make use of two algorithms in building a simple console compression software-

Winrar may produce better compression than these two algorithms but knowing them will give you an insight on the origin of compression software.If you want to make a better compression software than WinRar or any other existing software you will have to come up with a better algorithm and ‘how the concept of compressing data was born‘ may prove useful in the process.Let’s get on with the first algorithm!

#### 7 bits compression method

The smallest size a character can occupy in our computer is 8 bits.However,not all character occupy 8 bits.The characters that you see in your keyboard occupy up to maximum of 7 bits,the last bits is free.8 bits is only used for some other characters not found in the keyboard.Since a text document consist only of English Alphabets ,number ,punctuation,some special symbols,etc. which are all found in the keyboard,so we need up to maximum of 7 bits to store them.So what if we allocate only 7 bits for each character instead of 8 bits we can save some space.For instance,if a text document have 500 characters which is 500 bytes *8=4000 bits size but,if we allocate only 7 bits size for each character the size becomes 500 byte *7=3500 bits,so the size decreases by 500(4000-3500) bits or we would have saved 62.5 bytes of memory.And if the text document size is larger we would have saved more space.The concept is simple here “allocate a fixed size of 7 bits instead of 8 bits” for a character.

To implement 7 bits compression method our algorithm will consist of two parts: i)Encoding and ii)Decoding.

## Encoding

In encoding process we will allocate 7 bits for a character.We cannot use bit field to allocate 7 bits for a character and write out the 7 bits data to the file because any data must have a minimum size of 8 bits whether it makes sense or not.Encoding process involve taking 7 bits from the character and storing it in a sequence of 8 bits unsigned character.To understand this process look at the chart below.

Chart 1

var1 , var2 , var3 , var4 … are the 8 bits characters to be encoded.And encoded1 , encoded2 , encoded3 , encoded4 , encoded5 … are the sequence of 8 bits unsigned character.If we look at the chart ,the 7 bits of var1 is stored in encoded1 however,encoded1 has still one free bit left so the first bit of var2 is stored in it.The remaining 6 bits is store in the next 8 bit sequence which is encoded2.The last two free bits of encoded2 is use to store the first two bits of var3. Again,the remaining 5 bits of var3 is store in the next 8 bit sequence of encoded3.The process goes on but,after 8 characters have been encoded the process of encoding the next character is same as the first character.So,the process repeats itself after every 8 characters.The sequence encoded1 , encoded2 , encoded3 , encoded4 , encoded5… are the sequence of data for the encoded file,so save it in a file.We will use this file later to decode the data .From the chart it is clear that for every 8 characters we obtained seven, 8 bits encoded character so,compression is achieved.

The source code of the program implementing the above algorithm can be Download here(7_bits_encoding_method.zip) .The .zip file contain an encoded compressed file of William Shakespeare poem “Shall I compare thee to a summer’s day?“.You can compare the difference in their size.

## Decoding

In decoding the reverse of encoding is performed.From the encoded 8 bits sequence 7 bits are assigned to unique unsigned character and it is save in the file.This file is the decoded file and the content is the same with the original file.Decoding process can be understood easily by looking at the chart below.

Chart 2

Consider encoded1 , encoded2 , encoded3 , encoded4 , encoded5… as the encoded sequence from the encoded file and decoded1 , decoded2 , decoded3 , decoded4 , decoded5… as the decoded unsigned characters obtained after assigning a fix size of 7 bit which is extracted from the encoded sequence.Looking at the chart the first seven bits of encoded1 is assigned to decoded1 and the last bit is assigned to the next character , decoded2. However,decoded2 still needs 6 more bits so 6 bits is taken from the next encoded sequence,encoded2.The two bits of encoded2 is assigned to decoded3 and the missing five bits is taken from encoded3 which is the next sequence.The process goes on until the last character has been decoded.Decoding the encoded8 character is same as decoding encoded1 so the decoding pattern repeats after every 7 characters.

You can download the source code of the program implementing the decoding algorithm here(7bit_decoding_method.zip) .You can test the decoding program by decoding the encoded file included in the zip file 7_bits_encoding_method.zip.

## LZW(Lempel-Ziv-Welch) Algorithm

It is a dictionary base compression method.The algorithm uses a fix size bits assigned to the encoded sequence.This fix size bits is stored in the output file which is used for decoding.The first 256(0-255) bits is reserve for the 256 characters.The next encoded sequence is given 256 and the coming encoded sequence is given 257 and so on.

*Note:
Dictionary:A dictionary is nothing but a temporary file created to store an entries/string which is assigned certain value .These values assigned to the entries is used later to encoded the data read from the file. It can be of any type .txt or .bin or any other type.

Sequence:A sequence means a data or a character taken from the file to be encoded.

I will be using a fix size of 10 bits to encode the sequence in the example.

## Encoding the sequence

Let’s try to implement the algorithm on the string “ghkjlncvghkjnmgghkjlnghkjlnmr“.

 Current sequence Next sequence Dictionary entry Output code Null g g h gh — 256 103 h k hk — 257 104 k j kj — 258 107 j l jl — 259 106 l n ln — 260 108 n c nc — 261 110 c v cv — 262 99 v g vg — 263 118 gh k ghk —264 256 kj n kjn — 265 258 n m nm — 266 110 m g mg — 267 109 g g gg — 268 103 ghk j ghkj — 269 264 jl n jln — 270 259 n g ng — 271 110 ghkj l ghkjl – 272 269 ln m lnm — 273 260 m r mr — 274 109 r ~ ~ 114

The string size is 29*8 bits=232 bits.The compressed data size is 20*10 bits=200 bits.The size difference is 32 bits or 4 bytes.For files having thousand characters or more the algorithm would produce a noticeable difference in the size of the encoded file and the original file.The dictionary file can be deleted after the compression because the dictionary can be built again while decoding.

The source code of the encoding algorithm is given here(LZW_encoding.zip). The dictionary file and the encoded file of the text file containing the string “ghkjlncvghkjnmgghkjlnghkjlnmr” is also included.The program is made simple by not using class and C++ core features.To understand the program you also need to know how to allocate 10 bits for a data using 8 bits unsigned character.If you are not interested in the given source code then try to write your own program! in that way you will also learn more.

## Decoding the sequence

To decode the sequence we need to know the initial dictionary entry,the remaining dictionary can be created from the next sequence obtain from the encoded file.Read the first 10 bits from the sequence and convert it to unsigned char value say x. This character is the first character of the decoded file.Read the next 10 bits from the file and convert it to unsigned character type say y. This is the second character of the decoded file.Combine the first and second character which gives the string “xy” and send this string to the dictionary file and assign it’s value as 256.By now we have obtained two characters of the decoded file and one dictionary entry.Read the next 10 bits, consider it’s character value as z. Combine the previous character and the current character and assign it’s value as 257.The string “yz” and 257 is the new entry in the dictionary and ‘z‘ is the decoded character.This process continue for all the sequence read from the file.

CASE 1-When the current 10 bits sequence have a size greater than 255:Suppose while decoding you come across a sequence of 10 bits having integer value greater than 255.In such case look up the corresponding string in the dictionary and send the string as the output value.For instance,if the value was say 256 than the corresponding string is “xy“. Now “xy” is the output value and if the previous output value was say ‘p‘ then the dictionary entry for this case is “px” (only the first character of the output string is taken for the dictionary entry) and the assigned value is the incremented value of the previous entry.

CASE 2-When the current 10 bits sequence have a size greater than 255 and the previous output value is a string:In this case the process is same as the case 1(above) but the dictionary entry is the combine string of the precious output string and the first character of the current output.

There is a case in which the decoding algorithm fails,we will discuss that later in case 3.The table below shows the decoding steps of the file which was encoded earlier.

 Input bits char value dictionary entry output data 0001100111‬ g (103) g 0001101000‬ h (104) gh — 256 h ‭0001101011‬ k (107) hk — 257 k ‭0001101010‬ j (106) kj — 258 j ‭0001101100‬ l (108) jl — 259 l ‭0001101110‬ n (110) ln — 260 n ‭0001100011‬ c (99) nc — 261 c ‭0001110110‬ v (118) cv — 262 v 0100000000‬ gh (256) vg — 263 gh 0100000010‬ kj (258) ghk – 264 kj ‭0001101110‬ n (110) kjn – 265 n ‭0001101101‬ m (109) nm — 266 m ‭0001100111‬ g (103) mg — 267 g 0100001000 ghk (268) gg — 268 ghk 0011111010 jl (259) ghkj – 269 jl ‭0001101110‬ n (110) jln — 270 n 0100001101 ghkj (269) ng — 271 ghkj 0100000100 ln (260) ghkjl – 272 ln ‭0001101101‬ m (109) lnm – 273 m ‭0001110010‬ r (114) mr — 274 r

If you compare the encoded and decoded dictionary the entries are the same.So,making sure that your encoding and decoding program render the same dictionary entry is the first step in writing a correct program .

CASE 3-When the current fix 10 bits value is greater than 255 and the dictionary entry for the value has not been made.:Suppose while decoding you come across a sequence whose value is greater than 255 but the dictionary entry corresponding to that value has not been made.In such case you will take the previous decoded string and add the first character of the string to itself.The new string is the current decoded output and a dictionary entry for the current sequence.For instance,let’s encode and decode the string “abcabcabcabcabcabcabc”

 Current sequence Next sequence Dictionary entry Output NULL a a b ab — 256 97 b c bc — 257 98 c a ca — 258 99 ab c abc — 259 256 ca b cab — 260 258 bc a bca -261 257 abc a abca – 262 259 abca b abcab – 263 262 bca b bcab – 264 261 bc ~ ~ 257

There is no problem with encoding the string.Now,let’s decode the string.

 Input bits Char Dictionary entry decoded output ‭01100001‬ a (97) a ‭01100010‬ b (98) ab — 256 b ‭01100011‬ c (99) bc — 257 c ‭0100000000‬ ab (256) ca — 258 ab ‭0100000010‬ ca (258) abc — 259 ca ‭0100000001‬ bc (257) cab — 260 bc ‭0100000011‬ abc (259) bca — 261 abc ‭*0100000110‬ *abca (262) * abca — 262 abca ‭0100000101‬ bca (261) abcab – 263 bca ‭0100000001‬ bc (257) bcabc – 264 bc

If you look at the 8th column marked with ‘*‘ the sequence has a value of 262,but the dictionary does not contain any entry with the value 262.So, the previous decoded string “abc” is taken and it’s first character ‘a‘ is added to it.The new string “abca” is the decoded string of input bits value 262 and the dictionary entry for the current sequence.

The source code of the decoding algorithm is given here(LZW_decode_algorithm_implementation.zip).