File Compression and Decompression (Huffman Coding)

Keywords: encoding less

In this paper, Huffman coding is used to compress and decompress files (text files). Firstly, the overall idea of the project is introduced: Huffman coding compression file is actually to count the frequency of each character in the file, then generate the corresponding encoding for each character, and then save each character in the compressed file in bytes in the form of Huffman coding. The decompression of files is actually to translate the compressed files and save them to the decompressed files, which need to be completed with the configuration files generated during the compression process. The steps of compression and decompression of files are described in detail below.

   The core of file compression is to generate Huffman coding, and the process of Huffman coding needs to find the minimum weight and sub-minimum weight in a series of data. We naturally associate the structure of heap with small recurrence time and easy to find the minimum and sub-minimum value. I put the source code of heap in Heap.h file (see below). Now let's compress the files.

1. The number of occurrences of all characters in the statistical file. Since there are 255 characters in Ascall code, only the first 128 characters can be displayed. When defining character variables, we must define unsigned char ch as unsigned charch. This is the end mark of the file that ch cannot read. So we can use function feof to replace the end mark of the file. The most important thing is that the file must be opened in binary form, otherwise we can not read Chinese characters. Character, will appear random code. As for the storage method, we use hash table to map each character to the subscript of the hash table, which can easily correspond each character to the number of occurrences. It should be pointed out that our hash table is not simply the number of times, but the node FileInfo. This node is called the weighted node to save the number of occurrences and characters, as well as the Huffman coding we will generate in the future, so that we can index easily.

bool Compress(const char *filename)//This function plays a statistical role.
{
FILE *fout = fopen(filename, "rb");//Open files in binary form
assert(fout);
unsigned char ch = fgetc(fout);
while (!feof(fout))
{
_Infos[ch]._count++;//Statistics of the Number of Characters in Documents
ch = fgetc(fout);
COUNT++;//Number of Characters in Statistical Documents
}
fclose(fout);
return true;
}

2. Now let's create a minimum heap and push the counted nodes into the heap.
3. Take data from heap and build Huffman tree in HuffMan.h header file.
4. Huffman codes are generated by Huffman tree and stored in nodes.
5. Traversing the files to be compressed saves the corresponding Huffman codes in bytes to the compressed files.
6. Save the number of occurrences of each character in the configuration file. When we traverse to the last character of the file, if the encoding can't make up 8 bits per byte, we add 0 to the remaining position. In order to solve the problem of parsing the last character, we count the total number of characters in the compressed file to the first line of the configuration file, and then store the characters and characters in the form of "X, n" in each line. The corresponding number of occurrences. This completes our file compression process.

The idea of file decompression is simple, but it takes some thought to read the configuration file in detail, which is reflected in the statistics of newline characters. The following is the decompression of the file (see Uncompress.h for the source file):

1. Read the configuration file;
2. Reconstruct Huffman tree by configuration file.
3. The decompression of files, read-in encoding according to characters, find the corresponding characters by encoding in the Huffman tree species, and save the characters into the decompression file. The COUNT read in the configuration file controls the beginning and end of the correct encoding of the last character. The decompression of the file is completed.

Heap.h

#include <vector>  

template<class T>  
struct Less  
{  
    bool operator() (const T& l, const T& r)  
    {  
        return l < r; // operator<  
    }  
};  

template<class T>  
struct Greater  
{  
    bool operator() (const T& l, const T& r)  
    {  
        return l > r; // operator>  
    }  
};  


template<class T, class Compare=Less<T>>//Affine Functions of Huffman Nodes  
class Heap  
{  
public:  
    Heap()  
    {}  
    Heap(const T* a, size_t size)  
    {  
        for (size_t i = 0; i < size; ++i)  
        {  
            _arrays.push_back(a[i]);//Insert all data into the heap  
        }  

        // Build heap  
        for (int i = (_arrays.size() - 2) / 2; i >= 0; --i)  
        {  
            AdjustDown(i);//Every node in this range is adjusted downward, and the process of building a heap is actually the process of adjusting the heap downward.  
        }  
    }  

    void Push(const T& x)  
    {  
        _arrays.push_back(x);  
        AdjustUp(_arrays.size() - 1);//The process of inserting nodes is actually the process of adjusting the heap up.  
    }  

    void Pop()  
    {  
        assert(_arrays.size() > 0);  
        swap(_arrays[0], _arrays[_arrays.size() - 1]);  
        _arrays.pop_back();  

        AdjustDown(0);  
    }  

    T& Top()  
    {  
        assert(_arrays.size() > 0);  
        return _arrays[0];  
    }  

    bool Empty()  
    {  
        return _arrays.empty();  
    }  

    size_t Size()  
    {  
        return _arrays.size();  
    }  

    void AdjustDown(int root)  
    {  
        int child = root * 2 + 1;  

        Compare com;  
        while (child < _arrays.size())  
        {  
            // Compare the middle and small children  
            if (child + 1<_arrays.size() &&  
                com(_arrays[child + 1], _arrays[child]))  
            {  
                ++child;  
            }  

            if (com(_arrays[child], _arrays[root]))  
            {  
                swap(_arrays[child], _arrays[root]);  
                root = child;  
                child = 2 * root + 1;  
            }  
            else  
            {  
                break;  
            }  
        }  
    }  

    void AdjustUp(int child)  
    {  
        int parent = (child - 1) / 2;  

        while (child > 0)  
        {  
            if (Compare()(_arrays[child], _arrays[parent]))  
            {  
                swap(_arrays[parent], _arrays[child]);  
                child = parent;  
                parent = (child - 1) / 2;  
            }  
            else  
            {  
                break;  
            }  
        }  
    }  

    void Print()  
    {  
        for (size_t i = 0; i < _arrays.size(); ++i)  
        {  
            cout << _arrays[i] << " ";  
        }  
        cout << endl;  
    }  

public:  
    vector<T> _arrays;  
};  

//Test reactor   
//void Test1()  
//{  
//  int a[10] = { 10, 11, 13, 12, 16, 18, 15, 17, 14, 19 };  
//  Heap<int, Greater<int> > hp1(a, 10);  
//  hp1.Push(1);  
//  hp1.Print();  
//  
//  Heap<int> hp2(a, 10);  
//  hp2.Push(1);  
//  hp2.Print();  
//  
//  
//  Less<int> less;  
//  cout<<less(1, 2)<<endl;  
//  
//  Greater<int> greater;  
//  cout<<greater(1, 2)<<endl;  
//}

HuffMan.h

#pragma once  

#include "Heap.h"  

template<class T>  
struct HuffManNode  
{  
    HuffManNode<T> *_left;  
    HuffManNode<T> *_right;  
    HuffManNode<T> *_parent;  
    T _weight;  
    HuffManNode(const T&x)  
        : _left(NULL)  
        , _right(NULL)  
        , _parent(NULL)  
        , _weight(x)  
    {}  
};  

template<class T>  
class HuffMan  
{  
    typedef HuffManNode<T> Node;  

    template<class T>  
    struct NodeCompare  
    {  
        bool operator() ( const Node*l, const Node*r)//Template cannot be compiled separately  
        //So where you use NodeCompare, you put it in a file.  
        {  
            return l->_weight < r->_weight;  
        }  
    };  

protected:  
    Node* _root;  

public:  
    HuffMan()  
        :_root(NULL)  
    {}  

    ~HuffMan()  
    {}  

public:  
    Node* GetRootNode()  
    {  
        return _root;  
    }  

    Node* CreatTree(T*a, size_t size,const T& invalid)  
    {  
        //Converting numbers to Huffman nodes and putting them in the smallest heap  
        assert(a);  
        Heap<Node*, NodeCompare<T>> minHeap;  
        for (size_t i = 0; i < size; ++i)  
        {  
            if (a[i] != invalid)  
            {  
                Node*node = new Node(a[i]);  
                minHeap.Push(node);  
            }  

        }  
        /*for (int i = 0; i<10; i++) 
        { 
            Node *temp = minHeap._arrays[i];//Code for testing 
            cout << temp->_weight << " "; 
        }*/  
        //Establishing Huffman tree by taking the smallest and sub-smallest nodes from the smallest heap  
        while (minHeap.Size()>1)  
        {  
            Node* left = minHeap.Top();//Minimum  
            minHeap.Pop();  
            Node* right = minHeap.Top();//Small order  
            minHeap.Pop();  
            Node *parent = new Node(left->_weight + right->_weight);  
            parent->_left = left;  
            parent->_right = right;  
            left->_parent = parent;  
            right->_parent = parent;//Relations among Link Nodes  

            minHeap.Push(parent);//Put the smallest sum and the next smallest sum in the heap for readjustment  
        }  
        _root = minHeap.Top();//The last remaining node in the heap is Huffman's root node.  
        return _root;  
    }  

    HuffManNode<T>* GetRoot()  
    {  
        return _root;  
    }  
    void PrintHuff()  
    {  
        Node *root = _root;  
        _Print(root);  
    }  
protected:  
    void _Print(Node *root)  
    {  
        if (root == NULL)  
        {  
            return;  
        }  
        else  
        {  
            cout << root->_weight;  
        }  
        _Print(root->_left);  
        _Print(root->_right);  
    }  

};  

//void TestHuff()  
//{  
//  int a[] = { 1, 0, 2, 3, 4, 5, 6, 7, 8, 9 };  
//  HuffMan<int> t;  
//  t.CreatTree(a, sizeof(a) / sizeof(int), -1);  
//  
//}

filecompress.h

# include<iostream>  
# include<cassert>  
# include<string>  
# include<algorithm>  
# include"HuffMan.h"  
using namespace std;  
typedef unsigned long long LongType;  
struct FileInfo  
{  
  unsigned  char _ch;  
  LongType  _count;  
  string  _code;  
  FileInfo(unsigned char ch=0)  
      :_ch(ch)  
      , _count(0)  
  {}  
 FileInfo operator+(FileInfo filein)  
  {  
     FileInfo temp;  
     temp._count=_count + filein._count;  
     return temp;  
  }  
 bool operator<( const FileInfo filein)const                 
 {  
     return _count < filein._count;  
 }  
 bool operator!=(const FileInfo  Invalid)const  
 {  
     return _count != Invalid._count;  
 }  
};  
class FileCompress  
{  
protected:  
    FileInfo _Infos[256];  
    LongType COUNT = 0;  
public:  
    FileCompress()  
    {  
        for (int i = 0; i < 256;i++)  
        {  
            _Infos[i]._ch = i;  
        }  
    }  
    bool Compress(const char *filename)//This function plays a statistical role.  
    {  
        FILE *fout = fopen(filename, "rb");//Open files in binary form  
        assert(fout);  
        unsigned char ch = fgetc(fout);  
        while (!feof(fout))  
        {  
            _Infos[ch]._count++;//Statistics of the Number of Characters in Documents  
            ch = fgetc(fout);  
            COUNT++;//Number of Characters in Statistical Documents  
        }  
        fclose(fout);  
        return true;  
    }  
    void GenerateHuffManCode()  
    {  
        HuffMan<FileInfo> t;  
        FileInfo invalid;  
        t.CreatTree(_Infos, 256, invalid);  
        HuffManNode<FileInfo>*root = t.GetRoot();  
        _GenrateHuffManCode(root);  
    }  
    void _GenrateHuffManCode(HuffManNode<FileInfo>* root)  
    {  
        if (root == NULL)  
        {  
            return;  
        }  
        _GenrateHuffManCode(root->_left);  
        _GenrateHuffManCode(root->_right);  

        if ((root->_left == NULL) && (root->_right == NULL))  
        {  
            HuffManNode<FileInfo>*cur = root;  
            HuffManNode<FileInfo>*parent = cur->_parent;  
            string &code = _Infos[cur->_weight._ch]._code;  
            while (parent)//From the leaf node to the root node  
            {  
                if (parent->_left == cur)  

                    code += '0';          
                else          
                    code += '1';      
                cur = parent;  
                parent = cur->_parent;  
            }  
            reverse(code.begin(), code.end());  
        }         
    }  

    //Following is file compression  
    void CompressFile(const char *filename)  
    {  
        Compress(filename);  
        string compressFile = filename;  
        compressFile += ".huffman";  
        FILE *FinCompress = fopen(compressFile.c_str(), "wb");  
        assert(FinCompress);//Naming Processing of Compressed Files  

        GenerateHuffManCode();//Generate code  
        FILE *fout = fopen(filename, "rb");  
        assert(fout);  

        //Compression of files  
         unsigned char inch = 0;  
        int index = 0;  
        char ch = fgetc(fout);  
        while (ch!=EOF)  
        {  
            string&code = _Infos[(unsigned char)ch]._code;  
            for (int i = 0; i < code.size(); ++i)  
            {  
                ++index;  
                inch <<= 1;  
                if (code[i] == '1')  
                {  
                    inch |= 1;  
                }  
                if (index == 8)  
                {  
                    fputc(inch, FinCompress);  
                    index = 0;  
                    inch = 0;  
                }         
            }  
            ch = fgetc(fout);  
        }  
        if (index != 0)  
        {  
            inch <<= (8 - index);  
            fputc(inch,FinCompress);  
        }  
        fclose(fout);  
        FileInfo invalid;  
        CreateConfig(filename,invalid);  
    }  
    void CreateConfig( const char* filename,FileInfo invalid)  
    {  
        string ConfigFile = filename;  
        ConfigFile += ".config";  
        FILE *FileConfig = fopen(ConfigFile.c_str(), "wb");  
        assert(FileConfig);  

        char ch[256];  
        string tempcount;  
        int i = 0;  
        tempcount=  _itoa(COUNT, ch, 10);  
        while (i < tempcount.size())  
        {  
            fputc(tempcount[i],FileConfig);  
            i++;  
        }//Write the total number of characters to the configuration file  
        fputc('\n', FileConfig);  
        for (size_t i = 0; i < 256; i++)  
        {  
            if (_Infos[i] != invalid)  
            {  
                string chInfo;  
                chInfo.clear();  

                if (_Infos[i]._count>0)  
                {  
                    chInfo += _Infos[i]._ch;  
                    chInfo += ',';  
                    char ch[256]; //Converted characters may be long enough  
                    _itoa(_Infos[i]._count,ch, 10);  
                    chInfo += ch;  
                    for (int j = 0; j < chInfo.size(); j++)  
                    {  
                        fputc(chInfo[j], FileConfig);  
                    }  

                        fputc('\n', FileConfig);                      
                }  
            }  
        }  
        fclose(FileConfig);  
    }  

};  
void TestFileCompress()  
{  
    FileCompress FC;  
    FC.CompressFile("fin.txt");  
    cout << "Compression success" << endl;  
}

Uncompress.h

# include<iostream>  
using namespace std;  
# include"HuffMan.h"  
# include"filecompress.h"  

class Uncompress  
{  
private:  
    FileInfo _UNinfos[256];  
    LongType Count;  
public:  
    Uncompress()//Initialization of Hash Table  
    {  
        for (int i = 0; i < 256; i++)  
        {  
            _UNinfos[i]._ch = i;  
        }  
        Count = 0;  
    }  
    bool _Uncompress(const char *Ufilename)//read configuration file  
    {  
        string Configfile = Ufilename;  
        Configfile += ".config";  
        FILE *fpConfig = fopen(Configfile.c_str(), "rb");  
        assert(fpConfig);  

        string line;  
        unsigned char ch = fgetc(fpConfig);  
        while (ch != '\n')  
        {     
            line += ch;  
            ch =fgetc(fpConfig);          
        }//Read the first character  
        Count = atoi(line.substr(0).c_str());//(Total number of characters)  
        ch = fgetc(fpConfig);//Read in the next line of characters  
        line.clear();  
        int j = 0;  
        while (!feof(fpConfig))  
        {  

            j++;  
            while (ch != '\n')  
            {  
                line += ch;  
                ch = fgetc(fpConfig);  

            }  
            if (line.empty())  
            {  
                line += '\n';  
                ch = fgetc(fpConfig);  
                while (ch != '\n')  
                {  
                    line += ch;  
                    ch = fgetc(fpConfig);  
                }  
            }  
            ch = fgetc(fpConfig);  
            unsigned char tempch = line[0];//Converting the first character to an unsigned number corresponds to the subscript  
                                           //Especially, if you don't pay attention to it, it's all scrambled.    
            _UNinfos[tempch]._count = atoi(line.substr(2).c_str());//Intercepting strings and converting them into integer data  
            line.clear();  
        }  
        return true;  
    }  
    void GenrateHuffManCode(HuffManNode<FileInfo>* root)//Reconstructing Huffman Tree  
    {  
        if (root == NULL)  
        {  
            return;  
        }  
        GenrateHuffManCode(root->_left);  
        GenrateHuffManCode(root->_right);  

        if ((root->_left == NULL) && (root->_right == NULL))  
        {  
            HuffManNode<FileInfo>*cur = root;  
            HuffManNode<FileInfo>*parent = cur->_parent;  
            string &code = _UNinfos[cur->_weight._ch]._code;  
            while (parent)//From the leaf node to the root node  
            {  
                if (parent->_left == cur)  

                    code += '0';  
                else  
                    code += '1';  
                cur = parent;  
                parent = cur->_parent;  
            }  
            reverse(code.begin(), code.end());  
        }  
    }  

    bool UncomPress(const char *UNfilename)//Decompression of files  
    {  
        _Uncompress(UNfilename);  
        HuffMan<FileInfo> Re_huffTree;  
        FileInfo invalid;  
        HuffManNode<FileInfo>*root = Re_huffTree.CreatTree(_UNinfos, 256, invalid);//Reconstructing Huffman Tree  
        GenrateHuffManCode(root);  

        //Open file  
        string UnComPressFile = UNfilename;  
        UnComPressFile += ".Unhuffman";  
        FILE *UCPfile = fopen(UnComPressFile.c_str(), "wb");  
        string ComPressFile = UNfilename;  
        ComPressFile += ".huffman";  
        FILE *CPfile = fopen(ComPressFile.c_str(), "rb");  

        //Unzip character to write to file  


        HuffManNode<FileInfo>*tempRoot = root;//Get its root node  
        while (!feof(CPfile))  
        {  
            unsigned char ch = fgetc(CPfile);  
            int bitCount = 7;  
            for (int i = bitCount; i >= 0; i--)  
            {  
                if (ch&(1 << i))  
                {  
                    tempRoot = tempRoot->_right;  
                }  
                else  
                {  
                    tempRoot = tempRoot->_left;  
                }  
                if (!tempRoot->_left&&!tempRoot->_right)//Do it here.  
                {  
                    fputc(tempRoot->_weight._ch, UCPfile);  
                    Count--;  
                    tempRoot = root;  
                }  
                if (Count <= 0)  
                {  
                    break;  
                }  
            }  
            if (Count <= 0)  
            {  
                break;  
            }  
        }  
        return true;  
    }  

};  
void TestUNCP()  
{  
    Uncompress Uncp;  
    Uncp.UncomPress("fin.txt");  
}

Posted by Online Connect on Sun, 26 May 2019 16:32:31 -0700