Introduction

Zip files are one of the most popular archive formats out there, and there are a lot of things you can do with them. While working with the ePub ebook format I spent a lot of time working with zip files. Thankfully, the standards committee for ePub used zip as the container format instead of designing their own.

If you’ve ever done anything with Android you’ve dealt with zip files and probably don’t even realize it. The apk files are really zip files. Now you might have ever needed to go into the apk itself, but I’ve had to in order to check manifest merging with libraries.

Probably the funnest use of zip files is using them to package plugins. Think about Firefox and the xip (they’re zip files) extension format. Having everything in one file and the app installing the plugins makes it really easy for users.

Now let’s think about a game with add on skins. The skin might be comprised of multiple images all in one zip file. This gives ease of transmission, and being compressed this does provide a storage savings if there are a large number of skins. I’m ignoring the fact that a zip file of compressed images doesn’t provide much if any space savings. Use your imagination if you don’t like this example.

In this situation the images need to be read out of the zip file and will be directly used. We don’t need a stream API. Just an easy way to open the zip file and pull every thing out.

It’s also nice to provide tools for developers that can automatically package their plugins. Most of the time a plugin needs a specific layout and some metadata and a packaging app can take care of this.

Suffice it to say at some point you’ll probably need to read and write zip files. You could implement the zip spec because it’s really simple but why bother when there are easier ways to go about this.

Utilizing MiniZip

Minizip is one of the better and easier to use zip libraries. However, it’s easier than other zip libraries but it’s not particularly easy to use. To deal with this we can make a simple API that hides a lot of the verbosity of Minizip. I’m not going to focus on things like encryption or random file access within the zip archive. All this is going to do is make it easy to add files and get files out of a zip archive.

If you’re curious you can get the spec (free) from PKWARE.

The zlib compression library provides MiniZip in it’s contrib directory. It’s a full fledged zip library and is easy to use. It’s an updated version of the original MiniZip with various patches pulled in. You want to use the zlib provided one and not the outdated original.

Also, Deflate compression is always used when adding files to the archive. This API is mean to be simple and for basic zip file manipulation. This is not meant to be a full featured archive library like libarchive.

General Flow

Lets look at the general process for working with MiniZip.

Zipping:

  1. Open the zip (doesn’t have to exist).
  2. Open a new file.
  3. Write to the file.
  4. Close the file.
  5. Close the zip.

Unzipping:

  1. Open the zip (has to exist).
  2. Open the current file in the zip.
  3. Read the file.
  4. Close the file.
  5. Move to the next file.
  6. Continue until there are no more files.
  7. Close the zip.

While this looks fairly straight forward, we want to streamline it with a few helpers.

Minizip Functions

This is a wrapper around Minizip and some Minizip functions will need to be used directly. I say some, but only the open and close functions:

  • zipOpen64
  • zipClose
  • unzOpen64
  • unzClose

There are two sets of open functions MiniZip provides but we will only ever use the ones ending in 64. The other versions are legacy and shouldn’t be used. The 64 ones indicate support for 64 bit zip which is what’s used today. Don’t worry because zipOpen64 and unzOpen64 can still open 32 bit zip files.

Remember, these are all helpers so we still need to make common operations easier. We’re only merging some of the zipping and unzipping steps and not creating our own API to completely replace MiniZips.

We’re also giving ourselves some extras like zipping files from disk and form memory because the library doesn’t natively provide these interfaces (zipper_add_file). Plus some things just aren’t clear (adding a directory) so we have helpers for those.

#ifndef __ZIPPER_H__
#define __ZIPPER_H__

#include <stdbool.h>
#include <minizip/unzip.h>
#include <minizip/zip.h>

typedef enum {
    ZIPPER_RESULT_ERROR = 0,
    ZIPPER_RESULT_SUCCESS,
    ZIPPER_RESULT_SUCCESS_EOF
} zipper_result_t;

typedef void (*zipper_read_cb_t)(const unsigned char *buf, size_t size, void *thunk);

bool zipper_add_file(zipFile zfile, const char *filename);
bool zipper_add_buf(zipFile zfile, const char *zfilename, const unsigned char *buf, size_t buflen);
bool zipper_add_dir(zipFile zfile, const char *dirname);

zipper_result_t zipper_read(unzFile zfile, zipper_read_cb_t cb, void *thunk);
zipper_result_t zipper_read_buf(unzFile zfile, unsigned char **buf, size_t *buflen);

bool zipper_skip_file(unzFile zfile);
char *zipper_filename(unzFile zfile, bool *isutf8);
bool zipper_isdir(unzFile zfile);
uint64_t zipper_filesize(unzFile zfile);

#endif /* __ZIPPER_H__ */

Result Enum

The first thing you’ll see with the header is we have a result enum called zipper_result_t. The main use of this is to return to the read functions when the last file has been read.

Callback Prototype

The zipper_read_cb_t function prototype allows for a callback based read function. This gives some flexibility on loading the data. While I said earlier this wasn’t going to be stream based wrapper, I was partly lying. The callback will let you stream data if you really want to…

Add functions

The add functions are mostly self explanatory. The big thing to keep in mind is this wrapper does not track filenames. Every time you add a file to the archive it is added as a new entry (same for directories). Meaning, if you add the same file multiple times, its data will be in the archive multiple times. When you extract it will write (overwrite) the file for each entry.

Read Functions

The read functions are also pretty self explanatory. If there is a read failure, ZIPPER_RESULT_ERROR is returned. On success where there are more files ZIPPER_RESULT_SUCCESS if returned. If this read was of the last file in the archive and no more remain then ZIPPER_RESULT_SUCCESS_EOF is returned.

Reading is sequential and not random access. Once you read a file it’s read and the position in the archive advances to the next file. There is no going backwards. If you don’t read a file and realize you should have, you need to open the archive again and start going though the files until you reach the one you want.

Info Functions

When reading you might want or need some information about the file before preceding with the read. For example, you might need to know what size buffer to allocate. Or you might not want to read the file at all.

The filename function at the very least must be called before reading a file from the archive. Remember, the wrapper is sequential. So, once you read everything is pointing to the next file. Once you’ve read a file these functions will be pointing to the next file in the archive.

zipper_skip_file

Skip this file and move to the next one. This returns true if there is a next file. If the current file is the last file this will return false.

zipper_filename

Read the filename for the file in the archive. This includes the full path. You really need to call this before extracting a file.

zipper_isdir

Is this a file or a directory entry.

zipper_filesize

The size of the file. This is the uncompressed size and can be used to allocate a buffer to hold the file data.

Implementation

Headers and Defines

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifdef _WIN32
#  include <minizip/iowin32.h>
#endif

#include "str_builder.h"
#include "zipper.h"

#define BUF_SIZE 8192
#define MAX_NAMELEN 256

We have BUF_SIZE defined because we’re going to use temporary buffer reading (because we have to). 8192 gives us a good balance between memory and performance. We can always tweak this later (or make it configurable) later.

Notice we include str_builder.h which is this string builder. We’ll use it for some of our buffered reading.

The MAX_NAMELEN is required because we’ll need to use a temporary buffer to pull out filenames for files within the zip. The name within the zip includes the path. 256 was chosen because it’s a safe number and it’s unlikely you’ll see any paths within an archive longer than this.

Adding Files

bool zipper_add_file(zipFile zfile, const char *filename)
{
    FILE          *f;
    unsigned char  buf[BUF_SIZE];
    int            ret;
    size_t         red;
    size_t         flen;

    if (zfile == NULL || filename == NULL)
        return false;

    f = fopen(filename, "r");
    if (f == NULL)
        return false;

    fseek(f, 0, SEEK_END);
    flen = ftell(f);
    rewind(f);

    ret = zipOpenNewFileInZip64(zfile, filename, NULL, NULL, 0, NULL, 0, NULL,
            Z_DEFLATED, Z_DEFAULT_COMPRESSION, (flen > 0xffffffff)?1:0);
    if (ret != ZIP_OK) {
        fclose(f);
        return false;
    }

    while ((red = fread(buf, sizeof(*buf), sizeof(buf), f)) > 0) {
        ret = zipWriteInFileInZip(zfile, buf, red);
        if (ret != ZIP_OK) {
            fclose(f);
            zipCloseFileInZip(zfile);
            return false;
        }
    }

    zipCloseFileInZip(zfile);
    return true;
}

bool zipper_add_buf(zipFile zfile, const char *zfilename, const unsigned char *buf, size_t buflen)
{
    int ret;

    if (zfile == NULL || buf == NULL || buflen == 0)
        return false;

    ret = zipOpenNewFileInZip64(zfile, zfilename, NULL, NULL, 0, NULL, 0, NULL,
            Z_DEFLATED, Z_DEFAULT_COMPRESSION, (buflen > 0xffffffff)?1:0);
    if (ret != ZIP_OK)
        return false;

    ret = zipWriteInFileInZip(zfile, buf, buflen);
    zipCloseFileInZip(zfile);
    return ret==ZIP_OK?true:false;
}

We can either add files directly read from disk or we can dump a buffer into a file. We’re combining steps 1-3 in the zipping process into these helpers. Not to mention reading a file does a bit more on top of that.

bool zipper_add_dir(zipFile zfile, const char *dirname)
{
    char   *temp;
    size_t  len;
    int     ret;

    if (zfile == NULL || dirname == NULL || *dirname == '\0')
        return false; 

    len  = strlen(dirname);
    temp = calloc(1, len+2);
    memcpy(temp, dirname, len+2);
    if (temp[len-1] != '/') {
        temp[len] = '/';
        temp[len+1] = '\0';
    } else {
        temp[len] = '\0';
    }

    ret = zipOpenNewFileInZip64(zfile, temp, NULL, NULL, 0, NULL, 0, NULL, 0, 0, 0);
    if (ret != ZIP_OK)
        return false;
    free(temp);
    zipCloseFileInZip(zfile);
    return ret==ZIP_OK?true:false;
}

Directories get weird in zips because there isn’t a specific directory type. Instead a directory is signified by a 0 length file who’s name ends with a ‘/’. I should point out that all paths in a zip are Unix style forward slash and they cannot be an absolute path.

To deal with directories our helper will ensure the directory name we’re passing in ends with a ‘/’ by adding it if necessary. It will then write a 0 zero length file for us so we don’t have to worry about that either.

We don’t have to specify a directory then files that will be in the directory so zipper_add_dir only needs to be called if a directory is empty. That said, you can call it for every directory you’ll have files in.

Reading

zipper_result_t zipper_read(unzFile zfile, zipper_read_cb_t cb, void *thunk)
{
    unsigned char tbuf[BUF_SIZE];
    int           red;
    int           ret;

    if (zfile == NULL || cb == NULL)
        return ZIPPER_RESULT_ERROR;

    ret = unzOpenCurrentFile(zfile);
    if (ret != UNZ_OK)
        return ZIPPER_RESULT_ERROR;

    while ((red = unzReadCurrentFile(zfile, tbuf, sizeof(tbuf))) > 0) {
        cb(tbuf, red, thunk);
    }

    if (red < 0) {
        unzCloseCurrentFile(zfile);
        return ZIPPER_RESULT_ERROR;
    }

    unzCloseCurrentFile(zfile);
    if (unzGoToNextFile(zfile) != UNZ_OK)
        return ZIPPER_RESULT_SUCCESS_EOF;
    return ZIPPER_RESULT_SUCCESS;
}

To make reading flexible we’ll use a central callback based read function which takes a user provided thunk. This way we don’t need wrappers for every conceivable place we could store the read data. This is combining unzipping steps 3-5 for us.

static void zipper_read_buf_cb(const unsigned char *buf, size_t buflen, void *thunk)
{
    str_builder_t *sb = thunk;
    str_builder_add_str(sb, (const char *)buf, buflen);
}

zipper_result_t zipper_read_buf(unzFile zfile, unsigned char **buf, size_t *buflen)
{
    str_builder_t   *sb;
    zipper_result_t  ret;

    sb = str_builder_create();
    ret = zipper_read(zfile, zipper_read_buf_cb, sb);
    if (ret != ZIPPER_RESULT_ERROR)
        *buf = (unsigned char *)str_builder_dump(sb, buflen);
    str_builder_destroy(sb);
    return ret;
}

Reading into a buffer is quite common so we’ll provide wrapper for that. Since we have a central callback based read we can leverage it for buffer reading. This also givers users an easy to understand example of how to write their own callbacks.

Back in the header we had the zipper_result_t enum which the read functions use. Since our read function automatically moves to the next file for us, we need to propagate this back to the caller if we’ve read the last file or not. We could have done an out parameter flag or something but I find this easier to use in loops.

Info

MiniZip provides us with all kinds of info about the files in the zip but there are only a few that are really important 99% of the time. Now, all of the info from our helpers comes from the same MiniZip function but it’s pretty quick to call so having to call it (potentially) 3 times isn’t really a big deal.

char *zipper_filename(unzFile zfile, bool *isutf8)
{
    char            name[MAX_NAMELEN];
    unz_file_info64 finfo;
    int             ret;

    if (zfile == NULL)
        return NULL;

    ret = unzGetCurrentFileInfo64(zfile, &finfo, name, sizeof(name), NULL, 0, NULL, 0);
    if (ret != UNZ_OK)
        return NULL;
    if (isutf8 != NULL)
        *isutf8 = (finfo.flag & (1<<11))?true:false;
    return strdup(name);
}

zipper_filename has an isutf8 out parameter because historically zip only supported filenames in IBM Code Page 437 encoding. Later utf8 was added so we need to let the caller know how they should handle any decoding of the filename. The encoding is stored in bit 11 of the info flags and thankfully this reads that for us.

bool zipper_isdir(unzFile zfile)
{
    char            name[MAX_NAMELEN];
    unz_file_info64 finfo;
    size_t          len;
    int             ret;

    if (zfile == NULL)
        return false;

    ret = unzGetCurrentFileInfo64(zfile, &finfo, name, sizeof(name), NULL, 0, NULL, 0);
    if (ret != UNZ_OK)
        return false;

    len = strlen(name);
    if (finfo.uncompressed_size == 0 && len > 0 && name[len-1] == '/')
        return true;
    return false;
}

zipper_isdir is handy because of the weird way zip handles directories. If it’s a directory you’ll want to make a directory and not write a 0 byte file.

bool zipper_skip_file(unzFile zfile)
{
    if (unzGoToNextFile(zfile) != UNZ_OK)
        return false;
    return true;
}

uint64_t zipper_filesize(unzFile zfile)
{
    unz_file_info64 finfo;
    int             ret;

    if (zfile == NULL)
        return 0;

    ret = unzGetCurrentFileInfo64(zfile, &finfo, NULL, 0, NULL, 0, NULL, 0);
    if (ret != UNZ_OK)
        return 0;
    return finfo.uncompressed_size;
}

These two don’t need much explanation.

Example App

Now that we have our zipper wrapper we should look at using it. For this example we’ll also need some file helpers. Also, one of these recursive make directory functions.

#ifdef _WIN32
#  include <Windows.h>
#else
#  include <sys/stat.h>
#endif
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "rw_files.h"
#include "zipper.h"

static const char *zipname = "test.zip";
static const char *f1_name = "Moneypenny.txt";
static const char *f2_name = "Bond.txt";
static const char *f3_name = "Up/M.txt";
static const char *f4_name = "Down/Q.tt";
static const char *f1_data = "secretary\n";
static const char *f2_data = "secret agent\n";
static const char *f3_data = "top guy\n";
static const char *f4_data = "bottom guy\n";
static const char *d1_name = "Up/";
static const char *d2_name = "Around/";
static const char *d3_name = "Bound/A/B";

static bool create_test_zip(void)
{
    zipFile zfile;

    zfile = zipOpen64(zipname, 0);
    if (zfile == NULL) {
        printf("Could not open %s for zipping\n", zipname);
        zipClose(zfile, NULL); 
        return false;
    }

    printf("adding dir: %s\n", d2_name);
    if (!zipper_add_dir(zfile, d2_name)) {
        printf("failed to write dir %s\n", d2_name);
        zipClose(zfile, NULL); 
        return false;
    }

    printf("adding dir: %s\n", d3_name);
    if (!zipper_add_dir(zfile, d3_name)) {
        printf("failed to write dir %s\n", d3_name);
        zipClose(zfile, NULL); 
        return false;
    }

    printf("adding file: %s\n", f1_name);
    if (!zipper_add_buf(zfile, f1_name, (const unsigned char *)f1_data, strlen(f1_data))) {
        printf("failed to write %s\n", f1_name);
        zipClose(zfile, NULL); 
        return false;
    }

    printf("adding file: %s\n", f2_name);
    if (!zipper_add_buf(zfile, f2_name, (const unsigned char *)f2_data, strlen(f2_data))) {
        printf("failed to write %s\n", f2_name);
        zipClose(zfile, NULL); 
        return false;
    }

    printf("adding file: %s\n", f3_name);
    if (!zipper_add_buf(zfile, f3_name, (const unsigned char *)f3_data, strlen(f3_data))) {
        printf("failed to write %s\n", f3_name);
        zipClose(zfile, NULL); 
        return false;
    }

    printf("adding dir: %s\n", d1_name);
    if (!zipper_add_dir(zfile, d1_name)) {
        printf("failed to write dir %s\n", d1_name);
        zipClose(zfile, NULL); 
        return false;
    }

    printf("adding file: %s\n", f4_name);
    if (!zipper_add_buf(zfile, f4_name, (const unsigned char *)f4_data, strlen(f4_data))) {
        printf("failed to write %s\n", f4_name);
        zipClose(zfile, NULL); 
        return false;
    }

    zipClose(zfile, NULL); 
    return true;
}

static bool unzip_test_zip(void)
{
    unzFile          uzfile;
    char            *zfilename;
    unsigned char   *buf;
    size_t           buflen;
    zipper_result_t  zipper_ret;
    uint64_t         len;

    uzfile = unzOpen64(zipname);
    if (uzfile == NULL) {
        printf("Could not open %s for unzipping\n", zipname);
        return false;
    }

    do {
        zipper_ret = ZIPPER_RESULT_SUCCESS;
        zfilename  = zipper_filename(uzfile, NULL);
        if (zfilename == NULL)
            return true;

        if (zipper_isdir(uzfile)) {
            printf("reading dir: %s\n", zfilename);
            recurse_mkdir(zfilename);
            unzGoToNextFile(uzfile);
            free(zfilename);
            continue;
        }

        len = zipper_filesize(uzfile);
        printf("reading file (%llu bytes): %s\n", len, zfilename);
        zipper_ret = zipper_read_buf(uzfile, &buf, &buflen);
        if (zipper_ret == ZIPPER_RESULT_ERROR) {
            free(zfilename);
            break;
        }

        recurse_mkdir(zfilename);
        write_file(zfilename, buf, buflen, false);
        free(buf);
        free(zfilename);
    } while (zipper_ret == ZIPPER_RESULT_SUCCESS);

    if (zipper_ret == ZIPPER_RESULT_ERROR) {
        printf("failed to read file\n");
        return false;
    }

    unzClose(uzfile);
    return true;
}

int main(int argc, char **argv)
{
    if (!create_test_zip())
        return 1;
    if (!unzip_test_zip())
        return 1;
    return 0;
}

This example app creates a zip file, then extracts everything from it.

Building

Assuming you have the string builder and rw write file files in the same directory as the zipper code you can use the following CMakeLists.txt to build the example.

cmake_minimum_required (VERSION 3.0)
project(zipper)

include(FindPkgConfig)
pkg_check_modules(MZIP minizip REQUIRED)

link_directories(
    ${MZIP_LIBDIR}
)

add_executable(${PROJECT_NAME}
    str_builder.c
    rw_files.c
    zipper.c
    main.c
)

target_include_directories(${PROJECT_NAME}
    PRIVATE ${CMAKE_CURRENT_BINARY_DIR}
            ${CMAKE_CURRENT_SOURCE_DIR}
            ${MZIP_INCLUDE_DIRS}
)

target_link_libraries(${PROJECT_NAME}
    PRIVATE ${MZIP_LIBRARIES}
)

Conclusion

Zip has the small problem of not being able to remove files and MiniZip doesn’t provide any helpers for this. To remove a file you have to create new zip, add the files you want to keep, save it and delete the old zip… Not very elegant but that’s how it is. We could write a function that does all this for us but if we want to remove multiple files this would be very intensive and slow. The alternative is to create a zip object which stores what operations need to take place and run them all when a save function is called.

MiniZip does not natively support working with zips in memory and always needs them to be on disk. Interestingly in the zipper includes there is the _WIN32 include For Windows specific io. MiniZip internally uses a pluggable, callback based io system which by default works with files. However, we could add an in memory layer.

The test app we wrote has all the components of a zip and unzip app making it really easy to pull this into a real app. For fun we could also expand it out and make a full fledged app like unzip.