Introduction

A few years ago I switched from WordPress to Jekyll and I lost full text search. But then I remembered I’m hosting my blog on my own server! So I wrote a Full Text Search service to restore the only feature I lost in the transition.

This was working great until I moved from Jekyll to Hugo. The search Javascript needed a slight tweak but otherwise search was still working. Mostly. I realized the date wasn’t coming through properly and that prompted me to take a deeper look at how I wrote the search index generator. Which then prompted me to look at the application in it’s entirety.

As with most projects I start, I went a bit further than I originally intended. I completely rewrote the generator and expanded it’s functionality. I also updated the search service to allow specifying a different location for the search index via an environment variable. Finally, I created a Dockerfile to allow easier deployment and testing.

Directory Layout

I restructured the project layout because it keeps expanding and having everything in a single directory was getting unwieldy. This is the directory layout I’m now using for the application.

search_service
- docker
  - Dockerfile
  - Dockerfile.pyal
- js
  - search_box.js
  - search_results.js
- src
  - generate_search_index.py
  - search_service.py
- requirements.txt

Docker

I decided to containerize the service because I wanted to learn more about Docker. I already updated the service to run locally as a stand alone application so containerizing wasn’t needed for testing. However, it’s making rethink the server deployment.

That said, this makes deployment easier because I can contain the gunicorn WSGI server within the image along side the service. This reduces service configuration and maintenance. On my server I use currently uWSGI fronting the service service. Containing that part within the container will reduce server setup and maintenance.

Dockerfile

FROM alpine:latest AS build
WORKDIR /app

COPY ./requirements.txt .

RUN apk add --no-cache git
RUN apk add --update --no-cache python3
RUN python3 -m ensurepip

RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

RUN pip3 install  --no-cache-dir -r requirements.txt
RUN pip3 uninstall -y pip setuptools packaging

FROM alpine:latest AS release
EXPOSE 80
WORKDIR /app

RUN apk add --update --no-cache python3

COPY ./src/search_service.py .
COPY ./src/generate_search_index.py .

COPY --from=build /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

VOLUME /data
ENV SEARCH_INDEX="/data/search_index"

CMD ["gunicorn", "--bind", "0.0.0.0:80", "search_service:app" ]

I’m using alpine Linux as my base image because it’s super small. Also, it supports Python and the service is a pure python application. I don’t need any dependences outside of Python and some Python libraries I can install with pip.

Anatomy

Creating the image is a two step process because my goal was to create as small of a image as possible.

Step 1: Build

In the build step, FROM alpine:latest AS build, all dependencies are installed. git is installed because because one of the packages defined in requirements.txt is pulled from GitHub.

The application dependencies are installed into a Python virtual environment at /opt/venv. After everything is installed a few unnecessary things are uninstalled from the VENV. Things like pip, and setuptools aren’t needed in the release image.

A VENV is used instead of installing the dependencies directly onto the system because Python packages get installed into /.../python<VERSION>/site-packages. It’s easier to install everything into a tidy, self contained location and copy that instead of a system directory. It also ensures we’re only pulling over things the application will use and not anything that might have been installed as a build dependency. E.g. removing pip and friends from the VENV.

Step 2: Release

In the release step, FROM alpine:latest AS release, Python is installed because we’ll need it to run the service. No other packages need to be system installed and the only things in the image are exactly what is needed. Build items, like git are confined to the build stage to ensure as little unnecessary packages are included in the image.

The VENV is copied from the build step into the release image and the environment path is updated to include the VENV. A few other things are defined like the default location the services uses for the search index. Also, the port it will listen on.

Finally, the command to run gunicorn and run the service is specified. gunicorn was installed via pip and is contained in the VENV.

Versioning

Realize nothing I’m doing is versioned. If Alpine, or Python or any of my requirements makes API changes I’ll need to update the service. For my needs, this is fine. I’d want to update the service anyway if I’m making changes that necessitate building a new image.

Second (.pyal) Dockerfile

There are two Docker files which are slightly different. The .pyal uses the python:3.11-alpine image which has Python already included. The file is slightly simpler than default file and ensures a consistent Python version. It creates an image about 4 MB larger than than the default file. It’s mainly here as an exercise when I was working through creating the Docker build. In the end it’s not really any different than the standard file and can be disregarded.

Dockerfile.pyal

FROM python:3.11-alpine AS build
WORKDIR /app

COPY ./requirements.txt .

RUN apk add --no-cache git

RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

RUN pip3 install  --no-cache-dir -r requirements.txt
RUN pip3 uninstall -y pip setuptools packaging

FROM python:3.11-alpine AS release
EXPOSE 80
WORKDIR /app

COPY ./src/search_service.py .
COPY ./src/generate_search_index.py .

COPY --from=build /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

VOLUME /data
ENV SEARCH_INDEX="/data/search_index"

CMD ["gunicorn", "--bind", "0.0.0.0:80", "search_service:app" ]

Size

You might think the container is a bad idea from a size standpoint because it has to bundle Python and all of the service’s dependencies. I was pleasantly surprised at how small the image ended up being at 61.74 MB. This sounds large until you realize the Python system package on Arch Linux has an install size of 75.2 MB.

Deployment

I have the service set to listen on port 80 within the container. We can bind the container’s port to any system port we choose so using port 80 works fine.

The service will look for the search index in /data/search_index within the container. However, the environment variable SEARCH_INDEX can be set to override the default location of the search index. Again, the path is the path within the container.

Javascript

I’ve described the two Javascript files previous with a minor update. Those posts can be referenced for a deeper dive of these two files.

search_box.js

"use strict";

let submit = document.getElementById("searchInput");
submit.addEventListener("keydown", function(even) {
    if (event.key !== "Enter") {
        return;
    }
    let params = new URLSearchParams();
    let query = document.getElementById("searchInput");
    params.set("s", query.value);
    window.location.href = "/search/?" + params;
    event.preventDefault();
});

Search box listens for enter to be pressed in the search entry and redirects back to the search page with the search query in order for the next script to run.

search_results.js

"use strict";

setupPage();

function setupPage() {
    let params = new URLSearchParams(window.location.search);
    let search = params.get("s");
    let page = Number(params.get("p"));
    if (page == 0 || page == NaN || page == Infinity) {
        page = 1;
    }
    page = Math.floor(page);

    if (search) {
        // Pull the search results from the server and display them
        runSearch(search, page);
    } else {
        populateNavNoPages();
    }
}

async function runSearch(search, page)
{
    if (runSearch.tries === undefined) {
        runSearch.tries = 0;
    }

    page = page || 1;

    let params = new URLSearchParams();
    params.set("s", search);
    params.set("p", page);

    // fetch the results
    let results = null;
    try {
        let response = await fetch("https://search.nachtimwald.com/?" + params);
        results = await response.json();
    } catch (e) {
        // Retry a few times in case of a hiccup.
        if (runSearch.tries >= 3) {
            populateSearchError();
            return;
        }

        runSearch.tries++;
        setTimeout(runSearch, 250, search, page);
        return;
    }

    if (!results || results.pages == 0) {
        populateNoResults();
        populateNavNoPages()
    } else {
        populateResults(search, results.hits);
        populateNav(search, results.page, results.pages)
    }
}

function populateSearchError()
{
    // Set the no text
    let entry = document.createElement("div");
    entry.textContent = "Search Failed";
    let view = document.getElementById("searchResults");
    view.appendChild(entry);
}

function populateNoResults()
{
    // Set the no text
    let entry = document.createElement("div");
    entry.textContent = "No results";
    let view = document.getElementById("searchResults");
    view.replaceChildren(entry);
}

function populateNavNoPages()
{
    let page_nav = document.getElementById("pages");
    page_nav.replaceChildren();
}

function populateResults(search, results)
{
    let view = document.getElementById("searchResults");
    for (let result of results) {
        let article = document.createElement("article");
        article.classList.add("post-entry");

        let header = document.createElement("header");
        header.classList.add("entry-header");
        let h2 = document.createElement("h2");
        h2.textContent = result.title + "\u00A0»";
        header.appendChild(h2);
        article.appendChild(header);

        let preview = document.createElement("div");
        preview.classList.add("entry-content");
        preview.textContent = `${result.excerpt}... `;
        article.appendChild(preview);

        if (result.post_date) {
            let footer = document.createElement("footer");
            footer.classList.add("entry-footer");
            footer.textContent = result.post_date;
            article.appendChild(footer);
        }

        let link = document.createElement("a");
        link.href = result.uri;
        link.ariaLabel = result.title;
        article.appendChild(link);

        view.appendChild(article);
    }
}

function populateNav(search, page, pages)
{
    let page_params = new URLSearchParams();
    page_params.set("s", search);

    let page_nav = document.getElementById("pages");
    page_nav.replaceChildren();

    if (page != 1) {
        let nav_link = document.createElement("a");
        nav_link.classList.add("prev");
        page_params.set("p", page-1);
        nav_link.href = "/search?" + page_params;
        nav_link.textContent = '«\u00A0Prev';
        page_nav.appendChild(nav_link);
    }
    if (page != pages) {
        let nav_link = document.createElement("a");
        nav_link.classList.add("next");
        page_params.set("p", page+1);
        nav_link.href = "/search?" + page_params;
        nav_link.textContent = 'Next\u00A0»';
        page_nav.appendChild(nav_link);
    }
}

When the search results sees a query in the query parameters it sends the search to the search service. Then it populates the returned Javascript into the search page.

Python Dependencies

There are a few dependencies that need to be installed with pip. One of which is pulled from GitHub using git instead of PyPI. pip is still used to do this but you need git installed on the system. That said, git is only needed for the pip install step and not needed once the package is installed.

requirements.txt

# Generator
Markdown
Plain-Text-Markdown-Extention @ git+https://github.com/kostyachum/python-markdown-plain-text.git#egg=plain-text-markdown-extention
python-frontmatter

# Service
falcon
gunicorn

# Both
Whoosh

I’m using one file that combines the requirements for the generator and service because when I create the Docker image I’m including both. That way you can use a container to generate the index. The generator dependencies are minimal so it’s not a big impact to have both in one image. Alternatively I could have separate images for the generator and service but that isn’t as manageable as putting them both in the same image.

Generator

High level, the generator reads posts and generates a search index. It always creates a search_index directory under the output directory. If the directory exists, it will be deleted first to ensure a clean index is generated.

Original Design

The search index generator was pretty simple and read the generated HTML that Jekyll/Hugo generated. It would extract text from the HTML based on element classes. This works fine as long as the HTML structure never changes. I got really lucky the search index generator worked after switching to Hugo. It just so happened the Jekyll and Hugo themes used the same classes for title, and content elements. Author and date were generated differently in the output HTML and were missing from the search results.

Reading the generated HTML has the benefit of only including content that will be deployed. No checks or validation needs to take place if a post should or shouldn’t be included because that already happened when generating the HTML.

While I got lucky and search didn’t completely break this time, the next time I change the look and feel of the blog I might not be so lucky. I needed to come up with a theme independent solution.

Updated Design

The updated design is quite a bit different. Instead of reading the generated HTML, it’s reading the markdown posts. This requires a bit more logic but it’s the right way to build the index.

Keep in mind we’re not just reading markdown files, posts are markdown files that have front matter that needs to be handled. Thankfully, there is a package in pip that can read this for us.

Script

The new generation script is a bit more complex but also has a few extra features. Specifically, flags to include drafts and future posts. I found this useful for testing. As well as better logging that can be turned on with a verbose flag.

generate_search_index.py

#!/usr/bin/env python

import argparse
import frontmatter
import logging
import os
import re
import shutil
import sys
import time

from datetime import datetime
from markdown_plain_text.extention import convert_to_plain_text
from whoosh.fields import Schema, TEXT, ID
from whoosh.index import create_in

INDEX_DIR_NAME = 'search_index'
EXERPT_LENGTH = 300

def get_arguments():
    parser = argparse.ArgumentParser(description='Generate Whoosh search index from frontmatter formatted markdown file directory')
    parser.add_argument('-p', '--post_dir', help='posts directory', default='posts')
    parser.add_argument('-o', '--output_directory', help='"{}" will be created here'.format(INDEX_DIR_NAME), default='.')
    parser.add_argument('-D', '--drafts', help='Index drafts', action='store_true')
    parser.add_argument('-F', '--future', help='Index future posts', action='store_true')
    parser.add_argument('-v', '--verbose', help='verbose output', action='store_true')

    return parser.parse_args()

def get_directoires(args):
    post_dir = os.path.abspath(args.post_dir)
    out_dir = os.path.abspath(args.output_directory)
    index_dir = os.path.join(out_dir, INDEX_DIR_NAME)

    return post_dir, index_dir

def create_index_dir(location):
    logging.info('Creating search index at: "{}"'.format(location))

    try:
        if os.path.exists(location):
            shutil.rmtree(location)
        os.mkdir(location)
    except:
        logging.critical('Failed to create index "{}" directory'.format(location))
        return False

    return True

def search_writer(index_dir):
    schema = Schema(uri=ID(stored=True), title=TEXT(stored=True), content=TEXT, excerpt=TEXT(stored=True), post_date=TEXT(stored=True))
    ix = create_in(index_dir, schema)
    return ix.writer()

def validate_post(post_data, drafts, future):
    if not post_data['dt']:
        logging.warning('Post has no date: {}'.format(post_data['fname']))
        return False

    if post_data['ts'] <= 0:
        logging.warning('Post timestamp is invalid: {}'.format(post_data['fname']))
        return False

    if post_data['draft'] and not drafts:
        logging.info('Skipping draft {}'.format(post_data['fname']))
        return False

    if post_data['ts'] > time.time() and not future:
        logging.info('Skipping future post {}'.format(post_data['fname']))
        return False

    if not post_data['title']:
        logging.warning('Skipping post without title {}'.format(post_data['fname']))
        return False

    if not post_data['content']:
        logging.info('Skipping post without content {}'.format(post_data['fname']))
        return False

    return True

def post_uri(post_data):
    url = post_data['url']
    if url:
        return url

    logging.warning('Post has no url: {}. Falling back to slug'.format(post_data['fname']))

    slug = post_data['slug']
    if not slug:
        logging.warning('Post has no slug: {}. Falling back to title'.format(post_data['fname']))
        slug = re.sub('[^A-Za-z0-9*]', '-', post_data.get('title', '')).lower()

        if not slug:
            logging.warning('Failed to determine uri: {}'.format(post_data['fname']))
            return None

        logging.info('Calculated slug from title for {} as {}'.format(post_data['fname'], slug))

    uri = '/{}/{}/'.format(post_data['dt'].strftime('%Y/%m/%d'), slug)
    logging.info('Calculated uri for {} as {}'.format(post_data['fname'], uri))
    return uri

def populate_post_data(post, fname):
    post_data = {
        'fname': fname,
        'title': post.get('title'),
        'url': post.get('url'),
        'slug': post.get('slug'),
        'dt': post.get('date'),
        'ts': post.get('date', datetime.fromtimestamp(0)).timestamp(),
        'content': convert_to_plain_text(post.content).strip().replace('\n', ' '),
        'draft': post.get('draft', False)
    }

    return post_data

def index_posts(post_dir, index_writer, drafts, future):
    for dirpath, dirnames, filenames in os.walk(post_dir):
        for fname in filenames:
            _, ext = os.path.splitext(fname)
            if ext not in [ '.md', 'markdown' ]:
                continue

            logging.info('Found file: "{}"'.format(fname))

            post = frontmatter.load(os.path.join(dirpath, fname))
            post_data = populate_post_data(post, fname)

            if not validate_post(post_data, drafts, future):
                continue

            uri = post_uri(post_data)
            if not uri:
                continue

            post_date = post_data['dt'].strftime("%B %d, %Y")

            logging.info('Adding post: {}'.format(uri))
            index_writer.add_document(uri=uri, title=post_data['title'], content=post_data['content'], excerpt=post_data['content'][:EXERPT_LENGTH], post_date=post_date)

def main():
    args = get_arguments()
    logging.basicConfig(level=logging.INFO if args.verbose else logging.WARN, format='%(levelname)s - %(message)s')

    post_dir, index_dir = get_directoires(args)

    if not create_index_dir(index_dir):
        sys.exit(1)

    writer = search_writer(index_dir)
    index_posts(post_dir, writer, args.drafts, args.future)
    writer.commit()

if __name__ == '__main__':
    main()

Anatomy

The first few functions are helpers that should be easy enough to understand.

  • def get_arguments():
  • def get_directoires(args):
  • def create_index_dir(location):

def search_writer(index_dir): will create the Whoosh search object to add indexes we create.

def validate_post(post_data, drafts, future): does two things. It verifies the post has all required data needed in the front matter. A post we can’t use doesn’t cause a hard failure, and only excludes the post. The function also determines if a post should be included based on things like draft or future flags.

def post_uri(post_data): will return the front matter url element which is an override for permalinks. It tells Hugo to use that instead of generating a permalink for the post. All of my posts have url set but the fuction does fallback to try and generate the permalink using the format /:year/:month/:day/:slug/. Which is what I have Hugo configured to use if url is missing. This isn’t really needed but I have it anyway just in case.

def populate_post_data(post, fname): seems like it’s not necessary because the post object has pretty much all the information. However, some of the information needs to be manipulated and then used in multiple places. The returned post_data dict is for convenience.

For example you’ll see convert_to_plain_text(post.content).strip().replace('\n', ' ') which gives us just the text of the post without any markdown formatting to mess up the indexing or the excerpt.

def index_posts(post_dir, index_writer, drafts, future): is where the real magic happens. Not really, but it goes through all the posts and determines if they should be included. If they’re included they get added to the index using the index_writer.

Finally, def main(): runs the script.

Running With Docker

The generate_search_index.py script is installed in /app within the image and can be run using Docker in two way.

If the container is running:

docker exec -it <NAME_OF_CONTAINER> /opt/venv/bin/python /app/generate_search_index.py -p /posts -o /data

This assumes your posts are already mounted to /posts. Most likely you won’t have your posts on the same machine that’s running the container unless you’re testing.

If the container is not running:

This is a more likely scenario where you have the site on a work/build machine and not on the production server. In which case you don’t need to create a container from the image and instead you can have Docker create and tear down a container.

docker run --rm -it -v /<PATH_TO_POSTS>:/posts /<PATH_TO_OUTPUT>:/data <NAME_OF_IMAGE> /opt/venv/bin/python /app/generate_search_index.py -p /posts -o /data

The search_index directory will be created and populated in the data directory.

Service

The service hasn’t changed much. It’s still a WSGI service using the falcon framework. The updates to the service are switching to falcon.App() which replaces falcon.API and adding if __name__ == '__main__': section to allow running the script directly. This makes testing the search index a bit easier because I don’t need a WSGI service to run the script when testing locally. That said, it’s easy enough to run the container for testing.

Additionally, I added the environment variable SEARCH_INDEX to determine where the search index is located. If not set, it defaults to the directory search_index in the same location as the service.

Script

search_service.py

#!/usr/bin/env python

import json
import falcon
import os
from whoosh.index import open_dir
from whoosh.qparser import QueryParser

class SearchResource:

    def __init__(self):
        index_dir = os.getenv('SEARCH_INDEX', 'search_index')
        self._ix = open_dir(os.path.abspath(index_dir))

    def _do_search(self, query_str, page):
        ret = {}

        with self._ix.searcher() as searcher:
            qp = QueryParser('content', self._ix.schema)
            q = qp.parse(query_str)
            results = searcher.search_page(q, page, 15)

            ret['page'] = results.pagenum
            ret['pages'] = results.pagecount
            ret['hits'] = []

            for h in results:
                match = {
                    'uri': h['uri'],
                    'title': h['title'],
                    'post_date': h['post_date'],
                    'excerpt': h['excerpt']
                }
                ret['hits'].append(match)

        return ret


    def on_get(self, req, resp):
        resp.status = falcon.HTTP_200
        res = self._do_search(req.get_param('s', default=''), int(req.get_param('p', default=1)))
        resp.text = json.dumps(res)

app = falcon.App()
searcher = SearchResource()
app.add_route('/', searcher)

if __name__ == '__main__':
    from wsgiref.simple_server import make_server

    with make_server('', 8000, app) as httpd:
        print('Serving on port 8000...')

        # Serve until process is killed
        httpd.serve_forever()

Anatomy

This is a much simpler script than the generator. It’s a REST service that opens the search index, plugs in the query, and returns the result as JSON.

Conclusion

This project started as fixing author and date being left out of the search results and turned into a much larger project. Now my simple little search service feels like an actual application.

I could have done a minimal update to the generator and updated the xpath I was using to pull text from the correct elements but I decided to do it right. The script is now more robust and will continue to work if I change the theme. This should save time in the future because I shouldn’t have to update it again for a long time.

Updating the generator made me realize I should be using a requirements.txt file so I went ahead and added one. With more files being added I decided I needed to restructure the layout.

Then it was time to put some attention on the service itself. Testing was proving difficult so I added the ability to run the service locally. That is nice but I’ve been learning about Docker recently and this looked like a good way to learn more.

I’m very happy I took the time to create a Dockerfile and making the application a fully deployable service that just works. Complete with a real WSGI server (gunicorn) powering it. I’m also surprised with how small I was able to get the Docker image and how little work actually went into it. Not really a little, I spent a lot of time learning.

Now I just have to decide if I want to setup Docker on my server and deploy the image or if I want to stick with the current, more traditional setup I already have in place. Either way, this was a very fun project.