Introduction

When I switched from WordPress to Jekyll I gave up searching posts. I knew this was a limitation of running a static site and I didn’t think it would be too big a deal. After a year I’ve found I actually look up some old reference posts just enough that search to be useful. Trying to remember which tags given posts use became a bit of an annoyance. I really started to miss the ability to search.

Options

I don’t want to stop using Jekyll, and I have no intention to do so. However, I did need to find a way to add searching. Not just search but full text search of post content. There are really only a few options to choose from.

  • Client side search
  • Use an external search engine, such as a Google Custom Search
  • Use an external indexing and search service

Client Side

Client side search really isn’t an option. It require loading all data onto every page so the client browser can handle searching. For this to work, I’d have to load the content of every single post into each page. I have way to many posts for this to be feasible. This would be untenable, even on a small blow with few posts.

External search engine

Using something like Google Custom search world work but I’d rather not use it simply because it’s ad supported. I have no problem with Google (or others) offering a service that they monetize, but I already pay to host my blog so I don’t have to run someone else’s ads.

Indexing and Search Service

Which leads to the final option, using an indexing service. At first I was thinking about using Amazon’s Elasticsearch. However, to use Elasticsearch I’d have to.

  • Write something that would upload all post content to the service
  • Write front end Javascript to query and pull the results from the service

I didn’t want to do this much work, so I tried for preexisting Jekyll plugins that handle everything automatically. Sadly, most of the ones I found were out data. Also, they only handled uploading posts to the service. None handled the front end code because of the vast differences in look and feel.

The closest I found to a complete solution is algolia. However, their jekyll plugin is outdated and no longer maintained. Their front end instantsearch Javascript library works great, but takes a lot of work to integrate the look and feel.

The solution

Using a search service is the only solution that will work. At this point, I was resigned to write code, which isn’t a big deal. That said, the more I thought about it, why should I pay someone to host the index and handle the search for me. I have my own server, which runs this blog. I can run my own search service for free.

The tools

Since I like Python, it has a library for everything, and since it’s been awhile since I wrote anything in Python, I decided to write the service in Python. I probably should have though more about the solution before the language but Python ended up having everything I needed.

To handle searching I decided to use Whoosh, which is a very easy to use and very fast full text search engine. The search syntax is very complete and complex queries can be created.

Creating and searching an index is one thing but I also have to expose the service publicly. To this end I decided to use a REST API exposed using Falcon.

This combination turned out to be very easy to work with and I was amazed at how little code was necessary. The front end Javascript ended but being most of the code.

Search service

Indexing posts

The first part of this project that needs to be tackled is generating the Whoosh index. Since Whoosh is written in Python, I decided to use an external script instead of writing a Ruby Jekyll plugin.

The index generator is going to use the final HTML from the “_site” directory instead of parsing the Markdown posts. It was easier to parse the final HTML because things like Liquid tags and variables have already been handled. If this was a Jekyll plugin, then it would index the posts directly.

Each post has it’s own title, data, and content CSS class that are only used on posts. I used these to filter out non-post pages. This means, the generator is somewhat tied to the layout of my blog.

generate_search_index.py

#!/usr/bin/env python

import argparse
import os
import shutil
from bs4 import BeautifulSoup
from whoosh.fields import Schema, TEXT, ID
from whoosh.index import create_in

INDEX_DIR_NAME = '_search_index'
EXERPT_LENGTH = 300

def main():
    parser = argparse.ArgumentParser(description='Generate Whoosh search index from Jekyll _site directory')
    parser.add_argument('-s', '--site_dir', help='Jekyll _site directory', default='_site')
    parser.add_argument('-o', '--output_directory', help='directory _search_index directory will be created in', default='.')

    args = parser.parse_args()

    site_dir = os.path.abspath(args.site_dir)
    out_dir = os.path.abspath(args.output_directory)
    index_dir = os.path.join(out_dir, INDEX_DIR_NAME)

    print('Creating index at: "%s"' % index_dir)

    schema = Schema(uri=ID(stored=True), title=TEXT(stored=True), content=TEXT, excerpt=TEXT(stored=True), post_date=TEXT(stored=True))
    try:
        if os.path.exists(index_dir):
            shutil.rmtree(index_dir)
        os.mkdir(index_dir)
    except:
        print('Failed to create %s directory' % index_dir)
        return

    ix = create_in(index_dir, schema)
    writer = ix.writer()

    for dirpath, dirnames, filenames in os.walk(site_dir):
        for fname in filenames:
            _, ext = os.path.splitext(fname)
            if ext not in ['.html']:
                continue

            if dirpath.endswith('/amp'):
                continue

            uri = '%s/' % dirpath[len(site_dir):]
            data = ''
            title = ''
            post_date = ''
            content = ''
            with open(os.path.join(dirpath, fname), 'rb') as f:
                data = f.read().decode('utf-8')

            tree = BeautifulSoup(data, 'lxml')
            if not tree.find(class_='post-title') or not tree.find(class_='post-content'):
                continue

            print('Adding: %s' % uri)

            node = tree.find(class_='post-title')
            title = node.text.strip()

            node = tree.find(class_='entry-date')
            if node:
                post_date = node.text.strip()

            node = tree.find(class_='post-content')
            content = node.text.strip().replace('\n', ' ');

            writer.add_document(uri=uri, title=title, content=content, excerpt=content[:EXERPT_LENGTH], post_date=post_date)

    writer.commit()

if __name__ == '__main__':
    main()

In addition to Whoosh the generator uses Beautiful Soup.

Schema

The search index uses a schema to know what data to index and how it needs to be used. The index uses these entries

  • uri - The relative path to the post. Used to link to the post
  • title - The title of the post
  • content - The post content
  • excerpt - Short excerpt preview of the post
  • post_date - When the post was posted

Everything except the content is stored. The content only needs to be searchable and we’ll never pull the data back out of the index.

The uri is the path of the post, pre html file, below the “_site” directory. This works because Jekyll’s “_site” directory is the file site and its exact layout.

Creating the index

The code always creates and puts the index in a “_search_index” directory.

The main part of the code is the os.walk. It goes through the “_site” directory and parses all HTML. If it contains a post, it pulls out the data needed for the index and loads it as a record.

Finally, the index is committed and ready for use.

The generator will stay on my local machine with blog source. The index generated will be uploaded to my server just like the “_site” contents.

Search web service

The service is a Whoosh search exposed using a REST API provided by Falcon. This is what uses the index we generate to provide search results. The piece will be running on the server.

search_service.py

#!/usr/bin/env python

import json
import falcon
from whoosh.index import open_dir
from whoosh.qparser import QueryParser

class SearchResource(object):

    def __init__(self):
        self._ix = open_dir('search_index')

    def _do_search(self, query_str, page):
        ret = {}

        with self._ix.searcher() as searcher:
            qp = QueryParser('content', self._ix.schema)
            q = qp.parse(query_str)
            results = searcher.search_page(q, page, 15)

            ret['page'] = results.pagenum
            ret['pages'] = results.pagecount
            ret['hits'] = []

            for h in results:
                match = {
                    'uri': h['uri'],
                    'title': h['title'],
                    'post_date': h['post_date'],
                    'excerpt': h['excerpt']
                }
                ret['hits'].append(match)

        return ret


    def on_get(self, req, resp):
        resp.status = falcon.HTTP_200
        res = self._do_search(req.get_param('s', default=''), int(req.get_param('p', default=1)))
        resp.body = json.dumps(res)

app = falcon.API()
searcher = SearchResource()
app.add_route('/', searcher)

Falcon made wrapping Whoosh in a web service very easy. The service uses parameters s for the search query and p for the page number.

The search query is passed into Whoosh, which pulls out the post data and passes it into a JSON object for the front end to consume.

Being able to query by search page is intentional. Whoosh allows for paginated searching, which we want to use. It allows access to all results but still keeps the front end responsive because only a small number of posts are returned.

15 results per page was chosen because that’s the number of posts that are on the front, paginated pages.

It’s very important that we opened the search index during the SearchResource class initialization. Falcon creates one object searcher from the class and uses it for every on_get. This should make searching faster, because as long as the search service is persistently running, the index only needs to be loaded when it starts instead of on every request.

Front end

The front ends has two parts, the search box and results page. I have the search box on every page above the right metadata sidebar.

HTML

Before we can search we need the UI to be able to accept the query and display the results.

The search box is a single text entry within a form.

meta.html

<div class="search-box">
    <form name="search">
        <input type="text" id="search-query" placeholder="Search..."></input>
    </form>
</div>
...

This little bit of code was inserted at the top of my meta template. Since this is on every page, the script to load the associated Javascript is loaded in the template where the other global Javascript is loaded.

Search results

The search results needs its own, new, dedicated page.

search.html

<h3 class="search-title" id="search-title">
    Search Results
</h3>

<div id="results">
</div>

<nav aria-label="Page navigation">
    <ul class="pagination justify-content-center" id="nav_pages">
    </ul>
</nav>

<script src="{{ site.baseurl }}/assets/js/search_results.js" async></script>

The results are going to be added dynamically to the page based on the response from the search service. With the results and page components in place, the search Javascript will do the heavy lifting.

Javascript

Since this is a static site, posting to the web server isn’t going to work. Instead everything is handled by Javascript in the web browser.

All of the Javascript is intended to be loaded after the content. The code will hook into events for elements on the page. We could have it load first and use an page loaded event, but it’s easier to put the script tags at the end.

The search box has a little bit of Javascript which overrides the POST behavior when the query is submitted.

Search box

The Javascript waits for enter to be pressed in the search box. When that happens it redirects the browser to the search results page setting up the query arguments with the search query.

search_box.js

"use strict";

let submit = document.getElementById("search-query");
submit.addEventListener("keydown", function(even) {
    if (event.key !== "Enter") {
        return;
    }
    let params = new URLSearchParams();
    let query = document.getElementById("search-query");
    params.set("s", query.value);
    window.location.href = "/search?" + params;
    event.preventDefault();
});

Using a separate search box with a redirect mimics the look and feel of how a search via POSTing to the web server work work.

Search results

In a nutshell, the search results are pulled from the server via an async request and populated into the DOM. There may be a small delay before results are populated but in my testing the server is fast enough, and the amount of work the web browser has to do is so minimal, it populates without any visible delay. It looks and feels just like a traditional server POST search.

search_results.js

"use strict";

setupPage();

function setupPage() {
    let params = new URLSearchParams(window.location.search);
    let search = params.get("s");
    let page = Number(params.get("p"));
    if (page == 0 || page == NaN || page == Infinity) {
        page = 1;
    }
    page = Math.floor(page);

    if (search) {
        // Put the query into the title so we know what was searched
        let title = document.getElementById("search-title");
        title.innerText = "Search Results for: ";
        let query_span = document.createElement("span");
        query_span.classList.add("text-muted");
        query_span.classList.add("search-query-text");
        query_span.innerText = search;
        title.appendChild(query_span);

        // Pull the search results from the server and display them
        runSearch(search, page);
    } else {
        populateNavNoPages();
    }
}

async function runSearch(search, page)
{
    if (runSearch.tries === undefined) {
        runSearch.tries = 0;
    }

    page = page || 1;

    let params = new URLSearchParams();
    params.set("s", search);
    params.set("p", page);

    // fetch the results
    let results = null;
    try {
        let response = await fetch("https://search.nachtimwald.com/?" + params);
        results = await response.json();
    } catch (e) {
        // Retry a few times in case of a hiccup.
        if (runSearch.tries >= 3) {
            populateSearchError();
            return;
        }

        runSearch.tries++;
        setTimeout(runSearch, 250, search, page);
        return;
    }

    if (!results || results.pages == 0) {
        populateNoResults();
        populateNavNoPages()
    } else {
        populateResults(search, results.hits);
        populateNav(search, results.page, results.pages)
    }
}

function populateSearchError()
{
    // Set the no text
    let entry = document.createElement("div");
    entry.innerText = "Search Failed";
    let view = document.getElementById("results");
    view.appendChild(entry);
}

function populateNoResults()
{
    // Set the no text
    let entry = document.createElement("div");
    entry.innerText = "No results";
    let view = document.getElementById("results");
    view.appendChild(entry);
}

function generate_nav_item(text, href, active, disabled) {
    let nav_link = document.createElement("a");
    nav_link.classList.add("page-link");
    nav_link.href = href;
    nav_link.innerText = text;
    let nav_item = document.createElement("li");
    nav_item.classList.add("page-item");
    if (active) {
        nav_item.classList.add("active");
        nav_link.href = "#";
    }
    if (disabled) {
        nav_item.classList.add("disabled");
    }
    nav_item.appendChild(nav_link);
    return nav_item;
}

function populateNavNoPages()
{
    let page_nav = document.getElementById("nav_pages");
    page_nav.appendChild(generate_nav_item("First", "#", false, true));
    page_nav.appendChild(generate_nav_item("1", "1", true, false));
    page_nav.appendChild(generate_nav_item("Last", "#", false, true));
}

function populateResults(search, results)
{
    let view = document.getElementById("results");
    for (let result of results) {
        let article = document.createElement("article");
        article.classList.add("post-preview-box");
        let h1 = document.createElement("h1");
        h1.classList.add("post-title");
        h1.classList.add("title-text");
        let link = document.createElement("a");
        link.href = result.uri;
        link.innerText = result.title
        h1.appendChild(link);
        article.appendChild(h1);

        if (result.post_date) {
            let date_span = document.createElement("span");
            date_span.classList.add("post-meta");
            date_span.classList.add("entry-date");
            date_span.classList.add("text-muted");
            date_span.innerText = result.post_date;

            article.appendChild(date_span);
        }

        article.appendChild(document.createElement("br"));
        article.appendChild(document.createElement("br"));

        let preview = document.createElement("div");
        preview.classList.add("post-preview-content");
        preview.innerText = `${result.excerpt}... `;

        let continue_link = document.createElement("a");
        continue_link.classList.add("readMoreLink");
        continue_link.href = result.uri
        continue_link.innerText = "Continue reading";
        preview.appendChild(continue_link);

        article.appendChild(preview);
        view.appendChild(article);
    }
}

function populateNav(search, page, pages)
{
    const PAGES_PER_SIDE = 4;

    // Put the serach into a url param so we can use
    // it in links
    let page_params = new URLSearchParams();
    page_params.set("s", search);
    page_params.set("p", 1);

    let page_nav = document.getElementById("nav_pages");

    // Add the first link
    page_nav.appendChild(generate_nav_item("First", "/search?" + page_params, false, page == 1));

    // Determine the range, start and end pages.
    let rmin = Math.max(1, page-PAGES_PER_SIDE);
    let rmax = Math.min(page+PAGES_PER_SIDE, pages);
    let lshort = PAGES_PER_SIDE-(page-rmin);
    let rshort = PAGES_PER_SIDE-(rmax-page);
    rmin -= rshort;
    rmin = Math.max(1, rmin);
    rmax += lshort;
    rmax = Math.min(rmax, pages);

    for (let i = rmin; i <= rmax; i++) {
        page_params.set("p", i);
        page_nav.appendChild(generate_nav_item(i, "/search?" + page_params, i == page, false));
    }

    page_params.set("p", pages);
    page_nav.appendChild(generate_nav_item("Last", "/search?" + page_params, false, page == pages));
}

This is fairly site specific since it uses DOM elements and classes specific to my layout. This isn’t a generic drop in search library but the concept will work elsewhere.

The vast majority of the code creates element from the search results and adds the various classes that are applied to the elements. Since I use Bootstrap it also uses the Bootstrap pagination UI component. This will show a max of 9 items total and have 4 on each side of the current page. Assuming there are 4 or pages on either side.

Server

On the server we need to install a few packages.

$ sudo pacman -S uwsgi uwsgi-plugin-python python-whoosh python-falcon

To expose the search service we’re going to run it using uWSGI and proxy it to the world through nginx.

Service install

The service is going to be put in “/srv/wsgi/search/” The search_service.py script and the “search_index” index directory are both there. They should be set to read only because there is no reason for anything to be written to disk. This is a GET only service; It doesn’t make any modification to the index so we should be safe and ensure that can never happen.

uWSGI

Once uWSGI is installed and the service application in the right place, we need to setup uWSGI to run the service.

/etc/uwsgi/search.ini

Drop in this file which defines the wsgi service that will run the search service.

[uwsgi]
vacuum = true
enable-threads = true
thunder-lock = true
threads = 4
processes = 2
plugins = python
uid = http
gid = http
socket = /run/uwsgi/%n.sock
master = true
chdir = /srv/wsgi/%n
callable = app
wsgi-file = search_service.py

The service is setup to run using a socket and not on a network port. This makes it easier to keep the service internal and proxied though nginx.

Systemd

Now that we have uWSGI configured with the search service, we need the system (systemd) to start and run it.

/etc/systemd/system/uwsgi-app@search.service

[Unit]
Description=%i uWSGI app
After=syslog.target

[Service]
ExecStart=/usr/bin/uwsgi \
        --ini /etc/uwsgi/%i.ini \
        --socket /run/uwsgi/%i.socket
Restart=on-failure
KillSignal=SIGQUIT
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -INT $MAINPID
Type=notify
SuccessExitStatus=15 17 29 30
StandardError=syslog
NotifyAccess=all
PrivateDevices=yes
PrivateTmp=yes
ProtectSystem=full
ProtectHome=yes
NoNewPrivileges=yes

[Install]
WantedBy=multi-user.target

We need a service file which will start uWSGI and tell it to run the search service.

This will keep a persistent uWSGI process running for the search service. This is a good thing because the search index won’t need to be loaded on every request.

/etc/systemd/system/uwsgi-app@search.socket

[Unit]
Description=Socket for uWSGI app %i

[Socket]
ListenStream=/run/uwsgi/%i.socket
SocketGroup=http
SocketMode=0600

[Install]
WantedBy=sockets.target

Next we need to define the socket so systemd can manage that too.

nginx

So far we have the service, we have uWSGI running it, and we have systemd starting it. All that’s left is for nginx to proxy requests.

/etc/nginx/sites-enabled/nachtimwald.com

...
server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    server_name search.nachtimwald.com;

    error_log  /var/log/nginx/search.nachtimwald.com.error.log;
    access_log /var/log/nginx/search.nachtimwald.com.access.log;

    include conf.d/ssl_nachtimwald.conf;

    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    add_header Access-Control-Allow-Origin "https://nachtimwald.com";

    location / {
        include uwsgi_params;
        uwsgi_pass unix:///run/uwsgi/search.sock;
    }
}
...

This section was added to my site’s nginx configuration file. It tells nginx to take any request from search.nachtimwald.com and pass it onto the search service socket.

The most important piece is the Access-Control-Allow-Origin entry. It allows for Cross-Origin Resource Sharing (CORS) between the main site and the search service sub domain. Without this the browser will throw an error instead of loading the search results.

Conclusion

Search has returned! This project has many interconnected components but was surprisingly easy to implement. One of the major components I lost when moving to Jekyll is back. If I had known it would have been this easy I would have added searching sooner.