Introduction
A few years ago I switched from WordPress to Jekyll and I lost full text search. But then I remembered I’m hosting my blog on my own server! So I wrote a Full Text Search service to restore the only feature I lost in the transition.
This was working great until I moved from Jekyll to Hugo. The search Javascript needed a slight tweak but otherwise search was still working. Mostly. I realized the date wasn’t coming through properly and that prompted me to take a deeper look at how I wrote the search index generator. Which then prompted me to look at the application in it’s entirety.
As with most projects I start, I went a bit further than I originally intended. I completely rewrote the generator and expanded it’s functionality. I also updated the search service to allow specifying a different location for the search index via an environment variable. Finally, I created a Dockerfile to allow easier deployment and testing.
Directory Layout
I restructured the project layout because it keeps expanding and having everything in a single directory was getting unwieldy. This is the directory layout I’m now using for the application.
search_service
- docker
- Dockerfile
- Dockerfile.pyal
- js
- search_box.js
- search_results.js
- src
- generate_search_index.py
- search_service.py
- requirements.txt
Docker
I decided to containerize the service because I wanted to learn more about Docker. I already updated the service to run locally as a stand alone application so containerizing wasn’t needed for testing. However, it’s making rethink the server deployment.
That said, this makes deployment easier because I can contain the gunicorn
WSGI server
within the image along side the service. This reduces service configuration and maintenance.
On my server I use currently uWSGI
fronting the service service. Containing that part within the
container will reduce server setup and maintenance.
Dockerfile
FROM alpine:latest AS build
WORKDIR /app
COPY ./requirements.txt .
RUN apk add --no-cache git
RUN apk add --update --no-cache python3
RUN python3 -m ensurepip
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN pip3 install --no-cache-dir -r requirements.txt
RUN pip3 uninstall -y pip setuptools packaging
FROM alpine:latest AS release
EXPOSE 80
WORKDIR /app
RUN apk add --update --no-cache python3
COPY ./src/search_service.py .
COPY ./src/generate_search_index.py .
COPY --from=build /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
VOLUME /data
ENV SEARCH_INDEX="/data/search_index"
CMD ["gunicorn", "--bind", "0.0.0.0:80", "search_service:app" ]
I’m using alpine Linux as my base image because it’s super small.
Also, it supports Python and the service is a pure python application.
I don’t need any dependences outside of Python and some Python libraries
I can install with pip
.
Anatomy
Creating the image is a two step process because my goal was to create as small of a image as possible.
Step 1: Build
In the build step, FROM alpine:latest AS build
, all dependencies are
installed. git
is installed because because one of the packages defined in
requirements.txt
is pulled from GitHub.
The application dependencies are installed into a Python virtual environment at
/opt/venv
. After everything is installed a few unnecessary things are
uninstalled from the VENV. Things like pip
, and setuptools
aren’t needed in
the release image.
A VENV is used instead of installing the dependencies directly onto the system
because Python packages get installed into /.../python<VERSION>/site-packages
.
It’s easier to install everything into a tidy, self contained location and copy
that instead of a system directory. It also ensures we’re only pulling over things
the application will use and not anything that might have been installed as a build
dependency. E.g. removing pip
and friends from the VENV.
Step 2: Release
In the release step, FROM alpine:latest AS release
, Python is installed
because we’ll need it to run the service. No other packages need to be system
installed and the only things in the image are exactly what is needed.
Build items, like git
are confined to the build stage to ensure as little
unnecessary packages are included in the image.
The VENV is copied from the build step into the release image and the environment path is updated to include the VENV. A few other things are defined like the default location the services uses for the search index. Also, the port it will listen on.
Finally, the command to run gunicorn
and run the service is specified.
gunicorn
was installed via pip
and is contained in the VENV.
Versioning
Realize nothing I’m doing is versioned. If Alpine, or Python or any of my requirements makes API changes I’ll need to update the service. For my needs, this is fine. I’d want to update the service anyway if I’m making changes that necessitate building a new image.
Second (.pyal) Dockerfile
There are two Docker files which are slightly different. The .pyal
uses the python:3.11-alpine
image which has Python already included. The file is slightly simpler than default file
and ensures a consistent Python version. It creates an image about 4 MB larger than than the
default file. It’s mainly here as an exercise when I was working through creating the Docker
build. In the end it’s not really any different than the standard file and can be disregarded.
Dockerfile.pyal
FROM python:3.11-alpine AS build
WORKDIR /app
COPY ./requirements.txt .
RUN apk add --no-cache git
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN pip3 install --no-cache-dir -r requirements.txt
RUN pip3 uninstall -y pip setuptools packaging
FROM python:3.11-alpine AS release
EXPOSE 80
WORKDIR /app
COPY ./src/search_service.py .
COPY ./src/generate_search_index.py .
COPY --from=build /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
VOLUME /data
ENV SEARCH_INDEX="/data/search_index"
CMD ["gunicorn", "--bind", "0.0.0.0:80", "search_service:app" ]
Size
You might think the container is a bad idea from a size standpoint because it has to bundle Python and all of the service’s dependencies. I was pleasantly surprised at how small the image ended up being at 61.74 MB. This sounds large until you realize the Python system package on Arch Linux has an install size of 75.2 MB.
Deployment
I have the service set to listen on port 80 within the container. We can bind the container’s port to any system port we choose so using port 80 works fine.
The service will look for the search index in /data/search_index
within the container.
However, the environment variable SEARCH_INDEX
can be set to override the default location
of the search index. Again, the path is the path within the container.
Javascript
I’ve described the two Javascript files previous with a minor update. Those posts can be referenced for a deeper dive of these two files.
search_box.js
"use strict";
let submit = document.getElementById("searchInput");
submit.addEventListener("keydown", function(even) {
if (event.key !== "Enter") {
return;
}
let params = new URLSearchParams();
let query = document.getElementById("searchInput");
params.set("s", query.value);
window.location.href = "/search/?" + params;
event.preventDefault();
});
Search box listens for enter to be pressed in the search entry and redirects back to the search page with the search query in order for the next script to run.
search_results.js
"use strict";
setupPage();
function setupPage() {
let params = new URLSearchParams(window.location.search);
let search = params.get("s");
let page = Number(params.get("p"));
if (page == 0 || page == NaN || page == Infinity) {
page = 1;
}
page = Math.floor(page);
if (search) {
// Pull the search results from the server and display them
runSearch(search, page);
} else {
populateNavNoPages();
}
}
async function runSearch(search, page)
{
if (runSearch.tries === undefined) {
runSearch.tries = 0;
}
page = page || 1;
let params = new URLSearchParams();
params.set("s", search);
params.set("p", page);
// fetch the results
let results = null;
try {
let response = await fetch("https://search.nachtimwald.com/?" + params);
results = await response.json();
} catch (e) {
// Retry a few times in case of a hiccup.
if (runSearch.tries >= 3) {
populateSearchError();
return;
}
runSearch.tries++;
setTimeout(runSearch, 250, search, page);
return;
}
if (!results || results.pages == 0) {
populateNoResults();
populateNavNoPages()
} else {
populateResults(search, results.hits);
populateNav(search, results.page, results.pages)
}
}
function populateSearchError()
{
// Set the no text
let entry = document.createElement("div");
entry.textContent = "Search Failed";
let view = document.getElementById("searchResults");
view.appendChild(entry);
}
function populateNoResults()
{
// Set the no text
let entry = document.createElement("div");
entry.textContent = "No results";
let view = document.getElementById("searchResults");
view.replaceChildren(entry);
}
function populateNavNoPages()
{
let page_nav = document.getElementById("pages");
page_nav.replaceChildren();
}
function populateResults(search, results)
{
let view = document.getElementById("searchResults");
for (let result of results) {
let article = document.createElement("article");
article.classList.add("post-entry");
let header = document.createElement("header");
header.classList.add("entry-header");
let h2 = document.createElement("h2");
h2.textContent = result.title + "\u00A0»";
header.appendChild(h2);
article.appendChild(header);
let preview = document.createElement("div");
preview.classList.add("entry-content");
preview.textContent = `${result.excerpt}... `;
article.appendChild(preview);
if (result.post_date) {
let footer = document.createElement("footer");
footer.classList.add("entry-footer");
footer.textContent = result.post_date;
article.appendChild(footer);
}
let link = document.createElement("a");
link.href = result.uri;
link.ariaLabel = result.title;
article.appendChild(link);
view.appendChild(article);
}
}
function populateNav(search, page, pages)
{
let page_params = new URLSearchParams();
page_params.set("s", search);
let page_nav = document.getElementById("pages");
page_nav.replaceChildren();
if (page != 1) {
let nav_link = document.createElement("a");
nav_link.classList.add("prev");
page_params.set("p", page-1);
nav_link.href = "/search?" + page_params;
nav_link.textContent = '«\u00A0Prev';
page_nav.appendChild(nav_link);
}
if (page != pages) {
let nav_link = document.createElement("a");
nav_link.classList.add("next");
page_params.set("p", page+1);
nav_link.href = "/search?" + page_params;
nav_link.textContent = 'Next\u00A0»';
page_nav.appendChild(nav_link);
}
}
When the search results sees a query in the query parameters it sends the search to the search service. Then it populates the returned Javascript into the search page.
Python Dependencies
There are a few dependencies that need to be installed with pip
. One of
which is pulled from GitHub using git
instead of PyPI. pip
is still used
to do this but you need git
installed on the system. That said,
git
is only needed for the pip install
step and not needed once the package
is installed.
requirements.txt
# Generator
Markdown
Plain-Text-Markdown-Extention @ git+https://github.com/kostyachum/python-markdown-plain-text.git#egg=plain-text-markdown-extention
python-frontmatter
# Service
falcon
gunicorn
# Both
Whoosh
I’m using one file that combines the requirements for the generator and service because when I create the Docker image I’m including both. That way you can use a container to generate the index. The generator dependencies are minimal so it’s not a big impact to have both in one image. Alternatively I could have separate images for the generator and service but that isn’t as manageable as putting them both in the same image.
Generator
High level, the generator reads posts and generates a search index. It always creates
a search_index
directory under the output directory. If the directory exists, it will
be deleted first to ensure a clean index is generated.
Original Design
The search index generator was pretty simple and read the generated HTML that Jekyll/Hugo generated. It would extract text from the HTML based on element classes. This works fine as long as the HTML structure never changes. I got really lucky the search index generator worked after switching to Hugo. It just so happened the Jekyll and Hugo themes used the same classes for title, and content elements. Author and date were generated differently in the output HTML and were missing from the search results.
Reading the generated HTML has the benefit of only including content that will be deployed. No checks or validation needs to take place if a post should or shouldn’t be included because that already happened when generating the HTML.
While I got lucky and search didn’t completely break this time, the next time I change the look and feel of the blog I might not be so lucky. I needed to come up with a theme independent solution.
Updated Design
The updated design is quite a bit different. Instead of reading the generated HTML, it’s reading the markdown posts. This requires a bit more logic but it’s the right way to build the index.
Keep in mind we’re not just reading markdown files, posts are markdown files
that have front matter that needs to be handled. Thankfully, there is a package
in pip
that can read this for us.
Script
The new generation script is a bit more complex but also has a few extra features. Specifically, flags to include drafts and future posts. I found this useful for testing. As well as better logging that can be turned on with a verbose flag.
generate_search_index.py
#!/usr/bin/env python
import argparse
import frontmatter
import logging
import os
import re
import shutil
import sys
import time
from datetime import datetime
from markdown_plain_text.extention import convert_to_plain_text
from whoosh.fields import Schema, TEXT, ID
from whoosh.index import create_in
INDEX_DIR_NAME = 'search_index'
EXERPT_LENGTH = 300
def get_arguments():
parser = argparse.ArgumentParser(description='Generate Whoosh search index from frontmatter formatted markdown file directory')
parser.add_argument('-p', '--post_dir', help='posts directory', default='posts')
parser.add_argument('-o', '--output_directory', help='"{}" will be created here'.format(INDEX_DIR_NAME), default='.')
parser.add_argument('-D', '--drafts', help='Index drafts', action='store_true')
parser.add_argument('-F', '--future', help='Index future posts', action='store_true')
parser.add_argument('-v', '--verbose', help='verbose output', action='store_true')
return parser.parse_args()
def get_directoires(args):
post_dir = os.path.abspath(args.post_dir)
out_dir = os.path.abspath(args.output_directory)
index_dir = os.path.join(out_dir, INDEX_DIR_NAME)
return post_dir, index_dir
def create_index_dir(location):
logging.info('Creating search index at: "{}"'.format(location))
try:
if os.path.exists(location):
shutil.rmtree(location)
os.mkdir(location)
except:
logging.critical('Failed to create index "{}" directory'.format(location))
return False
return True
def search_writer(index_dir):
schema = Schema(uri=ID(stored=True), title=TEXT(stored=True), content=TEXT, excerpt=TEXT(stored=True), post_date=TEXT(stored=True))
ix = create_in(index_dir, schema)
return ix.writer()
def validate_post(post_data, drafts, future):
if not post_data['dt']:
logging.warning('Post has no date: {}'.format(post_data['fname']))
return False
if post_data['ts'] <= 0:
logging.warning('Post timestamp is invalid: {}'.format(post_data['fname']))
return False
if post_data['draft'] and not drafts:
logging.info('Skipping draft {}'.format(post_data['fname']))
return False
if post_data['ts'] > time.time() and not future:
logging.info('Skipping future post {}'.format(post_data['fname']))
return False
if not post_data['title']:
logging.warning('Skipping post without title {}'.format(post_data['fname']))
return False
if not post_data['content']:
logging.info('Skipping post without content {}'.format(post_data['fname']))
return False
return True
def post_uri(post_data):
url = post_data['url']
if url:
return url
logging.warning('Post has no url: {}. Falling back to slug'.format(post_data['fname']))
slug = post_data['slug']
if not slug:
logging.warning('Post has no slug: {}. Falling back to title'.format(post_data['fname']))
slug = re.sub('[^A-Za-z0-9*]', '-', post_data.get('title', '')).lower()
if not slug:
logging.warning('Failed to determine uri: {}'.format(post_data['fname']))
return None
logging.info('Calculated slug from title for {} as {}'.format(post_data['fname'], slug))
uri = '/{}/{}/'.format(post_data['dt'].strftime('%Y/%m/%d'), slug)
logging.info('Calculated uri for {} as {}'.format(post_data['fname'], uri))
return uri
def populate_post_data(post, fname):
post_data = {
'fname': fname,
'title': post.get('title'),
'url': post.get('url'),
'slug': post.get('slug'),
'dt': post.get('date'),
'ts': post.get('date', datetime.fromtimestamp(0)).timestamp(),
'content': convert_to_plain_text(post.content).strip().replace('\n', ' '),
'draft': post.get('draft', False)
}
return post_data
def index_posts(post_dir, index_writer, drafts, future):
for dirpath, dirnames, filenames in os.walk(post_dir):
for fname in filenames:
_, ext = os.path.splitext(fname)
if ext not in [ '.md', 'markdown' ]:
continue
logging.info('Found file: "{}"'.format(fname))
post = frontmatter.load(os.path.join(dirpath, fname))
post_data = populate_post_data(post, fname)
if not validate_post(post_data, drafts, future):
continue
uri = post_uri(post_data)
if not uri:
continue
post_date = post_data['dt'].strftime("%B %d, %Y")
logging.info('Adding post: {}'.format(uri))
index_writer.add_document(uri=uri, title=post_data['title'], content=post_data['content'], excerpt=post_data['content'][:EXERPT_LENGTH], post_date=post_date)
def main():
args = get_arguments()
logging.basicConfig(level=logging.INFO if args.verbose else logging.WARN, format='%(levelname)s - %(message)s')
post_dir, index_dir = get_directoires(args)
if not create_index_dir(index_dir):
sys.exit(1)
writer = search_writer(index_dir)
index_posts(post_dir, writer, args.drafts, args.future)
writer.commit()
if __name__ == '__main__':
main()
Anatomy
The first few functions are helpers that should be easy enough to understand.
def get_arguments():
def get_directoires(args):
def create_index_dir(location):
def search_writer(index_dir):
will create the Whoosh search object
to add indexes we create.
def validate_post(post_data, drafts, future):
does two things. It verifies
the post has all required data needed in the front matter. A post
we can’t use doesn’t cause a hard failure, and only excludes the post.
The function also determines if a post should be included based on things
like draft or future flags.
def post_uri(post_data):
will return the front matter url
element which
is an override for permalinks. It tells Hugo to use that instead of generating
a permalink for the post. All of my posts have url
set but the fuction
does fallback to try and generate the permalink using the format /:year/:month/:day/:slug/
.
Which is what I have Hugo configured to use if url
is missing. This isn’t
really needed but I have it anyway just in case.
def populate_post_data(post, fname):
seems like it’s not necessary because
the post
object has pretty much all the information. However, some of the
information needs to be manipulated and then used in multiple places. The
returned post_data
dict
is for convenience.
For example you’ll see convert_to_plain_text(post.content).strip().replace('\n', ' ')
which gives us just the text of the post without any markdown formatting to
mess up the indexing or the excerpt.
def index_posts(post_dir, index_writer, drafts, future):
is where the real
magic happens. Not really, but it goes through all the posts and determines if
they should be included. If they’re included they get added to the index using
the index_writer
.
Finally, def main():
runs the script.
Running With Docker
The generate_search_index.py
script is installed in /app
within the image and
can be run using Docker in two way.
If the container is running:
docker exec -it <NAME_OF_CONTAINER> /opt/venv/bin/python /app/generate_search_index.py -p /posts -o /data
This assumes your posts are already mounted to /posts
. Most likely you won’t have your
posts on the same machine that’s running the container unless you’re testing.
If the container is not running:
This is a more likely scenario where you have the site on a work/build machine and not on the production server. In which case you don’t need to create a container from the image and instead you can have Docker create and tear down a container.
docker run --rm -it -v /<PATH_TO_POSTS>:/posts /<PATH_TO_OUTPUT>:/data <NAME_OF_IMAGE> /opt/venv/bin/python /app/generate_search_index.py -p /posts -o /data
The search_index
directory will be created and populated in the data directory.
Service
The service hasn’t changed much. It’s still a WSGI service using the falcon framework.
The updates to the service are switching to falcon.App()
which replaces falcon.API
and adding if __name__ == '__main__':
section to allow running the script directly.
This makes testing the search index a bit easier because I don’t need a WSGI
service to run the script when testing locally. That said, it’s easy enough to
run the container for testing.
Additionally, I added the environment variable SEARCH_INDEX
to determine where the
search index is located. If not set, it defaults to the directory search_index
in the same location as the service.
Script
search_service.py
#!/usr/bin/env python
import json
import falcon
import os
from whoosh.index import open_dir
from whoosh.qparser import QueryParser
class SearchResource:
def __init__(self):
index_dir = os.getenv('SEARCH_INDEX', 'search_index')
self._ix = open_dir(os.path.abspath(index_dir))
def _do_search(self, query_str, page):
ret = {}
with self._ix.searcher() as searcher:
qp = QueryParser('content', self._ix.schema)
q = qp.parse(query_str)
results = searcher.search_page(q, page, 15)
ret['page'] = results.pagenum
ret['pages'] = results.pagecount
ret['hits'] = []
for h in results:
match = {
'uri': h['uri'],
'title': h['title'],
'post_date': h['post_date'],
'excerpt': h['excerpt']
}
ret['hits'].append(match)
return ret
def on_get(self, req, resp):
resp.status = falcon.HTTP_200
res = self._do_search(req.get_param('s', default=''), int(req.get_param('p', default=1)))
resp.text = json.dumps(res)
app = falcon.App()
searcher = SearchResource()
app.add_route('/', searcher)
if __name__ == '__main__':
from wsgiref.simple_server import make_server
with make_server('', 8000, app) as httpd:
print('Serving on port 8000...')
# Serve until process is killed
httpd.serve_forever()
Anatomy
This is a much simpler script than the generator. It’s a REST service that opens the search index, plugs in the query, and returns the result as JSON.
Conclusion
This project started as fixing author and date being left out of the search results and turned into a much larger project. Now my simple little search service feels like an actual application.
I could have done a minimal update to the generator and updated the xpath I was using to pull text from the correct elements but I decided to do it right. The script is now more robust and will continue to work if I change the theme. This should save time in the future because I shouldn’t have to update it again for a long time.
Updating the generator made me realize I should be using a requirements.txt file so I went ahead and added one. With more files being added I decided I needed to restructure the layout.
Then it was time to put some attention on the service itself. Testing was proving difficult so I added the ability to run the service locally. That is nice but I’ve been learning about Docker recently and this looked like a good way to learn more.
I’m very happy I took the time to create a Dockerfile and making the application
a fully deployable service that just works. Complete with a real WSGI server
(gunicorn
) powering it. I’m also surprised with how small I was able to get
the Docker image and how little work actually went into it. Not really a
little, I spent a lot of time learning.
Now I just have to decide if I want to setup Docker on my server and deploy the image or if I want to stick with the current, more traditional setup I already have in place. Either way, this was a very fun project.