For a project I’ve been working on I needed to split a string into it’s
component parts. There is strtok
which I find useless for pretty much any
task. It is not thread-safe, nor is it re-entrant, which makes it impossible to
parse two strings (in a loop) at once. Yet another issue with strtok
is that
after splitting, parts are returned by multiple calls to the function. The only
way to know the number of parts is to loop until you’ve gone through all of
them. Also, you can’t specify a maximum number of parts you want the string
split into. Finally, strtok
modifies the input string, which might not be
desirable.
To deal with the shortcomings of strtok
I wrote a simple string splitting
function. It duplicates the input string and also takes a length so it can
split a sub string. Not to mention, it supports a maximum number of splits in
case I want to use it to partition a string.
char **str_split(const char *in, size_t in_len, char delm, size_t *num_elm, size_t max)
{
char *parsestr;
char **out;
size_t cnt = 1;
size_t i;
if (in == NULL || in_len == 0 || num_elm == NULL)
return NULL;
parsestr = malloc(in_len+1);
memcpy(parsestr, in, in_len+1);
parsestr[in_len] = '\0';
*num_elm = 1;
for (i=0; i<in_len; i++) {
if (parsestr[i] == delm)
(*num_elm)++;
if (max > 0 && *num_elm == max)
break;
}
out = malloc(*num_elm * sizeof(*out));
out[0] = parsestr;
for (i=0; i<in_len && cnt<*num_elm; i++) {
if (parsestr[i] != delm)
continue;
/* Add the pointer to the array of elements */
parsestr[i] = '\0';
out[cnt] = parsestr+i+1;
cnt++;
}
return out;
}
Before we start the actual splitting, we need to determine the number of
elements. There will always be at least one element because if there is no
delimiter within the data, then the entire input will be returned as the only
element. The number of elements will stop once max
is reached unless max
was set to 0. If max was set to 0, we’ll find the real total. This has to
happen before the actual splitting takes place because we need to know the
number of elements to allocate in the output array.
Next copy the data into a new string that we’ll chop up. We’ll ensure a NULL terminator in this string so we don’t have to worry about the last element in a split. Since we could be splitting a sub string it’s possible the data we’re copying isn’t already NULL terminated. Then we’ll, loop through the string again so we can start pulling out elements. If there was only one element (no delimiter in the string), then this loop will not run. Since the duplicated string was already set to the first element, it doesn’t matter if this doesn’t run because this loop only deals with the remaining elements.
As we go though the string, any delimiters are changed into NULL terminators. The pointer after the (delimiter changed into a NULL) is stored as the start of the next split. If the last data character in the string is the delimiter, then the next character will be the NULL terminator. In this situation the last element in the array will point to the NULL terminator so we end up with an empty element.
What we’ve done is take an string and put NULL terminators throughout it. We’ve
also create an array of pointers into locations within the string after each
terminator. The string itself is the first element in the array. This way we
only have one malloc
for all the string data instead of needing one for each
substring. Due to this, we can’t have the caller free each part individually.
Instead we need a separate function to handle freeing memory allocated by the
split function.
void str_split_free(char **in, size_t num_elm)
{
if (in == NULL)
return;
if (num_elm != 0)
free(in[0]);
free(in);
}
There are two allocations in the split array so naturally there will only be two deallocations (the array itself and the fist element in the array). Don’t forget the first element is the full string with the rest of the array containing pointers to specific locations within the string.
The number of elements isn’t really needed but it’s an additional safety check to prevent mistakes.