About Silicon Valley
If you haven’t heard about the Silicon Valley series, I highly recommend anyone to give it a watch! It came out a decade ago, but it still gives the current tech and non-tech world a laugh with timeless problems - the only difference is that back then it was about compression, now we race for the best LLM.
All the technical cases in the story are very well consulted with real tech guys, so as a tech person, I’m quite satisfied.
The problem
Why am I bringing this up here? Well, here’s the story:
Lately, I had to clean up our Codeowners module at Tidio. So, what exactly does it do?
We have prepared a config file in our backend that reflects our organizational chart, and assigned specific modules in code to each team, specifying who is responsible for what. We are using it for Sentry to automatically assign error owners, for generating CODEOWNERS files for GitLab, which automatically assign reviewers.
The last place where we're using this definition is in adding the team name to the log message context.
Basically, this file looks like this:
1class CodeownersDefinition
2{
3 public function getCodeowners(): Codeowners
4 {
5 return (new CodeownersBuilder())
6 ->addTeam(Team::Billing,
7 new Path('app/Modules/Invoice', TeamMember::JOHN_DOE, TeamMember::JANE_DOE),
8 ...
9 )
10 ->addTeam(
11 ...
12 )
13 ->build();
14 }
15}
16
Very nice object-oriented interface.
To find which team is owning the current request, we are taking a specific controller path from PHP backtrace, and comparing it to each path. Path
class has a method belongsTo(string $path)
which compares path prefixes. We have over 200 of those paths.
So during my cleaning up work, I found that we have some places where a part of one specific module can be owned by another team. For example:
app/Modules/Contacts
→ First Teamapp/Modules/Contacts/Import
→ Second Team
With the current way of building it, we cannot guarantee that a longer path will be found first.
So we have two requirements:
- performance (at least it shouldn’t be slower)
- prioritising longer paths
First step - indexing
To handle this case, we had to sort our list of paths by length. If we do it in descending order, the first match will be a longer path, which is more specific, and our requirement will be fulfilled.
It’s very simple, but adding sorting of this list on every request would add some additional load to the request. To avoid that, I’ve created an index, which is a PHP file, that is generated automatically together with CODEOWNERS.
We don’t need team members in the logger, so we are able to make a simpler index:
1return [
2 'app/Modules/Invoice' => Team:Billing,
3 ...
4];
5
This way, we can sort it once, before the application even starts, and just search the same way as we’ve done before.
After that change was deployed, I was curious how much time it would take. So I’ve checked the blackfire profile.
This thing works, and also gives us 2ms less time of searching for a team. Success!
Two things made this solution fast:
- The first one was skipping creating objects. This indexed array is just a hash table with a string key and value.
- The second thing was probably OPCache, which can keep this entire index in cache and not load it every time.
N sorted list brute-force search
There is still a space to improve, because it takes ~6ms for every request, and what is worse, for shorter paths it takes more time, because they are lower in the index file.
Meanwhile, while all that was going on, I was watching Silicon Valley (probably for the tenth time, or even more).
Thankfully, I was somewhere in the episode when Richard's old manager was kidding him about his old code, when he brute-force searched over N sorted list.
I mean this episode. You can probably find a short version of this scene on YouTube.
It got me thinking about our case and how familiar both scenarios are!
Solution 1 - binary searching
So, my first thought was, what is the best solution for searching over an N sorted list?
Yup, binary search.
How does it work?
To simplify, lets say, we are counting indexes starting from 1, not from 0.
If you have a sorted array of numbers, like this one:
[1, 3, 6, 10, 15, 29, 30, 37, 52, 60]
And if you want to find an index of one specific number, let's say 30, you can virtually split this array in half. Let's take a middle index which is floor(10/2) = 5:
▼
[1, 3, 6, 10, 15, 29, 30, 37, 52, 60]
So it’s 15. Is 30 more or less than 15? More, so take a middle index from the second half. 5 items are on the right, so floor(5/2) = 2. Previous index 5, plus 2 is 7:
▼
[1, 3, 6, 10, 15, 29, 30, 37, 52, 60]
On index 7, we have 30. Is 30 equal to 30? Yes, we found it, index is 7!
So, that's how it works. Instead of doing the comparison 7 times, we made it only twice.
But this algorithm does not apply to our case, because we don’t have a sorted list of numbers.
Solution 2 - search tree
The solution that we could apply here is a search tree.
We can easily do a tree, with paths, then split the path that we are looking for into directories, search for specific keys. Sounds great, and easy, but let’s try to create a tree:
1app
2 Modules
3 Invoice
4 Payment
5 Contacts
6 Imports
7 Auth
8 Users
9 ...
In our case, almost all paths are in modules, and we are looking for a specific module, in the lightest case to a directory in the module, but it's a very rare situation.
We are in the same place, some optimisation, because we have some legacy files that are out of Modules, but still not enough. Also, we need to prioritise the longest path, so we may have to search over all nodes of the tree.
Let’s look for something else.
Solution 3 - inverted searching — our middle out
Do you remember how Richard got a new idea when he was watching how his colleagues argued about — uhm — manipulating data? Then he found the middle-out solution for compression.
In my case, it wasn’t that spectacular. I was just thinking about other solutions, and found this one. But it feels like I found “middle out” for this smaller case.
What if, instead of trying to adjust indexing and comparing paths, we just start adjusting the current file path during searching?
The idea of the algorithm is simple. We have a path of file, then we are trying to find this specific file in our index by its key. It’s hashmap, so looking for a key this way is O(1):
1$path = 'app/Modules/Invoice/Infrastructure/Http/Controllers/InvoiceController.php';
2array_key_exists($path, $this->index);
Next, if there is no match, just cut the end of the path, and try again:
1$path = 'app/Modules/Invoice/Infrastructure/Http/Controllers/';
2array_key_exists($path, $this->index);
And again:
1$path = 'app/Modules/Invoice/Infrastructure/Http/';
2array_key_exists($path, $this->index);
And again:
1$path = 'app/Modules/Invoice/Infrastructure/';
2array_key_exists($path, $this->index);
Til we get it:
1$path = 'app/Modules/Invoice/';
2array_key_exists($path, $this->index);
This way, we can find it very quickly with long path prioritisation and constant search time for every path.
I did some research to see if this approach has a name, and it’s probably a variation of Longest Prefix Match algorithm.
From O(n), we have O(1) now. Actually, it’s O(d) where d is the depth of paths, which in our case will be max. 7, and avg. round 5, but technically it's O(1), because its complexity is independent of the number of paths in the index.
That optimization actually gave us that much reduction, that blackfire stopped profiling this method, because only calls above 1% of request time are taken into account during profiling.
So, we probably have less than 1ms.
Summary
While 7 milliseconds might seem to be a very small improvement, the investment was surprisingly small. Was it worth it? Much of the solution came from "involuntary" thought after rewatching one of my favorite tv-series, with the actual development that took me like 20 minutes at all.
I’ve done this task thanks to our "Improvement Fridays" initiative.
However, the impact of 7ms, when applied to every single request, can bring you very nice results. We can't precisely measure this in production because response times are too inconsistent. Still, ignoring these small improvements can lead to performance problems building up over time, particularly when they impact every single query.
This whole thing showed me that, as a software engineer, I often don't fully remember how powerful basic algorithms and data structures are. It was genuinely fun and refreshing to use these ideas on a real problem. It proves that learning them is priceless. You truly never know when they'll give you the perfect fit solution.