Suricata Engine’s 20x performance upgrade on rule grouping

Table of Contents

  1. The Issue with Engine start time
    • Preliminary information
    • Case Study
  2. Methodology before the fix
    • Simplified pseudocode
    • Runtime analysis
  3. Points of contention
  4. Understanding the problem
    • Solution to Part 1 of the problem
    • Solution to Part 2 of the problem
    • Solution to Part 3 of the problem
  5. Mistakes with the solutions
  6. Rectifying the mistakes
    • Solution to Issue #1
    • Solution to Issue #2
    • Solution to Issue #3
  7. Runtime analysis
  8. Last steps
  9. Progressive results
  10. Results on unaffected rulesets
  11. Conclusion

Note: The professional diagrams in this blog were made by a professional artist on a professional tool.

The Issue with Engine start time

Have you ever tried to load a large ruleset in Suricata? Say about 50k rules?

The answer is probably yes.

Now, have you tried to load a large ruleset with multiple different ports and port ranges in them?

I’m thinking you haven’t, otherwise, you’d file a ticket on our Redmine. 😉

In a recent report by AWS, we were informed that loading one such ruleset which comprised ~47k rules took a very long time to load thus making the engine startup time quite high. For reference, on my Intel(R) Core(TM) i5-8365U CPU, it took 9m45s to load.

Preliminary information

  1. Along with the report, we were also informed that a certain function DetectPortInsert had too many recursive calls when that ruleset was loaded. So, we were supposed to figure out what was happening and possibly bring the size of the ruleset down for better analysis. See picture below.
  2. Suricata engine performs something called “rule grouping” where certain rules that are similar to one another are grouped together before the engine starts so they’re better organized and ready for when the real world traffic is to be matched against.

Case Study

In order to break the problem down, I created several different sets of rules based on different properties. Following is a brief overview and conclusions drawn with each case.

Case #Test criteriaObservationsHigh Stack TraceConclusion
Case 0Rules as received by AWS.PortInsert fn takes a lot of time
– There is a combination of TLS, DNS, TCP and HTTP protocols
– There is a combination of “any”, individual ports (e.g. 80), range of ports (e.g. 100:5000), list of ports (e.g. [1, 5, 8. 20])
YesThere seems to be something with the TLS rules and how ports are grouped together
Case 1Just TLS rules which also happened to be rules with individual ports only.Load time was considerably lowerNoTLS probably does not have much to do w the issue
Case 2Mix of TLS, HTTP, DNS and TCP rules. Also a mix of different port syntax.Load time was considerably lowerYes– High stack trace does not necessarily mean high loading time for rules.
– On small dataset, despite recursion, resource consumption is relative
Case 3The rules with “any” port type were excluded.Took more time to load than Case 2 but less than Case 1YesIf more individual ports are given to PortInsert along with ranges, the load time increases
Case 4Only the rules with individual explicit ports were used.Load time was considerably lowerNoThe issue does not act up when dealing with individual explicit ports only
Case 5Only the rules with port ranges or explicit list of ports. The dataset was quite small for this. About 150 rules.Load time was considerably lowerNoOnly rules with port ranges with small overlaps do not cause much trouble
Case 6All TLS rules with individual ports and only one port range were used.Took measurable more time than Case 1YesThe issue is happening on a combination of many explicit ports and port ranges.
Case 7All TLS rules with individual ports and 4 rules with non overlapping different port ranges.Time to load these rules was higher than Case 6YesIndeed a combination of port ranges and several explicit individual ports seem to be causing the issue.
Case 8All TLS rules were converted to HTTP. They still had individual ports.Loaded in fairly less time (almost the same as Case 1)NoIt is not a protocol specific problem.

Conclusion: It was a problem to do with the combination of port ranges and individual ports. Next step was to understand the how and why.

Methodology before the fix

Upon diving into the code, I discovered that in order to resolve the port range overlaps, we have a complicated recursive logic in place. A brief overview and simplified pseudocode would seem as follows. Note that this gives the correct results.

Also, note that every port object is a tuple [low, high]. In case of single ports, low == high.

Simplified pseudocode

Well.. this doesn’t look too bad. So, why are things so expensive? Let’s take a look at the callers now.

See the recursive calls and the possible number of times we try to visit each function, not to mention new_list is always changing?

Runtime analysis

Since there could be any number of overlaps for a given port and that would determine the size of the new list and the items of this list could change, it was hard to determine how much time this procedure could take.

We also noticed that in the later stages, sorting the same list by some criteria was also taking a considerable time. Lesser than port cutting but still considerable.

Points of contention

It was clear at this point that

  1. Recursion was not the only problem.
  2. Port Cutting should happen more efficiently.
  3. There should be a way to find a cap on how much time the procedure can take.
  4. Fixups may be required in the later stage where the list was sorted.

Understanding the problem

In order to understand the problem better, I used my love for numbers, mathematics and patterns and laid down some example ports from some rules to find something common.

The issue could be broken down into:

  1. Finding all overlaps among ranges
  2. Breaking down/cutting the ranges
  3. Copying appropriate signatures to the designated ranges

Solving the problem

Solution to Part 1 of the problem

Since the issue was about finding overlaps on these ranges, I drew them in 1-D on a number line.

Now, it was fairly simple for me to see the overlaps visually.

Solution to Part 2 of the problem

Now the overlaps are visible but what should be the final ranges created? In order to solve this problem, I decided to consider any range from a given point 1 to a given point 2 a valid range and use the method of elimination in the next step. This would mean there are always the smallest possible port ranges at any given point.

This would also mean that at the most, we can have 65536 port objects in the end. So, this is fairly deterministic.

Solution to Part 3 of the problem

In order to make the proper port ranges and have appropriate signatures, I decided to eliminate any ranges that were considered valid but did not have any signatures associated.

While this solution sounds promising and simple, it had several issues when I started to implement the solution.

Mistakes with the solutions

Issue #1: Boundary overlaps

For the given ports [0, 3] and [13, 70], we’ll have ranges [0, 3], [3, 13] and [13, 70].

Note how 3 and 13 occur twice.

Issue #2: Invalid port ranges

For the same example as above, a range [3, 13] should be invalid. While 3 and 13 have associated signatures, any ports from 4-12 do not.

Note: I have listed a simple case here. There were several corner cases like that which I created unit tests for.

Issue #3: Searching for overlaps with existing signatures was a bit expensive

This info was stored in a linked list and would mean a time complexity of O(mn) where n = number of ports from signatures; m = number of smallest port ranges (irrespective of whether they do or do not have an overlap)

Rectifying the mistakes

Solution to Issue #1

In order to fix boundary overlaps, I decided to store more info corresponding to a port like whether it is a single port or a right boundary. So, now, I had something like:

With this, the new list would look like:

Solution to Issue #2

This also comes from the Solution to Issue #1. I simply needed to eliminate the ranges that did not have any signatures corresponding to them.As an added step, on Victor’s suggestion, I also merged the adjacent port ranges with the exact same signatures.

Solution to Issue #3

Now, it was time to find a less expensive solution for lookups for overlaps in order to know which signatures should belong to a given port range. Soon enough, accidentally, I came across something called Interval Trees which looked promising. Spoiler: it was!

Interval Tree

Interval Tree is a special type of a Red Black Tree that stores intervals in its node as well as an extra variable that stores the maximum high in any interval of a subtree rooted at that node.

See example below.

Cost of finding all possible overlaps in an interval tree = MIN(O(n.n), O(k.lgn))

where n = number of nodes in the tree

And,   k = number of overlapping intervals

Runtime Analysis

This means that our final solution would also correspond to the same time complexity as looking through an interval tree as that’s the most expensive operation now.

[Benchmark 1 will be till here]

Last steps

  • While we have solved most of the problem, to our surprise, this already helped and brought down the time of the Point 4 in the Points of Contention.

So, I tried a way to sort the final port list in a better way but it only yielded very little perf upgrade. But, I analyzed that this could be improved with a better sorting algorithm.

Just for reference, the sorting at this stage was to be done by two criteria:

  1. Score: determines how important a signature is based on whether it has MPM etc.
  2. Signature count: stores how many signatures use this port range.

[Benchmark 2 will be till here]

  • Then, Victor figured out an even better solution for sorting using qsort and it yielded a good perf upgrade.

[Benchmark 3 will be till here]

  • Now, Victor was unhappy with the time that was being spent on a function (SigGroupHeadCopySigs) and he felt it could be improved. See the perf output below.

He realized that it could use SIMD optimizations because of the nature of the operations performed by the function i.e. same operation (bitwise ORing) on a large dataset. Then, he went ahead and performed some SIMD optimizations that require the SSE2 instruction set which is fairly common and it improved things even further! See the perf results below.

[Benchmark 4 will be till here]

Progressive results

Now, let’s see what the results looked like at different revisions of the solution. For the sake of simplicity, I have removed any intermediary results which contained issues during the development.

StateTime taken
Base revision9m45s
Benchmark 11m17s
Benchmark 21m 11s
Benchmark 340s
Benchmark 4 [FINAL]29s

All these benchmarking tests were performed on Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz.

Results on unaffected rulesets

Please note that this perf upgrade is on the rules that affected port grouping. So, any rulesets that do not have such multiple port settings will load in a similar timeframe as before this change in the codebase. Following are the results for ET Open on my system, for example.

StateTime taken
Before this fix0m21.955s
After this fix0m21.615s

Conclusion

For a ruleset with complicated port overlaps and settings, we were able to bring the load time of the ruleset from 9m45s to 29s which is a whopping 20x perf upgrade!

To our surprise, when we shared the solution with AWS, they informed us that the performance was terrific for them AND that they were able to load ALL of their problematic rulesets at once which consisted of 170k rules in just about 50s! This was beyond our expectations!

Thank you AWS for the thorough testing of Suricata and bringing us an opportunity to improve!

This fix is available in the latest development branch and the latest stable release 7.0.6 of Suricata. If you’re interested, this is the redmine ticket that tracked this issue.

For more detailed updates on this solution, please consider signing up for SuriCon 2024 where we would present a deeper and better analysis and solution to this problem AND a lot of other interesting topics and talks from the team and the community! Until the next blog post. Cheers! 😀