Making a Rust application slower than the Node version. And then fixing it.

Whilst working on a new version of Browsersync I ran into an interesting example where my Rust version was not faster than the original NodeJS implementation - in fact it was 2x slower! 😱

In this previous post I outlined how the Browsersync proxy works: it sends requests to the target url with modified headers and then buffers the response body in memory before applying string replacements.

I had been following the idea of not spending too much time optimizing the Rust version whilst I was deep in development. Being liberal with .clone() and trying to avoid lifetimes where possible was proving very productive. I was getting features completed really quickly and feeling great about the overall direction.

The pressures of a public demo

Getting to feature parity on the proxy feature was an exciting milestone for me since it proves out a lot of the new architectural patterns (more on this in a future post) and I was keen to make a demo πŸ‘€.

So, imagine my horror when I spun up a simple example and it turned out to be much slower than the original NodeJS implementation I'd written years ago! 😱😱

So, did I just forget to run cargo build with --release like everyone does?

A common problem that people run into when benchmarking Rust programs (normally on Twitter when promoting an alternative) is that they forget to build in release mode! Even if you've been writing Rust for years, it's easy to forget. I think that's just because the local development story with Rust and cargo is so ergonomic that we just get so used to running cargo run 🀣.

Regardless, in my case this alone didn't help!

So what exactly was slow

In my demo I just wanted to show a localhost server that proxied all requests through to a live website (in the Browsersync style).

GET localhost:3000/ -> GET
GET localhost:3000/css/core.css -> GET
... etc

When using you get a total of just 19 requests (when viewed in the browser).

This was hardly a stress-test, so why the crazy numbers?

baselineproxy in node jsproxy in rust (slowest)
load time250-320ms780-900ms950ms-1.2sec
page weight368kb562kb368kb

Something was wrong here. Even though I hadn't applied any specific optimisations yet, I had expected the Rust version, even in debug mode to far out-perform the NodeJS implementation...

The Importance of Connection Pooling

The short version is that my implementation was creating a new HTTP client inside the main request handler! 🀣

So, for every single HTTP request that my application dealt with, it was spinning up a new thread pool and preparing things behind the scenes to re-use resources where possible. Only for the handler to exit, tearing down all the side effects and then re-creating them immediately on the next request 😱.

No wonder the Rust version was even slower than the NodeJS version!

The problem

Since I was deep in prototyping mode and was thinking more about architecture than individual handlers, I let this slip by:

Imagine an HTTP server being started like this:

async fn main() {
    // snip
    let app = Router::new().nest_service("/", any(handler));
    // snip

That's a reasonable way to allow a single service to handle all routes under /. The problem is what occurs within handler

pub async fn handler(req: Request) -> Result<Response, StatusCode> {

    let https = HttpsConnector::new();
    let client = Client::builder(TokioExecutor::new()).build(https);

    // snip: modify the response object

        .map_err(|_| StatusCode::BAD_REQUEST)?
  • This async function, handler will be called for every request - HTML, CSS, JS, images, web sockets, everything!
  • On lines 3 and 4 we create the http client and it's https connector
  • But this happens over and over again with every incoming request πŸ™ˆπŸ™ˆπŸ™ˆ

I was very confident this was the cause of the poor performance 🀣 - so, it was just going to be a case of moving the client creation code out of the handler, and coming up with a way to access the client in the handler.

The solution

Step 1) ensure the client is not re-created on every request. This can be done by pulling it up into main

async fn main() {
+   let https = HttpsConnector::new();
+   let client = Client::builder(TokioExecutor::new()).build(https);
    let app = Router::new().nest_service("/", any(handler));
    // snip

Step 2) Next, we need to access that client from within the handler.

Luckily Axum has a really nice way to do this, known as extractors. Extractors allow a type-safe way of accessing various useful things from within your handler functions.

async fn handler(State(client): State<Client>, req: Request) -> Result<Response, StatusCode> {
   // implementation as before.
  • Notice how we added a new parameter, State(client): State<Client>
  • This uses the State extractor along with the fact that parameters in Rust can also be patterns - this allows direct access to the element inside the tuple struct (client in this case).

Step 3) Provide state to the Router

If we tried to run the code at this point we'd get an error stating that our handler cannot be used with the Router configured in main.

This is the type-safe part - each handler contributes to the overall type of the outer Router - so to satify the type checker we need to call .with_state(...) with a type that satisfies all extractors.

We only have 1 of our own (we use Request too, but that's supported already), so it's just a case of making this small change:

async fn main() {
    let https = HttpsConnector::new();
    let client = Client::builder(TokioExecutor::new()).build(https);
-   let app = Router::new().nest_service("/", any(handler));
+   let app = Router::new().nest_service("/", any(handler).with_state(client));
    // snip

With that change in place, http connection pooling will work as expected, and the numbers start to look a lot better :)

baselineproxy in rustproxy in rust, no pooling
load time250-320ms250-320ms950ms-1.2sec
page weight368kb368kb368kb


The mistake highlighted in this post was a silly one πŸ™ˆ, and was easily fixed - but it does show how easy it is to crash performance when dealing with network bound workloads.

And there I was, thinking I just needed that simple --release flag 🀣
