Making a Rust application slower than the Node version. And then fixing it.
Whilst working on a new version of Browsersync I ran into an interesting example where my Rust version was not faster than the original NodeJS implementation - in fact it was 2x slower! π±
In this previous post I outlined how the Browsersync proxy works: it sends requests to the target url with modified headers and then buffers the response body in memory before applying string replacements.
I had been following the idea of not spending too much time optimizing the Rust version whilst I was deep in development. Being
liberal with .clone()
and trying to avoid lifetimes where possible was proving very productive. I was getting features
completed really quickly and feeling great about the overall direction.
The pressures of a public demo
Getting to feature parity on the proxy feature was an exciting milestone for me since it proves out a lot of the new architectural patterns (more on this in a future post) and I was keen to make a demo π.
So, imagine my horror when I spun up a simple example and it turned out to be much slower than the original NodeJS implementation I'd written years ago! π±π±
So, did I just forget to run cargo build
with --release
like everyone does?
A common problem that people run into when benchmarking Rust programs (normally on Twitter when promoting an alternative)
is that they forget to build in release mode! Even if you've been writing Rust for years, it's easy to forget. I think
that's just because the local development story with Rust and cargo is so ergonomic that we just get so used
to running cargo run
π€£.
Regardless, in my case this alone didn't help!
So what exactly was slow
In my demo I just wanted to show a localhost
server that proxied all requests through to a live website (in the Browsersync style).
GET localhost:3000/ -> GET https://browsersync.io/
GET localhost:3000/css/core.css -> GET https://browsersync.io/css/core.css
... etc
When using https://browsersync.io/ you get a total of just 19 requests (when viewed in the browser).
This was hardly a stress-test, so why the crazy numbers?
baseline | proxy in node js | proxy in rust (slowest) | |
---|---|---|---|
load time | 250-320ms | 780-900ms | 950ms-1.2sec |
page weight | 368kb | 562kb | 368kb |
compression | β | β | β |
Something was wrong here. Even though I hadn't applied any specific optimisations yet, I had expected the Rust version, even in debug mode to far out-perform the NodeJS implementation...
The Importance of Connection Pooling
The short version is that my implementation was creating a new HTTP client inside the main request handler! π€£
So, for every single HTTP request that my application dealt with, it was spinning up a new thread pool and preparing things behind the scenes to re-use resources where possible. Only for the handler to exit, tearing down all the side effects and then re-creating them immediately on the next request π±.
No wonder the Rust version was even slower than the NodeJS version!
The problem
Since I was deep in prototyping mode and was thinking more about architecture than individual handlers, I let this slip by:
Imagine an HTTP server being started like this:
#[tokio::main]
async fn main() {
// snip
let app = Router::new().nest_service("/", any(handler));
// snip
}
That's a reasonable way to allow a single service to handle all routes under /
. The problem is what occurs within handler
pub async fn handler(req: Request) -> Result<Response, StatusCode> {
let https = HttpsConnector::new();
let client = Client::builder(TokioExecutor::new()).build(https);
// snip: modify the response object
Ok(client
.request(req)
.await
.map_err(|_| StatusCode::BAD_REQUEST)?
.into_response())
}
- This async function,
handler
will be called for every request - HTML, CSS, JS, images, web sockets, everything! - On lines
3
and4
we create the http client and it'shttps
connector - But this happens over and over again with every incoming request πππ
I was very confident this was the cause of the poor performance π€£ - so, it was just going to be a case of moving the client creation code out of the handler, and coming up with a way to access the client in the handler.
The solution
Step 1) ensure the client is not re-created on every request. This can be done by pulling it up into main
#[tokio::main]
async fn main() {
+ let https = HttpsConnector::new();
+ let client = Client::builder(TokioExecutor::new()).build(https);
let app = Router::new().nest_service("/", any(handler));
// snip
}
Step 2) Next, we need to access that client from within the handler.
Luckily Axum has a really nice way to do this, known as extractors. Extractors allow a type-safe way of accessing various useful things from within your handler functions.
async fn handler(State(client): State<Client>, req: Request) -> Result<Response, StatusCode> {
// implementation as before.
}
- Notice how we added a new parameter,
State(client): State<Client>
- This uses the
State
extractor along with the fact that parameters in Rust can also be patterns - this allows direct access to the element inside the tuple struct (client
in this case).
Step 3) Provide state to the Router
If we tried to run the code at this point we'd get an error stating that our handler
cannot be used with the Router configured in main
.
This is the type-safe part - each handler contributes to the overall type of the outer Router - so to satify
the type checker we need to call .with_state(...)
with a type that satisfies all extractors.
We only have 1 of our own (we use Request
too, but that's supported already), so it's just a case of making this small
change:
#[tokio::main]
async fn main() {
let https = HttpsConnector::new();
let client = Client::builder(TokioExecutor::new()).build(https);
- let app = Router::new().nest_service("/", any(handler));
+ let app = Router::new().nest_service("/", any(handler).with_state(client));
// snip
}
With that change in place, http connection pooling
will work as expected, and the numbers start to look a lot better :)
baseline | proxy in rust | proxy in rust, no pooling | |
---|---|---|---|
load time | 250-320ms | 250-320ms | 950ms-1.2sec |
page weight | 368kb | 368kb | 368kb |
compression | β | β | β |
Summary
The mistake highlighted in this post was a silly one π, and was easily fixed - but it does show how easy it is to crash performance when dealing with network bound workloads.
And there I was, thinking I just needed that simple --release
flag π€£
Notes:
- I omitted some types in snippets above for brevity - have a look at the working demo I made here for details: https://github.com/shakyShane/t-stream/blob/main/https-proxy/src/main.rs
- There are other perf things I looked at too, like swapping
Mutex
's forRwLock
's, but for the type of tool I'm building those are not going to produce much of a difference :)