10 Pro Writing High Performance Rust Code Tips for Production Systems

Rust is famous for its “zero-cost abstractions,” but as I’ve learned through building several production-grade tools, zero-cost doesn’t mean “automatic performance.” You can still write slow Rust code if you aren’t mindful of how the compiler and hardware actually interact.

In this guide, I’m sharing my best writing high performance Rust code tips based on my experience optimizing CLI tools and backend services. Whether you’re fighting the borrow checker or trying to squeeze every last cycle out of your CPU, these strategies will help you move the needle.

1. Prefer Slices over Owned Vectors in Function Signatures

One of the most common mistakes I see is passing &Vec<T> into functions. This adds a layer of indirection (a pointer to a pointer). Instead, use &[T]. Slices are more flexible because they can accept both vectors and fixed-size arrays without forcing an allocation.

// Avoid this
fn process_data(data: &Vec<String>) { ... }

// Do this
fn process_data(data: &[String]) { ... }

2. Minimize Heap Allocations with SmallVec or TinyVec

Frequent allocations are the silent killer of performance. If you have a vector that usually contains only 2-4 elements but occasionally grows, using SmallVec or TinyVec allows you to store those elements on the stack. This avoids the overhead of calling the allocator for the common case.

3. Leverage the Power of Iterators

Newcomers often reach for for i in 0..len loops. However, Rust’s iterators are highly optimized and often eliminate bounds checking. I’ve found that switching from indexed loops to .iter().map().collect() patterns frequently results in tighter assembly code because the compiler can prove that indices will never be out of bounds.

4. Avoid Unnecessary Cloning

The .clone() method is a convenient way to satisfy the borrow checker, but it’s expensive. Whenever I see too many clones in a PR, I look for ways to use references or std::sync::Arc for shared ownership. If you’re doing a rust memory management deep dive, you’ll realize that minimizing data movement is key to speed.

5. Use Specialized Hashers for Small Keys

The default HashMap in Rust uses SipHash to prevent DoS attacks, but it’s relatively slow. If you are using a map for internal lookups where the keys are trusted, switching to rustc-hash or ahash can provide a 20-50% speedup for map-heavy workloads.

6. Profile Before You Optimize

I used to spend hours optimizing a function only to realize it accounted for 0.1% of total execution time. Use criterion for micro-benchmarking and flamegraph to see where your program is actually spending its time. As shown in the performance visualization below, targeting the “widest” parts of the flamegraph yields the highest ROI.

Example of a Rust cargo-flamegraph showing CPU hotspots

7. Opt-in to LTO and Codegen Units

For production releases, your Cargo.toml should be tuned. Enabling Link Time Optimization (LTO) allows the compiler to optimize across crate boundaries, which is essential for high-performance binaries.

[profile.release]
lto = true
codegen-units = 1
panic = 'abort'

8. Embrace SIMD via Autovectorization

While you can write explicit SIMD code using std::arch, the easiest way to get SIMD performance is to write “compiler-friendly” code. Use contiguous memory (arrays/slices) and avoid branching inside tight loops. This allows LLVM to autovectorize your code, processing multiple data points in a single CPU instruction.

9. Understand the Cost of Dynamic Dispatch

dyn Trait (dynamic dispatch) involves a vtable lookup, which prevents the compiler from inlining functions. In my experience, switching from Box<dyn Trait> to generics (static dispatch) with impl Trait can lead to significant performance gains in hot paths.

If you’re curious about how this stacks up against other languages, check out my rust vs golang performance comparison to see the impact of these low-level choices.

10. Use the Right Collection for the Job

Not everything should be a HashMap. If your keys are contiguous integers, a Vec is always faster. If you need a set but the data is small, a sorted Vec with binary search can often outperform a HashSet due to cache locality.

Common Performance Mistakes

Overusing Arc<Mutex<T>>: Contention on a single mutex can bottleneck a multi-threaded app. Use dashmap or atomic types where possible.
Ignoring Cache Misses: Processing data in a non-linear fashion forces the CPU to fetch from RAM instead of L1/L2 cache. Always prefer linear memory layouts.
String Concatenation in Loops: Using format! inside a loop creates a new String every time. Use a single String` and push_str or write! into a buffer.


Measuring Success
The only way to know if these tips worked is through measurement. I recommend this workflow:

Establish a baseline with cargo bench.
Identify the bottleneck using cargo-flamegraph.
Apply one of the tips above.
Verify the improvement and check for regressions.


Ready to scale your Rust apps? Subscribe to my newsletter for more deep dives into automation and system performance.