Unsafe can't ensure performance improvements, but you can sometimes get performa...

majewsky · on April 24, 2020

> Unsafe can't ensure performance improvements, but you can sometimes get performance boosts using it.

Maybe to back this up with a concrete example, let's say I'm parsing something and I have a &[u8] (a string of bytes) which I have already verified contains only ASCII digits. I wanted to get the numerical value of that number. I don't want to write my own number parsing code, but std only has a parsing function for &str (UTF-8 strings), not for &[u8]. I could do

  let string = str::from_utf8(bytes);
  let value: u64 = string.parse()?;

But then str::from_utf8 would iterate over the bytestring again to verify that it's valid UTF-8. This check is useless because I already know the string only contains ASCII digits. So in this case, I can improve performance with an unsafe block:

  //SAFETY: `bytes` was already proven to only contain ASCII digits
  let string = unsafe { str::from_utf8_unchecked(bytes) };
  let value: u64 = string.parse()?;

The performance gain comes not from unsafe per se, it comes from using a different function that skips the UTF-8 check. Since this function does not guarantee on its own that str's invariants are upheld, it's marked unsafe.

Arnavion · on April 24, 2020

In one of my crates, std::str::from_utf8 is indeed the only reason it uses unsafe.

    let (result, len) = match std::str::from_utf8(buf) {
        Ok(s) => (s, buf.len()),
        Err(err) => match (err.valid_up_to(), err.error_len()) {
            (0, Some(_)) => return Err(Error::Utf8(err)),
            (0, None) => return Err(Error::NeedMoreData),
            (valid_up_to, _) => (
                unsafe { std::str::from_utf8_unchecked(buf.get_unchecked(..valid_up_to)) },
                valid_up_to,
            ),
        },
    };

The intention is to extract as much valid &str from the given buf as possible, and only fail if there is no more valid &str to read from the buf. Unfortunately std::str::from_utf8's error does not also yield a &str corresponding to the part that did parse successfully ( * ), so I have to compute it myself. Using std::str::from_utf8 for this second pass unfortunately does end up verifying the UTF8-ness of the buf again, as verified from the asm, because the compiler isn't sufficiently smart. Similarly slicing buf with valid_up_to safely also reruns the bounds check, which is why the code also uses get_unchecked for that instead.

( * ) Nothing prevents it from doing that, and one of these days I might get around to PRing it.

dirtydroog · on April 24, 2020

Isn't this a bit fragile though? It's totally fine as long as you own and understand the code. But what's to stop someone new and/or inexperienced from coming along and violating the 'already proven' claim? How will you know?

Isn't there a risk of Rust code having "Don't ever change this" comments?

majewsky · on April 27, 2020

That's a general problem of coding, that the next edit can violate an invariant. The advantage with Rust is that, for several types of invariants, you only need to check places with an unsafe block during code review or audits. Everywhere else, the compiler upholds these invariants for you.

lidHanteyk · on April 24, 2020

Why not lift the encoding to the type system? The type system could remember which strings are ASCII and which are UTF-8. More generally, why not lift such proofs to types?

majewsky · on April 27, 2020

That's exactly the difference between [u8] and str. The unsafety here arises from skipping the standard proof over a more domain-specific one.