You are misunderstanding me. Values less than 128 take only one byte.
The obvious implementation fast path is:
1. Check that the current position is at least a 8 bytes from the end of the buffer. Otherwise, go to the short buffer slow path.
2. Count the number of leading 1s (or 0s) using BSR/LZCNT.
3. If the number of leading 1s (or 0s) indicates more than 8 bytes are necessary, use the multi-word slow path.
4. Shift and mask out your value.
5. Increment your position pointer by the length of your varint.
Values less than 128 will have no leading 1s (or no leading zeros, if your scheme uses leading zeroes instead). The shift value is 56 bits. Unless you have a 1-byte optimized path, your calculated mask value is (((uint64_t)-1LL) >> 57). The position pointer is incremented by one byte at the end of the decode.
Now that I think about it, for the uint126_t encoding case, you actually need to inspect up to the first 3 bytes. I've only implemented this for encoding/decoding uint64_ts, in which case there's a little trick that saves one byte, but the trick only works because 64 % 7 == 1.
Maybe it's most clear if I spell out the 19 cases necessary to encode any uint128_t (system using BSR (leading 1s) insteod of LZCNT (leading 0s)).
0 xxxxxxx -> 0 to 127
10 xxxxxx yyyyyyyy -> 128 to 2**14 - 1
110 xxxxx yyyyyyyy zzzzzzzz -> 2**14 to 2**21-1
1110 xxxx yyyyyyyy zzzzzzzz aaaaaaaa -> 2**21 to 2**28-1
11110 xxx yyyyyyyy zzzzzzzz ... 2 more bytes -> 2**28 to 2**35-1
111110 xx yyyyyyyy zzzzzzzz ... 3 more bytes -> 2**35 to 2**42-1
1111110 x yyyyyyyy zzzzzzzz ... 4 more bytes -> 2**42 to 2**49-1
11111110 yyyyyyyy zzzzzzzz ... 5 more bytes -> 2**49 to 2**56-1
11111111 0 yyyyyyy zzzzzzzz ... 6 more bytes -> 2**56 to 2**63-1
11111111 10 yyyyyy zzzzzzzz ... 7 more bytes -> 2**63 to 2**70-1
11111111 110 yyyyy zzzzzzzz ... 8 more bytes -> 2**70 to 2**77-1
11111111 1110 yyyy zzzzzzzz ... 9 more bytes -> 2**77 to 2**84-1
11111111 11110 yyy zzzzzzzz ... 10 more bytes -> 2**84 to 2**91-1
11111111 111110 yy zzzzzzzz ... 11 more bytes -> 2**91 to 2**98-1
11111111 1111110 y zzzzzzzz ... 12 more bytes -> 2**98 to 2**105-1
11111111 1111111 0 zzzzzzzz ... 13 more bytes -> 2**105 to 2**112-1
11111111 11111111 0zzzzzzz ... 14 more bytes -> 2**112 to 2**119-1
11111111 11111111 10zzzzzz ... 15 more bytes -> 2**119 to 2**126-1
11111111 11111111 110zzzzz ... 16 more bytes -> 2**126 to 2**133-1
No because values under 128 (very common in my data) take 2 bytes instead of 1. Unless I'm misunderstanding you.