16, 8, and 4-bit Floating Level Codecs — How Does it Work? | by Dmitrii Eliuseev | Sep, 2023

For 50 years, from the time of Kernighan, Ritchie, and their 1st version of the C Language e-book, it was recognized {that a} single-precision “float” kind has a 32-bit measurement and a double-precision kind has 64 bits. There was additionally an 80-bit “lengthy double” kind with prolonged precision, and all these sorts coated virtually all of the wants for floating-point knowledge processing. Nevertheless, throughout the previous couple of years, the appearance of enormous neural community fashions required builders to maneuver into one other a part of the spectrum and to shrink floating level sorts as a lot as doable.
Actually, I used to be stunned after I found that the 4-bit floating-point format exists. How on Earth can it’s doable? The easiest way to know is to check it on our personal. On this article, we’ll uncover the most well-liked floating level codecs, make a easy neural community, and see the way it works.
Let’s get began.
A “Customary” 32-bit Floating level
Earlier than going into “excessive” codecs, let’s recall a normal one. An IEEE 754 commonplace for floating-point arithmetic was established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). A typical quantity in a 32-float kind appears like this:
Right here, the primary bit is an indication, the following 8 bits signify an exponent, and the final bits signify the mantissa. The ultimate worth is calculated utilizing the formulation:
This easy helper perform permits us to print a floating level worth in binary kind:
import structdef print_float32(val: float):
""" Print Float32 in a binary kind """
m = struct.unpack('I', struct.pack('f', val))[0]
return format(m, 'b').zfill(32)
print_float32(0.15625)
# > 00111110001000000000000000000000
Let’s additionally make one other helper for backward conversion, which might be helpful later:
def ieee_754_conversion(signal, exponent_raw, mantissa, exp_len=8, mant_len=23):
""" Convert binary knowledge into the floating level worth """
sign_mult = -1 if signal == 1 else 1
exponent = exponent_raw - (2 ** (exp_len - 1) - 1)
mant_mult = 1
for b in vary(mant_len - 1, -1, -1):
if mantissa & (2 **…