AI

# 16, 8, and 4-bit Floating Level Codecs — How Does it Work? | by Dmitrii Eliuseev | Sep, 2023

## Let’s go into bits and bytes For 50 years, from the time of Kernighan, Ritchie, and their 1st version of the C Language e-book, it was recognized {that a} single-precision “float” kind has a 32-bit measurement and a double-precision kind has 64 bits. There was additionally an 80-bit “lengthy double” kind with prolonged precision, and all these sorts coated virtually all of the wants for floating-point knowledge processing. Nevertheless, throughout the previous couple of years, the appearance of enormous neural community fashions required builders to maneuver into one other a part of the spectrum and to shrink floating level sorts as a lot as doable.

Actually, I used to be stunned after I found that the 4-bit floating-point format exists. How on Earth can it’s doable? The easiest way to know is to check it on our personal. On this article, we’ll uncover the most well-liked floating level codecs, make a easy neural community, and see the way it works.

Let’s get began.

## A “Customary” 32-bit Floating level

Earlier than going into “excessive” codecs, let’s recall a normal one. An IEEE 754 commonplace for floating-point arithmetic was established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). A typical quantity in a 32-float kind appears like this:

Right here, the primary bit is an indication, the following 8 bits signify an exponent, and the final bits signify the mantissa. The ultimate worth is calculated utilizing the formulation:

This easy helper perform permits us to print a floating level worth in binary kind:

`import structdef print_float32(val: float):""" Print Float32 in a binary kind """m = struct.unpack('I', struct.pack('f', val))return format(m, 'b').zfill(32)print_float32(0.15625)# > 00111110001000000000000000000000 `

Let’s additionally make one other helper for backward conversion, which might be helpful later:

`def ieee_754_conversion(signal, exponent_raw, mantissa, exp_len=8, mant_len=23):""" Convert binary knowledge into the floating level worth """sign_mult = -1 if signal == 1 else 1exponent = exponent_raw - (2 ** (exp_len - 1) - 1)mant_mult = 1for b in vary(mant_len - 1, -1, -1):if mantissa & (2 **…`