Explaining 32-Bit Float Audio

When it comes to recording microphones, 24-bit memory is more than you’ll ever need.

Max Can't Help It!
7 min readSep 7, 2023

I’ve been on a mission to enlighten prospective buyers of audio recording equipment that 32-bit float does not prevent clipping, or in any way, improve the fidelity of microphones recorded in 24-bit fixed.

32-bit float has a benefit, and a trade-off, which are worth keeping mind after you record an analog source. I’ll get to that later.

This story is for anyone who tried to learn 32-bit float, but ended up more confused than ever. I hope to explain it in a way I wish someone had explained it to me. I’m no math expert! (So if I’ve made a mistake please let me know in the comments).

What makes 32-bit float a headache is its use of A) bit-math B) fractions and C) exponents — together.

To help you understand the fundamentals of 32-bit float, I’m going to create a 4-bit version, without bit-math, fractions or large exponents.

Imagine we have a restaurant. Even though we don’t know a fraction from an exponent, we have a line out the door. Someone built us a computer to keep track of that line. We are told it has 4-bits of memory.

It was explained that we can work with a line of up to 16 people in 4-bit.

What’s important is we can count up to 16. Which is fine, because our line never goes above 8. (That only fills up 3-bits. HA! We ain’t so dumb!)

Then we move to new restaurant with more space for people to line up. In the first restaurant, there was only 8 feet from the front door to the curb. In our new restaurant, there is 64 feet between our door and curb.

If we want to reference people on a line from 1 to 64 we’d need a 6-bit computer. The maximum value of 6-bit binary is 64. We’re told the computer would be expensive and slow.

Is there another way?

Thinking about it, we realized we’re more interested in where the person is in those 64 feet, than the exact number of them, which we expected would remain at 8. We went back to the computer person.

They said okay, if we’ll always have 8 people or less and it’s more important to know where in those 64 feet they stood, we could change how we used our 4-bits. We’d keep our 3-bits for the count of people, 8, but use the 4-th bit as an exponent that would extend our system — its scale — out to 64 feet.

We could create a new scale by scaling our 3-bit value with a 4-th bit exponent of 1 or 2.

HUH?!

Please bear with me.

Every number would be calculated by taking one of our 3-bit numbers and applying one of the exponents, so, for the first of our two exponents; that is, 1…

1¹, 2¹,3¹,4¹,5¹,6¹,7¹ and 8¹

OR

1,2,3,4,5,6,7,8

Then for our second exponent 2…

1²,2²,3²,4²,5²,6²,7²,8²

OR

1,4,9,16,25,36,49 and 64

Putting them all together, from our min to our max, our computer can work with these numbers using our 4-bit exponential format:

1,1,2,3,4,4,5,6,7,8,9,16,25,36,49,64

Each one is generated by a 3-bit number from 1 to 8 raised to an exponent of 1 to 2.

You’ll notice that those numbers have duplicates and gaps. They’re imprecise!

There are similar imprecisions in 32-bit float too, but in 32-bit float they’re super super small. Nonetheless, the fact remains. 32-bit Float is imprecise.

That’s okay because we’re trading SOME PRECISION for GREATER SCALE.

Because we are more interested in the scale, than the precision, we went with an exponential way of working with the wait list. We could go to the 49th foot and most likely find our 7th guest (assuming they’re spread out). We could find our 6th guest around 36 feet. We could find our last guest at our 64th foot.

Also, we can easily go back to 8 precise values should be want them. We just take the reverse of the exponent, the (square in this case) root, since we use 2 as an exponent. Fun formulas below!

32-bit float is only different in the size of numbers it deals with.

It creates numbers so large that we need to represent them as large fractions to make them usable. In 32-bit float, it uses 23-bits for the mantissa. That’s a range of 1 to 8,388,607 in decimal.

But the numbers it’s really using are fractional from 1 to 1.9999998807907104. Scary right? The first bit in the mantissa is 1. The trailing bits create fractional parts, the 0 to .9999998807907104 (remember, when the exponents are applied there will be small gaps in the range).

How does it create those numbers? Let’s say you want to convert the address of the White House to 32-bit float. You move up your decimal place from 1600 to 1.600 (This is done in binary so this is over simplifying, or doesn’t explain how you would do the number 9). Anyway, the 1.600 becomes your “mantissa”.

In a sense, 32-bit float using the EXACT same numbers used in 24-bit, except it converts them to decimal numbers between 0 and 2 first.

Why do this?

I know this stuff is hard (at least for me), but when I grasped this concept it made my morning!

In 24-bit fixed memory, I can count anything up to 16,777,216. But what if I’m an investment banker and my client has 160,777,216 dollars? What if I must work with large numbers? He would have 100 million more dollars than my memory can hold.

32-bit float allows me to trade a small amount of precision for a greater range by essentially shrinking my number down (0 to 16,777,216 to something like 0.0000001 to 1.9999999), then telling the computer how it can expand it back using an exponent.

So that number becomes 1.60777216 to an exponent of 8 (number of decimal places I moved to the left).

The imprecisions are so miniscule they rarely effect us on our day to day work. What has to be born in mind is that if we remove the imprecisions we get back to the same amount of discrete data in our 24-bit memory.

That’s a hard thing for me to grasp too!

My way of thinking of it is that say we have 1,2,3 in our 24-bit memory, and 1, 3, 3.5, 4, 5 in our 32-bit float memory. We can create a 3.5 representation in our 32-bit memory and it will show up as 3.5 if we ask for it. BUT, and this is the big BUT. If we subtract it from 4, let’s say, we will not get 0.5. The reason is that the .5 it NOT scaled the same way the other numbers are.

When we’re working with microphone data (to get back to my motivation to learn all this stuff) we need each measure of amplitude to be precisely relative to the the ones above and below it. So if we have 100 millivolts, it needs to be exactly 1 millivolt above 99 and 1 below 101.

All distances between values must be precise — it’s the very definition of true fidelity. If we don’t remove those imperfections we end up with distortion!

In 32-bit float, there are numbers like 100.5 which will not calculate as 0.5 away from 100. Therefore, they cannot be counted on (excuse the pun). If we remove all the imprecise numbers in 32-bit float we get back to 99,100 and 101.

Another way to intuit these imprecisions, “gaps” when you think of a pizza cut into 3rds. If we have a pizza in our restaurant that’s 10 inches, a 3rd would be 3.333333. If we call it 3.3 then we’re 0.033333333… imprecise. That would show up as a gap if we did something as simple as addition or subtraction.

For example, if we subtracted 1/3rd of pizza slice from 5 slices we’d get 5 minus (10/3); that is, 1.3333333333… We’d have to truncate somewhere, like, 1.33. What if later on we ask the computer to subtract 1.33 from (10/3). We’d get 0.003333333…

Not clear, don’t worry. All you really need to understand is there is a trade-off.

Whenever we work with fractions we will encounter fractions, like 1/3 or 11/10 (numbers that are difficult to calculate using exponents) that can’t be written out with total precision, like we can with 1,2,3… etc.

That doesn’t mean 32-bit float can’t be precise! If you reverse that “scaling” you’d get back to 24-bits worth of real numbers (as shown above with our 4-bit system).

32-bit float is useful in audio work because you can move your values along a wider scale. Sure, you lose some precision, but it’s might be more important to have a wider scale to work in.

Remember our restaurant analogy? It’s easier to move people forward and back in our 64 feet. Even though in the end, we’ll need to record “party of 8”, or “party of 3”, etc.

Because microphones output a limited number of usable milli-volt values for amplitude, well under 65,536, if not 4,096, (or 12-bit), we don’t need to widen our scale to record them — it’s just wasted memory. Even 24-bits is more than we need! Unused memory aside, it’s no harm to place our analog values in 32-bit float.

As long as we don’t believe its providing more discrete values for our microphone output than 24-bit. As long as we understand that, from a fidelity point of view, 24-bit EQUALS 32-bit float.

###

This is the essay, by Fabien Sanglard, that got me over the hump in understanding 32-bit float.

--

--