Hacker News Clone new | comments | show | ask | jobs | submit | github repologin
AMD Disables Zen 4's Loop Buffer (chipsandcheese.com)
94 points by luyu_wu 3 hours ago | hide | past | web | 31 comments | favorite





From another article:

"Both the fetch+decode and op cache pipelines can be active at the same time, and both feed into the in-order micro-op queue. Zen 4 could use its micro-op queue as a loop buffer, but Zen 5 does not. I asked why the loop buffer was gone in Zen 5 in side conversations. They quickly pointed out that the loop buffer wasn’t deleted. Rather, Zen 5’s frontend was a new design and the loop buffer never got added back. As to why, they said the loop buffer was primarily a power optimization. It could help IPC in some cases, but the primary goal was to let Zen 4 shut off much of the frontend in small loops. Adding any feature has an engineering cost, which has to be balanced against potential benefits. Just as with having dual decode clusters service a single thread, whether the loop buffer was worth engineer time was apparently “no”."


This is a wild guess, but could this feature be disabled in an attempt at preventing some publicly undisclosed hardware vulnerability?

Bingo.

I can't say more. :(


Have we learned nothing from Spectre and Meltdown?... :(

Complex systems are complex?

The Article more-or-less speculates that:

> Zen 4 is AMD's first attempt at putting a loop buffer into a high performance CPU. Validation is always difficult, especially when implementing a feature for the first time. It's not crazy to imagine that AMD internally discovered a bug that no one else hit, and decided to turn off the loop buffer out of an abundance of caution. I can't think of any other reason AMD would mess with Zen 4's frontend this far into the core's lifecycle.


Yeah, my first thoughts too.

It sounds to me like it was too small to make any real difference except in very specific scenarios and a larger one would have been too expensive to implement compared to the benefit.

That being said, some workloads will see a small regression, however AMD has made some small performance improvements since launch.

They should have just made it a BIOS option for Zen 4. The fact they do not appear to have done so does indicate the possibility of a bug or security issue.


Them *quietly* disabling a feature that few users will notice yet complicates the frontend suggests they pulled this chicken bit because they wanted to avoid or delay disclosing a hardware bug to the general public, but already push the mitigation. Fucking vendors! Will they ever learn? sigh

> Strangely, the game sees a 5% performance loss with the loop buffer disabled when pinned to the non-VCache die. I have no explanation for this, […]

With more detailed power measurements, it could be possible to determine if this is thermal/power budget related? It does sound like the feature was intended to conserve power…


The article seems to suggest that the loop buffer provides no performance benefit and no power benefit.

If so, it might be a classic case of "Team of engineers spent months working on new shiny feature which turned out to not actually have any benefit, but was shipped anyway, possibly so someone could save face".

I see this in software teams when someone suggests it's time to rewrite the codebase to get rid of legacy bloat and increase performance. Yet, when the project is done, there are more lines of code and performance is worse.

In both cases, the project shouldn't have shipped.


> but was shipped anyway, possibly so someone could save face

Was shipped anyway because it can be disabled with a firmware update and because drastically altering physical hardware layouts mid design was likely to have worse impacts.


That bathroom with a door to the kitchen.

> but was shipped anyway, possibly so someone could save face

no. once the core has it and you realize it doesn't help much, it absolutely is a risk to remove it.


No kidding. I was adjacent to a tape out w some last minute tweaks - ugh. The problem is the current cycle time is very slow and costly and u spend as much time validating things as you do designing. It’s not programming.

Once interviewed at a place which made sensors that was used a lot in the oil industry. Once you put a sensor on the bottom of the ocean 100+ meters (300+ feet) down, they're not getting serviced any time soon.

They showed me the facilities, and the vast majority was taken up by testing and validation rigs. The sensors would go through many stages, taking several weeks.

The final stage had an adjacent room with a viewing window and a nice couch, so a representative for the client could watch the final tests before bringing the sensors back.

Quite the opposite to the "just publish a patch" mentality that's so prevalent these days.


If you work on a critical piece of software (especially one you can't update later), you absolutely can spend way more time validating than you do writing code.

The ease of pushing updates encourages lazy coding.


> The ease of pushing updates encourages lazy coding.

Certainly in some cases, but in others, it just shifts the economics: Obviously, fault tolerance can be laborious and time consuming, and that time and labor is taken from something else. When the natures of your dev and distribution pipelines render faults less disruptive, and you have a good foundational codebase and code review process that pay attention to security and core stability, quickly creating 3 working features can be much, much more valuable than making sure 1 working feature will never ever generate a support ticket.


"the project shouldn't have shipped."

Tell that to the share holders. As a public company, they can very quickly lose enormous amounts of money by being behind or below on just about anything.


Interesting read, one thing I don’t understand is how much space does loop buffer take on the die? I’m curious with it removed, on future chips could you use the space for something more useful like a bigger L2 cache?

Judging from the diagrams, the loop buffer is using the same storage as the micro-op queue that's there anyway. If that is accurate (and it does seem plausible), then the area cost is just some additional control logic. I suspect the most expensive part is detecting a loop in the first place, but that's probably quite small compared to the size of the queue.

I think most modern chips are routing constrained and not floorspace constrained. You can build tons of features but getting them all power and normalized signals is an absolute chore.

My understanding is that it's a pretty small optimization on the front end. It doesn't have a lot of entries to begin with (144) so the amount of space saved is probably negligible. Theoretically, the loop buffer would let you save power or improve performance in a tight loop. In practice, it doesn't seem to do either, and AMD removed it completely for Zen 5.

It says 144 micro-op entries per core. Not sure how many bytes that is, but L2 caches these days are around 1MB per core, so assuming the loop buffer die space is mostly storage (sounds like it) then it wouldn't make a notable difference.

In the "power" section, it seems the analysis doesn't divide by the number of instructions executed per second.

Energy used per instruction is almost certainly the metric that should be considered to see the benefits of this loop buffer, not energy used per second (power, watts).


Anecdotally one of very few differences between 1979 68000 and 1982 68010 was addition of "loop mode", a 6 byte Loop Buffer :)

Much more importantly they fixed the MMU support. The original 68000 lost some state required to recover from a page fault the workaround was ugly and expensive: run two CPUs "time shifted" by one cycle and inject a recoverable interrupt on the second CPU. Apparently it was still cheaper than the alternatives at the time if you wanted a CPU with MMU, a 32 bit ISA and a 24 bit address bus. Must have been a wild time.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: