Monday, April 16, 2001
Also, new complexities uncovered regarding
P4 bandwidth utilization and power consumption.
2001 provides an unfortunate backdrop for Intel as it attempts to lure the mainstream market to a new generation of microprocessors. The timing is just plain bad. The economy is weak. Intel has lost market share and customer respect based on a series of unwise platform strategies and poor execution. Mainstream PC buyers are re-focusing on value, while enthusiasts are exploring their many new alternatives. In the face of all this, the market has sought, but not found very many reasons to be thrilled with the P4.
While Intel has predicted its fastest ramp ever for a new processor generation, the word from Asia is that P4 motherboard shipments in Q1 were 50% below expectations. Conversely, leading motherboard makers have reported healthy overall results in spite of low P4 motherboard sales. Intel should be able to identify this as a particularly bad omen for the future of the P4.
The market has spoken. It will not pay a premium for the P4 compared to AMD, regardless of a difference in clock speed. Intel is hoping that drastic price cuts can invigorate the market and help it to overlook its frustration with the new processor.
Is now a good time for manufacturers to aggressively fill their pipeline with P4+850 systems? Will the P4 platform gain any additional traction in the market? Are the P4’s woes tied exclusively to its unfamiliar performance balance, or does the platform have anything to do with it?
P4 Lacks Platform Stability
Intel coined the term “Platform Stability” while criticizing AMD’s transition to Athlon. Intel claimed that customers need a stable platform (one that is not in transition) in order to consider aggressive adoption of a CPU. In contrast, the P3 platform was well supported with alternate chip sets and low cost DRAM.
But now it seems that the shoe is on the other foot. The instability of the P4 platform appears to be a serious impediment affecting both the supply side and demand side of the market. Bottom line - the P4 lacks a platform that the market can have confidence in.
The necessary corrective action has been publicly identified since before the launch of the P4. The P4 platform will undergo five major changes that will transpire over the second half of 2001 – namely, a die shrink, a package change, a clock speed boost, alternate chip set availability, and support for mainstream DRAM types.
The anticipation of this massive re-shuffling of the P4’s platform infrastructure has directly contributed to the market’s conspicuously lukewarm reaction to Intel’s transitory P4+850 platform. No one wants to invest in a dead end road.
Below are results from two of Anand’s recent reader polls supporting this observation about the mood of the market on this issue.
Users seem keenly aware of the P4’s weaknesses. Only 9% favor the current platform. But that number jumps to 72% when DDR and Northwood show up. This suggests that a huge portion of Intel’s target customers are willing to wait at until Q4’01 to see how things turn out. Meanwhile, an overwhelming majority (78%) are bullish on AMD’s processor and platform strategy today. If Intel’s problem is the platform, we must wonder whether CPU pricing alone is enough to band-aid the situation.
Though the current P4 (Willamette) has its shortcomings, the Northwood processor (die shrink to 0.13 micron) will reduce costs, improve performance and enable much faster clock speeds. Its L1 and L2 caches will be enlarged and its raw computational efficiency will be improved.
Will Northwood significantly expand the P4 market this year? According to Intel’s memory platform roadmap presented at IDF, the outlook may not be entirely rosy. Page 11 of the presentation states that SDRAM platforms are intended to take the P4 into the mainstream business markets, but less than 1/3rd of P4 sales in 2001 will use SDRAM platforms. Not much penetration into the mainstream – it seems. This suggests that the P4 will probably remain in its current niche for the rest of this year.
We agree that Northwood will not necessarily be a quick save for the P4 brand name, but it will help to build a more positive mood about the P4 in the market. Production ramp is another matter entirely. Though it will be introduced in late Q3, Northwood might not achieve meaningful broad scale production levels until Q1’02.
Brookdale – No Salvation in 2001
Brookdale, Intel’s SDRAM chip set for the P4, is scheduled for the summer of 2001. Brookdale will have to compete with new DDR chip sets from VIA, Acer Labs and others that will begin to show up at about the same time. It may be difficult for the PC133 based Brookdale to gain much traction against its DDR enabled competition.
Recently, Intel has mysteriously ceased using the term ‘PC133’ when describing Brookdale. We think that there may be a good reason for this. PC133 is required to enable AGP4x, but when PC133 is synchronized to the P4 front side bus, we believe that it will synchronize at a 100MHz clock, not a 133MHz clock. This means that all performance critical operations on the bus will occur at the same performance level as Celeron platforms (100MHz). Even though the system may be populated with PC133, it will deliver the read performance of PC100.
PC100 was introduced on performance PCs when the fastest CPUs were running at 400MHz. It was replaced by PC133 by the time CPUs hit about 800MHz. Now Intel expects the market to get excited about PC100 performance on a 2GHz platform. We think Intel is going to experience yet another horrific market backlash on this one.
Perhaps Intel was a little frustrated that its own 815+PC133 platform beat RDRAM last year. This time Intel will not screw up – they will doubly cripple SDRAM to make certain that it cannot match RDRAM. In other words, Intel seems willing to publicly impale Northwood on an entirely inadequate Brookdale platform in order to make its point. It is this type of agenda that causes further doubt about Intel’s commitment to the P4 platform and leaves the door wide open for all of Intel’s competitors.
Need for a Balanced Platform
After a yearlong bout with austerity, the market will ease out of its current slump with a focus on balanced platforms. Clock speeds exceeding 2GHz could have an exhilarating effect on PC buyers, but in its short life the P4 has already left a deep legacy in the mind of users that clock speed is NOT the whole story. The P4 and RDRAM are both labeled as a products that run at a high RPM, but don’t always deliver a lot of horsepower.
If Intel is to regain traction, the P4 needs a stellar platform strategy. DDR could easily become the saving grace for the P4 late this year, thanks to Intel’s competitors. DDR props up the Intel platform in a balanced and logical manner that somehow escapes Intel’s marketing folks. We suppose the chip set vendors don’t mind – to them, this is becoming familiar territory.
P4 Bandwidth and Power Analysis
With RDRAM’s unfortunate legacy on the P3, the P4 inherits a negative stigma regarding performance, cost, manufacturability and availability. From the perspective of the end user and reseller, this may seem unavoidably clear, but Intel seems to deny the existence of a problem. In Intel’s memory roadmap presentation at IDF, Intel’s mounted a vigorous defense of Rambus with the proclamation that ‘RDRAM is the best solution’.
In an attempt to support this statement, Intel released some interesting new lab test results on platform bandwidth utilization. Using external hardware, Intel measured the long-term average bus utilization of DRAM under certain benchmarks. Two were quoted in the IDF presentation – both from the SPEC benchmark.
Our objective is to dissect this data to learn as much as possible from it. The diagram below was reconstructed from Intel’s presentation. The vortex object oriented database benchmark was provided as a low bandwidth application example, but contains some interesting information, nonetheless. Most benchmarks and applications are not DRAM bandwidth limited. This is simply because the cache is doing its job. If this were not true we would all be quite frustrated by the performance of our PCs (remember Covington?).
The chart is a bandwidth histogram that shows bus utilization averaged in one-second intervals while the benchmark runs.
Intel quotes average bandwidth consumption at 100MB/s for the P3+840, while the P4+850 platform roars ahead to about 400MB/s on average. On the surface this appears to be a magnificent revelation of the power of the P4. If the P4 consumes 4x more bandwidth, is that an indication that it can do 4x more work? Perhaps it will deliver a magnificent performance boost… Not surprisingly, one vital piece of information was selectively omitted from the presentation –Actual Benchmark Scores!
We located Intel’s internally generated SPEC scores at www.spec.org, and discovered essentially no difference in performance between the two platforms.
Even with a 50% clock speed advantage and an unexplained 4x hunger for bandwidth, the P4 produced a performance delta of only 3.5%. (For contrast we included the 1.33GHz Athlon+DDR, which delivers the best score of all.) Clearly, something is fishy. We all know about the P4’s pipeline problems, excessive branch misprediction penalties, a poor L1 cache implementation, etc., but perhaps something else is out of balance here.
The Culprit – Longer Burst Length
Since the P4 is not getting any more work done than the P3 in this application, then its excess bandwidth demand is probably just extraneous, meaningless bus noise. If so, this is a poor marketing justification for higher bandwidth.
The P4 uses a 128-byte sectored cache line. This means that most external burst accesses will be 128-bytes long, though some can be abbreviated to 64-bytes long (perhaps code fetches, some write backs or cache misses to the second sector). By the way, this type of long sectored cache design can negatively impact cache-hit rates. If 40% of external bus accesses are 64-bytes, then perhaps 40% of the cache lines are only using 64 of the 128-bytes available per line. This would mean that up to 20% of cache memory is empty (unused, invalid or unallocated). This would negatively impact P4 cache hit rates and thus, performance.
The P3 uses a very granular 32-byte cache line (non-sectored). All external accesses are 32-bytes and one should assume that the cache will always fill up nicely and have excellent hit rates.
External access are ordered ‘demand word first’, meaning that code or data immediately required by the CPU is delivered first, after which the burst continues to fill the cache line with neighboring code or data. In highly random applications (such as cached business apps) a long burst can result in a lot of extraneous (unneeded) code or data will be filled into the cache with each ‘demand word’.
The chart below shows the best case burst cycle times for the P3 and P4 measured in nanoseconds. The P3 shows superior latency by 19% and frees up the bus 33% sooner than the P4.
In applications that show very random external access patterns (such as mainstream benchmarks), very often the demand word is all that is required by the CPU. There is always a steeply diminishing possibility that additional burst data is also required, but in this case, Intel’s external bandwidth histogram provides evidence that the longer burst is NOT needed or even helpful. In fact it might even hurt performance.
Intel’s data essentially proves that both processors accomplish the same amount of work over the same period of time and are performing the same number of individual external burst transactions. The tell tale sign is that the P4 bandwidth consumption is precisely 4 times higher than the P3. This is evidence of a 1:1 relationship in the number of burst cycles on the bus (since the P4 burst is 4x longer). If the P3 can accomplish the same work in the same time while reading in only 1/4th as much information, then the P4’s additional 75% overhead is actually just wasted bus activity.
Two negative side effects are that the cache is being filled with useless data (resulting in lower hit rates) and that subsequent bus transactions must wait 33% longer to begin on the P4 as compared to the P3 (see diagram above). With this in mind, for Intel to demonstrate a bandwidth consumption increase from the P4 without a commensurate performance increase is actually disclosing a hidden weakness of the processor. Intel is merely demonstrating the P4’s ability to meaninglessly burning off excess bandwidth without improving performance. Rather annoying.
Another Data Point on Bandwidth
A second example supplied by Intel in its IDF presentation is completely different in nature. It uses the floating-point intensive SWIM test from the SPEC benchmark - a shallow water modeling benchmark. Unlike the others, this test is entirely limited by external bandwidth because the active data set significantly exceeds the cache size. The figures are below.
As we observe performance vs. bandwidth, they scale in perfect unison in this benchmark. This is entirely different from the vortex database test above. Comparing the P3 vs P4, bandwidth scales by 2.5x and the score also scales by 2.5x. Thus, the performance benefit is due exclusively to the external bus. The faster clock speed of the P4 had no visible performance impact. In fact, Intel’s 1.3GHz P4 earns a score of 1238 on this test – a difference of only 0.5%. We expect that the score will not budge even with the upcoming 1.7GHz processor.
Applications of this sort are uncommon, but they make a great demonstration for fast and wide buses. It is also possible that a big L3 cache could make a great showing as well, but we will have to wait to find out about that.
Pentium 4 Power Management – A Performance Limiter
Last year, as details of the Pentium 4 were emerging, Intel''s elaborate thermal and power regulation requirements for P4 systems raised many eyebrows. Platforms required new power supplies capable of pumping out more current, and enormous copper heat sinks for CPU cooling. Humorous jabs ensued… "Turns your PC into a toaster oven," some said.
But just prior to the P4 launch, Intel sought to confront the issue by publicizing power specifications for the soon to be released CPU. To many people''s amazement, the 1.5GHz Pentium 4 was said to consume only 54.7 Watts - a maximum power dissipation rate well below the fastest Athlon.
Numerous independent product reviews have quoted this value (and still do) particularly in contrast to AMD''s latest 73 Watt, 1.33 GHz Athlon. The only problem is that the quoted Pentium 4 power dissipation figure is wrong - or at least the number is misleading.
These two specifications are defined under entirely different conditions. AMD reports the true absolute maximum power dissipation rates without constraint. Intel on the other hand, publishes a compromised figure that is open to lots of interpretation. Page 70 of Intel’s P4 datasheet shows the dissimilarity.
Since most applications are ‘unlikely’ to cause the processor to consume its absolute maximum power, Intel quotes a figure that looks more like a ‘not-to-be-exceeded’ figure. In theory this power level could easily be exceeded under stress tests, benchmarks and particularly challenging application loads. If this were to happen, a thermal diode (required on all P4 platforms) would trigger a power management mechanism that instantly cuts CPU performance, and allows it to cool down.
The P4 spec reads, “The Thermal Monitor feature … is intended to protect the processor from overheating when running high power code that exceeds the recommendations in this table,” (54.7 Watts for the 1.5GHz model).
Looking forward, if you buy a 1.7GHz P4, it will run at that speed when it is idle, or under light loads, when CPU utilization is nominal, or in applications that don’t really need a 1.7GHz CPU. But when you drive it to extremes, or wish to extract all available performance from the processor, you may find yourself spontaneously and unavoidably power managed to a lower effective clock speed. Intel’s motto… “1.7GHz. Its there. Unless you need it.”
Update: Intel’s Thermal Design Guide has revealed that the absolute maximum power dissipation of the 1.5GHz P4 is actually 72.9 watts. This is 33% higher than the published system design specification, and essentially identical to the 1.33 GHz Athlon. If power dissipation is sustained at a level higher than 54.7 watts thermal overload can occur. In order to deal with this, a mechanism called thermal throttling is used. If performance critical applications drive the CPU above a predetermined temperature, the CPU is halted with a 50% duty cycle (alternating 2 microseconds on; 2 microseconds off) until it cools down. This effectively turns your 1.5GHz processor into a 750MHz processor – just at the moment you demand peak performance. On the other hand, you will probably still be able to check your email at 1.5GHz. This scheme is described on page 23 of Intel’s P4 Thermal Design Guide.
Commentary is already floating around the web that perhaps Intel feels guilty about selling 750MHz CPUs in 1.5GHz clothing, and thus has decided to cut the price by 50% as well.
After all of this, our intent is still the same - to evaluate the road ahead for P4. It seems that the jury is still out. At some point the P4 will entirely replace the P3 in Intel’s product mix, so Intel will be free to declare it a success at any time, on its own terms. But will Intel continue to lose market share along the way due to its frustrating and confusing platform strategy and performance profile?
With application throughput confounded by its core processor architecture, external bus irregularities, and thermal regulation compromises, the Pentium 4 presents an unfamiliar and seemingly unbalanced performance spectrum. Worse still, the P4 is weakest on the core applications that have thus far driven mainstream PC usage and sales demand.
With an oversized die, production capacity per wafer is reduced by about 60% relative to the P3. Intel’s near term P4 ramp is constrained by its ability to bear these elevated costs. Now, the P4 must undergo horrific premature price slashing in order to sustain any hope of popularity in the market.
Intel’s dependence on Rambus for the Pentium 4, serves as a major barrier for widespread acceptance in business domains. Pricing, availability problems and a recall fiasco have branded RDRAM as an unattractive path in the minds of many corporate buyers. Few IT managers want to spend their budget, or risk their reputations on technology that burned them only a few months ago.
In light of these issues, the P4 motherboard sales shortfall begins to make sense.
Although the move to Northwood and third party chip sets will help alleviate several of these problems, other issues arise. Not only is Intel moving to a new 0.13 micron process with Northwood, but this process utilizes copper technology that is foreign to Intel. We cannot help but wonder if Intel’s pronounced aggressive ramp of this process is a bit over-ambitious. We can’t help but wonder if 0.13 micron copper wafers and chips will be in short supply until 2002.
Finally, as the industry creeps toward a cautious economic recovery, circumspect consumers will first seek refuge in the low risk value segments. We expect an upswing of demand in the low-end segments in the latter half of the year. Unfortunately, high dollar P4s will be left on the shelf.
At the opposite end of the spectrum, the dot-com collapse has taken the wind out of the sails of the server market – but not to the same extent as the rest of the market. P3 platforms are in desperate need of upgrade and the P4 provides an excellent solution, aided by ServerWorks. The ramp to production is expected to begin late in the year though, and volumes cannot compare to the mainstream markets.
So, can the P4 recover? This year looks bleak. Expect continued margin erosion and declining ASPs. The erosion of market share to rival AMD will endure relatively unabated, and expect upstarts VIA and Transmeta to also make more gains at Intel’s expense.
By: Bert McComas, Inquest Inc.
Copyright © 2001 CST, Inc. All Rights Reserved