Another fascinating article, Johan. It's fun to see Anandtech spending more time delving into architecture and non x86 processors, and doing more analysis and less benchmarking. Keep it up!
Remember: Not come from one app... is not equal to parallel-well:
It is possible that it is more slow(many apps work together at the same time) than work one by one.
a core have only 8KB L1, but have to be split for 4 threads to use.It is too few L1 for 4-thread!
Xeon have 16KB for 2 threads, POWER5 have 32KB for 2 threads.
"Most cpus give up thread level performance..." ?
Remember: The Xeon from Intel and POWER5 from IBM --both are multi-thread CPU.
Sun stands to gain quite a bit from this, but not really at the expense of IBM, AMD, or Intel. This is doing something that the other guys aren't trying to do, rather than competing against them at what they do well. It is not the future of desktop CPUs. It will not be even a good general-purpose server CPU. It takes a lot of data in, and pushes a lot of data out. A workload that hinges on doing that, without much actual work done to that data, is all it is made to do.
It is basically a network appliance that happens to run generic programs on it. If you need that it offers, it will be Lord and Master of your rack. If you're not sure, you will pass it by; because you know that that Opteron over here can take anything you throw at it pretty well.
If all people think the your words is correct, SUN may cry.
So small areas for it's apps.
quote: It is not the future of desktop CPUs. It will not be even a good general-purpose server CPU. It takes a lot of data in, and pushes a lot of data out.
Why would they cry? They even go to pointing out it's crap for FPU tasks (well, if you notice it lacking entirely in the whole of the PR stuff for it), and tasks with high ILP and IPC (where our mainstream CPUs excel). They also still have a full line-up of other servers, including those based on their own updated SPARCs. It appears their buzzword for this stuff here is 'throughput computing'. Their own brochure for this thing also clearly sell it for high TLP and large data workloads. For more general work, they've got Opterons, and the UltraSPARC IV+ does not appear to be a slouch.
Let's look at their own "key applications":
* Web and application tier workloads Lots of web server threads. Lots of DB threads. Simple integer logic.
* Multithreaded workloads See above.
* Java application servers and Java Virtual Machines They're sun. Regardless of how good it may or may not be here, they must market Java™. McNealy has to eat, you know :).
* Consolidated web servers Basically the first one/two, but worded differently, to point out that it can do 2x as much web serving work as other servers in the rack with it, and maybe even more, while using little power.
* Infrastructure services (portal, directory, identity) Data in, shuffle it, pump it out. Only slightly different than the rest so far (except Java).
* Enterprise applications (ERP, CRM, SCM business logic) Again, mostly simple DB work where a lot of things may be going on at once, but plenty of them will really be separate from each other. What each task lags in will be made up for by being able to run another 30 at the same time.
Note that nothing like engineering, scientific simulations, etc., is on that list (things that do a lot of FPU work in parallel). It's basically web and DB said in different ways, and a plug for Java. In addition, their benchmarks look carefully chosen, but not cooked, like Apple's.
You think that the key of DB works and web server is the multi-thread-parallel performance.It seems that the multi-threads processor(such as P4 Xeon with HT and POWER5 with SMT) is more competitive than the single-thread processor(such as opteron).
Just for web server/links, very local java apps? No, not local java apps. Java is only there as marketing, because this is something Sun in trying to sell. Java probably works fine on it, but really has nothing to do with any of it, except that the same company is behind it and this chip.
You think that the key of DB works and web server is the multi-thread-parallel performance.It seems that the multi-threads processor(such as P4 Xeon with HT and POWER5 with SMT) is more competitive than the single-thread processor(such as opteron). All of those are multithreaded processors. A 386 is a multithreaded processor (in fact, its ability to handle threading in hardware is part of how Linux got created!). However, except for the Power5, none of those can run more than a single thread at a time per core. They can run tons and tons of instructions at a time, but not separate threads (yes, even with HT).
I don't know how IBM's SMT works, but Intel's is nothing like what the T1 is doing. The T1 seems to be made to send out threads without regard to whether one needs replacing or not.
Let's say your task has an IPC of 3, and you have 4 paths to use at a time.
***0
Not bad, 75% used. Now, let's say it's only 1.
*000
25%, not so great. But, because you have to send them in sets of 4, you can't get 3 more in.
OK, now, enter Hyperthreading. Let's go to 2 of those 3-IPC tasks.
***0 ***0
Hey, wait, it didn't use that extra one. 75% again. With HT, it switches the whole thing between threads fast. This help make up for the stalling that will happen a fair bit on the Netburst chips. You can't actually get more done--you just don't have to wait as long when something can't go on, because another one is ready to take its place. yes, it may help a little, but there is also the possibility the CPU will get too loaded down and decrease performance, too.
So, apply that to two 1 IPC tasks.
*000 *000
You're still only using 25%, there. You may get a little boost here or there, when one stalls and the other does not, but you've still got about 3/4 of it wasted.
Now, let's take one core of that T1. It runs four threads, each single-width. So, to that 3 IPC task:
*000, *000, *000
One path of the four is used, going over it three times, because it can't span them out and run them in parallel. So, it will take 3 passes to do the work the others can do in 1. Even with a very short pipeline, that hurts. For this task, the 'fat' CPUs, like the Opteron, are excellent.
But, let's go and run 3 1 IPC tasks, instead:
***0
Now, it got 75% used. Now, running 3x 1 IPC tasks on the Xeon or Opteron:
*000, *000, *000
Not so great. The OoOE, branch prediction, and large local caches help, but it just can't keep up, because it's only one thread at a time.
While this is a very specific kind if workload, the majority of machines that you use on the internet, and many that you may use within a large company, are basically that kind of workload.
Get request for data.
Fetch data.
Check where data needs to go.
Send it there.
The thing about it is that this workload accounts for the majority of what goes on over the internet, and most other networks. As long as your servers have enough work to do during peak times to keep one of these machines somewhat busy, it could save rack space, power use, and increase performance in the process.
Hopefully I didn't screw too much of that up--I did ramble a bit.
Ok this all works fine if you’re dealing with a non-superscaler 386. But the processors you’re referring to are fully pipelined out of order micro-opp architectures.
I believe the Opteron can have 72 instructions in flight at any one given time, the Power something like 200(x2?), and the P4 126. Each in various levels of decode, process and write.
As for the thread level parallelism, it is in no way as granular as you portray it. Think more in milliseconds not ticks. I believe thread quantums (time slice) for windows are on the order of 30ms. So a 2gh processor task switch occurs, if the thread holds its slice for more than its allotted time, in 6mhz of ticks.
HyperThreading does by definition feeds the execution units from two threads at a time; however, this doesn’t ever reach the level of instruction level parallelism that you portray it just kind of fills in the gaps.
Each core of the Niagara can by theory achieve an ILP of 0.7. Multiply that by 8 and you get a theoretical 5.7 IPC. (but even the ItainumII never reaches the theoretical). Something always gets in the way.
How terrible for the single thread apps!
NO branch prediction!
Someone must be crazy!
quote: If a branch is encountered, no branch prediction is performed: it would only waste power and transistors. No, the condition on which the branch is based is simply resolved. The CPU doesn't have to guess anymore. The pipeline is not stalled because other threads are switched in while the branch is resolved. So, instead of accelerating the little bit of compute time (10-15%) that there is, the long wait periods (memory latencies, branches) of each thread is overlapped with the compute time of 3 other threads.
How many FP instructions do you think a high-end web server runs? Try to think outside the box for a minute, rather than comparing it to HPC-oriented chips. Itanium wastes more than 80% of it's potential when running many database loads, and it does better than some of the other alternatives. Spending lots of die space on OOO logic and long pipelines isn't always the best solution, especially if you can guarantee that most code will have many threads. Quit thinking Half-Life and other games for a minute and try to shift to the big iron server world.
You seem to be confused with the concept of multi-threading v.s. multi-tasking.
You do NOT need to find an app that runs 32 parallel threads in one process.
You can simple run 32 _instances_ of that app, for example,
or, run 32 different apps even if everyone of them is single-threaded.
A typical server environment is just like that.
When we talk about Chip Multi-Threading (CMT), it's the _hardware_ thread, which is a totally different concept than software thread. Once hardward thread represents the capability of running one computing task, it does care where this task comes from the same app/process or not.
A perfect example is the Apache webserver, IIRC, at least in version 1.x.y (which is still the most popular version), the apache http server process is single-threaded. A new process is forked for each (or a group of) new http request. The more hardware threads you have, the more requests you can handle in parallel. Of course, the faster each hardward thread (or core, or cpu) is, the more requests it can handle in a given amount of time, but not in parallel.
It is also true that _most_ database out there doesn't use _any_ floatpoint computation.
So, if you think about it, the market for T1 type of CPU/server is not a small one.
The bottom line is, T1 excels at througput/Watt and througput/chip.
It's a well kown fact that it sucks at single-task or floatpoint computation.
You seem to be confused with the concept of SMT v.s. CMT.
It is very low efficient, if T1 only use CMT but not use SMT.
T1 have no branch prediction and one_inst_issue/core, very very poor FP performacne.
The only explain about how to improve the efficiency(very poor) is to use SMT to hide the latency(by branch miss/cache miss ect.)
But it has only 8KB L1(which will be used by 4 threads), the cache miss will increase. It is possible to become worst.
Explain to me the conceptual difference between SMT and CMT?
All you have said is the (component) _implementation_ difference between T1 and POWER in achieving hardware threading.
Since you appear to know this topic quite well, why the ignorant comment like this:
"It is too difficult to find apps(2-thread-paralle-well or more) for P4, how to find the apps(32-thread-paralle-well) easily?"
and kept screaming about the lack of floatpoint performance?
There is no "reminding" anyone of the poor FPU performance. The thing was never designed to be strong in FPU (quite obviously). It has "enough" FPU so that it doesn't have to do software emulation and that's it. So... going on and on about FPU performance is a useless argument here. Sun (nor anyone talking about the T1s) has ever said that it would be good at FPU perforamnce because it wasn't designed to be.
This CPU was designed for servers. Servers typically have high cache miss rates anyway because of a number of things (streaming any kind of I/O doesn't have much data locality advantages). Server processes also typically have lots of I/O stalls. When a context stalls, each core has multiple other contexts to chose from in order to keep running.
So, I think the points you are trying to stress are quite obvious from the design of the CPU and the types of loads it was designed to handle. Yes, poor FPU performance obvious from having very limited (and slow) FPU resources. Yes, if you aren't running lots of threads the machine is inefficient because the thing is designed to take advantage of server type threads where there will be lots of I/O stalling and if there is nothing else to run while waiting on the I/O requests to finish, it sits idle (much like any other machine). Yes, in-order execution and the lack of branch prediction will not mask any stalls the instruction stream will generate (which is OK because the design of the CPU actually counts on these stalls to happen so that lots of nice SMT can happen).
It sounds like you are in violent agreement with everyone :)
SMT and CMT appear to be the same type of technology (at least conceptual wise) with different names from two vendors.
> The very very poor FP performance of T1 is the truth.
> We have to remind ourselves that it is only a integer CPU. It's FP performance is too terrible.
OK. Since you have repeated so many times, I am sure everyone who's reading this will remember, and I do not disagree :-).
Obviously the apps that they used to benchmark in this article like running on the chip. Also, this chip doesn't run windows. It runs Sun's proprietary operating system. (I forgot what it's called.) Sun will give this new chip software support because they want it to do well.
I think I read in the article that the chip is backwards compatable with the previous design Sun chips, meaning a lot of software is already available that will run on the chip.
It is too narrow for the areas of 32-thread-parallel-well apps.
'have many threads' is not equal to '32-thread-parallel-well'!
Even there are 32 threads, but without parallel-well , This new CPU will waste more than 90% of it's potential.
The efficiency of Itanium( Itanium is capable of a 1.3-1.5 IPC) is much better than x86-CPU(0.7-0.9 IPC). Itanium never used OOO logic and long pipelines.
The efficiency of Itanium2 is still better than IBM's POWER5, and a Itanium2 core may retire 6 instrutions/cycle,and POWER5's can retire 5-instrutions/cycle.
But a core of this new CPU is only one instrutions/cycle.
I think you missed the part where x86 chips spend 400 cycles waiting on memory accesses when the Sun chip just keeps chugging with another thread while the load is happening.
Those 400 cycles are related to the higher clock speed (if your processor would be twice as slow, it would wait only 200 cycles). I assume the 400 cycles are based on the Xeon processor (that has high clock speed and slower FSB).
NO!
It is not true for all the x86 CPU.When Athlon64 spend many cycles waiting on memory accesses,
For P4 with HT,P4 just keeps chugging with another thread while the load is happening.
It is hard to find parallelism in one application so you could run it well on two cores. However, if you use 32 applications, you can run it very well on 32 cores.
Most servers don't run a lot of single-threaded apps, or if they do they run many instances of the single-threaded app/process at the same time. This is clearly not a chip designed for all markets, but it is instead focused on doing very well in a niche market.
Johan,
Nice article!
A small point: I don't think it's correct to refer Sun Microsystems Inc. as 'SUN', it should be 'Sun'.
Even though it originally stands for Standford University Network, 'SUN' is no longer the semi-official name, AFAIK.
When T1 based system is announced, I was hoping to see some independent benchmarks from Anandtech, especially the MySQL one you guys used to benchmark the server performance.
I know it's not scientific, and SPEC is as good as it gets, still I am curious :-)
Have you guys considered using T1000/T2000 to power Anandtech, given it's so cheap and designed for webserver type of workload?
That would be a good win-back story for Sun, I remembered you guys migraded from Sun Ultra boxes to PC server several years ago :-)
Actually lets start by saying you're missed on aceshardware.. and I do have to wonder how you felt about the oath of allegiance to Intel anandtech requires?
Ah well, all that aside the most glaring omission with respect to the Niagara II is the fact that it has a full floating point component in each core - meaning that the current floating point limitation will largely go away.
In addition: you cite (as a lot of other people do to) this 1.2Ghz "maximum" as if it had reality - it does not. As issued, the T1 incorporates some design trade-offs that make higher cycle rates impractical, but those are the result of engineering vs. marketing (time and cost) trade-offs, not inherent consequences of the technology. Sun has faster test units running now - with very high end products in the pipeline.
"Ah well, all that aside the most glaring omission with respect to the Niagara II is the fact that it has a full floating point component in each core - meaning that the current floating point limitation will largely go away."
Floating point limitation won't go away, 8 FPUs@1.4GHz will just make floating point capabilities of the chip somehow useful. For the comparison dual-core Opteron has 6 FPUs@2.4GHz NOW and in 2007 there will be quad-core Opterons (12 FPUs) available.
As somebody already mentioned, performance/$ is also very important. While T1 is way faster than any other chip, I guess it will cost much more, probably more than 2 high end dual-core Opterons.
I'm not saying that T1 isn't good. It is, but only in certain tasks.
I don't think it is a tradition at Anandtech to swear allegiance to Intel, or either they have forgotten to tell me.:-)
All jokes aside, When I say Intel has the advantage on hardware VT technology and the software support needed, that is solely based on facts. Sun is actively trying to get full support of Xen (VM), and also Linux and FreeBSD OS support, but for the moment T1 is Solaris only if you want good software support.
AFAIK there is no indication that SUN can go much faster than 1.2 GHz. To let the 4 threads access a 5.7 KB register file in one cycle is probably limiting the clockspeed, and the 6 stage pipeline is another clear indication that this CPU won't clock much higher. SUN counting on 65 nm to increase the clockspeed higher (1.4 GHz and more) is another indication.
It looks like sun is back with a vengance. This thing seems perfect for the server market. I am really suprised that they were able to get their $hit back together. I dought the single threaded performance on this thing would be that great but, then again, who cares this thing is a server not a workstation made for single threaded use. This thing would be perfect for virtualization. I don't know if this is possible for solaris or maybe vmware/ms virtual server will have this feature in the future but hopefully they will allow you to allocate which core to which virtualization layer that you want. So say your running 4 OS and you have 8 cores. You allocate 2 cores to each OS. You notice that 2 of the four high really high cpu utilization. You could then dynamically add one more core to each of the virtualized OS that had high cpu usage from the ones that had low cpu usage. For those of you who think virtualization isn't a big deal...now wouldnt' this be cool.
The benchmarks are from Sun's website (http://www.sun.com/servers/coolthreads/t1000/bench...">link)
"SPECjAppServer2004 is the only industry-standard benchmark used for Java Enterprise Edition application servers."
So, yes, you can assume they're all using the same TCP/IP stack. But, as the article mentions: "Of course, this is an ideal benchmark for the T1 with many java threads."
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
49 Comments
Back to Article
sgtroyer - Wednesday, January 4, 2006 - link
Another fascinating article, Johan. It's fun to see Anandtech spending more time delving into architecture and non x86 processors, and doing more analysis and less benchmarking. Keep it up!Scarceas - Friday, December 30, 2005 - link
Remember the converse: Most cpus give up thread level performance...Remember the intended market...
Remember not all 32 threads have to come from one app...
Betwon - Friday, December 30, 2005 - link
Remember: Not come from one app... is not equal to parallel-well:It is possible that it is more slow(many apps work together at the same time) than work one by one.
a core have only 8KB L1, but have to be split for 4 threads to use.It is too few L1 for 4-thread!
Xeon have 16KB for 2 threads, POWER5 have 32KB for 2 threads.
"Most cpus give up thread level performance..." ?
Remember: The Xeon from Intel and POWER5 from IBM --both are multi-thread CPU.
Cerb - Friday, December 30, 2005 - link
Sun stands to gain quite a bit from this, but not really at the expense of IBM, AMD, or Intel. This is doing something that the other guys aren't trying to do, rather than competing against them at what they do well. It is not the future of desktop CPUs. It will not be even a good general-purpose server CPU. It takes a lot of data in, and pushes a lot of data out. A workload that hinges on doing that, without much actual work done to that data, is all it is made to do.It is basically a network appliance that happens to run generic programs on it. If you need that it offers, it will be Lord and Master of your rack. If you're not sure, you will pass it by; because you know that that Opteron over here can take anything you throw at it pretty well.
Betwon - Friday, December 30, 2005 - link
If all people think the your words is correct, SUN may cry.So small areas for it's apps.
Cerb - Friday, December 30, 2005 - link
Why would they cry? They even go to pointing out it's crap for FPU tasks (well, if you notice it lacking entirely in the whole of the PR stuff for it), and tasks with high ILP and IPC (where our mainstream CPUs excel). They also still have a full line-up of other servers, including those based on their own updated SPARCs. It appears their buzzword for this stuff here is 'throughput computing'. Their own brochure for this thing also clearly sell it for high TLP and large data workloads. For more general work, they've got Opterons, and the UltraSPARC IV+ does not appear to be a slouch.Let's look at their own "key applications":
* Web and application tier workloads
Lots of web server threads. Lots of DB threads. Simple integer logic.
* Multithreaded workloads
See above.
* Java application servers and Java Virtual Machines
They're sun. Regardless of how good it may or may not be here, they must market Java™. McNealy has to eat, you know :).
* Consolidated web servers
Basically the first one/two, but worded differently, to point out that it can do 2x as much web serving work as other servers in the rack with it, and maybe even more, while using little power.
* Infrastructure services (portal, directory, identity)
Data in, shuffle it, pump it out. Only slightly different than the rest so far (except Java).
* Enterprise applications (ERP, CRM, SCM business logic)
Again, mostly simple DB work where a lot of things may be going on at once, but plenty of them will really be separate from each other. What each task lags in will be made up for by being able to run another 30 at the same time.
Note that nothing like engineering, scientific simulations, etc., is on that list (things that do a lot of FPU work in parallel). It's basically web and DB said in different ways, and a plug for Java. In addition, their benchmarks look carefully chosen, but not cooked, like Apple's.
Betwon - Friday, December 30, 2005 - link
Just for web server/links, very local java apps?You think that the key of DB works and web server is the multi-thread-parallel performance.It seems that the multi-threads processor(such as P4 Xeon with HT and POWER5 with SMT) is more competitive than the single-thread processor(such as opteron).
Cerb - Friday, December 30, 2005 - link
Just for web server/links, very local java apps?No, not local java apps. Java is only there as marketing, because this is something Sun in trying to sell. Java probably works fine on it, but really has nothing to do with any of it, except that the same company is behind it and this chip.
You think that the key of DB works and web server is the multi-thread-parallel performance.It seems that the multi-threads processor(such as P4 Xeon with HT and POWER5 with SMT) is more competitive than the single-thread processor(such as opteron).
All of those are multithreaded processors. A 386 is a multithreaded processor (in fact, its ability to handle threading in hardware is part of how Linux got created!). However, except for the Power5, none of those can run more than a single thread at a time per core. They can run tons and tons of instructions at a time, but not separate threads (yes, even with HT).
I don't know how IBM's SMT works, but Intel's is nothing like what the T1 is doing. The T1 seems to be made to send out threads without regard to whether one needs replacing or not.
Let's say your task has an IPC of 3, and you have 4 paths to use at a time.
***0
Not bad, 75% used. Now, let's say it's only 1.
*000
25%, not so great. But, because you have to send them in sets of 4, you can't get 3 more in.
OK, now, enter Hyperthreading. Let's go to 2 of those 3-IPC tasks.
***0 ***0
Hey, wait, it didn't use that extra one. 75% again. With HT, it switches the whole thing between threads fast. This help make up for the stalling that will happen a fair bit on the Netburst chips. You can't actually get more done--you just don't have to wait as long when something can't go on, because another one is ready to take its place. yes, it may help a little, but there is also the possibility the CPU will get too loaded down and decrease performance, too.
So, apply that to two 1 IPC tasks.
*000 *000
You're still only using 25%, there. You may get a little boost here or there, when one stalls and the other does not, but you've still got about 3/4 of it wasted.
Now, let's take one core of that T1. It runs four threads, each single-width. So, to that 3 IPC task:
*000, *000, *000
One path of the four is used, going over it three times, because it can't span them out and run them in parallel. So, it will take 3 passes to do the work the others can do in 1. Even with a very short pipeline, that hurts. For this task, the 'fat' CPUs, like the Opteron, are excellent.
But, let's go and run 3 1 IPC tasks, instead:
***0
Now, it got 75% used. Now, running 3x 1 IPC tasks on the Xeon or Opteron:
*000, *000, *000
Not so great. The OoOE, branch prediction, and large local caches help, but it just can't keep up, because it's only one thread at a time.
While this is a very specific kind if workload, the majority of machines that you use on the internet, and many that you may use within a large company, are basically that kind of workload.
Get request for data.
Fetch data.
Check where data needs to go.
Send it there.
The thing about it is that this workload accounts for the majority of what goes on over the internet, and most other networks. As long as your servers have enough work to do during peak times to keep one of these machines somewhat busy, it could save rack space, power use, and increase performance in the process.
Hopefully I didn't screw too much of that up--I did ramble a bit.
Schmide - Saturday, December 31, 2005 - link
As always correct me where I’m wrong.Ok this all works fine if you’re dealing with a non-superscaler 386. But the processors you’re referring to are fully pipelined out of order micro-opp architectures.
I believe the Opteron can have 72 instructions in flight at any one given time, the Power something like 200(x2?), and the P4 126. Each in various levels of decode, process and write.
As for the thread level parallelism, it is in no way as granular as you portray it. Think more in milliseconds not ticks. I believe thread quantums (time slice) for windows are on the order of 30ms. So a 2gh processor task switch occurs, if the thread holds its slice for more than its allotted time, in 6mhz of ticks.
HyperThreading does by definition feeds the execution units from two threads at a time; however, this doesn’t ever reach the level of instruction level parallelism that you portray it just kind of fills in the gaps.
Each core of the Niagara can by theory achieve an ILP of 0.7. Multiply that by 8 and you get a theoretical 5.7 IPC. (but even the ItainumII never reaches the theoretical). Something always gets in the way.
I think the Niagara has some promise.
Betwon - Thursday, December 29, 2005 - link
How terrible for the single thread apps!NO branch prediction!
Someone must be crazy!
Betwon - Thursday, December 29, 2005 - link
Why? Really?It shows that the performance of FP apps is very very poor!!!
We can't believe it.
It is terrible for many FP apps.
Now, we know that the new CPU is only for the integer/32-thread-parallel-well apps.
JarredWalton - Friday, December 30, 2005 - link
How many FP instructions do you think a high-end web server runs? Try to think outside the box for a minute, rather than comparing it to HPC-oriented chips. Itanium wastes more than 80% of it's potential when running many database loads, and it does better than some of the other alternatives. Spending lots of die space on OOO logic and long pipelines isn't always the best solution, especially if you can guarantee that most code will have many threads. Quit thinking Half-Life and other games for a minute and try to shift to the big iron server world.Betwon - Friday, December 30, 2005 - link
Only one FP unit? not less than 40 cycles latency?If it is true:
The new CPU will be slower than P3@450MHz in the area of FP apps.
Brian23 - Friday, December 30, 2005 - link
who cares. That's not what it's designed to do. The only reason that it has the floating point core is for the rare occation when a FP op is needed.Betwon - Friday, December 30, 2005 - link
It means that this new CPU does not fit for the FP apps. Maybe a old CPU(10 years old) can beat it.Now, we know that the apps-area of this new CPU is very very spec...
It is too difficult to find apps(2-thread-paralle-well or more) for P4, how to find the apps(32-thread-paralle-well) easily?
thesix - Friday, December 30, 2005 - link
Betwon,You seem to be confused with the concept of multi-threading v.s. multi-tasking.
You do NOT need to find an app that runs 32 parallel threads in one process.
You can simple run 32 _instances_ of that app, for example,
or, run 32 different apps even if everyone of them is single-threaded.
A typical server environment is just like that.
When we talk about Chip Multi-Threading (CMT), it's the _hardware_ thread, which is a totally different concept than software thread. Once hardward thread represents the capability of running one computing task, it does care where this task comes from the same app/process or not.
A perfect example is the Apache webserver, IIRC, at least in version 1.x.y (which is still the most popular version), the apache http server process is single-threaded. A new process is forked for each (or a group of) new http request. The more hardware threads you have, the more requests you can handle in parallel. Of course, the faster each hardward thread (or core, or cpu) is, the more requests it can handle in a given amount of time, but not in parallel.
It is also true that _most_ database out there doesn't use _any_ floatpoint computation.
So, if you think about it, the market for T1 type of CPU/server is not a small one.
The bottom line is, T1 excels at througput/Watt and througput/chip.
It's a well kown fact that it sucks at single-task or floatpoint computation.
Betwon - Friday, December 30, 2005 - link
NO!You seem to be confused with the concept of SMT v.s. CMT.
It is very low efficient, if T1 only use CMT but not use SMT.
T1 have no branch prediction and one_inst_issue/core, very very poor FP performacne.
The only explain about how to improve the efficiency(very poor) is to use SMT to hide the latency(by branch miss/cache miss ect.)
But it has only 8KB L1(which will be used by 4 threads), the cache miss will increase. It is possible to become worst.
thesix - Friday, December 30, 2005 - link
Explain to me the conceptual difference between SMT and CMT?All you have said is the (component) _implementation_ difference between T1 and POWER in achieving hardware threading.
Since you appear to know this topic quite well, why the ignorant comment like this:
"It is too difficult to find apps(2-thread-paralle-well or more) for P4, how to find the apps(32-thread-paralle-well) easily?"
and kept screaming about the lack of floatpoint performance?
I simply don't understand why you're so upset.
Betwon - Friday, December 30, 2005 - link
My english has some problem.I think that T1 use both CMT and SMT.
SMT -- one core with four threads
CMT -- one CPU with eight cores
If without SMT, cores of T1 will be very poor efficient (because of the stall's latency caused by branch miss/cache miss).
The very very poor FP performance of T1 is the truth.
We have to remind ourselves that it is only a integer CPU. It's FP performance is too terrible.
fitten - Sunday, January 1, 2006 - link
There is no "reminding" anyone of the poor FPU performance. The thing was never designed to be strong in FPU (quite obviously). It has "enough" FPU so that it doesn't have to do software emulation and that's it. So... going on and on about FPU performance is a useless argument here. Sun (nor anyone talking about the T1s) has ever said that it would be good at FPU perforamnce because it wasn't designed to be.This CPU was designed for servers. Servers typically have high cache miss rates anyway because of a number of things (streaming any kind of I/O doesn't have much data locality advantages). Server processes also typically have lots of I/O stalls. When a context stalls, each core has multiple other contexts to chose from in order to keep running.
So, I think the points you are trying to stress are quite obvious from the design of the CPU and the types of loads it was designed to handle. Yes, poor FPU performance obvious from having very limited (and slow) FPU resources. Yes, if you aren't running lots of threads the machine is inefficient because the thing is designed to take advantage of server type threads where there will be lots of I/O stalling and if there is nothing else to run while waiting on the I/O requests to finish, it sits idle (much like any other machine). Yes, in-order execution and the lack of branch prediction will not mask any stalls the instruction stream will generate (which is OK because the design of the CPU actually counts on these stalls to happen so that lots of nice SMT can happen).
It sounds like you are in violent agreement with everyone :)
thesix - Friday, December 30, 2005 - link
If you're talking about POWER5's SMT, currently it provides two HW threads per core:http://publib.boulder.ibm.com/infocenter/pseries/i...">http://publib.boulder.ibm.com/infocente...x.doc/ai...
If you look closer at T1, the best one has 8 cores, each core supports four HW threads.
http://www.sun.com/processors/UltraSPARC-T1/">http://www.sun.com/processors/UltraSPARC-T1/
SMT and CMT appear to be the same type of technology (at least conceptual wise) with different names from two vendors.
> The very very poor FP performance of T1 is the truth.
> We have to remind ourselves that it is only a integer CPU. It's FP performance is too terrible.
OK. Since you have repeated so many times, I am sure everyone who's reading this will remember, and I do not disagree :-).
Thanks.
Betwon - Friday, December 30, 2005 - link
We think that it is diffirent between CMT and SMT.For exapmle:
P4 630 is a kind of SMT CPU, but not a CMT CPU.
AthlonX2 is a kind of CMT CPU, but not a SMT CPU.
From anandtech:
T1 has no branch prediction,and it has only one-instruction-issue/core, 8KB L1D/core(too few for 4 threads to use).
POWER5 has 32KB L1D/core, which is used by two threads.
We think that the SMT of T1 may be OK, unless 4 threads only use very few L1D cache(It is impossible for most cases)
Betwon - Friday, December 30, 2005 - link
edit:The only explain about how to improve the efficiency(very poor) is to use SMT to hide the stall's latency(by branch miss/cache miss ect.)
But a core has only 8KB L1(which will be used by 4 threads), the cache miss will increase. It is possible to become worst.
Betwon - Friday, December 30, 2005 - link
edit: T1 have no branch prediction and it has only one_inst_issue/core.Brian23 - Friday, December 30, 2005 - link
Obviously the apps that they used to benchmark in this article like running on the chip. Also, this chip doesn't run windows. It runs Sun's proprietary operating system. (I forgot what it's called.) Sun will give this new chip software support because they want it to do well.I think I read in the article that the chip is backwards compatable with the previous design Sun chips, meaning a lot of software is already available that will run on the chip.
Betwon - Friday, December 30, 2005 - link
NO!It is too narrow for the areas of 32-thread-parallel-well apps.
'have many threads' is not equal to '32-thread-parallel-well'!
Even there are 32 threads, but without parallel-well , This new CPU will waste more than 90% of it's potential.
The efficiency of Itanium( Itanium is capable of a 1.3-1.5 IPC) is much better than x86-CPU(0.7-0.9 IPC). Itanium never used OOO logic and long pipelines.
Betwon - Friday, December 30, 2005 - link
The efficiency of Itanium2 is still better than IBM's POWER5, and a Itanium2 core may retire 6 instrutions/cycle,and POWER5's can retire 5-instrutions/cycle.But a core of this new CPU is only one instrutions/cycle.
Brian23 - Friday, December 30, 2005 - link
I think you missed the part where x86 chips spend 400 cycles waiting on memory accesses when the Sun chip just keeps chugging with another thread while the load is happening.Calin - Tuesday, January 3, 2006 - link
Those 400 cycles are related to the higher clock speed (if your processor would be twice as slow, it would wait only 200 cycles). I assume the 400 cycles are based on the Xeon processor (that has high clock speed and slower FSB).Betwon - Friday, December 30, 2005 - link
NO!It is not true for all the x86 CPU.When Athlon64 spend many cycles waiting on memory accesses,
For P4 with HT,P4 just keeps chugging with another thread while the load is happening.
Do you understand what I want to say?
Brian23 - Saturday, December 31, 2005 - link
While it's true that HT helps fight this issue, it's not the complete solution. Sun's approach is much better.Betwon - Thursday, December 29, 2005 - link
How terrible!The single issue pipeline/core!
Poeple always complains that: we fails to find the enough threads(2 or 4 threads) in the most apps for the multi-thread CPU.
Now, it is very difficult to find a app(8X4=32 threads parallel well).
Calin - Tuesday, January 3, 2006 - link
It is hard to find parallelism in one application so you could run it well on two cores. However, if you use 32 applications, you can run it very well on 32 cores.JarredWalton - Thursday, December 29, 2005 - link
Most servers don't run a lot of single-threaded apps, or if they do they run many instances of the single-threaded app/process at the same time. This is clearly not a chip designed for all markets, but it is instead focused on doing very well in a niche market.thesix - Thursday, December 29, 2005 - link
Johan,Nice article!
A small point: I don't think it's correct to refer Sun Microsystems Inc. as 'SUN', it should be 'Sun'.
Even though it originally stands for Standford University Network, 'SUN' is no longer the semi-official name, AFAIK.
When T1 based system is announced, I was hoping to see some independent benchmarks from Anandtech, especially the MySQL one you guys used to benchmark the server performance.
I know it's not scientific, and SPEC is as good as it gets, still I am curious :-)
Have you guys considered using T1000/T2000 to power Anandtech, given it's so cheap and designed for webserver type of workload?
That would be a good win-back story for Sun, I remembered you guys migraded from Sun Ultra boxes to PC server several years ago :-)
steveha - Thursday, December 29, 2005 - link
Why drop the opteron from the Specweb2005 results? Did it destroy the T1?stephenbrooks - Monday, January 2, 2006 - link
We think we should be told.NullSubroutine - Thursday, December 29, 2005 - link
How do these price? It seems the performance per watt is very good, but what if the cpu and the platform costs more?I might have missed it, but what was the die size?
icarus4586 - Thursday, December 29, 2005 - link
I'm assuming that should read,
I wouldn't guess Sun is using IBM technology or marketing terms.
JohanAnandtech - Thursday, December 29, 2005 - link
As thesix already commented (thanks :-), hypervisor is indeed IBM talk. AFAIK, IBM was first.thesix - Thursday, December 29, 2005 - link
"Hypervisor" is a technology used mostly by IBM from mainframe days. Every system vendor can implement this technology in their systems.pmurphy - Thursday, December 29, 2005 - link
Actually lets start by saying you're missed on aceshardware.. and I do have to wonder how you felt about the oath of allegiance to Intel anandtech requires?Ah well, all that aside the most glaring omission with respect to the Niagara II is the fact that it has a full floating point component in each core - meaning that the current floating point limitation will largely go away.
In addition: you cite (as a lot of other people do to) this 1.2Ghz "maximum" as if it had reality - it does not. As issued, the T1 incorporates some design trade-offs that make higher cycle rates impractical, but those are the result of engineering vs. marketing (time and cost) trade-offs, not inherent consequences of the technology. Sun has faster test units running now - with very high end products in the pipeline.
defter - Thursday, December 29, 2005 - link
"Ah well, all that aside the most glaring omission with respect to the Niagara II is the fact that it has a full floating point component in each core - meaning that the current floating point limitation will largely go away."Floating point limitation won't go away, 8 FPUs@1.4GHz will just make floating point capabilities of the chip somehow useful. For the comparison dual-core Opteron has 6 FPUs@2.4GHz NOW and in 2007 there will be quad-core Opterons (12 FPUs) available.
As somebody already mentioned, performance/$ is also very important. While T1 is way faster than any other chip, I guess it will cost much more, probably more than 2 high end dual-core Opterons.
I'm not saying that T1 isn't good. It is, but only in certain tasks.
JohanAnandtech - Thursday, December 29, 2005 - link
I don't think it is a tradition at Anandtech to swear allegiance to Intel, or either they have forgotten to tell me.:-)All jokes aside, When I say Intel has the advantage on hardware VT technology and the software support needed, that is solely based on facts. Sun is actively trying to get full support of Xen (VM), and also Linux and FreeBSD OS support, but for the moment T1 is Solaris only if you want good software support.
AFAIK there is no indication that SUN can go much faster than 1.2 GHz. To let the 4 threads access a 5.7 KB register file in one cycle is probably limiting the clockspeed, and the 6 stage pipeline is another clear indication that this CPU won't clock much higher. SUN counting on 65 nm to increase the clockspeed higher (1.4 GHz and more) is another indication.
ravedave - Thursday, December 29, 2005 - link
When might we expect to see Anandtech benchmarks? 1-2 months?Puddleglum - Thursday, December 29, 2005 - link
The [2] SUN T1 benchmarks reference link is pointing to a bizarre location at intel.com. The text says sun.com, but the link points to intel.com.It should be fixed to point to: http://www.sun.com/servers/coolthreads/t1000/bench...">http://www.sun.com/servers/coolthreads/t1000/bench...
ncage - Thursday, December 29, 2005 - link
It looks like sun is back with a vengance. This thing seems perfect for the server market. I am really suprised that they were able to get their $hit back together. I dought the single threaded performance on this thing would be that great but, then again, who cares this thing is a server not a workstation made for single threaded use. This thing would be perfect for virtualization. I don't know if this is possible for solaris or maybe vmware/ms virtual server will have this feature in the future but hopefully they will allow you to allocate which core to which virtualization layer that you want. So say your running 4 OS and you have 8 cores. You allocate 2 cores to each OS. You notice that 2 of the four high really high cpu utilization. You could then dynamically add one more core to each of the virtualized OS that had high cpu usage from the ones that had low cpu usage. For those of you who think virtualization isn't a big deal...now wouldnt' this be cool.Slaimus - Thursday, December 29, 2005 - link
Are these benchmarks all running similar TCP/IP stacks? We all know solaris 10 has a new TCP/IP stack that is much faster than linux.Puddleglum - Thursday, December 29, 2005 - link
The benchmarks are from Sun's website (http://www.sun.com/servers/coolthreads/t1000/bench...">link)"SPECjAppServer2004 is the only industry-standard benchmark used for Java Enterprise Edition application servers."
So, yes, you can assume they're all using the same TCP/IP stack. But, as the article mentions: "Of course, this is an ideal benchmark for the T1 with many java threads."