PDA

View Full Version : Need a Phenom benchmark


Dark Shikari
11th November 2008, 23:02
Run checkasm --bench="sad", paste results here or in a pastebin.

I've written some new Nehalem assembly and I figure some of it might be faster on the Phenom as well, but I won't know until someone tests it.

Checkasm download (http://www.mediafire.com/?zowiiegzlom).

kemuri-_9
12th November 2008, 01:56
nop: 684
sad_4x4_c: 1098
sad_4x4_mmx: 164
sad_4x8_c: 1925
sad_4x8_mmx: 242
sad_8x4_c: 1750
sad_8x4_mmx: 167
sad_8x4_mmx_c64: 182
sad_8x4_mmx_c32: 189
sad_8x8_c: 3172
sad_8x8_mmx: 243
sad_8x8_mmx_c64: 270
sad_8x8_mmx_c32: 282
sad_8x8_sse2: 252
sad_8x16_c: 6074
sad_8x16_mmx: 402
sad_8x16_mmx_c64: 440
sad_8x16_mmx_c32: 464
sad_8x16_sse2: 404
sad_16x8_c: 7017
sad_16x8_mmx: 354
sad_16x8_mmx_c64: 436
sad_16x8_mmx_c32: 504
sad_16x8_sse2: 279
sad_16x8_sse2_c64: 414
sad_16x8_sse3_c64: 281
sad_16x16_c: 13891
sad_16x16_mmx: 648
sad_16x16_mmx_c64: 838
sad_16x16_mmx_c32: 1025
sad_16x16_sse2: 467
sad_16x16_sse2_c64: 623
sad_16x16_sse3_c64: 468
sad_aligned_4x4_c: 1125
sad_aligned_4x4_mmx: 154
sad_aligned_4x8_c: 1909
sad_aligned_4x8_mmx: 225
sad_aligned_8x4_c: 1753
sad_aligned_8x4_mmx: 150
sad_aligned_8x8_c: 3150
sad_aligned_8x8_mmx: 234
sad_aligned_8x8_sse2: 232
sad_aligned_8x16_c: 6048
sad_aligned_8x16_mmx: 374
sad_aligned_8x16_sse2: 359
sad_aligned_16x8_c: 7006
sad_aligned_16x8_mmx: 328
sad_aligned_16x8_sse2: 242
sad_aligned_16x16_c: 13877
sad_aligned_16x16_mmx: 619
sad_aligned_16x16_sse2: 366
sad_x3_4x4_c: 3298
sad_x3_4x4_mmx: 279
sad_x3_4x8_c: 5632
sad_x3_4x8_mmx: 425
sad_x3_8x4_c: 5332
sad_x3_8x4_mmx: 318
sad_x3_8x4_sse2: 331
sad_x3_8x8_c: 9619
sad_x3_8x8_mmx: 515
sad_x3_8x8_mmx_c64: 671
sad_x3_8x8_mmx_c32: 774
sad_x3_8x8_sse2: 492
sad_x3_8x16_c: 18272
sad_x3_8x16_mmx: 942
sad_x3_8x16_mmx_c64: 1197
sad_x3_8x16_mmx_c32: 1345
sad_x3_8x16_sse2: 858
sad_x3_16x8_c: 21340
sad_x3_16x8_mmx: 961
sad_x3_16x8_mmx_c64: 1301
sad_x3_16x8_mmx_c32: 1587
sad_x3_16x8_sse2: 532
sad_x3_16x8_sse2_c64: 1099
sad_x3_16x8_sse3_c64: 532
sad_x3_16x16_c: 42275
sad_x3_16x16_mmx: 1825
sad_x3_16x16_mmx_c64: 2553
sad_x3_16x16_mmx_c32: 3243
sad_x3_16x16_sse2: 962
sad_x3_16x16_sse2_c64: 1656
sad_x3_16x16_sse3_c64: 963
sad_x4_4x4_c: 4356
sad_x4_4x4_mmx: 367
sad_x4_4x8_c: 7485
sad_x4_4x8_mmx: 564
sad_x4_8x4_c: 7110
sad_x4_8x4_mmx: 422
sad_x4_8x4_sse2: 402
sad_x4_8x8_c: 12865
sad_x4_8x8_mmx: 701
sad_x4_8x8_mmx_c64: 912
sad_x4_8x8_mmx_c32: 1067
sad_x4_8x8_sse2: 608
sad_x4_8x16_c: 24345
sad_x4_8x16_mmx: 1265
sad_x4_8x16_mmx_c64: 1561
sad_x4_8x16_mmx_c32: 1807
sad_x4_8x16_sse2: 1017
sad_x4_16x8_c: 28290
sad_x4_16x8_mmx: 1264
sad_x4_16x8_mmx_c64: 1729
sad_x4_16x8_mmx_c32: 2168
sad_x4_16x8_sse2: 701
sad_x4_16x8_sse2_c64: 1480
sad_x4_16x8_sse3_c64: 701
sad_x4_16x16_c: 56305
sad_x4_16x16_mmx: 2393
sad_x4_16x16_mmx_c64: 3380
sad_x4_16x16_mmx_c32: 4347
sad_x4_16x16_sse2: 1234
sad_x4_16x16_sse2_c64: 2221
sad_x4_16x16_sse3_c64: 1234


is that all you needed?

Dark Shikari
12th November 2008, 02:49
Yup, looks good. Seems this patch will help a bit on Phenom, but I'll have to be careful which ones I enable. Thanks.

kemuri-_9
12th November 2008, 03:31
cool, looking forward to it then.

plonk420
20th November 2008, 08:23
a) kemuri-_9, do you have a 9x00 or a 9x50? and b) does it matter to you, DS?

edit: (i have a 9x50)

also: back when i bought my phenom (around the B3 setting launch date) .. it seemed to keep up with C2Quads in a price-to-performance comparison... at least with Anandtech benchmarks. is that what was experienced by you all/in real world testing?

kemuri-_9
20th November 2008, 15:48
i have a 9850, as shown in my spec signature.

besides, the patch has already been committed.

burfadel
20th November 2008, 23:47
What about Shanghai? haven't got one myself and won't be changing to one, but the results may be different...

Dark Shikari
20th November 2008, 23:51
What about Shanghai? haven't got one myself and won't be changing to one, but the results may be different...If results are at all different, we'll need a completely new benchmark of all of checkasm.

Sharktooth
21st November 2008, 04:02
shangai core has some differencies from barcelona core. plus more L3 cache.
dunno if they added SSE5 though.
initial and unpublished benches show up to 40% speedup, but we wont know for sure until numbers will be published...

Dark Shikari
21st November 2008, 04:05
shangai core has some differencies from barcelona core. plus more L3 cache.
dunno if they added SSE5 though.
initial and unpublished benches show up to 40% speedup, but we wont know for sure until numbers will be published...No, SSE5 is for Bulldozer core.

Dark Shikari
21st November 2008, 06:43
Time for another Phenom benchmark!

I need these (http://www.mediafire.com/?hvowmjntkyz) run 5 times each (to get a standard deviation, just to make sure the results are not the results of chance) with the following options:

./checkasm --bench="hpel" 2> /dev/null

An example output of this would be:

hpel_filter_c: 188983
hpel_filter_mmx: 94321
hpel_filter_sse2slow: 91880
hpel_filter_sse2: 65004
hpel_filter_ssse3: 41563

This is testing whether using a bunch of unaligned loads instead of the SSE2 PALIGNR macro is faster on Phenom, where unaligned loads are just as fast as aligned loads. The change hurts on Conroe and helps on k8.

As usual with assembly benches, I only need one, so don't spam any more benches after the first person has responded.

P.S. if anyone working at AMD is reading this, x264 would run a lot faster on AMD chips if we had SSH access to a Phenom to do testing on. In fact, how about you just give us SSH access to prerelease chips, too? I mean, Intel did.

shon3i
21st November 2008, 08:47
AMD Phenom 9550

hpel_filter_c: 234217
hpel_filter_mmx: 48442
hpel_filter_sse2slow: 46446
hpel_filter_sse2: 35192

Dark Shikari
21st November 2008, 08:48
AMD Phenom 9550

hpel_filter_c: 234217
hpel_filter_mmx: 48442
hpel_filter_sse2slow: 46446
hpel_filter_sse2: 35192As I said, I need both checkasms tested (to compare)... and 3-5 times each, so I can make sure the numbers aren't deviating too much.

shon3i
21st November 2008, 09:29
As I said, I need both checkasms tested (to compare)... and 3-5 times each, so I can make sure the numbers aren't deviating too much.
Sorry :) i will update when i back home :)

kemuri-_9
21st November 2008, 09:44
checkasm_old:

nop: 680
hpel_filter_c: 240371
hpel_filter_mmx: 48328
hpel_filter_sse2slow: 46299
hpel_filter_sse2: 40085
nop: 680
hpel_filter_c: 237375
hpel_filter_mmx: 48175
hpel_filter_sse2slow: 45973
hpel_filter_sse2: 40072
nop: 685
hpel_filter_c: 237202
hpel_filter_mmx: 48480
hpel_filter_sse2slow: 46390
hpel_filter_sse2: 40188
nop: 684
hpel_filter_c: 238450
hpel_filter_mmx: 48816
hpel_filter_sse2slow: 46181
hpel_filter_sse2: 40076
nop: 681
hpel_filter_c: 241929
hpel_filter_mmx: 48269
hpel_filter_sse2slow: 46242
hpel_filter_sse2: 40139


checkasm:

nop: 680
hpel_filter_c: 236413
hpel_filter_mmx: 48444
hpel_filter_sse2slow: 46217
hpel_filter_sse2: 33148
nop: 685
hpel_filter_c: 235305
hpel_filter_mmx: 48257
hpel_filter_sse2slow: 46248
hpel_filter_sse2: 33168
nop: 699
hpel_filter_c: 237522
hpel_filter_mmx: 48318
hpel_filter_sse2slow: 46379
hpel_filter_sse2: 33109
nop: 680
hpel_filter_c: 237765
hpel_filter_mmx: 48153
hpel_filter_sse2slow: 45976
hpel_filter_sse2: 33085
nop: 680
hpel_filter_c: 235893
hpel_filter_mmx: 48209
hpel_filter_sse2slow: 46274
hpel_filter_sse2: 33225


is that what you needed for this round?

Dark Shikari
21st November 2008, 09:47
Oh wow, 21% faster hpel filter, putting it almost as fast as Nehalem's performance on the SSSE3 version (31k clocks).

This patch is definitely going in...

P.S. If anyone can give me access to a Phenom box on SSH, it would be quite welcome... I have quite a few things I want to test.

Easy123
21st November 2008, 11:11
Great to see that performance enhancements of x264 for phenom are getting done. Greatly appreciated, as I got a 9550 for some weeks now.

@Dark Shikari:
The only thing I could contribute, is installing ubuntu onto it and letting it run for a few days, so you could use it for ssh. Itīs my personal computer @ home though ;) Let me know if that would be useful *gg*

Dark Shikari
21st November 2008, 11:18
Great to see that performance enhancements of x264 for phenom are getting done. Greatly appreciated, as I got a 9550 for some weeks now.

@Dark Shikari:
The only thing I could contribute, is installing ubuntu onto it and letting it run for a few days, so you could use it for ssh. Itīs my personal computer @ home though ;) Let me know if that would be useful *gg*That'd be fine; having one on-call indefinitely would be much more useful, but I could try a few things this weekend as well (more tests of the misaligned-SSE options, fast unaligned loads, etc).

Easy123
21st November 2008, 11:24
Okay, I will install it today afternoon so you can test some things ;)

Denner
21st November 2008, 11:35
Hi I am on a Phenom x4 9750 ( 95watt version ) is there any way that I can help out ?

I'am a NOOB in the X264 encoding game, so you might need to guide me a bit if I am to help out :-)

Sharktooth
21st November 2008, 15:18
@D_S: i'll soon (a couple of months or so) have a Phenom II box. i can setup a SSH access but we also need a VPN since the IP of the machine will not be public... so a tunnel will be required.

btw, i think this is interesting for you all: http://www.pcper.com/comments.php?nid=6455

hajj_3
21st November 2008, 21:41
hope you can get AMD to send you a phenom II rig for you to play with and improve x264 on before they launch it. Will be interesting to see how the phenom II performs, hopefully similar to core i7 and hopefully overclock by the same amount too, intel needs a competitor for the mid-range cpu's.

keep up the great work, love speed improvements you guys regularly add:)

Easy123
18th December 2008, 20:21
Is there any News regarding this Topic, or as I like to call it, for the Dark Side of my Quadcore??? *gg*

nurbs
18th December 2008, 20:49
commit f9dba8bb274dffb19394db20912823464efcb8e1 r1030
Author: Jason Garrett-Glaser <>
Date: Fri Nov 21 03:39:11 2008 -0800

Phenom CPU optimizations
Faster hpel_filter by using unaligned loads instead of emulated PALIGNR
Faster hpel_filter on 64-bit by using the 32-bit version (the cost of emulated PALIGNR is high enough that the savings from caching intermediate values is not worth it).
Add support for misaligned_mask on Phenom: ~2% faster hpel_filter, ~4% faster width16 multisad, 7% faster width20 get_ref.
Replace width12 mmx with width16 sse on Phenom and Nehalem: 32% faster width12 get_ref on Phenom.
Merge cpu-32.asm and cpu-64.asm
Thanks to Easy123 for contributing a Phenom box for a weekend so I could write these optimizations.

Easy123
18th December 2008, 21:25
Didnīt see that ^^ Thx...

Dark Shikari
30th December 2008, 16:20
Time for another phenom benchmark!

Download Checkasm (http://kovensky.project357.com/checkasm.exe) and run it. If all tests pass (just to be sure), run:

checkasm --bench="coeff"

and post the results.

Note that the numbers are not in fact correct because the benchmark does not accurately measure the speed cost of these functions, so C is not actually faster than asm ;) But it's still useful for comparative tests among the different asm functions.

kemuri-_9
30th December 2008, 17:54
C:\>checkasm.exe
x264: using random seed 1774247936
x264: MMX
- pixel sad : [OK]
- pixel sad_aligned : [OK]
- pixel ssd : [OK]
- pixel satd : [OK]
- pixel sa8d : [OK]
- pixel sad_x3 : [OK]
- pixel sad_x4 : [OK]
- pixel var : [OK]
- pixel hadamard_ac : [OK]
- intra satd_x3 : [OK]
- intra sad_x3 : [OK]
- ssim : [OK]
- esa ads: [OK]
- sub_dct4 : [OK]
- sub_dct8 : [OK]
- add_idct4 : [OK]
- add_idct8 : [OK]
- dct4x4dc : [OK]
- idct4x4dc : [OK]
- zigzag_frame : [OK]
- zigzag_field : [OK]
- mc luma : [OK]
- mc chroma : [OK]
- mc wpredb : [OK]
- hpel filter : [OK]
- lowres init : [OK]
- integral init : [OK]
- intra pred : [OK]
- deblock : [OK]
- quant : [OK]
- dequant : [OK]
- denoise dct : [OK]
- decimate_score : [OK]
- coeff_last : [OK]
- coeff_level_run : [OK]
- cabac : [OK]
x264: MMX Cache64
- pixel sad : [OK]
- pixel sad_x3 : [OK]
- pixel sad_x4 : [OK]
- mc luma : [OK]
- lowres init : [OK]
x264: MMX Cache32
- pixel sad : [OK]
- pixel sad_x3 : [OK]
- pixel sad_x4 : [OK]
- mc luma : [OK]
- lowres init : [OK]
x264: SSE2Slow
- pixel sad_aligned : [OK]
- pixel ssd : [OK]
- pixel satd : [OK]
- pixel sa8d : [OK]
- pixel var : [OK]
- ssim : [OK]
- sub_dct4 : [OK]
- sub_dct8 : [OK]
- add_idct4 : [OK]
- add_idct8 : [OK]
- hpel filter : [OK]
- integral init : [OK]
- intra pred : [OK]
- deblock : [OK]
- quant : [OK]
- dequant : [OK]
- denoise dct : [OK]
- decimate_score : [OK]
- coeff_last : [OK]
- coeff_level_run : [OK]
x264: SSE2Fast
- pixel sad : [OK]
- pixel sad_aligned : [OK]
- pixel satd : [OK]
- pixel sad_x3 : [OK]
- pixel sad_x4 : [OK]
- pixel var : [OK]
- pixel hadamard_ac : [OK]
- intra sad_x3 : [OK]
- esa ads: [OK]
- zigzag_frame : [OK]
- mc luma : [OK]
- mc chroma : [OK]
- mc wpredb : [OK]
- hpel filter : [OK]
- lowres init : [OK]
- intra pred : [OK]
x264: SSE2Fast Cache64
- pixel sad : [OK]
- pixel sad_aligned : [OK]
- pixel sad_x3 : [OK]
- pixel sad_x4 : [OK]
- mc luma : [OK]
x264: SSE_Misalign
- pixel sad_x3 : [OK]
- pixel sad_x4 : [OK]
- mc luma : [OK]
- hpel filter : [OK]
x264: SSE_LZCNT
coeff_last4: [FAILED]
coeff_last15: [FAILED]
coeff_last16: [FAILED]
coeff_last64: [FAILED]
- coeff_last : [FAILED]
coeff_level_run4: [FAILED]
coeff_level_run15: [FAILED]
coeff_level_run16: [FAILED]
- coeff_level_run : [FAILED]
x264: SSE3
- pixel sad : [OK]
- pixel sad_aligned : [OK]
- pixel sad_x3 : [OK]
- pixel sad_x4 : [OK]
- mc luma : [OK]
x264: at least one test has failed. Go and fix that Right Now!

Dark Shikari
30th December 2008, 18:17
Thanks, bug resolved on IRC.