Alliance for Open Media codecs [Archive] - Page 25

View Full Version : Alliance for Open Media codecs

Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 [25] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

Mr_Khyron

9th November 2018, 20:16

https://mspoweruser.com/microsoft-release-av1-video-codec-for-windows-10/
Microsoft has released support for the new AV1 royalty-free video codec for Windows 10 via the Microsoft Store.

AOMedia Video 1 (AV1), is an open, royalty-free video coding format designed for video transmissions over the Internet. It is being developed by the Alliance for Open Media (AOMedia) and is meant to be a successor to VP9 without relying on any MPEG patents.

The AV1 extension in the Microsoft Store is an early beta version of the AV1 software decoder. Since this is an early release, users may see some performance issues when playing AV1 videos.

Microsoft says they will be regularly updating the codec via automatic store updates.

Find the new codec in the Microsoft Store here.

hydra3333

10th November 2018, 10:08

he he, clicked on "Get" 30 times in the microsoft store and nothing happens ... that may be saying something about quality.

v0lt

10th November 2018, 13:51

I tried bencmarking with ffmpeg 4.2
from 16fps with libaom to 77fps with Dav1d
Where can I download ffmpeg with libdav1d library?

lvqcl

10th November 2018, 14:22

from 16fps with libaom to 77fps with Dav1d

AFAICS dav1d has only x86-64 AVX2 assembly code, right?
I wonder what's their plans about older hardware...

SmilingWolf

10th November 2018, 15:04

aomenc -v --threads=8 --cpu-used=4 --row-mt=1 --lag-in-frames=25 --auto-alt-ref=1--passes=2 --pass=2 --bit-depth=10 --input-bit-depth=10 --end-usage=q --cq-level=28 -o Chimera_DCI4k2398p_HDR_P3PQ.ivf Chimera_DCI4k2398p_HDR_P3PQ.y4m
That worked, thanks!

Where can I download ffmpeg with libdav1d library?
Win64 GCC 8.2 static build:
ffmpeg-4.2-92396-g55e021f39b: https://mega.nz/#!IgAAVayA!jpzHzBaE6hZpmCb4_1Fdj-es2oRV-FnbjR-ruOD8lCI
- libaom 1.0.0-902-g03d8ebedc
- libdav1d 58fc516

NikosD

10th November 2018, 18:58

In order to GET the new MS AV1 codec from MS Store, you need to install the forbidden (banned) Windows October 2018 Update.

Test:
MS Windows October x64
Core i3 4170
DXVA Checker (new beta version)

Sample:
Chimera AV1 1080p 8bit (Netflix free sample)

LAV x64 0.73.1 vs MS MFT AV1

LAV x64 19/34/144 (min/avg/max fps) CPU Usage: 57/70/83 (%)

MS MFT AV1 15/26/156 CPU Usage: 50/68/81

It seems that AOM AV1 codec is ~30% faster than MS MFT AV1 on average fps

v0lt

10th November 2018, 19:16

@SmilingWolf
Thank.
But my results are different from those that were announced here.
I ran the following tests:
ffmpeg -hide_banner -t 10 -c:v libaom-av1 -i Stream2_AV1_4K_22.7mbps.webm -benchmark -f null -
ffmpeg -hide_banner -t 10 -c:v libdav1d -i Stream2_AV1_4K_22.7mbps.webm -benchmark -f null -
ffmpeg -hide_banner -t 10 -c:v libdav1d -threads 4 -tilethreads 4 -i Stream2_AV1_4K_22.7mbps.webm -benchmark -f null -
And got the following results:
libaom-av1 - max 14 fps
libdav1d - max 7.4 fps
libdav1d -threads 4 -tilethreads 4 - max 9.7 fps

Added:
Intel i5-3570k, Windows 7 Sp1 x64.

richardpl

10th November 2018, 20:20

Probably because you are not using right arch and CPU combo.

Wolfberry

11th November 2018, 04:36

I ran the following tests:
ffmpeg -hide_banner -t 10 -c:v libaom-av1 -i Stream2_AV1_4K_22.7mbps.webm -benchmark -f null -
ffmpeg -hide_banner -t 10 -c:v libdav1d -i Stream2_AV1_4K_22.7mbps.webm -benchmark -f null -
ffmpeg -hide_banner -t 10 -c:v libdav1d -threads 4 -tilethreads 4 -i Stream2_AV1_4K_22.7mbps.webm -benchmark -f null -

I ran the same test as above and get 16/38/46 fps.
What is the CPU you use for testing?
It might be related to the AVX2 code used in dav1d.

v0lt

11th November 2018, 05:12

it might be related to the avx2 code used in dav1d.
sse2, sse4.1?

Aleksoid1978

11th November 2018, 07:37

Very "good" optimisation dav1d - much slower on my system...

Nintendo Maniac 64

11th November 2018, 09:04

AFAICS dav1d has only x86-64 AVX2 assembly code, right?
I wonder what's their plans about older hardware...

Don't forgot that Pentiums and Celerons don't support AVX, and this includes the models that use full-fat Sky/Kaby/Coffee cores such as the ever-popular 2c/4t G4560 and its successor the G5400 (as well as the variants with the faster iGPU like the G4600 and G5500).

And of course, it's those very same AVX-lacking Celerons and Pentiums and such that would stand to gain the biggest benefit from any such software decoder optimizations because those processors simply lack the raw "moar cores!" computational grunt that their i7 and Ryzen brethren have for brute-forcing their way through.

So needless to say, it'd be pretty disappointing to me if dav1d pretty much required having an AVX-capable CPU in order to have any benefit.

Very "good" optimisation dav1d - mush slower on my system...

...that's not a Fernando Alonso reference (https://www.redditmedia.com/mediaembed/6ar8fb), is it?

Mystery Keeper

11th November 2018, 13:14

Selur

11th November 2018, 13:25

I wish aomenc/vpxenc had GOP-level parallelism.
Which would require 2pass encoding and a fixed gop structue (in regard to the gop sizes), iirc 2nd pass normally should be able to overwrite GOP to archive vbv limits (not totally sure).

Gravitator

11th November 2018, 13:49

ffmpeg -hide_banner -t 10 -c:v libaom-av1 -i 1.mp4 -benchmark -f null - (43 fps)
ffmpeg -hide_banner -t 10 -c:v libdav1d -i 1.mp4 -benchmark -f null - (52 fps)
ffmpeg -hide_banner -t 10 -c:v libdav1d -threads 1 -tilethreads 2 -i 1.mp4 -benchmark -f null - (61 fps)
ffmpeg -hide_banner -t 10 -c:v libdav1d -threads 2 -tilethreads 2 -i 1.mp4 -benchmark -f null - (65 fps)

lvqcl

11th November 2018, 14:08

sse2, sse4.1?

It seems that one of dav1d developers said: "we don't care about mmx/sse2 support anyway" (link (http://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2018-October/005348.html)). Have no idea about sse4.1.

SmilingWolf

11th November 2018, 15:12

It seems that one of dav1d developers said: "we don't care about mmx/sse2 support anyway" (link (http://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2018-October/005348.html)). Have no idea about sse4.1.

BBB is part of TwoOrioles, so it might have been referred to the company based on its userbase.
Still, MMX is hardly relevant nowadays. SSE4.1 as the lowest bar doesn't sound too unreasonable

Also relevant: https://code.videolan.org/videolan/dav1d/issues/15#note_22262

NikosD

11th November 2018, 19:06

It seems that one of dav1d developers said: "we don't care about mmx/sse2 support anyway"
Have no idea about sse4.1.Still, MMX is hardly relevant nowadays. SSE4.1 as the lowest bar doesn't sound too unreasonable.MMX is too old and not that beneficial as it can reach only 64bits (maybe 80bits max)
SSEx should be the base as it is 128bit with very fast implementation on all CPUs of the last 10 years.
Especially SSE2 is mandatory for x64 architecture.
From the last link it's obvious that dav1d developers targeted AVX2 for 256bit acceleration using ASM, but not exclusively.
They are going to optimise for SSEx later.
So no worries, I think.

marcomsousa

11th November 2018, 20:50

if they want to go with 4k and 8k videos they have to use AVX2.

Nintendo Maniac 64

12th November 2018, 00:30

Especially SSE2 is mandatory for x64 architecture.

You can also usually safely target SSE3 (no, not SSSE3) as well since it's supported on all DDR2-capable 64bit x86 CPUs and newer.

(the only 64bit x86 CPUs that don't support SSE3 are some socket 754 and 939 Athlon 64s which used DDR1)

Mystery Keeper

12th November 2018, 05:22

Which would require 2pass encoding and a fixed gop structue (in regard to the gop sizes), iirc 2nd pass normally should be able to overwrite GOP to archive vbv limits (not totally sure).

I'm totally fine with that. I usually use 2pass anyway. And, of course, I meant I wish they had it as an option.

LigH

12th November 2018, 12:47

@ Nintendo Maniac 64:

Even AMD Athlon64/Phenom (K8-K10 arch.) support some SSE3; but x264/x265 does not use it, considers their implementation as "too slow", I believe.

marcomsousa

12th November 2018, 23:16

SSE3-optimised av1_nn_predict

https://aomedia.googlesource.com/aom/+/486cc9894b7e76b09b4ee37dff6f313f27b1c501

I have developed a SIMD-optimised neural network implementation using
SSE3. I have also added functional equivalence tests between this and
the original implementation. I added aom_clear_system_state() to a few
places where FPU operations are used after av1_nn_predict.

Speed-ups over the original C implementation for various network shapes:
10x64x16: 1.72x
12x12x1: 2.72x
12x24x1: 2.35x
12x32x1: 3.34x
18x24x4: 0.94x
18x32x4: 0.93x
4x16x1: 2.01x
8x16x1: 1.89x
8x16x4: 2.02x
8x24x1: 2.77x
8x32x1: 2.98x
8x64x1: 3.76x
9x32x3: 1.08x
4x8x4: 1.66x

A few awkwardly-shaped networks are slightly slower: these could be
padded to more convenient sizes to use the SIMD kernels.

I also wrote an AVX/AVX2 implementation but on these relatively small
networks it was barely faster than the SSE3 code.

Nintendo Maniac 64

12th November 2018, 23:23

Even AMD Athlon64/Phenom (K8-K10 arch.) support some SSE3

...but this is exactly what I alluded to?

Athlon 64 CPUs are available on socket 754, 939, and AM2; 754 and 939 used DDR1 memory while AM2 used DDR2, and all AM2 CPUs support SSE3.

(there are some socket 754 and 939 CPUs that support SSE3, though it's kind of hit and miss).

Phenom for reference requires at least DDR2.

LigH

13th November 2018, 08:50

I'm sorry, I don't know socket numbers... :o - so we looked at the same topic from different angles. :D

v0lt

13th November 2018, 19:26

I ran the same test as above and get 16/38/46 fps.
What is the CPU you use for testing?
It might be related to the AVX2 code used in dav1d.
Intel i5-3570k (SSE4.1, SSE4.2, AVX), Windows 7 Sp1 x64.

SmilingWolf

15th November 2018, 00:29

Status report!
Previous edition: http://forum.doom9.org/showthread.php?p=1852449#post1852449
Whatever paragraph I don't repeat here can be assumed to be the same as in the aforementioned post

First of all: graphs! Click to enlarge
Y axis: chosen metric
X axis: bits per pixel

720p:
https://thumb.ibb.co/hDPSs0/msssim-720.png (https://ibb.co/hDPSs0)https://thumb.ibb.co/j8ObkL/psnrhvsm-720.png (https://ibb.co/j8ObkL)

1080p:
https://thumb.ibb.co/it4XQL/msssim-1080.png (https://ibb.co/it4XQL)https://thumb.ibb.co/izFvef/psnrhvsm-1080.png (https://ibb.co/izFvef)

BD rates for 720p:
x264 -> rav1e (yeah you read that right!)
RATE (%) DSNR (dB)
MSSSIM -0.736889 0.0375593
PSNRHVS -5.5274 0.375081

rav1e -> x265
RATE (%) DSNR (dB)
MSSSIM -26.5291 1.29942
PSNRHVS -27.1134 1.70509

x265 -> libaom
RATE (%) DSNR (dB)
MSSSIM -18.9088 0.7852
PSNRHVS -15.3123 0.761791

BD rates for 1080p:
x264 -> rav1e (yeah you read that right again!)
RATE (%) DSNR (dB)
MSSSIM -4.92009 0.235151
PSNRHVS -7.23088 0.473125

rav1e -> x265
RATE (%) DSNR (dB)
MSSSIM -26.7063 1.16103
PSNRHVS -28.0007 1.53902

x265 -> libaom
RATE (%) DSNR (dB)
MSSSIM -26.486 0.938124
PSNRHVS -21.7431 0.905916

Encoders:
x264 157-2935-545de2f
x265 2.9-4-471726d3a046
rav1e 0.1.0-702-ab4d23e2
libaom 1.0.0-908-g3a607f7b0

Cmdlines:
x264 --preset veryslow --tune ssim --crf 16 -o test.x264.crf16.264 orig.i420.y4m
x265 --preset veryslow --tune ssim --crf 16 -o test.x265.crf16.hevc orig.i420.y4m
rav1e --low_latency false -o test.rav1e.cq80.ivf --quantizer 80 -s 2 --tune psnr orig.i420.y4m
aomenc --frame-parallel=0 --tile-columns=3 --auto-alt-ref=1 --cpu-used=4 --tune=psnr --passes=2 --threads=2 --end-usage=q --cq-level=20 --test-decode=fatal -o test.av1.cq20.webm orig.i420.y4m

Notes:
So as you can see, the rav1e and aomenc cmdlines have been slightly adjusted to take advantage of the bugfixes and updates from the last months.
In particular, rav1e has been gifted by Frank Bossen the ability to create a B-pyramid, which almost single handedly decreed rav1e's advantage over x264.
A word of warning on this last point: it's still kind of a mixed bag. In very flat, static scenes like PresageFlowerWalk x264 still rules by quite a margin, while rav1e takes the crown in clips like F.Y.C and PresageFlowerFight
F.Y.C, x264 -> rav1e:
RATE (%) DSNR (dB)
MSSSIM -18.451 1.01281
PSNRHVS -25.7463 2.03419

PresageFlowerFight, x264 -> rav1e:
RATE (%) DSNR (dB)
MSSSIM -31.4953 1.80761
PSNRHVS -31.0827 2.27546

PresageFlowerWalk, x264 -> rav1e:
RATE (%) DSNR (dB)
MSSSIM 66.2264 -1.70084
PSNRHVS 70.8208 -2.28853
(as always, a negative BD rate means improvement, positive means regression)

Considerations about times with libaom:
I'm using my desktop PC to run all the encodes. It is also my main study/work PC, so the times can come quite off. Plus, I run multiple encodes in parallel, which further messes up timings.
HOWEVER, between annoying bugs and a lot of stuff, the first report did cost me nearly a week of time (this includes having to re-run some encodes because sh*t happened) ONLY to encode with libaom.
Taking advantage of the recent bugfixes and improvements I have been able to rework my workflow and bring down that time to a couple days only, WITHOUT having to touch the --cpu-used parameter and no night time encoding.
All in all, I am pretty satisfied.

This concludes my (bi-monthly?) report.
As always, I'm open to any kind of feedback to improve my comparisons and my encodes.

benwaggoner

16th November 2018, 19:53

SmilingWolf

16th November 2018, 22:45

So, what's everyone's favorite AV1 decoder app on Windows? Chrome looks to be not converting from video to PC range correctly (blacks are washed out, contrast is low, etcetera). Is there a nightly of something that does AV! correctly for apples-apples?

VLC 3.0.5 (Nightly). I fixed my nVidia settings just today because I had that same problem while playing back the ToS fragment I use for the tests. Plays out correctly now.
In alternative, ffplay for quick stuff when I already have a bunch of command prompts open in the right path.

LigH

16th November 2018, 23:27

I use almost only MPC-HC. Which uses LAV Filters with a direct API. It was able to play AV1 clips from the YouTube beta playlist and some tiny own encodes (I don't have powerful CPU's available). So, only a limited experience, yet, but it appears to work.

SmilingWolf

18th November 2018, 12:40

32/64bits binaries (GCC 9.0):
av1-1.0.0-941-gd2a592e1c: https://mega.nz/#!F5Am2KyK!9aQ6_7mM2pDJMsW11CM01Jjsa1R7S6_OaZahvKCHPWQ

mandarinka

19th November 2018, 10:51

I wish aomenc/vpxenc had GOP-level parallelism. When each thread is encoding one GOP, and then they are stitched together. That would make use of all CPU power without compromising quality/compression.

You could get the same results by splitting manually into X parts end encode them separately at once. I'm not sure how much does libvpx/libaom count with that. It works great with x264 and x265 (using raw output at least).

mandarinka

19th November 2018, 10:58

@ Nintendo Maniac 64:

Even AMD Athlon64/Phenom (K8-K10 arch.) support some SSE3; but x264/x265 does not use it, considers their implementation as "too slow", I believe.

SSE3 is not particularly useful for multimedia and it's just a few instructions introduced in Presscot P4 and Venice 90nm K8.

You probably mean SSSE3 (SSS instead of SS) aka "Suplemental SSE3" which is a confusing and dumb name. Probably should have been SSE4 but got renamed for marketing reasons. Or SSE3 was not supposed to be SSE3 originally.

SSSE3 is very useful for encoding and decoding, but only comes on Core 2 chips, and Bobcat/Bulldozer and later cores from AMD. K10 and K8 end at the not-so-important SSE3.
(Note that x265 actually needs SSSE3 + SSE4 to be useful, you are barred from most of assembly optimization if you only have SSSE3, like with 65nm Core 2s or pre-Sandy Bridge Pentium/Celeron).

LigH

19th November 2018, 13:09

Thanks, mandarinka, that explains a bit. I meant SSE3 of 2004 (a.k.a. "Prescott New Instructions" PNI, according to Wikipedia), originally. SSSE3 of 2006 did not arrive in AMD CPUs before the "Cat" (Fusion APU) and "Heavy Equipment" series, so Athlon64/Phenom are clearly out of business.

benwaggoner

20th November 2018, 01:08

You could get the same results by splitting manually into X parts end encode them separately at once. I'm not sure how much does libvpx/libaom count with that. It works great with x264 and x265 (using raw output at least).
Naïve Split-and-stich risks violating VBV at the stitch boundaries and/or reducing quality at those boundaries in order to ensure VBV.

Not that VBV is being used in any AV1 testing I've seen so far.

utack

20th November 2018, 04:13

So I am not entirely sure about what the stats file from first pass includes.
When using pure "q" mode for constant quality, is there a benefit to doing a first pass, or does the first pass only determine how to distribute bitrate when a target bitrate and vbr is specified?

marcomsousa

20th November 2018, 16:37

Building Modern Web Media Experiences: AV1 (Chrome Dev Summit 2018)
https://youtu.be/iTC3mfe0DwE?t=612

VP9 vs H.264
AV1 is 30% smaller in size that VP9
Support in companies
Support in browsers, WebRTC, web
Switch Codecs and Containers in MSE (AV1,VP9,H.264)
DEMO Switch codecs MSE http://storage.googleapis.com/change_type/index.html

uneedme

21st November 2018, 11:24

Hi all

Still, anywhere could find the detail explained parameter functions and arguments range?

forgive my poor wording...

high-end spree means nothing...

utack

22nd November 2018, 00:09

dav1d is doing well
http://www.jbkempf.com/blog/post/2018/dav1d-toward-the-first-release

Wolfberry

22nd November 2018, 11:48

64-bit GCC 8.2.0 binaries: av1-1.0.0-962-1468e60d7 (https://drive.google.com/drive/folders/1xZQABtoaSFgGu11YstmHKYLzO3elemlC)

AVX2 ver of highbd dr predictions Z1,Z3
perfromance increase 1.22x-20x depending on input params

NikosD

22nd November 2018, 18:47

dav1d is doing well
http://www.jbkempf.com/blog/post/2018/dav1d-toward-the-first-release Dav1d is very fast indeed and although is optimized for AVX2, RyZen manages to be a lot faster than Haswell, albeit Haswell has twice as fast AVX2 implementation.

Scaling to more threads and better hyperthreading implementation along with better clocks (?) for the specific SKUs, probably gave RyZen the clear lead.

mandarinka

22nd November 2018, 19:52

The Haswell chips they use for testing is a mobile 4C/8T quadcore which probably runs with low clocks (probably some macbook, so...) and the other is a 4C/4T lower-price desktop SKU which is why it will have lower performance than Ryzen. BTW that Ryzen is a hexacore 6C/12T anyway (yay for AMD!).

littleD

22nd November 2018, 20:49

Wonder how they ran six/eight thread benchmark on 4core/4 thread cpu. If they did, that means decoder has internal switch for thread count. And whats more, single core is underutilized since more CPU threads gives more performance. And since benchmark on 6 core zen gives better results than on 4 thread haswell means AOM decoder they compare to, is highly single threaded. Both decoders have still room to improve anyway.
And. If they compare speed on 12 thread zen this means they compare threading. Global Comparison benchmark shows it.

utack

26th November 2018, 00:17

I encoded the full 4096x1714 4K version of Tears of Steel with libaom.
Bitrate of the final file is about 3.5mbit/s, and quality is definitely on a scale of at least "very good".
Have fun looking what aom can do with that little bitrate, or at benchmarking:
https://drive.google.com/drive/folders/1qp4SvIxzitLipFiudG3iJDautldl0luL?usp=sharing

benwaggoner

26th November 2018, 18:47

easyfab

26th November 2018, 19:22

@utack

with my AMD 2700x and libdav1d : 98 fps.

frame=17620 fps= 98 q=-0.0 Lsize=N/A time=00:12:14.16 bitrate=N/A speed=4.09x
video:9223kB audio:412879kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
bench: utime=1003.859s stime=182.547s rtime=179.705s
bench: maxrss=2521828kB

Did you encode with some tiles ? because It only use 40-50% of the CPU.
You should, it encode faster ( use more threads ) and decode also faster .

SmilingWolf

26th November 2018, 20:39

Did you encode with some tiles ? because It only use 40-50% of the CPU.
You should, it encode faster ( use more threads ) and decode also faster .

He used row-mt and no tiles. At the moment trying to use both makes aomenc sputter invalid bitstreams.
row-mt can maximise CPU usage at encode time, but as you said the lack of tiles reduces playback performance

utack

26th November 2018, 21:47

Wow, what was the encoding time like?
A little over 3 days

Can you share the command line you used?
He used row-mt and no tiles.

It is in the google drive folder
xz -dc tearsofsteel-4k.y4m.xz | aomenc --cpu-used=4 --row-mt=1 --threads=8 --kf-max-dist=250 --bias-pct=75 --webm -o sk.webm --end-usage=q --aq-mode=1 --cq-level=44 --codec=av1 --passes=2 --pass=2 --fpf=/tmp/fpf -

At the moment trying to use both makes aomenc sputter invalid bitstreams.
It seems to do so in the outro, aom craps out at frame 16xxx and I reported it:
https://bugs.chromium.org/p/aomedia/issues/detail?id=2262

row-mt can maximise CPU usage at encode time, but as you said the lack of tiles reduces playback performance
Using dav1d for playback with frame parallel decoding works great, and also does not crash when it reaches the "corrupt" frame

easyfab

26th November 2018, 21:55

My try with the first 1500 frames @1000kb/s with row-mt + tiles

My 2 pass command line :

7z.exe" x "ToS_1920x800_xdither.7z" -so | aomenc.exe --cpu-used=4 --row-mt=1 --threads=16 --tile-columns=4 --tile-rows=2 --kf-max-dist=250 --bias-pct=75 --webm -o tos.webm --target-bitrate=1000 --codec=av1 --passes=2 --pass=2 --fpf=fpf --limit=1500 -
Pass 2/2 frame 1500/1481 8290530B 2633147 ms 34.18 fpm [ETA unknown]
Pass 2/2 frame 1500/1500 8314991B 44346b/f 1064304b/s 2656175 ms (0.56 fps)

And the result file : https://www.sendspace.com/file/dcf6ii

It seems to decode fine for me.

SmilingWolf

26th November 2018, 22:01

Using dav1d for playback with frame parallel decoding works great, and also does not crash when it reaches the "corrupt" frame

Still, using dav1d and having tiles makes a difference between 30-50% faster decoding on my system using Elecard's Holi Festival 4K clip (tested with ffmpeg, latest libdav1d, -tilethreads 1/2/4).
No reason to give that up IMO

BTW, just to be clear, using row-mt does NOT automatically introduce tiling. Using AOMAnalyzer to take a look at the clip confirms that every frame is just one big tile. There are no columns nor rows

@easyfab:
That's interesting, that it decodes properly. It used to croak in the first couple frames for me (http://forum.doom9.org/showthread.php?p=1856831#post1856831)
Maybe it has been fixed and it flew under my radar. Guess I'll check now
EDIT: Holysmoly you're right, mixing tiles, threads and row-mt together has been fixed!