The bitstream parsing is incredibly fast; the CABAC is actually faster than AVC's, and with wavefront, it's perfectly parallelizable up to a few cores. (There are optimized CABAC implementations so you don't have to re-implement that wheel.) Nearly all of the time is spent in qpel expansion and reduction, copying pixels around, and applying sao/deblocking, plus odds and ends like IDCT. I can do parsing in python, easily, almost as fast as it can read from the disk.
|