FFmpeg/libavcodec/arm
Martin Storsjö dd299a2d6d arm: vp9: Add NEON loop filters
This work is sponsored by, and copyright, Google.

The implementation tries to have smart handling of cases
where no pixels need the full filtering for the 8/16 width
filters, skipping both calculation and writeback of the
unmodified pixels in those cases. The actual effect of this
is hard to test with checkasm though, since it tests the
full filtering, and the benefit depends on how many filtered
blocks use the shortcut.

Examples of relative speedup compared to the C version, from checkasm:
                          Cortex       A7     A8     A9    A53
vp9_loop_filter_h_4_8_neon:          2.72   2.68   1.78   3.15
vp9_loop_filter_h_8_8_neon:          2.36   2.38   1.70   2.91
vp9_loop_filter_h_16_8_neon:         1.80   1.89   1.45   2.01
vp9_loop_filter_h_16_16_neon:        2.81   2.78   2.18   3.16
vp9_loop_filter_mix2_h_44_16_neon:   2.65   2.67   1.93   3.05
vp9_loop_filter_mix2_h_48_16_neon:   2.46   2.38   1.81   2.85
vp9_loop_filter_mix2_h_84_16_neon:   2.50   2.41   1.73   2.85
vp9_loop_filter_mix2_h_88_16_neon:   2.77   2.66   1.96   3.23
vp9_loop_filter_mix2_v_44_16_neon:   4.28   4.46   3.22   5.70
vp9_loop_filter_mix2_v_48_16_neon:   3.92   4.00   3.03   5.19
vp9_loop_filter_mix2_v_84_16_neon:   3.97   4.31   2.98   5.33
vp9_loop_filter_mix2_v_88_16_neon:   3.91   4.19   3.06   5.18
vp9_loop_filter_v_4_8_neon:          4.53   4.47   3.31   6.05
vp9_loop_filter_v_8_8_neon:          3.58   3.99   2.92   5.17
vp9_loop_filter_v_16_8_neon:         3.40   3.50   2.81   4.68
vp9_loop_filter_v_16_16_neon:        4.66   4.41   3.74   6.02

The speedup vs C code is around 2-6x. The numbers are quite
inconclusive though, since the checkasm test runs multiple filterings
on top of each other, so later rounds might end up with different
codepaths (different decisions on which filter to apply, based
on input pixel differences). Disabling the early-exit in the asm
doesn't give a fair comparison either though, since the C code
only does the necessary calcuations for each row.

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 4-9x.

This is pretty similar in runtime to the corresponding routines
in libvpx. (This is comparing vpx_lpf_vertical_16_neon,
vpx_lpf_horizontal_edge_8_neon and vpx_lpf_horizontal_edge_16_neon
to vp9_loop_filter_h_16_8_neon, vp9_loop_filter_v_16_8_neon
and vp9_loop_filter_v_16_16_neon - note that the naming of horizonal
and vertical is flipped between the libraries.)

In order to have stable, comparable numbers, the early exits in both
asm versions were disabled, forcing the full filtering codepath.

                           Cortex           A7      A8      A9     A53
vp9_loop_filter_h_16_8_neon:             597.2   472.0   482.4   415.0
libvpx vpx_lpf_vertical_16_neon:         626.0   464.5   470.7   445.0
vp9_loop_filter_v_16_8_neon:             500.2   422.5   429.7   295.0
libvpx vpx_lpf_horizontal_edge_8_neon:   586.5   414.5   415.6   383.2
vp9_loop_filter_v_16_16_neon:            905.0   784.7   791.5   546.0
libvpx vpx_lpf_horizontal_edge_16_neon: 1060.2   751.7   743.5   685.2

Our version is consistently faster on on A7 and A53, marginally slower on
A8, and sometimes faster, sometimes slower on A9 (marginally slower in all
three tests in this particular test run).

Signed-off-by: Martin Storsjö <martin@martin.st>
2016-11-11 14:16:42 +02:00
..
aac.h
aacpsdsp_init_arm.c
aacpsdsp_neon.S
ac3dsp_arm.S
ac3dsp_armv6.S
ac3dsp_init_arm.c
ac3dsp_neon.S
apedsp_init_arm.c dsputil: Move APE-specific bits into apedsp 2014-05-29 06:41:15 -07:00
apedsp_neon.S dsputil: Move APE-specific bits into apedsp 2014-05-29 06:41:15 -07:00
asm-offsets.h mpegvideo: move the MpegEncContext fields used from arm asm to the beginning 2014-04-29 14:49:42 +02:00
audiodsp_arm.h dsputil: Split audio operations off into a separate context 2014-06-22 06:20:15 -07:00
audiodsp_init_arm.c dsputil: Split audio operations off into a separate context 2014-06-22 06:20:15 -07:00
audiodsp_init_neon.c audiodsp: reorder arguments for vector_clipf 2016-09-22 09:47:52 +02:00
audiodsp_neon.S audiodsp: reorder arguments for vector_clipf 2016-09-22 09:47:52 +02:00
blockdsp_arm.h blockdsp: drop the high_bit_depth parameter 2016-09-22 09:47:52 +02:00
blockdsp_init_arm.c blockdsp: drop the high_bit_depth parameter 2016-09-22 09:47:52 +02:00
blockdsp_init_neon.c blockdsp: drop the high_bit_depth parameter 2016-09-22 09:47:52 +02:00
blockdsp_neon.S dsputil: Split clear_block*/fill_block* off into a separate context 2014-06-18 14:07:23 -07:00
cabac.h arm: get_cabac inline asm 2014-03-09 00:45:34 +01:00
dca.h dcadec: simplify decoding of VQ high frequencies 2014-02-28 13:03:22 +01:00
dcadsp_init_arm.c dca: remove unused decode_hf function and quant_d tables 2015-12-24 13:58:18 +01:00
dcadsp_neon.S dca: remove unused decode_hf function and quant_d tables 2015-12-24 13:58:18 +01:00
dcadsp_vfp.S dcadec: remove scaling in lfe_interpolation_fir 2014-02-28 13:00:47 +01:00
fft_fixed_init_arm.c fft: Split MDCT bits off from FFT 2016-03-01 10:18:28 +01:00
fft_fixed_neon.S arm: Use .data.rel.ro for const data with relocations 2014-12-09 11:43:25 +02:00
fft_init_arm.c fft: Split MDCT bits off from FFT 2016-03-01 10:18:28 +01:00
fft_neon.S arm: Use .data.rel.ro for const data with relocations 2014-12-09 11:43:25 +02:00
fft_vfp.S arm: Use .data.rel.ro for const data with relocations 2014-12-09 11:43:25 +02:00
flacdsp_arm.S
flacdsp_init_arm.c
fmtconvert_init_arm.c arm: add ff_int32_to_float_fmul_array8_neon 2015-12-14 16:45:02 +01:00
fmtconvert_neon.S arm: add ff_int32_to_float_fmul_array8_neon 2015-12-14 16:45:02 +01:00
fmtconvert_vfp.S
g722dsp_init_arm.c g722: Add ARM NEON implementation for g722_apply_qmf() 2015-02-15 22:47:21 +02:00
g722dsp_neon.S g722: Add ARM NEON implementation for g722_apply_qmf() 2015-02-15 22:47:21 +02:00
h264chroma_init_arm.c h264chroma: Change type of stride parameters to ptrdiff_t 2016-09-29 14:48:04 +02:00
h264cmc_neon.S h264chroma: Change type of stride parameters to ptrdiff_t 2016-09-29 14:48:04 +02:00
h264dsp_init_arm.c h264: Move start code search functions into separate source files. 2014-08-04 22:22:54 +02:00
h264dsp_neon.S
h264idct_neon.S arm: Add X() around all references to extern symbols 2014-02-07 15:13:58 +02:00
h264pred_init_arm.c h264: arm: use intra pred8x8 functions only for chroma_format_idc <= 1 2015-07-18 00:28:49 +02:00
h264pred_neon.S
h264qpel_init_arm.c qpeldsp: Mark source pointer in qpel_mc_func function pointer const 2014-07-25 02:52:54 -07:00
h264qpel_neon.S
hpeldsp_arm.h arm: Use full filenames as multiple inclusion guards 2014-01-14 00:04:52 +01:00
hpeldsp_arm.S hpeldsp: arm: Update comments left behind in 25841dfe80 2016-09-29 14:48:03 +02:00
hpeldsp_armv6.S arm: hpeldsp: fix put_pixels8_y2_{,no_rnd_}armv6 2014-03-08 18:31:57 +01:00
hpeldsp_init_arm.c dsputil: Refactor duplicated CALL_2X_PIXELS / PIXELS16 macros 2014-03-22 06:17:29 -07:00
hpeldsp_init_armv6.c
hpeldsp_init_neon.c
hpeldsp_neon.S
idct.h idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
idctdsp_arm.h dsputil: Split off IDCT bits into their own context 2014-06-30 07:58:46 -07:00
idctdsp_arm.S idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
idctdsp_armv6.S dsputil: Split off IDCT bits into their own context 2014-06-30 07:58:46 -07:00
idctdsp_init_arm.c idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
idctdsp_init_armv5te.c idct: Move arm-specific declarations to a header in the arm directory 2014-07-20 13:02:17 -07:00
idctdsp_init_armv6.c idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
idctdsp_init_neon.c idct: Move arm-specific declarations to a header in the arm directory 2014-07-20 13:02:17 -07:00
idctdsp_neon.S dsputil: Split off IDCT bits into their own context 2014-06-30 07:58:46 -07:00
int_neon.S dsputil: Move APE-specific bits into apedsp 2014-05-29 06:41:15 -07:00
jrevdct_arm.S
Makefile arm: vp9: Add NEON loop filters 2016-11-11 14:16:42 +02:00
mathops.h
mdct_fixed_init_arm.c fft: Split MDCT bits off from FFT 2016-03-01 10:18:28 +01:00
mdct_fixed_neon.S
mdct_init_arm.c fft: Split MDCT bits off from FFT 2016-03-01 10:18:28 +01:00
mdct_neon.S arm: Add X() around all references to extern symbols 2014-02-07 15:13:58 +02:00
mdct_vfp.S armv6: Accelerate ff_imdct_half for general case (mdct_bits != 6) 2014-07-18 01:34:08 +03:00
me_cmp_armv6.S dsputil: Split motion estimation compare bits off into their own context 2014-07-17 09:07:10 -07:00
me_cmp_init_arm.c motion_est: convert stride to ptrdiff_t 2014-11-24 01:30:10 +00:00
mlpdsp_armv5te.S arm: mlpdsp: handle pic offset calculation in a macro 2014-12-09 22:00:08 +01:00
mlpdsp_armv6.S cosmetics: Fix spelling mistakes 2016-05-04 18:16:21 +02:00
mlpdsp_init_arm.c truehd: add hand-scheduled ARM asm version of ff_mlp_pack_output. 2014-03-26 19:54:32 +02:00
mpegaudiodsp_fixed_armv6.S
mpegaudiodsp_init_arm.c
mpegvideo_arm.c mpegvideo: cosmetics: Lowercase ugly uppercase MPV_ function name prefixes 2014-08-15 01:26:33 -07:00
mpegvideo_arm.h mpegvideo: cosmetics: Lowercase ugly uppercase MPV_ function name prefixes 2014-08-15 01:26:33 -07:00
mpegvideo_armv5te_s.S
mpegvideo_armv5te.c cosmetics: Fix spelling mistakes 2016-05-04 18:16:21 +02:00
mpegvideo_neon.S arm: Add X() around all references to extern symbols 2014-02-07 15:13:58 +02:00
mpegvideoencdsp_armv6.S dsputil: Move pix_sum, pix_norm1, shrink function pointers to mpegvideoenc 2014-07-06 14:26:53 -07:00
mpegvideoencdsp_init_arm.c dsputil: Move pix_sum, pix_norm1, shrink function pointers to mpegvideoenc 2014-07-06 14:26:53 -07:00
neon.S
neontest.c lavc: add clobber tests for the new encoding/decoding API 2016-09-28 10:01:52 +02:00
pixblockdsp_armv6.S dsputil: Split off pixel block routines into their own context 2014-07-09 08:05:26 -07:00
pixblockdsp_init_arm.c pixblockdsp: Change type of stride parameters to ptrdiff_t 2016-09-14 14:12:36 +02:00
rdft_init_arm.c rdft: arm: Split RDFT initialization into a separate file 2016-02-26 14:34:58 +01:00
rdft_neon.S
rv34dsp_init_arm.c
rv34dsp_neon.S
rv40dsp_init_arm.c qpeldsp: Mark source pointer in qpel_mc_func function pointer const 2014-07-25 02:52:54 -07:00
rv40dsp_neon.S
sbrdsp_init_arm.c
sbrdsp_neon.S
simple_idct_arm.S cosmetics: Fix spelling mistakes 2016-05-04 18:16:21 +02:00
simple_idct_armv5te.S simple_idct: arm: Drop disabled code variant 2016-08-17 12:21:54 +02:00
simple_idct_armv6.S idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
simple_idct_neon.S idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
startcode_armv6.S h264: Move start code search functions into separate source files. 2014-08-04 22:22:54 +02:00
startcode.h h264: Move start code search functions into separate source files. 2014-08-04 22:22:54 +02:00
synth_filter_neon.S
synth_filter_vfp.S arm: cosmetics: Consistently use lowercase for shift operators 2014-07-18 11:17:40 +03:00
vc1dsp_init_arm.c vc-1: Add platform-specific start code search routine to VC1DSPContext. 2014-08-04 22:22:54 +02:00
vc1dsp_init_neon.c h264chroma: Change type of stride parameters to ptrdiff_t 2016-09-29 14:48:04 +02:00
vc1dsp_neon.S idct: Change type of array stride parameters to ptrdiff_t 2016-09-29 14:48:03 +02:00
vc1dsp.h
videodsp_arm.h
videodsp_armv5te.S arm: use a local label instead of the function symbol in ff_prefetch_arm 2015-07-20 23:10:29 +02:00
videodsp_init_arm.c
videodsp_init_armv5te.c
vorbisdsp_init_arm.c
vorbisdsp_neon.S
vp3dsp_init_arm.c vp3: Change type of stride parameters to ptrdiff_t 2016-08-26 11:36:26 +02:00
vp3dsp_neon.S arm: Add a missing # as prefix for an immediate constant 2014-01-07 19:30:13 +02:00
vp6dsp_init_arm.c vp56: Separate VP5 and VP6 dsp initialization 2016-08-26 11:50:22 +02:00
vp6dsp_neon.S
vp8_armv6.S
vp8.h arm: asm decode_block_coeffs_internal is vp8 specific 2014-04-04 10:39:29 +02:00
vp8dsp_armv6.S vp8: Update some assembly comments left unchanged in bd66f073fe 2016-08-26 11:36:53 +02:00
vp8dsp_init_arm.c On2 VP7 decoder 2014-04-04 04:00:11 +02:00
vp8dsp_init_armv6.c On2 VP7 decoder 2014-04-04 04:00:11 +02:00
vp8dsp_init_neon.c On2 VP7 decoder 2014-04-04 04:00:11 +02:00
vp8dsp_neon.S arm: Fix a typo in a comment 2016-07-06 22:58:51 +03:00
vp8dsp.h On2 VP7 decoder 2014-04-04 04:00:11 +02:00
vp9dsp_init_arm.c arm: vp9: Add NEON loop filters 2016-11-11 14:16:42 +02:00
vp9itxfm_neon.S arm: vp9: Add NEON itxfm routines 2016-11-11 11:09:05 +02:00
vp9lpf_neon.S arm: vp9: Add NEON loop filters 2016-11-11 14:16:42 +02:00
vp9mc_neon.S arm: vp9mc: Use a different helper register for PIC loads 2016-11-10 14:01:04 +02:00
vp56_arith.h