FFmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2024-09-20 13:26:39 +00:00

Author	SHA1	Message	Date
Clément Bœsch	b12a36170b	lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis	2017-06-28 12:22:39 +02:00
Clément Bœsch	e4a27e2f2d	lavc/arm: fix lack of precision in ff_ps_stereo_interpolate_neon The code originally pre-multiply by 2 the steps, causing the running sum of the h factors to drift away due to the lack of precision. It quickly causes an inaccuracy > 0.01. I tried diverse approaches such as multiply by 2.0 (instead of adding the value itself) without success. I'm unable to bench the impact of this change, feel free to compare. This commit fixes the incoming aacpsdsp tests. Following is an alternative simplified function (matching the incoming AArch64 code) that may be used: function ff_ps_stereo_interpolate_neon, export=1 vld1.32 {q0}, [r2] vld1.32 {q1}, [r3] ldr r12, [sp] vmov.f32 q8, q0 vmov.f32 q9, q1 vzip.32 q8, q0 vzip.32 q9, q1 1: vld1.32 {d4}, [r0,:64] vld1.32 {d6}, [r1,:64] vadd.f32 q8, q8, q9 vadd.f32 q0, q0, q1 vmov.f32 d5, d4 vmov.f32 d7, d6 vmul.f32 q2, q2, q8 vmla.f32 q2, q3, q0 vst1.32 {d4}, [r0,:64]! vst1.32 {d5}, [r1,:64]! subs r12, r12, #1 bgt 1b bx lr endfunc	2017-06-28 11:59:34 +02:00
Ronald S. Bultje	40cbd686dc	idct_arm: remove use of ff_put/add_pixels_clamped function pointer. Instead, hardcode the use of the _arm implementation of add_pixels, and use the C version for put_pixels (as no arm-optimized version exists). Since there's separate implementations of idct{,_put,_add} for neon, this has no practical impact on performance.	2017-04-06 10:03:27 -04:00
Ronald S. Bultje	0c46641784	vp9: split out generic decoding skeleton interface API from VP9 types. This allows vp9dsp.h to only include the VP9 types header, and not the decoder skeleton interface which is for hardware decoders (dxva2/vaapi).	2017-03-28 18:04:27 -04:00
Ronald S. Bultje	f8c019944d	vp9: re-split the decoder/format/dsp interface header files. The advantage here is that the internal software decoder interface is not exposed to the DSP functions or the hardware accelerations.	2017-03-28 18:04:26 -04:00
Clément Bœsch	1c9f4b5078	lavc/vp9: split into vp9{block,data,mvs} This is following Libav layout to ease merges.	2017-03-27 21:38:21 +02:00
James Almer	9a0fbb9ca9	Merge commit '2caa93b813adc5dbb7771dfe615da826a2947d18' * commit '2caa93b813adc5dbb7771dfe615da826a2947d18': mpegaudiodsp: Change type of array stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>	2017-03-21 16:04:22 -03:00
James Almer	a8474df944	Merge commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c' * commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c': h264chroma: Change type of stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>	2017-03-21 15:20:45 -03:00
James Almer	5a49097b42	Merge commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428' * commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428': idct: Change type of array stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>	2017-03-21 14:29:52 -03:00
Clément Bœsch	51b5672f49	Merge commit '92c5755a185086067fe49e7e64c23a8e7011be31' * commit '92c5755a185086067fe49e7e64c23a8e7011be31': hpeldsp: arm: Update comments left behind in `25841dfe80` Merged-by: Clément Bœsch <u@pkh.me>	2017-03-21 15:10:46 +01:00
Clément Bœsch	ad98af27f7	Merge commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0' * commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0': lavc: add clobber tests for the new encoding/decoding API The merge only re-order what we already have. Merged-by: Clément Bœsch <u@pkh.me>	2017-03-21 14:43:53 +01:00
Clément Bœsch	83cd80d10a	Merge commit '12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5' * commit '12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5': audiodsp/x86: yasmify vector_clipf_sse audiodsp: reorder arguments for vector_clipf Merged the version from Libav after a discussion with James Almer on IRC: 19:22 <ubitux> jamrial: opinion on 12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5? 19:23 <ubitux> it was apparently yasmified differently 19:23 <ubitux> (it depends on the previous commit arg shuffle) 19:24 <ubitux> i don't see the magic movsxdifnidn in your port btw 19:24 <ubitux> it's a port from `1d36defe94` 19:25 <jamrial> seems better thanks to said arg shuffle 19:25 <jamrial> the loop is the same, but init is simpler 19:25 <jamrial> probably worth merging 19:25 <ubitux> OK 19:25 <ubitux> thanks 19:26 <jamrial> curious they didn't make len ptrdiff_t after the previous bunch of commits, heh 19:26 <ubitux> yeah indeed Both commits are merged at the same time to prevent a conflict with our existing yasmified ff_vector_clipf_sse. Merged-by: Clément Bœsch <u@pkh.me>	2017-03-20 22:35:07 +01:00
Clément Bœsch	b78243c504	lavc/arm: fix indent in blockdsp_init_neon	2017-03-20 19:01:25 +01:00
Clément Bœsch	e07fa3008b	Merge commit 'de452e503734ebb0fdbce86e9d16693b3530fad3' * commit 'de452e503734ebb0fdbce86e9d16693b3530fad3': pixblockdsp: Change type of stride parameters to ptrdiff_t Merged-by: Clément Bœsch <u@pkh.me>	2017-03-20 15:58:32 +01:00
Martin Storsjö	eabc5abf94	arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14516 bytes to 22484 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 270.7 418.5 295.4 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 3840.2 3244.8 3700.1 2337.9 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4212.5 3575.4 3996.9 2571.6 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 5174.4 4270.5 4615.5 3031.9 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5676.0 4908.5 5226.5 3491.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6403.9 5589.0 5839.8 3948.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1710.7 944.7 1582.1 1045.4 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 21040.7 16706.1 18687.7 13193.1 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22197.7 18282.7 19577.5 13918.6 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 24511.5 20911.5 21472.5 15367.5 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 26939.5 24264.3 23239.1 16830.3 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 29419.5 26845.1 25020.6 18259.9 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31146.4 29633.5 26803.3 19721.7 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33376.3 32507.8 28642.4 21174.2 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35629.4 35439.6 30416.5 22625.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37269.9 37914.9 32271.9 24078.9 After: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 276.0 418.5 295.1 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2336.2 1886.0 2251.0 1458.6 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 2531.0 2054.7 2402.8 1591.1 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 3848.6 3491.1 3845.7 2554.8 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5703.8 4831.6 5230.8 3493.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6399.5 5567.0 5832.4 3951.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1722.1 938.5 1577.3 1044.5 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 15003.5 11576.8 13105.8 9602.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 15768.5 12677.2 13726.0 10138.1 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 17278.8 14825.4 14907.5 11185.7 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 22335.7 21544.5 20379.5 15019.8 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 24165.6 23881.7 21938.6 16308.2 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31082.2 30860.9 26835.3 19711.3 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33102.6 31922.8 28638.3 21161.0 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35104.9 34867.5 30411.7 22621.2 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37438.1 39103.4 32217.8 24067.6 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:54:33 +02:00
Martin Storsjö	0ea603203d	arm: vp9itxfm16: Make the larger core transforms standalone functions This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from 17500 to 14516 bytes. This gives a small slowdown of a couple tens of cycles, up to around 150 cycles for the full case of the largest transform, but makes it more feasible to add more optimized versions of these transforms. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4237.4 3561.5 3971.8 2525.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6371.9 5452.0 5779.3 3910.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22068.8 17867.5 19555.2 13871.6 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37268.9 38684.2 32314.2 23969.0 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4375.1 3571.9 4283.8 2567.2 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6415.6 5578.9 5844.6 3948.3 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.7 18079.7 19603.7 13905.3 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37593.2 38862.2 32235.8 24070.9 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:54:19 +02:00
Martin Storsjö	32e273c111	arm: vp9itxfm16: Avoid reloading the idct32 coefficients Keep the idct32 coefficients in narrow form in q6-q7, and idct16 coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering q0-q3 in the pass1 function, and squeeze the idct16 coefficients into q0-q1 in the pass2 function to avoid reloading them. The idct16 coefficients are clobbered and reloaded within idct32_odd though, since that turns out to be faster than narrowing them and swapping them into q6-q7. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.8 18268.4 19598.0 14079.0 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37699.0 38665.2 32542.3 24472.2 After: vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22270.8 18159.3 19531.0 13865.0 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37523.3 37731.6 32181.7 24071.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:53:57 +02:00
Martin Storsjö	c1619318e5	arm: vp9itxfm16: Fix vertical alignment Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:53:48 +02:00
Martin Storsjö	b46d37e93a	arm: vp9itxfm16: Use the right lane size This makes the code slightly clearer, but doesn't make any functional difference. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:53:43 +02:00
Martin Storsjö	21c89f3a26	arm/aarch64: vp9: Fix vertical alignment Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. This is cherrypicked from libav commit `7995ebfad1`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:53:32 +02:00
Martin Storsjö	70317b25aa	arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. This is cherrypicked from libav commit `3a0d5e206d`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:53:28 +02:00
Martin Storsjö	b7a565fe71	arm: vp9itxfm: Template the quarter/half idct32 function This reduces the number of lines and reduces the duplication. Also simplify the eob check for the half case. If we are in the half case, we know we at least will need to do the first three slices, we only need to check eob for the fourth one, so we can hardcode the value to check against instead of loading from the min_eob array. Since at most one slice can be skipped in the first pass, we can unroll the loop for filling zeros completely, as it was done for the quarter case before. This allows skipping loading the min_eob pointer when using the quarter/half cases. This is cherrypicked from libav commit `98ee855ae0`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-19 22:53:22 +02:00
James Almer	6966a5e4d7	Merge commit '721d57e608dc4fd6c86f27c5ae76ef559d646220' * commit '721d57e608dc4fd6c86f27c5ae76ef559d646220': vp56: Separate VP5 and VP6 dsp initialization Merged-by: James Almer <jamrial@gmail.com>	2017-03-19 17:15:24 -03:00
James Almer	4e4dfcac58	Merge commit '802727b538b484e3f9d1345bfcc4ab24cfea8898' * commit '802727b538b484e3f9d1345bfcc4ab24cfea8898': vp8: Update some assembly comments left unchanged in `bd66f073fe` Merged-by: James Almer <jamrial@gmail.com>	2017-03-19 15:18:31 -03:00
James Almer	4004d33fcb	Merge commit 'd9d26a3674f31f482f54e936fcb382160830877a' * commit 'd9d26a3674f31f482f54e936fcb382160830877a': vp56: Change type of stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>	2017-03-19 14:54:25 -03:00
Clément Bœsch	6a42a54b9d	Merge commit '6892df9294d93322d43255ada299507465bc93c8' * commit '6892df9294d93322d43255ada299507465bc93c8': vp3: Change type of stride parameters to ptrdiff_t Merged-by: Clément Bœsch <u@pkh.me>	2017-03-19 18:41:26 +01:00
Clément Bœsch	a754fae4a7	Merge commit '014852e932dab6e9cf2a53e7a17ce8321f3e922c' * commit '014852e932dab6e9cf2a53e7a17ce8321f3e922c': simple_idct: arm: Drop disabled code variant Merged-by: Clément Bœsch <u@pkh.me>	2017-03-19 16:12:07 +01:00
Martin Storsjö	b2e20d8984	arm: vp9itxfm: Reorder iadst16 coeffs This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from libav commit `08074c092d`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:52 +02:00
Martin Storsjö	4f693b56bd	arm: vp9itxfm: Reorder the idct coefficients for better pairing All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. This is cherrypicked from libav commit `de06bdfe6c`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:51 +02:00
Martin Storsjö	600f4c9b03	arm: vp9itxfm: Avoid reloading the idct32 coefficients The idct32x32 function actually pushed q4-q7 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. Since the idct16 core transform avoids clobbering q4-q7 (but clobbers q2-q3 instead, to avoid needing to back up and restore q4-q7 at all in the idct16 function), and the lanewise vmul needs a register in the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5 while doing idct16. While keeping these coefficients in registers, we still can skip pushing q7. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_sub32_add_neon: 18553.8 17182.7 14303.3 12089.7 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 18470.3 16717.7 14173.6 11860.8 This is cherrypicked from libav commit `402546a172`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:51 +02:00
Martin Storsjö	a88db8b9a0	arm: vp9lpf: Implement the mix2_44 function with one single filter pass For this case, with 8 inputs but only changing 4 of them, we can fit all 16 input pixels into a q register, and still have enough temporary registers for doing the loop filter. The wd=8 filters would require too many temporary registers for processing all 16 pixels at once though. Before: Cortex A7 A8 A9 A53 vp9_loop_filter_mix2_v_44_16_neon: 289.7 256.2 237.5 181.2 After: vp9_loop_filter_mix2_v_44_16_neon: 221.2 150.5 177.7 138.0 This is cherrypicked from libav commit `575e31e931`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:51 +02:00
Martin Storsjö	3fbbad2984	arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 After: vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0 vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7 vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0 vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7 This is cherrypicked from libav commit `c582cb8537`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:50 +02:00
Martin Storsjö	83399cf569	arm: vp9lpf: Interleave the start of flat8in into the calculation above This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. This is cherrypicked from libav commit `e18c39005a`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:49 +02:00
Martin Storsjö	92ab8374b1	arm: vp9lpf: Use orrs instead of orr+cmp This is cherrypicked from libav commit `435cd7bc99`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:49 +02:00
Martin Storsjö	f0ecbb13cf	arm/aarch64: vp9lpf: Calculate !hev directly Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7 vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0 After: vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 This is cherrypicked from libav commit `e1f9de86f4`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:48 +02:00
Martin Storsjö	758302e4bc	arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling This work is sponsored by, and copyright, Google. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 211.7 235.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 752.0 459.2 862.2 553.9 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 226.5 145.0 225.1 171.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 721.2 415.7 727.6 475.0 This is cherrypicked from libav commit `a76bf8cf12`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:48 +02:00
Martin Storsjö	bff0771590	arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter Before: Cortex A7 A8 A9 A53 vp9_put_8tap_smooth_4h_neon: 378.1 273.2 340.7 229.5 After: vp9_put_8tap_smooth_4h_neon: 352.1 222.2 290.5 229.5 This is cherrypicked from libav commit `fea92a4b57`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:47 +02:00
Martin Storsjö	1d8ab576a7	arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function This is cherrypicked from libav commit `3933b86bb9`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:26 +02:00
Martin Storsjö	824589556c	arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 12388 bytes to 19784 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 212.0 235.8 vp9_inv_dct_dct_16x16_sub2_add_neon: 2102.1 1521.7 1736.2 1265.8 vp9_inv_dct_dct_16x16_sub4_add_neon: 2104.5 1533.0 1736.6 1265.5 vp9_inv_dct_dct_16x16_sub8_add_neon: 2484.8 1828.7 2014.4 1506.5 vp9_inv_dct_dct_16x16_sub12_add_neon: 2851.2 2117.8 2294.8 1753.2 vp9_inv_dct_dct_16x16_sub16_add_neon: 3239.4 2408.3 2543.5 1994.9 vp9_inv_dct_dct_32x32_sub1_add_neon: 758.3 456.7 864.5 553.9 vp9_inv_dct_dct_32x32_sub2_add_neon: 10776.7 7949.8 8567.7 6819.7 vp9_inv_dct_dct_32x32_sub4_add_neon: 10865.6 8131.5 8589.6 6816.3 vp9_inv_dct_dct_32x32_sub8_add_neon: 12053.9 9271.3 9387.7 7564.0 vp9_inv_dct_dct_32x32_sub12_add_neon: 13328.3 10463.2 10217.0 8321.3 vp9_inv_dct_dct_32x32_sub16_add_neon: 14176.4 11509.5 11018.7 9062.3 vp9_inv_dct_dct_32x32_sub20_add_neon: 15301.5 12999.9 11855.1 9828.2 vp9_inv_dct_dct_32x32_sub24_add_neon: 16482.7 14931.5 12650.1 10575.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17589.5 15811.9 13482.8 11333.4 vp9_inv_dct_dct_32x32_sub32_add_neon: 18696.2 17049.2 14355.6 12089.7 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 211.7 235.8 vp9_inv_dct_dct_16x16_sub2_add_neon: 1203.5 998.2 1035.3 763.0 vp9_inv_dct_dct_16x16_sub4_add_neon: 1203.5 998.1 1035.5 760.8 vp9_inv_dct_dct_16x16_sub8_add_neon: 1926.1 1610.6 1722.1 1271.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 2873.2 2129.7 2285.1 1757.3 vp9_inv_dct_dct_16x16_sub16_add_neon: 3221.4 2520.3 2557.6 2002.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 753.0 457.5 866.6 554.6 vp9_inv_dct_dct_32x32_sub2_add_neon: 7554.6 5652.4 6048.4 4920.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 7549.9 5685.0 6046.9 4925.7 vp9_inv_dct_dct_32x32_sub8_add_neon: 8336.9 6704.5 6604.0 5478.0 vp9_inv_dct_dct_32x32_sub12_add_neon: 10914.0 9777.2 9240.4 7416.9 vp9_inv_dct_dct_32x32_sub16_add_neon: 11859.2 11223.3 9966.3 8095.1 vp9_inv_dct_dct_32x32_sub20_add_neon: 15237.1 13029.4 11838.3 9829.4 vp9_inv_dct_dct_32x32_sub24_add_neon: 16293.2 14379.8 12644.9 10572.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17424.3 15734.7 13473.0 11326.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.3 17457.0 14298.6 12080.0 This is cherrypicked from libav commit `5eb5aec475`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:25 +02:00
Martin Storsjö	3bd9b39108	arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function This allows reusing the macro for a separate implementation of the pass2 function. This is cherrypicked from libav commit `47b3c2c18d`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:23 +02:00
Martin Storsjö	f8fcee0daf	arm: vp9itxfm: Make the larger core transforms standalone functions This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from 15324 to 12388 bytes. This gives a small slowdown of a couple tens of cycles, up to around 150 cycles for the full case of the largest transform, but makes it more feasible to add more optimized versions of these transforms. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub4_add_neon: 2063.4 1516.0 1719.5 1245.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 3279.3 2454.5 2525.2 1982.3 vp9_inv_dct_dct_32x32_sub4_add_neon: 10750.0 7955.4 8525.6 6754.2 vp9_inv_dct_dct_32x32_sub32_add_neon: 18574.0 17108.4 14216.7 12010.2 After: vp9_inv_dct_dct_16x16_sub4_add_neon: 2060.8 1608.5 1735.7 1262.0 vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.2 2443.5 2546.1 1999.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 10682.0 8043.8 8581.3 6810.1 vp9_inv_dct_dct_32x32_sub32_add_neon: 18522.4 17277.4 14286.7 12087.9 This is cherrypicked from libav commit `0331c3f5e8`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:20 +02:00
Martin Storsjö	31e41350d2	arm: vp9itxfm: Avoid .irp when it doesn't save any lines This makes it more readable. This is cherrypicked from libav commit `3bc5b28d5a`. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-03-11 13:14:00 +02:00
Clément Bœsch	d0e132bab6	Merge commit '1bd890ad173d79e7906c5e1d06bf0a06cca4519d' * commit '1bd890ad173d79e7906c5e1d06bf0a06cca4519d': hevc: Separate adding residual to prediction from IDCT This commit should be a noop but isn't because of the following renames: - transform_add → add_residual - transform_skip → dequant - idct_4x4_luma → transform_4x4_luma Merged-by: Clément Bœsch <cboesch@gopro.com>	2017-01-31 15:31:34 +01:00
Martin Storsjö	1e5d87eec3	arm: Add NEON optimizations for 10 and 12 bit vp9 loop filter This work is sponsored by, and copyright, Google. This is pretty much similar to the 8 bpp version, but in some senses simpler. All input pixels are 16 bits, and all intermediates also fit in 16 bits, so there's no lengthening/narrowing in the filter at all. For the full 16 pixel wide filter, we can only process 4 pixels at a time (using an implementation very much similar to the one for 8 bpp), but we can do 8 pixels at a time for the 4 and 8 pixel wide filters with a different implementation of the core filter. Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9 A53 vp9_loop_filter_h_4_8_10bpp_neon: 1.83 2.16 1.40 2.09 vp9_loop_filter_h_8_8_10bpp_neon: 1.39 1.67 1.24 1.70 vp9_loop_filter_h_16_8_10bpp_neon: 1.56 1.47 1.10 1.81 vp9_loop_filter_h_16_16_10bpp_neon: 1.94 1.69 1.33 2.24 vp9_loop_filter_mix2_h_44_16_10bpp_neon: 2.01 2.27 1.67 2.39 vp9_loop_filter_mix2_h_48_16_10bpp_neon: 1.84 2.06 1.45 2.19 vp9_loop_filter_mix2_h_84_16_10bpp_neon: 1.89 2.20 1.47 2.29 vp9_loop_filter_mix2_h_88_16_10bpp_neon: 1.69 2.12 1.47 2.08 vp9_loop_filter_mix2_v_44_16_10bpp_neon: 3.16 3.98 2.50 4.05 vp9_loop_filter_mix2_v_48_16_10bpp_neon: 2.84 3.64 2.25 3.77 vp9_loop_filter_mix2_v_84_16_10bpp_neon: 2.65 3.45 2.16 3.54 vp9_loop_filter_mix2_v_88_16_10bpp_neon: 2.55 3.30 2.16 3.55 vp9_loop_filter_v_4_8_10bpp_neon: 2.85 3.97 2.24 3.68 vp9_loop_filter_v_8_8_10bpp_neon: 2.27 3.19 1.96 3.08 vp9_loop_filter_v_16_8_10bpp_neon: 3.42 2.74 2.26 4.40 vp9_loop_filter_v_16_16_10bpp_neon: 2.86 2.44 1.93 3.88 The speedup vs C code measured in checkasm is around 1.1-4x. These numbers are quite inconclusive though, since the checkasm test runs multiple filterings on top of each other, so later rounds might end up with different codepaths (different decisions on which filter to apply, based on input pixel differences). Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 2-4x. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:35:59 +02:00
Martin Storsjö	2ed67eba96	arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm This work is sponsored by, and copyright, Google. This is structured similarly to the 8 bit version. In the 8 bit version, the coefficients are 16 bits, and intermediates are 32 bits. Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit content, the intermediates also fit in 32 bits, but for all other transforms (4x4 for 12 bit content, and 8x8 and larger for both 10 and 12 bit) the intermediates are 64 bit. For the existing 8 bit case, the 8x8 transform fit all coefficients in registers; for 10/12 bit, when the coefficients are 32 bit, the 8x8 transform also has to be done in slices of 4 pixels (just as 16x16 and 32x32 for 8 bit). The slice width also shrinks from 4 elements to 2 elements in parallel for the 16x16 and 32x32 cases. The 16 bit coefficients from idct_coeffs and similar tables also need to be lenghtened to 32 bit in order to be used in multiplication with vectors with 32 bit elements. This leads to the fixed coefficient vectors needing more space, leading to more cases where they have to be reloaded within the transform (in iadst16). This technically would need testing in checkasm for subpartitions in increments of 2, but that slows down normal checkasm runs excessively. Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9 A53 vp9_inv_adst_adst_4x4_sub4_add_10_neon: 4.83 11.36 5.22 6.77 vp9_inv_adst_adst_8x8_sub8_add_10_neon: 4.12 7.60 4.06 4.84 vp9_inv_adst_adst_16x16_sub16_add_10_neon: 3.93 8.16 4.52 5.35 vp9_inv_dct_dct_4x4_sub1_add_10_neon: 1.36 2.57 1.41 1.61 vp9_inv_dct_dct_4x4_sub4_add_10_neon: 4.24 8.66 5.06 5.81 vp9_inv_dct_dct_8x8_sub1_add_10_neon: 2.63 4.18 1.68 2.87 vp9_inv_dct_dct_8x8_sub4_add_10_neon: 4.52 9.47 4.24 5.39 vp9_inv_dct_dct_8x8_sub8_add_10_neon: 3.45 7.34 3.45 4.30 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 3.56 6.21 2.47 4.32 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 5.68 12.73 5.28 7.07 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 4.42 9.28 4.24 5.45 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3.41 7.29 3.35 4.19 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 4.52 8.35 3.83 6.40 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 5.86 13.19 6.14 7.04 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 4.29 8.11 4.59 5.06 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 3.31 5.70 3.56 3.84 vp9_inv_wht_wht_4x4_sub4_add_10_neon: 1.89 2.80 1.82 1.97 The speedup compared to the C functions is around 1.3 to 7x for the full transforms, even higher for the smaller subpartitions. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:35:56 +02:00
Martin Storsjö	a4d4bad75c	arm: Add NEON optimizations for 10 and 12 bit vp9 MC This work is sponsored by, and copyright, Google. The plain pixel put/copy functions are used from the 8 bit version, for the double size (e.g. put16 uses ff_vp9_copy32_neon), and a new copy128 is added. Compared with the 8 bit version, the filters can no longer use the trick to accumulate in 16 bit with only saturation at the end, but now the accumulators need to be 32 bit. This avoids the need to keep track of which filter index is the largest though, reducing the size of the executable code for these filters. For the horizontal filters, we only do 4 or 8 pixels wide in parallel (while doing two rows at a time), since we don't have enough register space to filter 16 pixels wide. For the vertical filters, we still do 4 and 8 pixels in parallel just as in the 8 bit case, but we need to store the output after every 2 rows instead of after every 4 rows. Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9 A53 vp9_avg4_10bpp_neon: 2.25 2.44 3.05 2.16 vp9_avg8_10bpp_neon: 3.66 8.48 3.86 3.50 vp9_avg16_10bpp_neon: 3.39 8.26 3.37 2.72 vp9_avg32_10bpp_neon: 4.03 10.20 4.07 3.42 vp9_avg64_10bpp_neon: 4.15 10.01 4.13 3.70 vp9_avg_8tap_smooth_4h_10bpp_neon: 3.38 6.22 3.41 4.75 vp9_avg_8tap_smooth_4hv_10bpp_neon: 3.89 6.39 4.30 5.32 vp9_avg_8tap_smooth_4v_10bpp_neon: 5.32 9.73 6.34 7.31 vp9_avg_8tap_smooth_8h_10bpp_neon: 4.45 9.40 4.68 6.87 vp9_avg_8tap_smooth_8hv_10bpp_neon: 4.64 8.91 5.44 6.47 vp9_avg_8tap_smooth_8v_10bpp_neon: 6.44 13.42 8.68 8.79 vp9_avg_8tap_smooth_64h_10bpp_neon: 4.66 9.02 4.84 7.71 vp9_avg_8tap_smooth_64hv_10bpp_neon: 4.61 9.14 4.92 7.10 vp9_avg_8tap_smooth_64v_10bpp_neon: 6.90 14.13 9.57 10.41 vp9_put4_10bpp_neon: 1.33 1.46 2.09 1.33 vp9_put8_10bpp_neon: 1.57 3.42 1.83 1.84 vp9_put16_10bpp_neon: 1.55 4.78 2.17 1.89 vp9_put32_10bpp_neon: 2.06 5.35 2.14 2.30 vp9_put64_10bpp_neon: 3.00 2.41 1.95 1.66 vp9_put_8tap_smooth_4h_10bpp_neon: 3.19 5.81 3.31 4.63 vp9_put_8tap_smooth_4hv_10bpp_neon: 3.86 6.22 4.32 5.21 vp9_put_8tap_smooth_4v_10bpp_neon: 5.40 9.77 6.08 7.21 vp9_put_8tap_smooth_8h_10bpp_neon: 4.22 8.41 4.46 6.63 vp9_put_8tap_smooth_8hv_10bpp_neon: 4.56 8.51 5.39 6.25 vp9_put_8tap_smooth_8v_10bpp_neon: 6.60 12.43 8.17 8.89 vp9_put_8tap_smooth_64h_10bpp_neon: 4.41 8.59 4.54 7.49 vp9_put_8tap_smooth_64hv_10bpp_neon: 4.43 8.58 5.34 6.63 vp9_put_8tap_smooth_64v_10bpp_neon: 7.26 13.92 9.27 10.92 For the larger 8tap filters, the speedup vs C code is around 4-14x. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:35:50 +02:00
Martin Storsjö	cda9a3e80b	arm: vp9dsp: Restructure the bpp checks This work is sponsored by, and copyright, Google. This is more in line with how it will be extended for more bitdepths. Signed-off-by: Martin Storsjö <martin@martin.st>	2017-01-24 22:35:44 +02:00
Martin Storsjö	656d910981	arm: vp9mc: Fix vertical alignment of operands This is cherrypicked from libav commit `c536e5e869`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:37 +01:00
Martin Storsjö	388f6e6715	arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3 By skipping individual 4x16 or 4x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8 vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7 vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5 vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9 vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8 vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5 vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6 vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6 vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7 vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1 I.e. in general a very minor overhead for the full subpartition case due to the additional loads and cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. In common VP9 content in a few inspected clips, 70-90% of the non-dc-only 16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left 8x8 or 16x16 subpartitions respectively. This is cherrypicked from libav commit `9c8bc74c2b`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:30 +01:00
Martin Storsjö	ecd343aa1f	arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combination This avoids reloading them if they haven't been clobbered, if the first pass also was idct. This is similar to what was done in the aarch64 version. This is cherrypicked from libav commit `3c87039a40`. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2017-01-14 21:13:27 +01:00

1 2 3 4 5 ...

849 Commits