FFmpeg

mirror of https://git.ffmpeg.org/ffmpeg.git synced 2024-09-19 21:06:42 +00:00

Author	SHA1	Message	Date
Anton Khirnov	e4601cc339	lavc/hevc*: move to hevc/ subdir	2024-06-04 11:46:27 +02:00
Ramiro Polla	d4d09c8e42	lavc/aarch64/fdct: add neon-optimized fdct for aarch64 The code is imported from libjpeg-turbo-3.0.1. The neon registers used have been changed to avoid modifying v8-v15. Reviewed-by: Martin Storsjö <martin@martin.st>	2024-05-13 14:54:10 +02:00
Ramiro Polla	27f6211c74	lavc/aarch64: fix include for cpu.h	2024-05-13 14:50:38 +02:00
Lynne	134dba9544	opusdsp: add ability to modify deemphasis constant xHE-AAC relies on the same postfilter mechanism that Opus uses to improve clarity (albeit with a steeper deemphasis filter). The code to apply it is identical, it's still just a simple IIR low-pass filter. This commit makes it possible to use alternative constants.	2024-04-27 11:12:07 +02:00
Martin Storsjö	359b6a7f8a	aarch64/ac3dsp: simplify the end of ff_ac3_sum_square_butterfly_float_neon Before: Cortex A53 A72 A78 M1 ac3_sum_square_bufferfly_float_neon: 1005.7 516.5 224.5 194.0 After: ac3_sum_square_bufferfly_float_neon: 981.7 504.5 223.2 189.5 Signed-off-by: J. Dekker <jdek@itanimul.li>	2024-04-09 16:50:49 +02:00
Geoff Hill	ee1bc723de	avcodec/ac3: Implement sum_square_butterfly_float for aarch64 NEON Signed-off-by: Geoff Hill <geoff@geoffhill.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-04-08 13:36:40 +03:00
Geoff Hill	42e88f18f3	avcodec/ac3: Implement sum_square_butterfly_int32 for aarch64 NEON Signed-off-by: Geoff Hill <geoff@geoffhill.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-04-08 13:36:40 +03:00
Geoff Hill	69cb34f885	avcodec/ac3: Implement ac3_extract_exponents for aarch64 NEON Signed-off-by: Geoff Hill <geoff@geoffhill.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-04-08 13:36:40 +03:00
Geoff Hill	6f6bd10531	avcodec/ac3: Implement ac3_exponent_min for aarch64 NEON Signed-off-by: Geoff Hill <geoff@geoffhill.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-04-08 13:36:40 +03:00
Geoff Hill	b69486ea18	avcodec/ac3: Implement float_to_fixed24 for aarch64 NEON Signed-off-by: Geoff Hill <geoff@geoffhill.org> Signed-off-by: Martin Storsjö <martin@martin.st>	2024-04-08 13:36:28 +03:00
Martin Storsjö	f872b19714	aarch64: hevc: Produce plain neon versions of qpel_bi_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. By allocating storage for h+8 rows, incrementing the stack pointer won't end up at the right spot in the end. Store the intended final stack pointer value in a register x14 which we store on the stack. AWS Graviton 3: put_hevc_qpel_bi_hv4_8_c: 385.7 put_hevc_qpel_bi_hv4_8_neon: 131.0 put_hevc_qpel_bi_hv4_8_i8mm: 92.2 put_hevc_qpel_bi_hv6_8_c: 701.0 put_hevc_qpel_bi_hv6_8_neon: 239.5 put_hevc_qpel_bi_hv6_8_i8mm: 191.0 put_hevc_qpel_bi_hv8_8_c: 1162.0 put_hevc_qpel_bi_hv8_8_neon: 228.0 put_hevc_qpel_bi_hv8_8_i8mm: 225.2 put_hevc_qpel_bi_hv12_8_c: 2305.0 put_hevc_qpel_bi_hv12_8_neon: 558.0 put_hevc_qpel_bi_hv12_8_i8mm: 483.2 put_hevc_qpel_bi_hv16_8_c: 3965.2 put_hevc_qpel_bi_hv16_8_neon: 732.7 put_hevc_qpel_bi_hv16_8_i8mm: 656.5 put_hevc_qpel_bi_hv24_8_c: 8709.7 put_hevc_qpel_bi_hv24_8_neon: 1555.2 put_hevc_qpel_bi_hv24_8_i8mm: 1448.7 put_hevc_qpel_bi_hv32_8_c: 14818.0 put_hevc_qpel_bi_hv32_8_neon: 2763.7 put_hevc_qpel_bi_hv32_8_i8mm: 2468.0 put_hevc_qpel_bi_hv48_8_c: 32855.5 put_hevc_qpel_bi_hv48_8_neon: 6107.2 put_hevc_qpel_bi_hv48_8_i8mm: 5452.7 put_hevc_qpel_bi_hv64_8_c: 57591.5 put_hevc_qpel_bi_hv64_8_neon: 10660.2 put_hevc_qpel_bi_hv64_8_i8mm: 9580.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00
Martin Storsjö	d21b9a0411	aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. AWS Graviton 3: put_hevc_qpel_uni_w_hv4_8_c: 422.2 put_hevc_qpel_uni_w_hv4_8_neon: 140.7 put_hevc_qpel_uni_w_hv4_8_i8mm: 100.7 put_hevc_qpel_uni_w_hv8_8_c: 1208.0 put_hevc_qpel_uni_w_hv8_8_neon: 268.2 put_hevc_qpel_uni_w_hv8_8_i8mm: 261.5 put_hevc_qpel_uni_w_hv16_8_c: 4297.2 put_hevc_qpel_uni_w_hv16_8_neon: 802.2 put_hevc_qpel_uni_w_hv16_8_i8mm: 731.2 put_hevc_qpel_uni_w_hv32_8_c: 15518.5 put_hevc_qpel_uni_w_hv32_8_neon: 3085.2 put_hevc_qpel_uni_w_hv32_8_i8mm: 2783.2 put_hevc_qpel_uni_w_hv64_8_c: 57254.5 put_hevc_qpel_uni_w_hv64_8_neon: 11787.5 put_hevc_qpel_uni_w_hv64_8_i8mm: 10659.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00
Martin Storsjö	5ab138673b	aarch64: hevc: Produce plain neon versions of qpel_uni_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. By allocating storage for h+8 rows, incrementing the stack pointer won't end up at the right spot in the end. Store the intended final stack pointer value in a register x14 which we store on the stack. AWS Graviton 3: put_hevc_qpel_uni_hv4_8_c: 384.2 put_hevc_qpel_uni_hv4_8_neon: 127.5 put_hevc_qpel_uni_hv4_8_i8mm: 85.5 put_hevc_qpel_uni_hv6_8_c: 705.5 put_hevc_qpel_uni_hv6_8_neon: 224.5 put_hevc_qpel_uni_hv6_8_i8mm: 176.2 put_hevc_qpel_uni_hv8_8_c: 1136.5 put_hevc_qpel_uni_hv8_8_neon: 216.5 put_hevc_qpel_uni_hv8_8_i8mm: 214.0 put_hevc_qpel_uni_hv12_8_c: 2259.5 put_hevc_qpel_uni_hv12_8_neon: 498.5 put_hevc_qpel_uni_hv12_8_i8mm: 410.7 put_hevc_qpel_uni_hv16_8_c: 3824.7 put_hevc_qpel_uni_hv16_8_neon: 670.0 put_hevc_qpel_uni_hv16_8_i8mm: 603.7 put_hevc_qpel_uni_hv24_8_c: 8113.5 put_hevc_qpel_uni_hv24_8_neon: 1474.7 put_hevc_qpel_uni_hv24_8_i8mm: 1351.5 put_hevc_qpel_uni_hv32_8_c: 14744.5 put_hevc_qpel_uni_hv32_8_neon: 2599.7 put_hevc_qpel_uni_hv32_8_i8mm: 2266.0 put_hevc_qpel_uni_hv48_8_c: 32800.0 put_hevc_qpel_uni_hv48_8_neon: 5650.0 put_hevc_qpel_uni_hv48_8_i8mm: 5011.7 put_hevc_qpel_uni_hv64_8_c: 57856.2 put_hevc_qpel_uni_hv64_8_neon: 9863.5 put_hevc_qpel_uni_hv64_8_i8mm: 8767.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00
Martin Storsjö	5cbeefc79e	aarch64: hevc: Produce plain neon versions of qpel_hv As the plain neon qpel_h functions process two rows at a time, we need to allocate storage for h+8 rows instead of h+7. By allocating storage for h+8 rows, incrementing the stack pointer won't end up at the right spot in the end. Store the intended final stack pointer value in a register x14 which we store on the stack. AWS Graviton 3: put_hevc_qpel_hv4_8_c: 386.0 put_hevc_qpel_hv4_8_neon: 125.7 put_hevc_qpel_hv4_8_i8mm: 83.2 put_hevc_qpel_hv6_8_c: 749.0 put_hevc_qpel_hv6_8_neon: 207.0 put_hevc_qpel_hv6_8_i8mm: 166.0 put_hevc_qpel_hv8_8_c: 1305.2 put_hevc_qpel_hv8_8_neon: 216.5 put_hevc_qpel_hv8_8_i8mm: 213.0 put_hevc_qpel_hv12_8_c: 2570.5 put_hevc_qpel_hv12_8_neon: 480.0 put_hevc_qpel_hv12_8_i8mm: 398.2 put_hevc_qpel_hv16_8_c: 4158.7 put_hevc_qpel_hv16_8_neon: 659.7 put_hevc_qpel_hv16_8_i8mm: 593.5 put_hevc_qpel_hv24_8_c: 8626.7 put_hevc_qpel_hv24_8_neon: 1653.5 put_hevc_qpel_hv24_8_i8mm: 1398.7 put_hevc_qpel_hv32_8_c: 14646.0 put_hevc_qpel_hv32_8_neon: 2566.2 put_hevc_qpel_hv32_8_i8mm: 2287.5 put_hevc_qpel_hv48_8_c: 31072.5 put_hevc_qpel_hv48_8_neon: 6228.5 put_hevc_qpel_hv48_8_i8mm: 5291.0 put_hevc_qpel_hv64_8_c: 53847.2 put_hevc_qpel_hv64_8_neon: 9856.7 put_hevc_qpel_hv64_8_i8mm: 8831.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:55 +02:00
Martin Storsjö	20c38f4b8d	aarch64: hevc: Reorder qpel_hv functions to prepare for templating This is a pure reordering of code without changing anything in the individual functions. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:50 +02:00
Martin Storsjö	4f71e4ebf2	aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions The hv32 and hv64 functions were identical - both loop and process 16 pixels at a time. The hv16 function was near identical, except for the outer loop (and using sp instead of a separate register). Given the size of these functions, the extra cost of the outer loop is negligible, so use the same function for hv16 as well. This removes over 200 lines of duplicated assembly, and over 4 KB of binary size. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:40 +02:00
Martin Storsjö	4063e50eec	aarch64: hevc: Split the qpel_*_hv functions into two parts The first horizontal filter can use either i8mm or plain neon versions, while the second part is a pure neon implementation. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:05:29 +02:00
Martin Storsjö	ad01d06f91	aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8 AWS Graviton 3: put_hevc_qpel_uni_w_h4_8_c: 159.0 put_hevc_qpel_uni_w_h4_8_neon: 64.2 put_hevc_qpel_uni_w_h4_8_i8mm: 40.0 put_hevc_qpel_uni_w_h6_8_c: 344.7 put_hevc_qpel_uni_w_h6_8_neon: 114.5 put_hevc_qpel_uni_w_h6_8_i8mm: 82.0 put_hevc_qpel_uni_w_h8_8_c: 596.2 put_hevc_qpel_uni_w_h8_8_neon: 132.2 put_hevc_qpel_uni_w_h8_8_i8mm: 106.0 put_hevc_qpel_uni_w_h12_8_c: 1325.0 put_hevc_qpel_uni_w_h12_8_neon: 299.0 put_hevc_qpel_uni_w_h12_8_i8mm: 211.5 put_hevc_qpel_uni_w_h16_8_c: 2300.0 put_hevc_qpel_uni_w_h16_8_neon: 422.0 put_hevc_qpel_uni_w_h16_8_i8mm: 286.2 put_hevc_qpel_uni_w_h24_8_c: 5059.0 put_hevc_qpel_uni_w_h24_8_neon: 912.2 put_hevc_qpel_uni_w_h24_8_i8mm: 664.2 put_hevc_qpel_uni_w_h32_8_c: 9198.2 put_hevc_qpel_uni_w_h32_8_neon: 1638.2 put_hevc_qpel_uni_w_h32_8_i8mm: 1033.7 put_hevc_qpel_uni_w_h48_8_c: 20754.7 put_hevc_qpel_uni_w_h48_8_neon: 3633.7 put_hevc_qpel_uni_w_h48_8_i8mm: 2300.7 put_hevc_qpel_uni_w_h64_8_c: 36854.7 put_hevc_qpel_uni_w_h64_8_neon: 6435.7 put_hevc_qpel_uni_w_h64_8_i8mm: 4039.2 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:03:18 +02:00
Martin Storsjö	de23b384fd	aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm In addition to just templating, this contains one change to ff_hevc_put_hevc_epel_bi_hv32_8, by setting the w6 register which ff_hevc_put_hevc_epel_h32_8_neon requires. AWS Graviton 3: put_hevc_epel_bi_hv4_8_c: 176.5 put_hevc_epel_bi_hv4_8_neon: 62.0 put_hevc_epel_bi_hv4_8_i8mm: 58.0 put_hevc_epel_bi_hv6_8_c: 343.7 put_hevc_epel_bi_hv6_8_neon: 109.7 put_hevc_epel_bi_hv6_8_i8mm: 105.7 put_hevc_epel_bi_hv8_8_c: 536.0 put_hevc_epel_bi_hv8_8_neon: 112.7 put_hevc_epel_bi_hv8_8_i8mm: 111.7 put_hevc_epel_bi_hv12_8_c: 1107.7 put_hevc_epel_bi_hv12_8_neon: 254.7 put_hevc_epel_bi_hv12_8_i8mm: 239.0 put_hevc_epel_bi_hv16_8_c: 1927.7 put_hevc_epel_bi_hv16_8_neon: 356.2 put_hevc_epel_bi_hv16_8_i8mm: 334.2 put_hevc_epel_bi_hv24_8_c: 4195.2 put_hevc_epel_bi_hv24_8_neon: 736.7 put_hevc_epel_bi_hv24_8_i8mm: 715.5 put_hevc_epel_bi_hv32_8_c: 7280.5 put_hevc_epel_bi_hv32_8_neon: 1287.7 put_hevc_epel_bi_hv32_8_i8mm: 1162.2 put_hevc_epel_bi_hv48_8_c: 16857.7 put_hevc_epel_bi_hv48_8_neon: 2836.2 put_hevc_epel_bi_hv48_8_i8mm: 2908.5 put_hevc_epel_bi_hv64_8_c: 29248.2 put_hevc_epel_bi_hv64_8_neon: 5051.7 put_hevc_epel_bi_hv64_8_i8mm: 4491.5 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 09:03:16 +02:00
Martin Storsjö	96e5adda9f	aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm AWS Graviton 3: put_hevc_epel_uni_w_hv4_8_c: 191.2 put_hevc_epel_uni_w_hv4_8_neon: 87.7 put_hevc_epel_uni_w_hv4_8_i8mm: 83.2 put_hevc_epel_uni_w_hv6_8_c: 349.5 put_hevc_epel_uni_w_hv6_8_neon: 153.0 put_hevc_epel_uni_w_hv6_8_i8mm: 148.5 put_hevc_epel_uni_w_hv8_8_c: 581.2 put_hevc_epel_uni_w_hv8_8_neon: 166.7 put_hevc_epel_uni_w_hv8_8_i8mm: 163.5 put_hevc_epel_uni_w_hv12_8_c: 1230.0 put_hevc_epel_uni_w_hv12_8_neon: 387.7 put_hevc_epel_uni_w_hv12_8_i8mm: 370.2 put_hevc_epel_uni_w_hv16_8_c: 2003.2 put_hevc_epel_uni_w_hv16_8_neon: 501.5 put_hevc_epel_uni_w_hv16_8_i8mm: 490.2 put_hevc_epel_uni_w_hv24_8_c: 4448.7 put_hevc_epel_uni_w_hv24_8_neon: 1092.2 put_hevc_epel_uni_w_hv24_8_i8mm: 1069.7 put_hevc_epel_uni_w_hv32_8_c: 7817.2 put_hevc_epel_uni_w_hv32_8_neon: 1916.2 put_hevc_epel_uni_w_hv32_8_i8mm: 1829.5 put_hevc_epel_uni_w_hv48_8_c: 16728.2 put_hevc_epel_uni_w_hv48_8_neon: 4263.7 put_hevc_epel_uni_w_hv48_8_i8mm: 4342.7 put_hevc_epel_uni_w_hv64_8_c: 29563.2 put_hevc_epel_uni_w_hv64_8_neon: 7474.2 put_hevc_epel_uni_w_hv64_8_i8mm: 7128.5 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:59:58 +02:00
Martin Storsjö	d7294199ab	aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm AWS Graviton 3: put_hevc_epel_uni_hv4_8_c: 163.5 put_hevc_epel_uni_hv4_8_neon: 59.7 put_hevc_epel_uni_hv4_8_i8mm: 57.5 put_hevc_epel_uni_hv6_8_c: 344.7 put_hevc_epel_uni_hv6_8_neon: 105.0 put_hevc_epel_uni_hv6_8_i8mm: 102.7 put_hevc_epel_uni_hv8_8_c: 552.2 put_hevc_epel_uni_hv8_8_neon: 111.2 put_hevc_epel_uni_hv8_8_i8mm: 104.0 put_hevc_epel_uni_hv12_8_c: 1195.0 put_hevc_epel_uni_hv12_8_neon: 248.7 put_hevc_epel_uni_hv12_8_i8mm: 229.5 put_hevc_epel_uni_hv16_8_c: 1910.2 put_hevc_epel_uni_hv16_8_neon: 339.5 put_hevc_epel_uni_hv16_8_i8mm: 323.2 put_hevc_epel_uni_hv24_8_c: 4048.2 put_hevc_epel_uni_hv24_8_neon: 737.7 put_hevc_epel_uni_hv24_8_i8mm: 713.7 put_hevc_epel_uni_hv32_8_c: 6865.7 put_hevc_epel_uni_hv32_8_neon: 1285.0 put_hevc_epel_uni_hv32_8_i8mm: 1206.0 put_hevc_epel_uni_hv48_8_c: 15830.5 put_hevc_epel_uni_hv48_8_neon: 2844.7 put_hevc_epel_uni_hv48_8_i8mm: 2914.0 put_hevc_epel_uni_hv64_8_c: 27912.7 put_hevc_epel_uni_hv64_8_neon: 4970.5 put_hevc_epel_uni_hv64_8_i8mm: 4653.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:59:28 +02:00
Martin Storsjö	7bf3d14769	aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm AWS Graviton 3: put_hevc_epel_hv4_8_c: 163.7 put_hevc_epel_hv4_8_neon: 52.5 put_hevc_epel_hv4_8_i8mm: 49.5 put_hevc_epel_hv6_8_c: 292.2 put_hevc_epel_hv6_8_neon: 97.7 put_hevc_epel_hv6_8_i8mm: 101.2 put_hevc_epel_hv8_8_c: 471.0 put_hevc_epel_hv8_8_neon: 106.7 put_hevc_epel_hv8_8_i8mm: 102.5 put_hevc_epel_hv12_8_c: 1030.2 put_hevc_epel_hv12_8_neon: 240.5 put_hevc_epel_hv12_8_i8mm: 215.0 put_hevc_epel_hv16_8_c: 1711.5 put_hevc_epel_hv16_8_neon: 340.2 put_hevc_epel_hv16_8_i8mm: 319.2 put_hevc_epel_hv24_8_c: 3670.0 put_hevc_epel_hv24_8_neon: 702.0 put_hevc_epel_hv24_8_i8mm: 666.5 put_hevc_epel_hv32_8_c: 6785.5 put_hevc_epel_hv32_8_neon: 1247.0 put_hevc_epel_hv32_8_i8mm: 1169.0 put_hevc_epel_hv48_8_c: 14689.7 put_hevc_epel_hv48_8_neon: 2665.2 put_hevc_epel_hv48_8_i8mm: 2740.0 put_hevc_epel_hv64_8_c: 25899.2 put_hevc_epel_hv64_8_neon: 4801.2 put_hevc_epel_hv64_8_i8mm: 4487.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:59:19 +02:00
Martin Storsjö	5b5666e5ab	aarch64: hevc: Reorder epel_hv functions to prepare for templating This is a pure reordering of code without changing anything in the individual functions. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:59:07 +02:00
Martin Storsjö	e6d4c0e117	aarch64: hevc: Split the epel_*_hv functions into two parts The first horizontal filter can use either i8mm or plain neon versions, while the second part is a pure neon implementation. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:59:00 +02:00
Martin Storsjö	54af555bfa	aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8 AWS Graviton 3: put_hevc_epel_uni_w_h4_8_c: 97.2 put_hevc_epel_uni_w_h4_8_neon: 41.2 put_hevc_epel_uni_w_h4_8_i8mm: 35.2 put_hevc_epel_uni_w_h6_8_c: 203.7 put_hevc_epel_uni_w_h6_8_neon: 84.7 put_hevc_epel_uni_w_h6_8_i8mm: 74.7 put_hevc_epel_uni_w_h8_8_c: 345.7 put_hevc_epel_uni_w_h8_8_neon: 94.0 put_hevc_epel_uni_w_h8_8_i8mm: 80.7 put_hevc_epel_uni_w_h12_8_c: 768.7 put_hevc_epel_uni_w_h12_8_neon: 196.7 put_hevc_epel_uni_w_h12_8_i8mm: 169.7 put_hevc_epel_uni_w_h16_8_c: 1313.0 put_hevc_epel_uni_w_h16_8_neon: 290.7 put_hevc_epel_uni_w_h16_8_i8mm: 238.0 put_hevc_epel_uni_w_h24_8_c: 2877.5 put_hevc_epel_uni_w_h24_8_neon: 650.0 put_hevc_epel_uni_w_h24_8_i8mm: 512.0 put_hevc_epel_uni_w_h32_8_c: 5113.5 put_hevc_epel_uni_w_h32_8_neon: 1129.5 put_hevc_epel_uni_w_h32_8_i8mm: 739.2 put_hevc_epel_uni_w_h48_8_c: 11757.0 put_hevc_epel_uni_w_h48_8_neon: 2518.7 put_hevc_epel_uni_w_h48_8_i8mm: 1688.5 put_hevc_epel_uni_w_h64_8_c: 20478.0 put_hevc_epel_uni_w_h64_8_neon: 4411.7 put_hevc_epel_uni_w_h64_8_i8mm: 2884.0 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:47 +02:00
Martin Storsjö	6d384298ec	aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8 AWS Graviton 3: put_hevc_epel_h4_8_c: 64.7 put_hevc_epel_h4_8_neon: 25.0 put_hevc_epel_h4_8_i8mm: 21.2 put_hevc_epel_h6_8_c: 130.0 put_hevc_epel_h6_8_neon: 40.7 put_hevc_epel_h6_8_i8mm: 36.5 put_hevc_epel_h8_8_c: 209.0 put_hevc_epel_h8_8_neon: 45.2 put_hevc_epel_h8_8_i8mm: 41.2 put_hevc_epel_h12_8_c: 465.5 put_hevc_epel_h12_8_neon: 104.5 put_hevc_epel_h12_8_i8mm: 86.5 put_hevc_epel_h16_8_c: 830.7 put_hevc_epel_h16_8_neon: 134.2 put_hevc_epel_h16_8_i8mm: 114.0 put_hevc_epel_h24_8_c: 1844.7 put_hevc_epel_h24_8_neon: 282.2 put_hevc_epel_h24_8_i8mm: 277.2 put_hevc_epel_h32_8_c: 3227.5 put_hevc_epel_h32_8_neon: 501.5 put_hevc_epel_h32_8_i8mm: 396.0 put_hevc_epel_h48_8_c: 7229.2 put_hevc_epel_h48_8_neon: 1120.2 put_hevc_epel_h48_8_i8mm: 901.2 put_hevc_epel_h64_8_c: 12869.0 put_hevc_epel_h64_8_neon: 1999.2 put_hevc_epel_h64_8_i8mm: 1610.5 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:29 +02:00
Martin Storsjö	8f03c30a17	aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:20 +02:00
Martin Storsjö	717cc82d28	aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping For widths of 32 pixels and more, loop first horizontally, then vertically. Previously, this function would process a 16 pixel wide slice of the block, looping vertically. After processing the whole height, it would backtrack and process the next 16 pixel wide slice. When doing 8tap filtering horizontally, the function must load 7 more pixels (in practice, 8) following the actual inputs, and this was done for each slice. By iterating first horizontally throughout each line, then vertically, we access data in a more cache friendly order, and we don't need to reload data unnecessarily. Keep the original order in put_hevc_\type\()_h12_8_neon; the only suboptimal case there is for width=24. But specializing an optimal variant for that would require more code, which might not be worth it. For the h16 case, this implementation would give a slowdown, as it now loads the first 8 pixels separately from the rest, but for larger widths, it is a gain. Therefore, keep the h16 case as it was (but remove the outer loop), and create a new specialized version for horizontal looping with 16 pixels at a time. Before: Cortex A53 A72 A73 Graviton 3 put_hevc_qpel_h16_8_neon: 710.5 667.7 692.5 211.0 put_hevc_qpel_h32_8_neon: 2791.5 2643.5 2732.0 883.5 put_hevc_qpel_h64_8_neon: 10954.0 10657.0 10874.2 3241.5 After: put_hevc_qpel_h16_8_neon: 697.5 663.5 705.7 212.5 put_hevc_qpel_h32_8_neon: 2767.2 2684.5 2791.2 920.5 put_hevc_qpel_h64_8_neon: 10559.2 10471.5 10932.2 3051.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:11 +02:00
Martin Storsjö	e3a54cabde	aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon This gets rid of a couple instructions, but the actual performance is almost identical on Cortex A72/A73. On Cortex A53, it is a handful of cycles faster. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:01 +02:00
Martin Storsjö	78db8405c0	aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm Many of the routines within hevcdsp_epel_neon and hevcdsp_qpel_neon store temporary buffers on the stack. When consuming it, many of these functions use the stack pointer as incremental pointer for reading the data (instead of storing it in another register), which is rather unusual. Technically, this is fine as long as the pointer remains properly aligned. However in the case of ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, after incrementing sp when reading data (within each 16 pixel wide stripe) it would then reset the stack pointer back to a lower value, for reading the next 16 pixel wide stripe, expecting the data to remain untouched. This can't be assumed; data on the stack below the stack pointer can be clobbered (e.g. by a signal handler). Some OS ABIs allow for a little margin that won't be touched, aka a red zone, but not all do. The ones that do, guarantee 16 or 128 bytes, not 9 KB. Convert this function to use a separate pointer register to iterate through the data, retaining the stack pointer to point at the bottom of the data we require to remain untouched. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:57:55 +02:00
Martin Storsjö	e66858fbab	aarch64: hevc: Reorder a misplaced function init line Group the epel and qpel functions together. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:57:50 +02:00
Martin Storsjö	0c5da7be59	aarch64: Fix ff_hevc_put_hevc_epel_h48_8_neon_i8mm The first 32 elements of each row were correct, while the last 16 were scrambled. This hasn't been noticed, because the checkasm test erroneously only checked half of the output (for 8 bit functions), and apparently none of the samples as part of "fate-hevc" seem to trigger this specific function. Signed-off-by: J. Dekker <jdek@itanimul.li>	2024-03-14 13:42:39 +01:00
J. Dekker	570052cd2a	avcodec/aarch64/hevc: add luma deblock NEON Benched using single-threaded full decode on an Ampere Altra. Bpp Before After Speedup 8 73,3s 65,2s 1.124x 10 114,2s 104,0s 1.098x 12 125,8s 115,7s 1.087x Signed-off-by: J. Dekker <jdek@itanimul.li>	2024-02-28 10:14:58 +01:00
Mikhail Nitenko	0f745b74ec	lavc/aarch64: h264qpel, add 10-bit lowpass_8_10 based functions Benchmarks A53 A55 A72 A76 avg_h264_qpel_8_mc01_10_c: 936.5 924.0 656.0 504.7 avg_h264_qpel_8_mc01_10_neon: 234.7 202.0 120.7 63.2 avg_h264_qpel_8_mc02_10_c: 921.0 920.0 669.2 493.7 avg_h264_qpel_8_mc02_10_neon: 202.0 173.2 102.7 58.5 avg_h264_qpel_8_mc03_10_c: 936.5 924.0 656.0 509.5 avg_h264_qpel_8_mc03_10_neon: 236.2 203.7 120.0 63.2 avg_h264_qpel_8_mc10_10_c: 1441.0 1437.7 806.7 478.5 avg_h264_qpel_8_mc10_10_neon: 325.7 324.0 153.7 94.2 avg_h264_qpel_8_mc11_10_c: 2160.7 2148.2 1366.7 906.7 avg_h264_qpel_8_mc11_10_neon: 492.0 464.0 242.5 134.5 avg_h264_qpel_8_mc13_10_c: 2157.0 2138.2 1357.0 908.2 avg_h264_qpel_8_mc13_10_neon: 494.0 467.2 242.0 140.0 avg_h264_qpel_8_mc20_10_c: 1433.5 1410.0 785.2 486.0 avg_h264_qpel_8_mc20_10_neon: 293.7 289.7 138.0 91.5 avg_h264_qpel_8_mc30_10_c: 1458.5 1461.7 813.7 483.2 avg_h264_qpel_8_mc30_10_neon: 341.7 339.2 154.0 95.2 avg_h264_qpel_8_mc31_10_c: 2194.7 2197.2 1358.7 928.0 avg_h264_qpel_8_mc31_10_neon: 520.0 495.0 245.5 142.5 avg_h264_qpel_8_mc33_10_c: 2188.0 2205.5 1356.7 910.7 avg_h264_qpel_8_mc33_10_neon: 521.0 494.5 245.7 145.7 avg_h264_qpel_16_mc01_10_c: 3717.2 3595.0 2610.0 2012.0 avg_h264_qpel_16_mc01_10_neon: 920.5 791.5 483.2 240.5 avg_h264_qpel_16_mc02_10_c: 3684.0 3633.0 2659.0 1919.7 avg_h264_qpel_16_mc02_10_neon: 790.7 678.2 409.2 217.0 avg_h264_qpel_16_mc03_10_c: 3726.5 3596.0 2606.7 2010.0 avg_h264_qpel_16_mc03_10_neon: 922.0 792.5 483.2 239.7 avg_h264_qpel_16_mc10_10_c: 5912.0 5803.2 3241.5 1916.7 avg_h264_qpel_16_mc10_10_neon: 1267.5 1277.2 616.5 365.0 avg_h264_qpel_16_mc11_10_c: 8599.2 8482.5 5338.0 3616.2 avg_h264_qpel_16_mc11_10_neon: 1913.0 1827.0 956.2 542.2 avg_h264_qpel_16_mc13_10_c: 8643.7 8488.5 5388.0 3628.5 avg_h264_qpel_16_mc13_10_neon: 1914.7 1828.7 969.2 530.5 avg_h264_qpel_16_mc20_10_c: 5719.5 5641.0 3147.0 1946.2 avg_h264_qpel_16_mc20_10_neon: 1139.5 1150.0 539.5 344.0 avg_h264_qpel_16_mc30_10_c: 5930.0 5872.5 3267.5 1918.0 avg_h264_qpel_16_mc30_10_neon: 1331.5 1341.2 616.5 369.5 avg_h264_qpel_16_mc31_10_c: 8758.7 8697.7 5353.0 3630.7 avg_h264_qpel_16_mc31_10_neon: 2018.7 1941.7 982.2 574.7 avg_h264_qpel_16_mc33_10_c: 8683.2 8675.2 5339.2 3634.7 avg_h264_qpel_16_mc33_10_neon: 2019.7 1940.2 994.5 566.0 put_h264_qpel_8_mc01_10_c: 854.2 843.0 599.2 478.0 put_h264_qpel_8_mc01_10_neon: 192.7 168.0 101.7 56.7 put_h264_qpel_8_mc02_10_c: 766.5 760.0 550.2 441.0 put_h264_qpel_8_mc02_10_neon: 160.0 139.2 88.7 53.0 put_h264_qpel_8_mc03_10_c: 854.2 843.0 599.2 479.0 put_h264_qpel_8_mc03_10_neon: 194.2 169.7 102.0 56.2 put_h264_qpel_8_mc10_10_c: 1352.7 1353.7 749.7 446.7 put_h264_qpel_8_mc10_10_neon: 289.7 294.2 135.5 88.5 put_h264_qpel_8_mc11_10_c: 2080.0 2066.2 1309.5 876.7 put_h264_qpel_8_mc11_10_neon: 450.0 429.7 229.7 131.2 put_h264_qpel_8_mc13_10_c: 2074.7 2060.2 1294.5 870.5 put_h264_qpel_8_mc13_10_neon: 452.5 434.5 226.5 130.0 put_h264_qpel_8_mc20_10_c: 1221.5 1216.0 684.5 399.7 put_h264_qpel_8_mc20_10_neon: 257.7 262.5 121.2 78.7 put_h264_qpel_8_mc30_10_c: 1379.0 1374.7 757.2 449.5 put_h264_qpel_8_mc30_10_neon: 305.7 310.2 135.5 86.5 put_h264_qpel_8_mc31_10_c: 2109.2 2119.7 1299.5 878.0 put_h264_qpel_8_mc31_10_neon: 478.0 458.5 226.0 137.2 put_h264_qpel_8_mc33_10_c: 2101.5 2115.2 1306.5 887.0 put_h264_qpel_8_mc33_10_neon: 479.0 458.7 229.7 141.7 put_h264_qpel_16_mc01_10_c: 3485.7 3396.7 2460.5 1914.5 put_h264_qpel_16_mc01_10_neon: 752.5 665.5 397.0 213.2 put_h264_qpel_16_mc02_10_c: 3103.5 3023.2 2154.7 1720.7 put_h264_qpel_16_mc02_10_neon: 622.7 551.2 347.7 196.2 put_h264_qpel_16_mc03_10_c: 3486.2 3394.0 2436.5 1917.7 put_h264_qpel_16_mc03_10_neon: 754.0 666.5 397.0 215.7 put_h264_qpel_16_mc10_10_c: 5533.0 5488.5 2989.0 1783.0 put_h264_qpel_16_mc10_10_neon: 1123.5 1165.2 535.2 334.7 put_h264_qpel_16_mc11_10_c: 8437.7 8281.2 5209.0 3510.7 put_h264_qpel_16_mc11_10_neon: 1745.0 1697.0 878.5 513.5 put_h264_qpel_16_mc13_10_c: 8567.7 8468.0 5221.5 3528.0 put_h264_qpel_16_mc13_10_neon: 1751.7 1698.2 889.2 507.0 put_h264_qpel_16_mc20_10_c: 4907.5 4885.0 2786.2 1607.5 put_h264_qpel_16_mc20_10_neon: 995.5 1034.5 475.5 307.0 put_h264_qpel_16_mc30_10_c: 5579.7 5537.7 3045.2 1789.5 put_h264_qpel_16_mc30_10_neon: 1187.5 1231.2 532.5 334.5 put_h264_qpel_16_mc31_10_c: 8677.2 8672.5 5204.2 3516.0 put_h264_qpel_16_mc31_10_neon: 1850.7 1813.2 893.0 545.2 put_h264_qpel_16_mc33_10_c: 8688.7 8671.2 5223.2 3512.0 put_h264_qpel_16_mc33_10_neon: 1851.7 1814.2 908.5 535.2 Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-12-07 23:20:14 +02:00
Logan Lyu	fa0470347e	lavc/aarch64: new optimization for 8-bit hevc_qpel_bi_hv put_hevc_qpel_bi_hv4_8_c: 433.7 put_hevc_qpel_bi_hv4_8_i8mm: 117.9 put_hevc_qpel_bi_hv6_8_c: 803.9 put_hevc_qpel_bi_hv6_8_i8mm: 252.7 put_hevc_qpel_bi_hv8_8_c: 1296.4 put_hevc_qpel_bi_hv8_8_i8mm: 316.2 put_hevc_qpel_bi_hv12_8_c: 2867.4 put_hevc_qpel_bi_hv12_8_i8mm: 669.2 put_hevc_qpel_bi_hv16_8_c: 4709.4 put_hevc_qpel_bi_hv16_8_i8mm: 929.9 put_hevc_qpel_bi_hv24_8_c: 9639.7 put_hevc_qpel_bi_hv24_8_i8mm: 2072.4 put_hevc_qpel_bi_hv32_8_c: 16663.7 put_hevc_qpel_bi_hv32_8_i8mm: 3391.4 put_hevc_qpel_bi_hv48_8_c: 36972.9 put_hevc_qpel_bi_hv48_8_i8mm: 7505.7 put_hevc_qpel_bi_hv64_8_c: 64106.4 put_hevc_qpel_bi_hv64_8_i8mm: 13145.2 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-12-01 21:25:39 +02:00
Logan Lyu	595f97028b	lavc/aarch64: new optimization for 8-bit hevc_qpel_bi_v put_hevc_qpel_bi_v4_8_c: 166.1 put_hevc_qpel_bi_v4_8_neon: 61.9 put_hevc_qpel_bi_v6_8_c: 309.4 put_hevc_qpel_bi_v6_8_neon: 75.6 put_hevc_qpel_bi_v8_8_c: 531.1 put_hevc_qpel_bi_v8_8_neon: 78.1 put_hevc_qpel_bi_v12_8_c: 1139.9 put_hevc_qpel_bi_v12_8_neon: 238.1 put_hevc_qpel_bi_v16_8_c: 2063.6 put_hevc_qpel_bi_v16_8_neon: 308.9 put_hevc_qpel_bi_v24_8_c: 4317.1 put_hevc_qpel_bi_v24_8_neon: 629.9 put_hevc_qpel_bi_v32_8_c: 8241.9 put_hevc_qpel_bi_v32_8_neon: 1140.1 put_hevc_qpel_bi_v48_8_c: 18422.9 put_hevc_qpel_bi_v48_8_neon: 2533.9 put_hevc_qpel_bi_v64_8_c: 37508.6 put_hevc_qpel_bi_v64_8_neon: 4520.1 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-12-01 21:25:39 +02:00
Logan Lyu	00290a64f7	lavc/aarch64: new optimization for 8-bit hevc_epel_bi_hv put_hevc_epel_bi_hv4_8_c: 242.9 put_hevc_epel_bi_hv4_8_i8mm: 68.6 put_hevc_epel_bi_hv6_8_c: 402.4 put_hevc_epel_bi_hv6_8_i8mm: 135.9 put_hevc_epel_bi_hv8_8_c: 636.4 put_hevc_epel_bi_hv8_8_i8mm: 145.6 put_hevc_epel_bi_hv12_8_c: 1363.1 put_hevc_epel_bi_hv12_8_i8mm: 324.1 put_hevc_epel_bi_hv16_8_c: 2222.1 put_hevc_epel_bi_hv16_8_i8mm: 509.1 put_hevc_epel_bi_hv24_8_c: 4793.4 put_hevc_epel_bi_hv24_8_i8mm: 1091.9 put_hevc_epel_bi_hv32_8_c: 8393.9 put_hevc_epel_bi_hv32_8_i8mm: 1720.6 put_hevc_epel_bi_hv48_8_c: 19526.6 put_hevc_epel_bi_hv48_8_i8mm: 4285.9 put_hevc_epel_bi_hv64_8_c: 33915.4 put_hevc_epel_bi_hv64_8_i8mm: 6783.6 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-12-01 21:25:39 +02:00
Logan Lyu	0448f27f41	lavc/aarch64: new optimization for 8-bit hevc_epel_bi_v put_hevc_epel_bi_v4_8_c: 138.4 put_hevc_epel_bi_v4_8_neon: 33.7 put_hevc_epel_bi_v6_8_c: 302.9 put_hevc_epel_bi_v6_8_neon: 46.7 put_hevc_epel_bi_v8_8_c: 408.7 put_hevc_epel_bi_v8_8_neon: 48.7 put_hevc_epel_bi_v12_8_c: 779.4 put_hevc_epel_bi_v12_8_neon: 139.7 put_hevc_epel_bi_v16_8_c: 1344.9 put_hevc_epel_bi_v16_8_neon: 160.2 put_hevc_epel_bi_v24_8_c: 2981.7 put_hevc_epel_bi_v24_8_neon: 344.9 put_hevc_epel_bi_v32_8_c: 5280.9 put_hevc_epel_bi_v32_8_neon: 618.4 put_hevc_epel_bi_v48_8_c: 12494.9 put_hevc_epel_bi_v48_8_neon: 1364.4 put_hevc_epel_bi_v64_8_c: 22127.7 put_hevc_epel_bi_v64_8_neon: 2473.7 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-12-01 21:25:39 +02:00
Logan Lyu	216275bd80	lavc/aarch64: new optimization for 8-bit hevc_epel_bi_h put_hevc_epel_bi_h4_8_c: 96.0 put_hevc_epel_bi_h4_8_neon: 36.3 put_hevc_epel_bi_h6_8_c: 288.3 put_hevc_epel_bi_h6_8_neon: 59.3 put_hevc_epel_bi_h8_8_c: 358.5 put_hevc_epel_bi_h8_8_neon: 61.5 put_hevc_epel_bi_h12_8_c: 759.8 put_hevc_epel_bi_h12_8_neon: 159.5 put_hevc_epel_bi_h16_8_c: 1307.0 put_hevc_epel_bi_h16_8_neon: 182.0 put_hevc_epel_bi_h24_8_c: 2778.3 put_hevc_epel_bi_h24_8_neon: 430.5 put_hevc_epel_bi_h32_8_c: 4952.3 put_hevc_epel_bi_h32_8_neon: 679.5 put_hevc_epel_bi_h48_8_c: 11803.3 put_hevc_epel_bi_h48_8_neon: 1443.5 put_hevc_epel_bi_h64_8_c: 20654.8 put_hevc_epel_bi_h64_8_neon: 2737.0 put_hevc_qpel_bi_h4_8_c: 140.0 put_hevc_qpel_bi_h4_8_neon: 111.5 put_hevc_qpel_bi_h6_8_c: 318.0 put_hevc_qpel_bi_h6_8_neon: 85.8 put_hevc_qpel_bi_h8_8_c: 536.5 put_hevc_qpel_bi_h8_8_neon: 95.3 put_hevc_qpel_bi_h12_8_c: 1188.5 put_hevc_qpel_bi_h12_8_neon: 291.3 put_hevc_qpel_bi_h16_8_c: 2064.3 put_hevc_qpel_bi_h16_8_neon: 365.3 put_hevc_qpel_bi_h24_8_c: 4757.5 put_hevc_qpel_bi_h24_8_neon: 1010.0 put_hevc_qpel_bi_h32_8_c: 8351.8 put_hevc_qpel_bi_h32_8_neon: 2917.8 put_hevc_qpel_bi_h48_8_c: 19299.8 put_hevc_qpel_bi_h48_8_neon: 2976.8 put_hevc_qpel_bi_h64_8_c: 34182.5 put_hevc_qpel_bi_h64_8_neon: 5236.3 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-12-01 21:25:39 +02:00
Logan Lyu	40cf4a5ca3	lavc/aarch64: new optimization for 8-bit hevc_pel_bi_pixels put_hevc_pel_bi_pixels4_8_c: 54.7 put_hevc_pel_bi_pixels4_8_neon: 43.0 put_hevc_pel_bi_pixels6_8_c: 94.7 put_hevc_pel_bi_pixels6_8_neon: 37.0 put_hevc_pel_bi_pixels8_8_c: 171.0 put_hevc_pel_bi_pixels8_8_neon: 24.0 put_hevc_pel_bi_pixels12_8_c: 354.0 put_hevc_pel_bi_pixels12_8_neon: 68.7 put_hevc_pel_bi_pixels16_8_c: 588.2 put_hevc_pel_bi_pixels16_8_neon: 77.5 put_hevc_pel_bi_pixels24_8_c: 1670.7 put_hevc_pel_bi_pixels24_8_neon: 173.0 put_hevc_pel_bi_pixels32_8_c: 2267.7 put_hevc_pel_bi_pixels32_8_neon: 281.2 put_hevc_pel_bi_pixels48_8_c: 5787.5 put_hevc_pel_bi_pixels48_8_neon: 673.5 put_hevc_pel_bi_pixels64_8_c: 9897.0 put_hevc_pel_bi_pixels64_8_neon: 1159.5 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-12-01 21:25:39 +02:00
xufuji456	cc86343b96	lavc/hevcdsp_qpel_neon: using movi.16b instead of movi.2d Building iOS platform with arm64, the compiler has a warning: "instruction movi.2d with immediate #0 may not function correctly on this CPU, converting to movi.16b" Signed-off-by: xufuji456 <839789740@qq.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-11-28 15:54:49 +02:00
Logan Lyu	55f28eb627	lavc/aarch64: new optimization for 8-bit hevc_qpel_hv checkasm bench: put_hevc_qpel_hv4_8_c: 422.1 put_hevc_qpel_hv4_8_i8mm: 101.6 put_hevc_qpel_hv6_8_c: 756.4 put_hevc_qpel_hv6_8_i8mm: 225.9 put_hevc_qpel_hv8_8_c: 1189.9 put_hevc_qpel_hv8_8_i8mm: 296.6 put_hevc_qpel_hv12_8_c: 2407.4 put_hevc_qpel_hv12_8_i8mm: 552.4 put_hevc_qpel_hv16_8_c: 4021.4 put_hevc_qpel_hv16_8_i8mm: 886.6 put_hevc_qpel_hv24_8_c: 8992.1 put_hevc_qpel_hv24_8_i8mm: 1968.9 put_hevc_qpel_hv32_8_c: 15197.9 put_hevc_qpel_hv32_8_i8mm: 3209.4 put_hevc_qpel_hv48_8_c: 32811.1 put_hevc_qpel_hv48_8_i8mm: 7442.1 put_hevc_qpel_hv64_8_c: 58106.1 put_hevc_qpel_hv64_8_i8mm: 12423.9 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-31 14:14:21 +02:00
Logan Lyu	97a9d12657	lavc/aarch64: new optimization for 8-bit hevc_qpel_v checkasm bench: put_hevc_qpel_v4_8_c: 138.1 put_hevc_qpel_v4_8_neon: 41.1 put_hevc_qpel_v6_8_c: 276.6 put_hevc_qpel_v6_8_neon: 60.9 put_hevc_qpel_v8_8_c: 478.9 put_hevc_qpel_v8_8_neon: 72.9 put_hevc_qpel_v12_8_c: 1072.6 put_hevc_qpel_v12_8_neon: 203.9 put_hevc_qpel_v16_8_c: 1852.1 put_hevc_qpel_v16_8_neon: 264.1 put_hevc_qpel_v24_8_c: 4137.6 put_hevc_qpel_v24_8_neon: 586.9 put_hevc_qpel_v32_8_c: 7579.1 put_hevc_qpel_v32_8_neon: 1036.6 put_hevc_qpel_v48_8_c: 16355.6 put_hevc_qpel_v48_8_neon: 2326.4 put_hevc_qpel_v64_8_c: 33545.1 put_hevc_qpel_v64_8_neon: 4126.4 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-31 14:14:21 +02:00
Logan Lyu	265450b89e	lavc/aarch64: new optimization for 8-bit hevc_epel_hv checkasm bench: put_hevc_epel_hv4_8_c: 213.7 put_hevc_epel_hv4_8_i8mm: 59.4 put_hevc_epel_hv6_8_c: 350.9 put_hevc_epel_hv6_8_i8mm: 130.2 put_hevc_epel_hv8_8_c: 548.7 put_hevc_epel_hv8_8_i8mm: 136.9 put_hevc_epel_hv12_8_c: 1126.7 put_hevc_epel_hv12_8_i8mm: 302.2 put_hevc_epel_hv16_8_c: 1925.2 put_hevc_epel_hv16_8_i8mm: 459.9 put_hevc_epel_hv24_8_c: 4301.9 put_hevc_epel_hv24_8_i8mm: 1024.9 put_hevc_epel_hv32_8_c: 7509.2 put_hevc_epel_hv32_8_i8mm: 1680.4 put_hevc_epel_hv48_8_c: 16566.9 put_hevc_epel_hv48_8_i8mm: 3945.4 put_hevc_epel_hv64_8_c: 29134.2 put_hevc_epel_hv64_8_i8mm: 6567.7 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-31 14:02:53 +02:00
Logan Lyu	22c7291506	lavc/aarch64: new optimization for 8-bit hevc_epel_v checkasm bench: put_hevc_epel_v4_8_c: 79.9 put_hevc_epel_v4_8_neon: 25.7 put_hevc_epel_v6_8_c: 151.4 put_hevc_epel_v6_8_neon: 46.4 put_hevc_epel_v8_8_c: 250.9 put_hevc_epel_v8_8_neon: 41.7 put_hevc_epel_v12_8_c: 542.7 put_hevc_epel_v12_8_neon: 108.7 put_hevc_epel_v16_8_c: 939.4 put_hevc_epel_v16_8_neon: 169.2 put_hevc_epel_v24_8_c: 2104.9 put_hevc_epel_v24_8_neon: 307.9 put_hevc_epel_v32_8_c: 3713.9 put_hevc_epel_v32_8_neon: 524.2 put_hevc_epel_v48_8_c: 8175.2 put_hevc_epel_v48_8_neon: 1197.2 put_hevc_epel_v64_8_c: 16049.4 put_hevc_epel_v64_8_neon: 2094.9 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-31 14:02:53 +02:00
Logan Lyu	772865717b	lavc/aarch64: new optimization for 8-bit hevc_epel_pixels and and hevc_qpel_pixels checkasm bench: put_hevc_pel_pixels4_8_c: 33.7 put_hevc_pel_pixels4_8_neon: 20.2 put_hevc_pel_pixels6_8_c: 61.4 put_hevc_pel_pixels6_8_neon: 25.4 put_hevc_pel_pixels8_8_c: 121.4 put_hevc_pel_pixels8_8_neon: 16.9 put_hevc_pel_pixels12_8_c: 199.9 put_hevc_pel_pixels12_8_neon: 40.2 put_hevc_pel_pixels16_8_c: 355.9 put_hevc_pel_pixels16_8_neon: 43.4 put_hevc_pel_pixels24_8_c: 774.7 put_hevc_pel_pixels24_8_neon: 78.9 put_hevc_pel_pixels32_8_c: 1345.2 put_hevc_pel_pixels32_8_neon: 152.2 put_hevc_pel_pixels48_8_c: 2963.7 put_hevc_pel_pixels48_8_neon: 309.4 put_hevc_pel_pixels64_8_c: 5236.2 put_hevc_pel_pixels64_8_neon: 514.2 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-31 14:02:53 +02:00
Martin Storsjö	a4877f1ec1	aarch64: Only enable extensions in the intended files/regions This eases actual development of the assembly functions, by only allowing extension instructions within the sections that explicitly enable them, instead of having all extensions enabled everywhere. Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-24 14:46:20 +03:00
Martin Storsjö	1762975ba1	libavcodec/aarch64/hevc: Require consistent use of trailing semicolon Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-23 10:39:12 +03:00
Martin Storsjö	a76b409dd0	aarch64: Reindent all assembly to 8/24 column indentation libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally uses a layered indentation style to visually show how different unrolled/interleaved phases fit together. Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-21 23:25:54 +03:00
Martin Storsjö	7f905f3672	aarch64: Make the indentation more consistent Some functions have slightly different indentation styles; try to match the surrounding code. libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally uses a layered indentation style to visually show how different unrolled/interleaved phases fit together. Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-21 23:25:29 +03:00

1 2 3 4 5 ...

402 Commits