Commit Graph

455 Commits

Author SHA1 Message Date
James Darnley
5336887867 avcodec/h264: sse2, avx h luma mbaff deblock/loop filter
x86-64 only

Yorkfield:
- sse2: ~2.17x (434 vs. 200 cycles)

Nehalem:
- sse2: ~2.94x (409 vs. 139 cycles)

Skylake:
- sse2: ~3.10x (370 vs. 119 cycles)
- avx:  ~3.29x (370 vs. 112 cycles)
2017-02-18 20:26:52 +01:00
James Darnley
7627df15d4 x86util: import MOVHL macro
Originally committed to x264 in 1637239a by Henrik Gramner who has
agreed to re-license it as LGPL.  Original commit message follows.

    x86: Avoid some bypass delays and false dependencies

    A bypass delay of 1-3 clock cycles may occur on some CPUs when transitioning
    between int and float domains, so try to avoid that if possible.
2017-02-18 20:26:51 +01:00
James Darnley
9d815b7424 avcodec/x86: deduplicate PASS8ROWS macro 2017-02-18 20:26:49 +01:00
James Almer
8d5df204d0 Merge commit '8e9cd81d291b1010c625b2766058aadf4affb537'
* commit '8e9cd81d291b1010c625b2766058aadf4affb537':
  x86: cpu: Detect Conroe CPUs and their slow shuffle unit

Merged-by: James Almer <jamrial@gmail.com>
2017-01-31 15:20:54 -03:00
James Almer
2eab48177d Merge commit '7d7355aa92bb36ca0765c49a569a999bcb96f332'
* commit '7d7355aa92bb36ca0765c49a569a999bcb96f332':
  x86: Add SSSE3_SLOW CPU flag and related convenience macros

Merged-by: James Almer <jamrial@gmail.com>
2017-01-31 15:17:19 -03:00
Henrik Gramner
cd09e3b349 x86inc: Avoid using eax/rax for storing the stack pointer
When allocating stack space with an alignment requirement that is larger
than the current stack alignment we need to store a copy of the original
stack pointer in order to be able to restore it later.

If we chose to use another register for this purpose we should not pick
eax/rax since it can be overwritten as a return value.
2017-01-09 16:00:29 +01:00
Michael Niedermayer
051517648b avutil/x86/emms: Document the emms_c() vs alloc/free relation.
Reviewed-by: Andreas Cadhalpun <andreas.cadhalpun@googlemail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2016-10-23 13:02:37 +02:00
Fiona Glaser
8e9cd81d29 x86: cpu: Detect Conroe CPUs and their slow shuffle unit 2016-07-20 18:43:28 +02:00
Diego Biurrun
7d7355aa92 x86: Add SSSE3_SLOW CPU flag and related convenience macros 2016-07-20 18:43:28 +02:00
James Almer
fd5e6a095f x86util: Extend SPLATW for avx2
Integration to Libav by Josh de Kock <josh@itanimul.li>.

Signed-off-by: Alexandra Hájková <alexandra@khirnov.net>
2016-07-18 15:27:13 +02:00
Ronald S. Bultje
f0a2b6249b vp9: add 16x16 idct avx2 (8-bit).
checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
that it's about 1.65x as fast as the AVX version for the full IDCT, and
similar speedups for the sub-IDCTs:

nop: 24.6
vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4
vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2
vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5
vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7
vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9
vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2
vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9
vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3
vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7
vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4
vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1
vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1
vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0
vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4
vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6
vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7
vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9
vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2
vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6
vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5
vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0
vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9
vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
2016-07-11 10:14:58 -04:00
Matthieu Bouron
9eb3da2f99 asm: FF_-prefix internal macros used in inline assembly
See merge commit '39d6d3618d48625decaff7d9bdbb45b44ef2a805'.
2016-06-27 17:21:18 +02:00
Clément Bœsch
8ef57a0d61 Merge commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb'
* commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb':
  cosmetics: Fix spelling mistakes

Merged-by: Clément Bœsch <u@pkh.me>
2016-06-21 21:55:34 +02:00
Matt Oliver
5ca44ebd99 lavu/intmath.h: fix compilation with msvc10.
Signed-off-by: Matt Oliver <protogonoi@gmail.com>
2016-06-13 13:49:24 +10:00
James Almer
172af20852 x86/showcqt: use three operand format for some instructions
Fixes failures with yasm 1.1.0 and older

Signed-off-by: James Almer <jamrial@gmail.com>
2016-06-08 19:37:08 -03:00
James Almer
99b899483e avutil/x86util: move haddps sse emulation from showcqt
Signed-off-by: James Almer <jamrial@gmail.com>
2016-06-08 14:18:00 -03:00
Diego Biurrun
1e9c5bf4c1 asm: FF_-prefix internal macros used in inline assembly
These warnings conflict with system macros on Solaris, producing
truckloads of warnings about macro redefinition.
2016-05-28 19:18:26 +02:00
Anton Mitrofanov
2fb1d17a5a x86inc: Enable AVX emulation in additional cases
Allows emulation to work when dst is equal to src2 as long as the
instruction is commutative, e.g. `addps m0, m1, m0`.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-05-16 10:31:24 +02:00
Anton Mitrofanov
300fb0df84 x86inc: Improve handling of %ifid with multi-token parameters
The yasm/nasm preprocessor only checks the first token, which means that
parameters such as `dword [rax]` are treated as identifiers, which is
generally not what we want.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-05-16 10:31:20 +02:00
Anton Mitrofanov
8d02579fae x86inc: Fix AVX emulation of some instructions
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-05-16 10:31:17 +02:00
Henrik Gramner
ba3eb745cc x86inc: Fix AVX emulation of scalar float instructions
Those instructions are not commutative since they only change the first
element in the vector and leave the rest unmodified.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-05-16 10:31:13 +02:00
Vittorio Giovara
41ed7ab45f cosmetics: Fix spelling mistakes
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2016-05-04 18:16:21 +02:00
Anton Mitrofanov
e428f3b30c x86inc: Enable AVX emulation in additional cases
Allows emulation to work when dst is equal to src2 as long as the
instruction is commutative, e.g. `addps m0, m1, m0`.
2016-04-20 19:16:22 +02:00
Anton Mitrofanov
4bd5583ace x86inc: Improve handling of %ifid with multi-token parameters
The yasm/nasm preprocessor only checks the first token, which means that
parameters such as `dword [rax]` are treated as identifiers, which is
generally not what we want.
2016-04-20 19:16:22 +02:00
Anton Mitrofanov
42be240ad6 x86inc: Fix AVX emulation of some instructions 2016-04-20 19:16:22 +02:00
Henrik Gramner
8dd3ee9ddd x86inc: Fix AVX emulation of scalar float instructions
Those instructions are not commutative since they only change the first
element in the vector and leave the rest unmodified.
2016-04-20 19:16:22 +02:00
James Almer
70d685a77f x86: use the new helper macros where useful
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-02-14 20:00:21 -03:00
James Almer
73a4589d4b x86: add some more helper macros to check for slow cpuflags
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-02-14 20:00:17 -03:00
James Almer
be22bd32fe x86/cpu: set avxslow cpuflag on btver2 CPUs
They are also slow when using 256 bit wide registers

Reviewed-by: Hendrik Leppkes <h.leppkes@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-02-07 16:39:21 -03:00
James Almer
b3b0ecee15 x86/emms: empty the mmx state unconditionally on supported targets
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-02-04 01:49:01 -03:00
Timothy Gu
44304ae322 all: Add missing header guards 2016-01-28 19:49:48 -08:00
James Almer
b624f0660b x86: Add ymm_reg struct
Needed to declare 32-byte long constants

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
2016-01-28 00:41:19 +01:00
Geza Lore
cc602061ee x86inc: Add debug symbols indicating sizes of compiled functions
Some debuggers/profilers use this metadata to determine which function a
given instruction is in; without it they get can confused by local labels
(if you haven't stripped those). On the other hand, some tools are still
confused even with this metadata. e.g. this fixes `gdb`, but not `perf`.

Currently only implemented for ELF.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-01-23 20:46:28 +01:00
Henrik Gramner
002c47798d x86inc: Avoid creating unnecessary local labels
The REP_RET workaround is only needed on old AMD cpus, and the labels clutter
up the symbol table and confuse debugging/profiling tools, so use EQU to
create SHN_ABS symbols instead of creating local labels. Furthermore, skip
the workaround completely in functions that definitely won't run on such cpus.

Note that EQU is just creating a local label when using nasm instead of yasm.
This is probably a bug, but at least it doesn't break anything.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-01-23 20:44:25 +01:00
Henrik Gramner
fd6ecac38e x86inc: Simplify AUTO_REP_RET
cpuflags is never undefined any more, it's set to 0 instead.

Also fix an incorrect comment.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-01-23 20:43:39 +01:00
Henrik Gramner
5ca8e195e5 x86inc: Use more consistent indentation
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-01-23 20:42:59 +01:00
Henrik Gramner
91ed050f42 x86inc: Preserve arguments when allocating stack space
When allocating stack space with a larger alignment than the known stack
alignment a temporary register is used for storing the stack pointer.
Ensure that this isn't one of the registers used for passing arguments.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-01-23 20:41:59 +01:00
Henrik Gramner
715eb7ca24 x86inc: Improve FMA instruction handling
* Correctly handle FMA instructions with memory operands.
 * Print a warning if FMA instructions are used without the correct cpuflag.
 * Simplify the instantiation code.
 * Clarify documentation.

Only the last operand in FMA3 instructions can be a memory operand. When
converting FMA4 instructions to FMA3 instructions we can utilize the fact
that multiply is a commutative operation and reorder operands if necessary
to ensure that a memory operand is used only as the last operand.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-01-23 20:30:30 +01:00
Henrik Gramner
f60f06d989 x86inc: Be more verbose in assertion failures
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-01-23 20:30:07 +01:00
Henrik Gramner
7adcd4e841 x86inc: Make cpuflag() and notcpuflag() return 0 or 1
Makes it possible to use them in arithmetic expressions.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2016-01-23 20:19:19 +01:00
Geza Lore
d39c229e54 x86inc: Add debug symbols indicating sizes of compiled functions
Some debuggers/profilers use this metadata to determine which function a
given instruction is in; without it they get can confused by local labels
(if you haven't stripped those). On the other hand, some tools are still
confused even with this metadata. e.g. this fixes `gdb`, but not `perf`.

Currently only implemented for ELF.
2016-01-21 23:19:46 +01:00
Henrik Gramner
d3662777e0 x86inc: Avoid creating unnecessary local labels
The REP_RET workaround is only needed on old AMD cpus, and the labels clutter
up the symbol table and confuse debugging/profiling tools, so use EQU to
create SHN_ABS symbols instead of creating local labels. Furthermore, skip
the workaround completely in functions that definitely won't run on such cpus.

Note that EQU is just creating a local label when using nasm instead of yasm.
This is probably a bug, but at least it doesn't break anything.
2016-01-21 23:19:46 +01:00
Henrik Gramner
87b587d4fe x86inc: Simplify AUTO_REP_RET
cpuflags is never undefined any more, it's set to 0 instead.

Also fix an incorrect comment.
2016-01-21 23:19:46 +01:00
Henrik Gramner
2d60b18cf0 x86inc: Use more consistent indentation 2016-01-21 23:19:46 +01:00
Henrik Gramner
dfe771dc5a x86inc: Preserve arguments when allocating stack space
When allocating stack space with a larger alignment than the known stack
alignment a temporary register is used for storing the stack pointer.
Ensure that this isn't one of the registers used for passing arguments.
2016-01-21 23:19:46 +01:00
Henrik Gramner
b1496008ee x86inc: Improve FMA instruction handling
* Correctly handle FMA instructions with memory operands.
 * Print a warning if FMA instructions are used without the correct cpuflag.
 * Simplify the instantiation code.
 * Clarify documentation.

Only the last operand in FMA3 instructions can be a memory operand. When
converting FMA4 instructions to FMA3 instructions we can utilize the fact
that multiply is a commutative operation and reorder operands if necessary
to ensure that a memory operand is used only as the last operand.
2016-01-21 23:19:46 +01:00
Henrik Gramner
6cbd0fdf28 x86inc: Be more verbose in assertion failures 2016-01-21 23:19:46 +01:00
James Almer
a2e1b66460 x86/intmath: disable sse av_clip functions when using ICC
It seems to miscompile them

Should fix fate-ra-288 and fate-twinvq

Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-01-21 16:50:51 -03:00
James Almer
dee579ffcd x86/fixed_dsp: add ff_butterflies_fixed_sse2
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-01-16 21:09:38 -03:00
Ganesh Ajjanagadde
5989add4ab lavu/x86/lls: add fma3 optimizations for update_lls
This improves accuracy (very slightly) and speed for processors having
fma3.

Sample benchmark (fate flac-16-lpc-cholesky, Haswell):
old:
5993610 decicycles in ff_lpc_calc_coefs,      64 runs,      0 skips
5951528 decicycles in ff_lpc_calc_coefs,     128 runs,      0 skips

new:
5252410 decicycles in ff_lpc_calc_coefs,      64 runs,      0 skips
5232869 decicycles in ff_lpc_calc_coefs,     128 runs,      0 skips

Tested with FATE and --disable-fma3, also examined contents of
lavu/lls-test.

Reviewed-by: James Almer <jamrial@gmail.com>
Reviewed-by: Henrik Gramner <henrik@gramner.com>
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
2016-01-15 16:46:13 -05:00