Commit Graph

751 Commits

Author SHA1 Message Date
Diego Biurrun
4d1f69f244 x86: h264_qpel_10bit: port to cpuflags 2012-11-09 21:17:05 +01:00
Diego Biurrun
6ca60d4ddd x86: h264_intrapred: port to cpuflags 2012-11-08 18:05:23 +01:00
Diego Biurrun
930e26a3ea x86: h264qpel: Only define mmxext QPEL functions if H264QPEL is enabled
This fixes compilation with --disable-everything and components enabled.
2012-11-05 20:48:43 +01:00
Diego Biurrun
dbb37e7711 x86: PABSW: port to cpuflags 2012-11-05 14:51:10 +01:00
Diego Biurrun
6c104826bd x86: vc1dsp: port to cpuflags 2012-11-05 14:51:10 +01:00
Diego Biurrun
0a7a94f2e5 x86: Refactor PSWAPD fallback implementations and port to cpuflags 2012-11-02 17:05:29 +01:00
Diego Biurrun
26f01bd106 x86: PMINUB: port to cpuflags 2012-11-02 15:38:15 +01:00
Diego Biurrun
9ce02e14f0 x86: ac3dsp: port to cpuflags 2012-11-02 15:24:50 +01:00
Diego Biurrun
c37322e68c x86: Move optimization suffix to end of function names
This simplifies cpuflags porting.
2012-10-31 18:21:55 +01:00
Diego Biurrun
fa8fcab1e0 x86: h264_chromamc_10bit: drop pointless PAVG %define
It is only used in one place so there is no need for the abstraction.
2012-10-31 18:21:55 +01:00
Diego Biurrun
d8eda37080 x86: mmx2 ---> mmxext in function names 2012-10-31 17:53:57 +01:00
Diego Biurrun
be2c456e96 x86: fmtconvert: Refactor cvtps2pi emulation through cpuflags 2012-10-31 01:05:03 +01:00
Diego Biurrun
be923ed659 x86: fmtconvert: port to cpuflags 2012-10-31 01:05:03 +01:00
Diego Biurrun
588fafe7f3 x86: MMX2 ---> MMXEXT in macro names 2012-10-31 01:04:55 +01:00
Diego Biurrun
652f518594 x86: mmx2 ---> mmxext in comments and messages 2012-10-31 00:37:42 +01:00
Diego Biurrun
04581c8c77 x86: yasm: Use complete source path for macro helper %includes
This is more consistent with the way we handle C #includes and
it simplifies the build system.
2012-10-31 00:37:42 +01:00
Diego Biurrun
6860b4081d x86: include x86inc.asm in x86util.asm
This is necessary to allow refactoring some x86util macros with cpuflags.
2012-10-31 00:37:42 +01:00
Ronald S. Bultje
95c89da36e Use ptrdiff_t instead of int for intra pred "stride" function parameter.
This way, SIMD-optimized functions don't have to sign-extend their
stride argument manually to be able to do pointer arithmetic.
2012-10-29 17:49:13 -07:00
Ronald S. Bultje
bad8e33dc9 x86: use PRED4x4/8x8/8x8L/16x16 macros to declare intrapred prototypes. 2012-10-29 17:48:23 -07:00
Ronald S. Bultje
c285edd06e Remove usage of INIT_AVX in h264_intrapred_10bit.asm.
Replace INIT_AVX by INIT_XMM avx. Port the whole file to use cpuflag
based function declarations. Remove (now unused) cputype argument in
function declaration macros. Change function prototypes to have mmx2
instead of mmxext as suffix, since that's required by cpuflags.
2012-10-29 14:10:51 -07:00
Luca Barbato
2d6caade22 dsputil: split out mlp dsp function 2012-10-11 12:01:08 +02:00
Janne Grunau
7e522859fc x86: vc1: call ff_vc1dsp_init_x86() under if (ARCH_X86) 2012-10-08 11:54:05 +02:00
Janne Grunau
cb36febcbc x86: cavs: call ff_cavsdsp_init_x86() under if (ARCH_X86) 2012-10-08 11:54:05 +02:00
Janne Grunau
f101eab1be x86: call most of the x86 dsp init functions under if (ARCH_X86)
Rename the called dsp init functions to *_init_x86.
2012-10-08 11:54:05 +02:00
Diego Biurrun
e4cbf7529b Give all anonymously typedeffed structs in headers a name
Anonymous structs cannot be forward declared and have no benefit.
2012-10-06 09:27:11 +02:00
Mans Rullgard
bcf07a15a0 x86: dsputil: kill VLA in gmc_mmx()
Instead of using an evil VLA, fall back to C version when edge
emulation is needed.  MPEG4 GMC is a rarely used fringe feature
so the speed loss is an acceptable cost for safer code.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-10-05 22:33:32 +01:00
Diego Biurrun
9c6cf7f2c9 avcodec: Drop silly and/or broken printf debug output 2012-10-01 10:24:28 +02:00
Michael Niedermayer
791b5954bc dsputil_mmx: fix reading prior of the src array in sub_hfyu_median_prediction()
This should fix the utvideoenc valgrind failure

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2012-09-28 12:25:07 -04:00
Diego Biurrun
58139e141b x86: dsputil: Move Xvid IDCT put/add functions to a more suitable place 2012-09-14 01:59:47 +02:00
Diego Biurrun
2017f0fdb7 x86: Remove some leftover declarations for non-existent functions 2012-09-13 21:38:47 +02:00
Martin Storsjö
91ff4e83ca x86: ac3dsp: Only refer to the ac3_downmix_sse symbol if it has been declared
This fixes building without inline assembly.

Signed-off-by: Martin Storsjö <martin@martin.st>
2012-09-13 13:51:52 +03:00
Mans Rullgard
97cb9236cf ac3: move ac3_downmix() from dsputil to ac3dsp
Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-09-12 23:39:50 +01:00
Diego Biurrun
1648a508fa x86: dsputil: Move specific optimization settings out of global init function
They belong in the init functions specific to each CPU capability.
2012-09-11 10:12:17 +02:00
Diego Biurrun
a84edbacaf x86: dsputil: Only compile motion_est code when encoders are enabled 2012-09-10 08:31:47 +02:00
Diego Biurrun
e0c6cce447 x86: Replace checks for CPU extensions and flags by convenience macros
This separates code relying on inline from that relying on external
assembly and fixes instances where the coalesced check was incorrect.
2012-09-08 18:18:34 +02:00
Hendrik Leppkes
fb4e983e0c x86: mlpdsp: mlp_filter_channel_x86 requires inline asm
Signed-off-by: Martin Storsjö <martin@martin.st>
2012-09-08 15:41:44 +03:00
Diego Biurrun
1169f0d0af x86: more specific checks for availability of required assembly capabilities 2012-09-07 18:16:04 +02:00
Diego Biurrun
8cb7ed5562 x86: avcodec: Drop silly "_mmx" suffix from dsputil template names 2012-09-07 13:50:52 +02:00
Mans Rullgard
6efb698883 cavsdsp: set idct permutation independently of dsputil
CAVS uses its own idct so using dsputil to set the permutation
is fragile.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-09-07 11:42:35 +01:00
Mans Rullgard
5fe64d88f6 x86: allow using add_hfyu_median_prediction_cmov on any cpu with cmov
For some reason add_hfyu_median_prediction_cmov is only selected
on 3Dnow-capable CPUs, even though it uses no 3Dnow instructions.
This patch allows it to be selected on any cpu with cmov with the
possibility of being overridden by the mmxext version.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-09-07 11:42:35 +01:00
Diego Biurrun
ef6ba1f237 x86: dsputil: Do not redundantly check for CPU caps before calling init funcs
The init functions check for CPU capabilities on their own already.
2012-09-06 09:05:52 +02:00
Hendrik Leppkes
d914ea6fd8 x86: vp56: cmov version of vp56_rac_get_prob requires inline asm
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-09-05 21:30:46 +02:00
Diego Biurrun
a84ac7a860 x86: h264dsp: drop some unnecessary ifdefs around prototype declarations 2012-09-04 01:44:59 +02:00
Diego Biurrun
17337f54c0 x86: Split inline and external assembly #ifdefs 2012-08-31 01:53:25 +02:00
Diego Biurrun
ec36aa6944 x86: Fix linking with some or all of yasm, mmx, optimizations disabled
Some optimized template functions reference optimized symbols, so they
must be explicitly disabled when those symbols are unavailable.
2012-08-30 19:37:32 +02:00
Diego Biurrun
a886b279a0 x86: cosmetics: Comment some #endifs for better readability 2012-08-30 18:50:33 +02:00
Diego Biurrun
2e6f93a284 x86: Always compile files with functions that are called unconditionally 2012-08-29 00:27:06 +02:00
Diego Biurrun
2f2aa2e542 x86: mpegvideoenc: fix linking with --disable-mmx
The optimized dct_quantize template functions reference optimized
fdct symbols, so these functions must only be enabled if the relevant
optimizations have been enabled by configure.
2012-08-29 00:26:56 +02:00
Diego Biurrun
d39791bf39 x86: mpegvideoenc: Do not abuse HAVE_ variables for template instantiation
This avoids trouble if HAVE_ variables are used elsewhere in the file.
2012-08-29 00:14:52 +02:00
Diego Biurrun
bcc45d6348 x86: avcodec: Drop silly "_mmx" suffixes from filenames 2012-08-28 18:37:34 +02:00
Diego Biurrun
efbd04c332 x86: avcodec: Drop silly "_sse" suffixes from filenames 2012-08-28 18:37:33 +02:00
Diego Biurrun
3f02c533f3 build: fft: x86: Drop unused YASM-OBJS-FFT- variable 2012-08-27 03:10:58 +02:00
Mans Rullgard
db70730291 x86: fft: remove unused fft_dispatch* functions
These functions are not used since the yasm conversion.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-08-25 23:58:26 +01:00
Diego Biurrun
dc40285427 x86: mpegvideo: more sensible names for optimization file and init function 2012-08-24 02:23:16 +02:00
Diego Biurrun
d211547ddd x86: mpegvideoenc: Split optimizations off into a separate file 2012-08-24 02:23:16 +02:00
Diego Biurrun
26ce9aec03 dnxhdenc: x86: more sensible names for optimization file and init function 2012-08-24 02:23:15 +02:00
Diego Biurrun
6fa488678f build: x86: Only compile mpegvideo optimizations when necessary 2012-08-22 01:06:33 +02:00
Diego Biurrun
6961bdface x86: avcodec: Consistently name all init files 2012-08-16 11:05:38 +02:00
Martin Storsjö
1d9c2dc89a Don't include common.h from avutil.h
Signed-off-by: Martin Storsjö <martin@martin.st>
2012-08-15 22:32:06 +03:00
Diego Biurrun
29cfdd3767 x86: avcodec: Appropriately name files containing only init functions 2012-08-15 03:24:08 +02:00
Diego Biurrun
be12958937 mpegvideo_mmx_template: drop some commented-out cruft 2012-08-15 03:24:07 +02:00
Mans Rullgard
8ec0204ee4 x86: cabac: allow building with suncc
This fixes two issues preventing suncc from building this code.

The undocumented 'a' operand modifier, causing gcc to omit a $ in
front of immediate operands (as required in addresses), is not
supported by suncc.  Luckily, the also undocumented 'c' modifer
has the same effect and is supported.

On some asm statements with a large number of operands, suncc for no
obvious reason fails to correctly substitute some of the operands.
Fortunately, some of the operands in these statements are plain
numbers which can be inserted directly into the code block instead
of passed as operands.

With these changes, the code builds correctly with both gcc and
suncc.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-08-13 14:51:52 +01:00
Mans Rullgard
c8252e80eb x86: mlpdsp: avoid taking address of void
This code contains a C array of addresses of labels defined in
inline asm.  To do this, the names must be declared as external
in C.  The declared type does not matter since only the address is
used, and for some reason, the author of the code used the 'void'
type despite taking the address of a void expression being invalid.

Changing the type to char, a reasonable choice since the alignment
of the code labels cannot be known or guaranteed, eliminates gcc
warnings and allows building with suncc.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-08-13 14:51:52 +01:00
Diego Biurrun
3b9e832e17 x86: Drop silly "_yasm" suffixes from filenames 2012-08-12 17:13:05 +02:00
Mans Rullgard
d7a4f8f8b9 Move MASK_ABS macro to libavcodec/mathops.h
This macro is only used in two places, both in libavcodec, so this
is a more sensible place for it.

Two small tweaks to the macro are made:

- removing the trailing semicolon
- dropping unnecessary 'volatile' from the x86 asm

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-08-09 00:58:20 +01:00
Mans Rullgard
c318626ce2 x86: rename libavutil/x86_cpu.h to libavutil/x86/asm.h
This puts x86-specific things in the x86/ subdirectory where they
belong.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-08-09 00:58:20 +01:00
Dave Yeo
197439c1ef x86: pngdsp: Fix assembly for OS/2
The a.out object format does not allow aligning sections.
On OS/2 LD aligns sections to 16 bytes.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-08-08 15:45:09 +02:00
Mans Rullgard
2b140a3d09 x86: use 32-bit source registers with movd instruction
yasm tolerates mismatch between movd/movq and source register size,
adjusting the instruction according to the register.  nasm is more
strict.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-08-07 15:21:20 +01:00
Mans Rullgard
a3df4781f4 x86: add colons after labels
nasm prints a warning if the colon is missing.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-08-07 15:20:56 +01:00
Anton Khirnov
36ef5369ee Replace all CODEC_ID_* with AV_CODEC_ID_* 2012-08-07 16:00:24 +02:00
Diego Biurrun
2096857551 x86: h264_idct: Rename x264_add8x4_idct_sse2 --> h264_add8x4_idct_sse2 2012-08-05 21:40:49 +02:00
Ronald S. Bultje
4a8143e73c fft: 3dnow: fix register name typo in DECL_IMDCT macro
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-08-04 00:16:02 +02:00
Diego Biurrun
0c3ff1982c x86: dct32: port to cpuflags 2012-08-03 22:51:06 +02:00
Diego Biurrun
239fdf1b4a x86: build: replace mmx2 by mmxext
Refactoring mmx2/mmxext YASM code with cpuflags will force renames.
So switching to a consistent naming scheme beforehand is sensible.
The name "mmxext" is more official and widespread and also the name
of the CPU flag, as reported e.g. by the Linux kernel.
2012-08-03 22:51:05 +02:00
Ronald S. Bultje
da6505ad2f dsputil: make add_hfyu_left_prediction_sse4() support unaligned src.
This makes add_hfyu_left_prediction_sse4() handle sources that are not
16-byte aligned in its own function rather than by proxying the call to
add_hfyu_left_prediction_ssse3(). This fixes a crash on Win64, since the
sse4 version clobberes xmm6, but the ssse3 version (which uses MMX regs)
does not restore it, thus leading to XMM clobbering and RSP being off.

Fixes bug 342.
2012-08-03 11:09:14 -07:00
Diego Biurrun
ca844b7be9 x86: Use consistent 3dnowext function and macro name suffixes
Currently there is a wild mix of 3dn2/3dnow2/3dnowext.  Switching to
"3dnowext", which is a more common name of the CPU flag, as reported
e.g. by the Linux kernel, unifies this.
2012-08-03 14:00:47 +02:00
Diego Biurrun
03737412a3 x86: proresdsp: improve SIGNEXTEND macro comments 2012-08-02 22:30:44 +02:00
Diego Biurrun
81905088a1 x86: h264dsp: K&R formatting cosmetics 2012-08-02 20:20:21 +02:00
Ronald S. Bultje
c728518b3c x86: fft: fix imdct_half() for AVX
Some calculations were changed in b6a3849 to use mmsize, which was not correct
for the AVX version, which uses INIT_YMM and therefore has mmsize == 32.

Fixes Bug 341.

Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
2012-08-02 13:40:11 -04:00
Mans Rullgard
ec7c501ed5 x86: remove libmpeg2 mmx(ext) idct functions
These functions are not faster than other mmx implementations on
any hardware I have been able to test on, and they are horribly
inaccurate.  There is thus no reason to ever use them.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-08-02 12:14:52 +01:00
Ronald S. Bultje
b6a3849adb fft: port FFT/IMDCT 3dnow functions to yasm, and disable on x86-64.
64-bit CPUs always have SSE available, thus there is no need to compile
in the 3dnow functions. This results in smaller binaries.
2012-07-31 21:20:47 -07:00
Ronald S. Bultje
53dfaedc01 x86/dsputilenc: bury inline asm under HAVE_INLINE_ASM. 2012-07-31 20:28:52 -07:00
Diego Biurrun
6376a3ad24 x86: h264dsp: Remove unused variable ff_pb_3_1 2012-08-01 00:17:16 +02:00
Diego Biurrun
8728b381cb x86: h264dsp: Adjust YASM #ifdefs
This fixes compilation with YASM disabled.
2012-07-31 13:54:07 +02:00
Ronald S. Bultje
b829b4ce29 h264: convert loop filter strength dsp function to yasm.
This completes the conversion of h264dsp to yasm; note that h264 also
uses some dsputil functions, most notably qpel. Performance-wise, the
yasm-version is ~10 cycles faster (182->172) on x86-64, and ~8 cycles
faster (201->193) on x86-32.
2012-07-30 19:39:47 -07:00
Ronald S. Bultje
c83f44dba1 h264_idct_10bit: port x86 assembly to cpuflags. 2012-07-28 08:29:45 -07:00
Ronald S. Bultje
b3c5ae5607 fft: rename "z" to "zc" to prevent name collision.
Without this, cglobal will expand "z" to "zh" to access the high byte
in a register's word, which causes a name collision with the ZH(x) macro
further up in this file.
2012-07-28 08:29:44 -07:00
Ronald S. Bultje
4d777eedfd vp3: don't compile mmx IDCT functions on x86-64.
64-bit CPUs always have SSE2, and a SSE2 version exists, thus the MMX
version will never be used.
2012-07-27 20:12:30 -07:00
Ronald S. Bultje
a5bbb1242c h264_loopfilter: port x86 simd to cpuflags. 2012-07-27 20:12:11 -07:00
Ronald S. Bultje
d07ff3cd5a h264_chromamc_10bit: port x86 simd to cpuflags. 2012-07-27 17:35:49 -07:00
Ronald S. Bultje
4a26fdd852 vp3: port x86 SIMD to cpuflags. 2012-07-27 17:35:49 -07:00
Ronald S. Bultje
76888c64b0 rv34: port x86 SIMD to cpuflags. 2012-07-27 15:13:26 -07:00
Ronald S. Bultje
158744a4cd vp56: only compile MMX SIMD on x86-32.
All x86-64 CPUs have SSE2, so the MMX version will never be used. This
leads to smaller binaries.
2012-07-27 14:40:27 -07:00
Ronald S. Bultje
2734ba787b vp56: port x86 simd to cpuflags. 2012-07-27 14:39:07 -07:00
Ronald S. Bultje
5361e10a5e proresdsp: port x86 assembly to cpuflags. 2012-07-27 11:43:06 -07:00
Ronald S. Bultje
bde73f28af mpegaudio: bury inline asm under HAVE_INLINE_ASM. 2012-07-26 13:43:16 -07:00
Ronald S. Bultje
30b45d9c38 x86inc: automatically insert vzeroupper for YMM functions. 2012-07-26 13:43:16 -07:00
Ronald S. Bultje
a1878a88a1 vp3: don't use calls to inline asm in yasm code.
Mixing yasm and inline asm is a bad idea, since if either yasm or inline
asm is not supported by your toolchain, all of the asm stops working.
Thus, better to use either one or the other alone.

Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
2012-07-25 14:24:30 -04:00
Ronald S. Bultje
79195ce565 x86/dsputil: put inline asm under HAVE_INLINE_ASM.
This allows compiling with compilers that don't support gcc-style
inline assembly.

Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
2012-07-25 14:24:27 -04:00
Yang Wang
845e92fd6a dsputil_mmx: fix incorrect assembly code
In ff_put_pixels_clamped_mmx(), there are two assembly code blocks.
In the first block (in the unrolled loop), the instructions
"movq 8%3, %%mm1 \n\t", and so forth, have problems.

From above instruction, it is clear what the programmer wants: a load from
p + 8. But this assembly code doesn’t guarantee that. It only works if the
compiler puts p in a register to produce an instruction like this:
"movq 8(%edi), %mm1". During compiler optimization, it is possible that the
compiler will be able to constant propagate into p. Suppose p = &x[10000].
Then operand 3 can become 10000(%edi), where %edi holds &x. And the instruction
becomes "movq 810000(%edx)". That is, it will stride by 810000 instead of 8.

This will cause a segmentation fault.

This error was fixed in the second block of the assembly code, but not in
the unrolled loop.

How to reproduce:
    This error is exposed when we build using Intel C++ Compiler, with
    IPO+PGO optimization enabled. Crashed when decoding an MJPEG video.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
2012-07-25 14:22:18 -04:00
Jason Garrett-Glaser
85a3c19ed1 dsputil: x86: add SHUFFLE_MASK_W macro
Simplifies pshufb masks that operate on words.
2012-07-22 16:56:58 -04:00
Diego Biurrun
9f97af2688 x86: dsputil: drop some unused CPU flag debug code 2012-07-19 10:17:56 +02:00
Mans Rullgard
28f9ab7029 vp3: move idct and loop filter pointers to new vp3dsp context
This moves all VP3-specific function pointers from dsputil to a
new vp3dsp context.  There is no reason to ever use the VP3 IDCT
where an MPEG2 IDCT is expected or vice versa.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-07-18 10:32:19 +01:00
Mans Rullgard
ab9f987661 build: add CONFIG_VP3DSP, reduce repetition in OBJS lists
Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-07-18 10:32:18 +01:00
Martin Storsjö
f27386cdc7 x86: h264_intrapred: Don't add the 'd' suffix to the SPLATB_REG macro
The SPLATB_REG macro already adds the 'd' suffix internally.

This fixes building on Win64, which has been broken since 878e66902.

This worked for unix, where r2 happened to be rdx in this case, which
with the first suffix rdxd was mapped to eax, and eaxd is defined back
to eax. On win64 however, r2 happened to be R8 in this case, and
R8d mapps to R8D just fine, but there's no mapping for R8Dd to anything.

Signed-off-by: Martin Storsjö <martin@martin.st>
2012-07-06 21:07:23 +03:00
Diego Biurrun
878e669029 x86: h264_intrapred: use newly introduced SPLAT* and PSHUFLW macros 2012-07-05 17:37:11 +02:00
Loren Merritt
4d4752366f x86inc: add SPLATB_LOAD, SPLATB_REG, PSHUFLW macros
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-07-05 17:37:11 +02:00
Diego Biurrun
d20f133ef9 x86: h264_intrapred: port to cpuflag macros 2012-07-05 17:37:10 +02:00
Martin Storsjö
07eeeb1d4f vp8: Add ifdef guards around the sse2 loopfilter in the sse2slow branch too
This was missed in the the previous commit in 70a1c800.

Signed-off-by: Martin Storsjö <martin@martin.st>
2012-07-05 09:39:01 +03:00
Martin Storsjö
70a1c8000f vp8: loopfilter >=sse2 functions need aligned stack on x86-32.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-07-04 08:25:50 -07:00
Ronald S. Bultje
723b266d72 dsputilenc: group yasm and inline asm function pointer assignment. 2012-07-04 07:46:27 -07:00
Ronald S. Bultje
ceabc13f12 dsputilenc_mmx: split assignment of ff_sse16_sse2 to SSE2 section. 2012-06-30 09:24:52 -07:00
Ronald S. Bultje
66a02159ea x86: fmtconvert: add special asm for float_to_int16_interleave_misc_*
This gets rid of a variable-length array and a for loop in C code.

Signed-off-by: Martin Storsjö <martin@martin.st>
2012-06-30 19:10:36 +03:00
Mans Rullgard
f2fd167835 x86: vc1: fix and enable optimised loop filter
The problem is that the ssse3 psign instruction does the wrong
thing here.  Commit ea60dfe incorrectly removed a macro emulating
this instruction for pre-ssse3 code.  However, the emulation is
incorrect, and the code relies on the behaviour of the macro.
Specifically, the psign sets destination elements to zero where
the corresponding source element is zero, whereas the emulation
only negates destination elements where the source is negative.

Furthermore, the PSIGNW_MMX macro in x86util.asm is totally bogus,
which is why the original VC-1 code had an additional right shift
when using it.  Since the psign instruction cannot be used here,
skip all the macro hell and use the working instruction sequence
directly.

None of this was noticed due a stray return statement in
ff_vc1dsp_init_mmx() which meant that only the mmx version of the
loop filter was ever used (before being removed in ea60dfe).

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-30 00:12:05 +01:00
Christophe Gisquet
a5bfa66df5 x86: fft: replace call to memcpy by a loop
The function call was a mess to handle, and memcpy cannot make
the assumptions we do in the new code.

Tested on an IMC sample: 430c -> 370c.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-27 12:49:33 +01:00
Mans Rullgard
0595334892 x86: fft: elf64: fix PIC build
In a 64-bit PIC build, external functions must be called
through the PLT.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-25 22:58:18 +01:00
Mans Rullgard
8725da49a2 x86: fft: win64: fix stack alignment for memcpy() call 2012-06-25 15:10:39 +01:00
Mans Rullgard
8299260470 x86: fft: convert sse inline asm to yasm 2012-06-25 13:31:00 +01:00
Ronald S. Bultje
8123e0901f x86: place some inline asm under #if HAVE_INLINE_ASM
Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-25 13:23:12 +01:00
Mans Rullgard
0b6f973635 h264: use asm cabac reader under a generic condition
This removes a dependency on implementation details from generic
code and allows easy addition of the equivalent optimisation for
other architectures than x86.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-23 22:14:21 +01:00
Diego Biurrun
fe07c9c6b5 x86: Only use optimizations with cmov if the CPU supports the instruction 2012-06-23 16:21:50 +02:00
Mans Rullgard
29686d6ea3 x86: remove unused inline asm macros from dsputil_mmx.h
Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-23 14:14:06 +01:00
Mans Rullgard
685f5438bb x86: move some inline asm macros to the only places they are used
Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-23 14:14:06 +01:00
Diego Biurrun
a5a93fa8f5 cosmetics: do not use full path for local headers 2012-06-22 10:49:40 +02:00
Ronald S. Bultje
d9669eab0b dwt: remove variable-length arrays
Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-17 23:20:10 +01:00
Justin Ruggles
d5a7229ba4 Add a float DSP framework to libavutil
Move vector_fmul() from DSPContext to AVFloatDSPContext.
2012-06-08 13:14:38 -04:00
Vitor Sessak
bac0729d9e x86: use new schema for ASM macros
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2012-05-29 14:49:45 +02:00
Justin Ruggles
713548cbad x86: lavc: use %if HAVE_AVX guards around AVX functions in yasm code.
This is needed for older versions of yasm/nasm that do not support AVX.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-05-22 20:46:02 +02:00
Kieran Kunhya
5ff01259a8 Convert vector_fmul range of functions to YASM and add AVX versions
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
2012-05-21 17:13:05 -04:00
Michael Kostylev
6797d1948b x86: rv40: Mark rv40_weight functions as MMX2; they use MMX2 instructions. 2012-05-15 23:54:08 +02:00
Justin Ruggles
95a98ab3f0 ac3dsp: simplify x86 versions of ac3_max_msb_abs_int16
Simplifies the code by using cpuflags and a new macro.
Also fixes the invalid use of the MMX2 pshufw operation in the MMX-only
function.
2012-05-15 15:23:59 -04:00
Vitor Sessak
fcc456b829 x86: use more standard construct for setting ASM functions in FFT code
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-05-14 15:38:42 +02:00
Michael Kostylev
ea60dfe284 x86: vc1: drop MMX loop filter implementation, which uses MMX2 instructions. 2012-05-12 14:02:45 +02:00
Christophe Gisquet
110d0cdc9d rv40dsp x86: MMX/MMX2/3DNow/SSE2/SSSE3 implementations of MC
Code mostly inspired by vp8's MC, however:
- its MMX2 horizontal filter is worse because it can't take advantage of
  the coefficient redundancy
- that same coefficient redundancy allows better code for non-SSSE3 versions

Benchmark (rounded to tens of unit):
        V8x8  H8x8  2D8x8  V16x16  H16x16  2D16x16
C       445    358   985    1785    1559    3280
MMX*    219    271   478     714     929    1443
SSE2    131    158   294     425     515     892
SSSE3   120    122   248     387     390     763

End result is overall around a 15% speedup for SSSE3 version (on 6 sequences);
all loop filter functions now take around 55% of decoding time, while luma MC
dsp functions are around 6%, chroma ones are 1.3% and biweight around 2.3%.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-05-10 18:42:43 +02:00
Ronald S. Bultje
bec207f9f9 snowdsp: explicitily state instruction size.
Fixes a compile error with clang at -O0.
2012-05-02 09:57:12 -07:00
Christophe GISQUET
e75d1d4f73 dsputil x86: revert a test back to its previous value
Commit 356ee8d caused the initial inversion.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 11:00:51 -07:00
Christophe Gisquet
fe5ed69dc7 rv34dsp x86: implement MMX2 inverse transform
141 cycles down to 51.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 10:58:47 -07:00
Roland Scheidegger
9b9df1cdff h264: new assembly version of get_cabac for x86_64 with PIC
This adds a hand-optimized assembly version for get_cabac much like the
existing one, but it works if the table offsets are RIP-relative.
Compared to the non-RIP-relative version this adds 2 lea instructions
and it needs one extra register. get_cabac() gets about 40% faster, for
an overall speedup of about 5%.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 09:43:25 -07:00
Roland Scheidegger
14e9ffc1e4 h264: use one table instead of several for cabac functions
The reason is this is easier for PIC code (in particular on darwin...).
Keep the old names as pointers (static in cabac_functions.h so gcc
knows these are just immediate offsets) so the c code can nicely stay the same
(alternatively could use offsets directly in the functions needing the
tables). This should produce the same code as before with non-pic and better
code (confirmed) with pic.

The assembly uses the new table but still won't work for PIC case.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 08:26:12 -07:00
Roland Scheidegger
444f47b55c h264: (trivial) remove unneeded macro argument in x86/cabac.h
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 08:24:56 -07:00
Mans Rullgard
2bcbd98459 Remove lowres video decoding
This feature is complex, of questionable utility, and slows down
normal decoding.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-04-21 18:56:19 +01:00
Mans Rullgard
95510be8c3 avcodec: remove AVCodecContext.dsp_mask
This removes all references to AVCodecContext.dsp_mask and marks
it for eviction at the next version bump.  It has been superseded
by av_set_cpu_flag_mask() which, unlike this field, works everywhere.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-04-21 18:30:01 +01:00
Ronald S. Bultje
87a246341b h264: use proper PROLOGUE statement for a function using 8 registers.
Fixes crashes when using biweight on win64.
2012-04-16 08:07:21 -07:00
Ronald S. Bultje
b089ca871a dsputil: fix optimized emu_edge function on Win64.
Recent register allocation changes (x86inc.asm update) changed the
register order and thus opcodes for the inner loops. One of them became
>128bytes, which confuses other parts of this function where it jumps
to fixed-offset positions to extend the edge by fixed amounts. A simple
register change fixes this.
2012-04-13 11:28:30 -07:00
Justin Ruggles
de7f22ab0c ac3dsp: call femms/emms at the end of float_to_fixed24() for 3DNow and SSE
Fixes ac3-encode and eac3-encode FATE test failures with SSE2 disabled.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-12 21:33:04 -07:00
Ronald S. Bultje
76538d7a78 h264: fix 10bit biweight functions after recent x86inc.asm fixes.
This should have been updated in the x86inc.asm update, but was
accidently forgotten.
2012-04-12 21:13:57 -07:00
Diego Biurrun
7bb3a302fe build: Consistently handle conditional compilation for all optimization OBJS. 2012-04-12 09:00:49 +02:00
Henrik Gramner
729f90e268 x86inc improvements for 64-bit
Add support for all x86-64 registers
Prefer caller-saved register over callee-saved on WIN64
Support up to 15 function arguments

Also (by Ronald S. Bultje)
Fix up our asm to work with new x86inc.asm.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
2012-04-11 15:47:00 -04:00
Christophe GISQUET
2130bd8f5b rv40dsp x86: use only one register, for both increment and loop counter
Around 10 cycles faster for luma.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-10 10:07:09 -07:00
Christophe GISQUET
272b252c01 rv40dsp: implement prescaled versions for biweight.
Quite often, the original weights are multiple of 512. By prescaling them
by 1/512 when they are computed (once per frame), no intermediate shifting
is needed, and no prescaling on each call either.

The x86 code already used that trick.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-10 10:06:48 -07:00