Commit Graph

694 Commits

Author SHA1 Message Date
Diego Biurrun
d20f133ef9 x86: h264_intrapred: port to cpuflag macros 2012-07-05 17:37:10 +02:00
Martin Storsjö
07eeeb1d4f vp8: Add ifdef guards around the sse2 loopfilter in the sse2slow branch too
This was missed in the the previous commit in 70a1c800.

Signed-off-by: Martin Storsjö <martin@martin.st>
2012-07-05 09:39:01 +03:00
Martin Storsjö
70a1c8000f vp8: loopfilter >=sse2 functions need aligned stack on x86-32.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-07-04 08:25:50 -07:00
Ronald S. Bultje
723b266d72 dsputilenc: group yasm and inline asm function pointer assignment. 2012-07-04 07:46:27 -07:00
Ronald S. Bultje
ceabc13f12 dsputilenc_mmx: split assignment of ff_sse16_sse2 to SSE2 section. 2012-06-30 09:24:52 -07:00
Ronald S. Bultje
66a02159ea x86: fmtconvert: add special asm for float_to_int16_interleave_misc_*
This gets rid of a variable-length array and a for loop in C code.

Signed-off-by: Martin Storsjö <martin@martin.st>
2012-06-30 19:10:36 +03:00
Mans Rullgard
f2fd167835 x86: vc1: fix and enable optimised loop filter
The problem is that the ssse3 psign instruction does the wrong
thing here.  Commit ea60dfe incorrectly removed a macro emulating
this instruction for pre-ssse3 code.  However, the emulation is
incorrect, and the code relies on the behaviour of the macro.
Specifically, the psign sets destination elements to zero where
the corresponding source element is zero, whereas the emulation
only negates destination elements where the source is negative.

Furthermore, the PSIGNW_MMX macro in x86util.asm is totally bogus,
which is why the original VC-1 code had an additional right shift
when using it.  Since the psign instruction cannot be used here,
skip all the macro hell and use the working instruction sequence
directly.

None of this was noticed due a stray return statement in
ff_vc1dsp_init_mmx() which meant that only the mmx version of the
loop filter was ever used (before being removed in ea60dfe).

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-30 00:12:05 +01:00
Christophe Gisquet
a5bfa66df5 x86: fft: replace call to memcpy by a loop
The function call was a mess to handle, and memcpy cannot make
the assumptions we do in the new code.

Tested on an IMC sample: 430c -> 370c.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-27 12:49:33 +01:00
Mans Rullgard
0595334892 x86: fft: elf64: fix PIC build
In a 64-bit PIC build, external functions must be called
through the PLT.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-25 22:58:18 +01:00
Mans Rullgard
8725da49a2 x86: fft: win64: fix stack alignment for memcpy() call 2012-06-25 15:10:39 +01:00
Mans Rullgard
8299260470 x86: fft: convert sse inline asm to yasm 2012-06-25 13:31:00 +01:00
Ronald S. Bultje
8123e0901f x86: place some inline asm under #if HAVE_INLINE_ASM
Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-25 13:23:12 +01:00
Mans Rullgard
0b6f973635 h264: use asm cabac reader under a generic condition
This removes a dependency on implementation details from generic
code and allows easy addition of the equivalent optimisation for
other architectures than x86.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-23 22:14:21 +01:00
Diego Biurrun
fe07c9c6b5 x86: Only use optimizations with cmov if the CPU supports the instruction 2012-06-23 16:21:50 +02:00
Mans Rullgard
29686d6ea3 x86: remove unused inline asm macros from dsputil_mmx.h
Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-23 14:14:06 +01:00
Mans Rullgard
685f5438bb x86: move some inline asm macros to the only places they are used
Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-23 14:14:06 +01:00
Diego Biurrun
a5a93fa8f5 cosmetics: do not use full path for local headers 2012-06-22 10:49:40 +02:00
Ronald S. Bultje
d9669eab0b dwt: remove variable-length arrays
Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-06-17 23:20:10 +01:00
Justin Ruggles
d5a7229ba4 Add a float DSP framework to libavutil
Move vector_fmul() from DSPContext to AVFloatDSPContext.
2012-06-08 13:14:38 -04:00
Vitor Sessak
bac0729d9e x86: use new schema for ASM macros
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2012-05-29 14:49:45 +02:00
Justin Ruggles
713548cbad x86: lavc: use %if HAVE_AVX guards around AVX functions in yasm code.
This is needed for older versions of yasm/nasm that do not support AVX.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-05-22 20:46:02 +02:00
Kieran Kunhya
5ff01259a8 Convert vector_fmul range of functions to YASM and add AVX versions
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
2012-05-21 17:13:05 -04:00
Michael Kostylev
6797d1948b x86: rv40: Mark rv40_weight functions as MMX2; they use MMX2 instructions. 2012-05-15 23:54:08 +02:00
Justin Ruggles
95a98ab3f0 ac3dsp: simplify x86 versions of ac3_max_msb_abs_int16
Simplifies the code by using cpuflags and a new macro.
Also fixes the invalid use of the MMX2 pshufw operation in the MMX-only
function.
2012-05-15 15:23:59 -04:00
Vitor Sessak
fcc456b829 x86: use more standard construct for setting ASM functions in FFT code
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-05-14 15:38:42 +02:00
Michael Kostylev
ea60dfe284 x86: vc1: drop MMX loop filter implementation, which uses MMX2 instructions. 2012-05-12 14:02:45 +02:00
Christophe Gisquet
110d0cdc9d rv40dsp x86: MMX/MMX2/3DNow/SSE2/SSSE3 implementations of MC
Code mostly inspired by vp8's MC, however:
- its MMX2 horizontal filter is worse because it can't take advantage of
  the coefficient redundancy
- that same coefficient redundancy allows better code for non-SSSE3 versions

Benchmark (rounded to tens of unit):
        V8x8  H8x8  2D8x8  V16x16  H16x16  2D16x16
C       445    358   985    1785    1559    3280
MMX*    219    271   478     714     929    1443
SSE2    131    158   294     425     515     892
SSSE3   120    122   248     387     390     763

End result is overall around a 15% speedup for SSSE3 version (on 6 sequences);
all loop filter functions now take around 55% of decoding time, while luma MC
dsp functions are around 6%, chroma ones are 1.3% and biweight around 2.3%.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-05-10 18:42:43 +02:00
Ronald S. Bultje
bec207f9f9 snowdsp: explicitily state instruction size.
Fixes a compile error with clang at -O0.
2012-05-02 09:57:12 -07:00
Christophe GISQUET
e75d1d4f73 dsputil x86: revert a test back to its previous value
Commit 356ee8d caused the initial inversion.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 11:00:51 -07:00
Christophe Gisquet
fe5ed69dc7 rv34dsp x86: implement MMX2 inverse transform
141 cycles down to 51.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 10:58:47 -07:00
Roland Scheidegger
9b9df1cdff h264: new assembly version of get_cabac for x86_64 with PIC
This adds a hand-optimized assembly version for get_cabac much like the
existing one, but it works if the table offsets are RIP-relative.
Compared to the non-RIP-relative version this adds 2 lea instructions
and it needs one extra register. get_cabac() gets about 40% faster, for
an overall speedup of about 5%.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 09:43:25 -07:00
Roland Scheidegger
14e9ffc1e4 h264: use one table instead of several for cabac functions
The reason is this is easier for PIC code (in particular on darwin...).
Keep the old names as pointers (static in cabac_functions.h so gcc
knows these are just immediate offsets) so the c code can nicely stay the same
(alternatively could use offsets directly in the functions needing the
tables). This should produce the same code as before with non-pic and better
code (confirmed) with pic.

The assembly uses the new table but still won't work for PIC case.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 08:26:12 -07:00
Roland Scheidegger
444f47b55c h264: (trivial) remove unneeded macro argument in x86/cabac.h
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-28 08:24:56 -07:00
Mans Rullgard
2bcbd98459 Remove lowres video decoding
This feature is complex, of questionable utility, and slows down
normal decoding.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-04-21 18:56:19 +01:00
Mans Rullgard
95510be8c3 avcodec: remove AVCodecContext.dsp_mask
This removes all references to AVCodecContext.dsp_mask and marks
it for eviction at the next version bump.  It has been superseded
by av_set_cpu_flag_mask() which, unlike this field, works everywhere.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-04-21 18:30:01 +01:00
Ronald S. Bultje
87a246341b h264: use proper PROLOGUE statement for a function using 8 registers.
Fixes crashes when using biweight on win64.
2012-04-16 08:07:21 -07:00
Ronald S. Bultje
b089ca871a dsputil: fix optimized emu_edge function on Win64.
Recent register allocation changes (x86inc.asm update) changed the
register order and thus opcodes for the inner loops. One of them became
>128bytes, which confuses other parts of this function where it jumps
to fixed-offset positions to extend the edge by fixed amounts. A simple
register change fixes this.
2012-04-13 11:28:30 -07:00
Justin Ruggles
de7f22ab0c ac3dsp: call femms/emms at the end of float_to_fixed24() for 3DNow and SSE
Fixes ac3-encode and eac3-encode FATE test failures with SSE2 disabled.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-12 21:33:04 -07:00
Ronald S. Bultje
76538d7a78 h264: fix 10bit biweight functions after recent x86inc.asm fixes.
This should have been updated in the x86inc.asm update, but was
accidently forgotten.
2012-04-12 21:13:57 -07:00
Diego Biurrun
7bb3a302fe build: Consistently handle conditional compilation for all optimization OBJS. 2012-04-12 09:00:49 +02:00
Henrik Gramner
729f90e268 x86inc improvements for 64-bit
Add support for all x86-64 registers
Prefer caller-saved register over callee-saved on WIN64
Support up to 15 function arguments

Also (by Ronald S. Bultje)
Fix up our asm to work with new x86inc.asm.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
2012-04-11 15:47:00 -04:00
Christophe GISQUET
2130bd8f5b rv40dsp x86: use only one register, for both increment and loop counter
Around 10 cycles faster for luma.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-10 10:07:09 -07:00
Christophe GISQUET
272b252c01 rv40dsp: implement prescaled versions for biweight.
Quite often, the original weights are multiple of 512. By prescaling them
by 1/512 when they are computed (once per frame), no intermediate shifting
is needed, and no prescaling on each call either.

The x86 code already used that trick.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-10 10:06:48 -07:00
Christophe GISQUET
6b81da2fd0 dsputil x86: use SSE float instruction instead of SSE2 integer equivalent
All the more required since the users are pure SSE functions.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-04 11:24:27 -07:00
Christophe GISQUET
cd88105f6f dsputil x86: remove deprecated parameter from scalarproduct_int16 prototype
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-04 11:24:08 -07:00
Christophe GISQUET
f9888520cc vp8dsp x86: perform rounding shift with a single instruction
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-04 11:23:36 -07:00
Ronald S. Bultje
a940198130 cabac: add overread protection to BRANCHLESS_GET_CABAC().
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
2012-03-28 08:01:29 -07:00
Ronald S. Bultje
448dc42571 cabac: increment jump locations by one in callers of BRANCHLESS_GET_CABAC(). 2012-03-28 08:01:29 -07:00
Ronald S. Bultje
16f6e83f74 cabac: remove unused argument from BRANCHLESS_GET_CABAC_UPDATE(). 2012-03-28 08:01:29 -07:00
Ronald S. Bultje
951014e5bb cabac: use struct+offset instead of memory operand in BRANCHLESS_GET_CABAC(). 2012-03-28 08:01:29 -07:00
Ronald S. Bultje
a0bdcb019e h264: add overread protection to get_cabac_bypass_sign_x86(). 2012-03-28 08:01:29 -07:00
Ronald S. Bultje
95bfa4ead7 h264: reindent get_cabac_bypass_sign_x86(). 2012-03-28 08:01:29 -07:00
Ronald S. Bultje
db025929f2 h264: use struct offsets in get_cabac_bypass_sign_x86(). 2012-03-28 08:01:29 -07:00
Diego Biurrun
ad0e31f134 build: prettyprinting cosmetics 2012-03-26 13:00:10 +02:00
Diego Biurrun
62ce9defb8 x86: dsputil: prettyprint gcc inline asm 2012-03-25 11:50:48 +02:00
Diego Biurrun
3b54912113 x86: K&R prettyprinting cosmetics for dsputil_mmx.c 2012-03-25 11:50:48 +02:00
Diego Biurrun
915a2a0a65 x86: conditionally compile H.264 QPEL optimizations 2012-03-25 11:50:45 +02:00
Diego Biurrun
3816642eab dsputil_mmx: Surround QPEL macros by "do { } while (0);" blocks.
This makes them safe to use in non-fully braced if-blocks and similar.
2012-03-25 11:48:37 +02:00
Ronald S. Bultje
71ea26811c aacsbr: handle m_max values smaller than 4.
Prevents a signflip in the counter, and a subsequent crash because of
overreads/overwrites.

Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC: libav-stable@libav.org
2012-03-23 12:56:08 -07:00
Ronald S. Bultje
a928ed3751 vp8: convert mbedge loopfilter x86 assembly to use named arguments. 2012-03-10 11:36:33 -08:00
Ronald S. Bultje
bee330e300 vp8: convert inner loopfilter x86 assembly to use named arguments. 2012-03-10 11:36:33 -08:00
Reimar Döffinger
6eda85e15b sbrdsp.asm: convert all instructions to float/SSE ones.
Since the values are floats, using the float operations
makes sense, improves performance on some CPUs and
makes the code SSE compatible instead of needing SSE2.

Based on suggestion by Jason.

Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-03-07 13:50:13 -08:00
Christophe GISQUET
7e1ce6a6ac dsputil: remove shift parameter from scalarproduct_int16
There is only one caller, which does not need the shifting. Other use cases
are situations where different roundings would be needed.

The x86 and neon versions are modified accordingly.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-03-07 10:29:52 -08:00
Diego Biurrun
1e9d55e45e x86: Remove duplicated AVG_3DNOW_OP / AVG_MMX2_OP macros from h264_qpel_mmx.c. 2012-03-07 09:36:04 +01:00
Reimar Döffinger
b5161908e0 SBR DSP: fix SSE code to not use SSE2 instructions.
movq from SSE register _to_ memory is an SSE2 instruction.
Use the SSE movlps function instead that does the same thing.

Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-03-06 13:40:35 -08:00
Mans Rullgard
356ee8d7de x86: clean up ff_dsputil_init_mmx()
This splits ff_dsputil_init_mmx() into multiple functions, one for
each MMX/SSE level, somewhat simplifying the nested conditions.

Signed-off-by: Mans Rullgard <mans@mansr.com>
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-03-05 14:40:03 +01:00
Ronald S. Bultje
b4188f0d46 vp8: convert simple loopfilter x86 assembly to use named arguments. 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
8476ca3b4e vp8: convert idct x86 assembly to use named arguments. 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
21ffc78fd7 vp8: convert mc x86 assembly to use named arguments. 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
28170f1a39 vp8: convert loopfilter x86 assembly to use cpuflags(). 2012-03-03 20:40:00 -08:00
Ronald S. Bultje
e25be47154 vp8: convert idct/mc x86 assembly to use cpuflags(). 2012-03-03 20:39:59 -08:00
Ronald S. Bultje
291c9b6285 h264: change underread for 10bit QPEL to overread.
This prevents us from reading before the start of the buffer, and thus
prevents crashes resulting from this behaviour. Fixes bug 237.
2012-03-02 10:33:05 -08:00
Ronald S. Bultje
45549339bc vp8: disable mmx functions with sse/sse2 counterparts on x86-64.
x86-64 is guaranteed to have at least SSE2, therefore the MMX/MMX2
functions will never be used in practice.
2012-03-02 10:32:05 -08:00
Ronald S. Bultje
bd66f073fe vp8: change int stride to ptrdiff_t stride.
On 64bit platforms with 32bit int, this means we won't have to sign-
extend the integer anymore.
2012-03-02 10:31:50 -08:00
Ronald S. Bultje
b0c4f04338 h264: fix mmxext chroma deblock to use correct TC values. 2012-02-27 09:38:44 -08:00
Christophe GISQUET
2784d18791 SBR DSP x86: implement SSE sbr_hf_g_filt
Unrolling the main loop to process, instead of 4 elements:
- 8: minor gain of 2 cycles (not worth the extra object size)
- 2: loss of 8 cycles.

Assigning STEP to a register is a loss. Output address (Y) is almost always
unaligned.

Timings:
- C (32/64 bits): 117/109 cycles
- SSE: 57 cycles

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-02-23 15:50:09 -08:00
Christophe GISQUET
34454c761f SBR DSP x86: implement SSE sbr_sum_square_sse
The 32bits targets have been compiled with -mfpmath=sse for proper reference.
sbr_sum_square C  /32bits: 82c (unrolled)/102c
               C  /64bits: 69c (unrolled)/82c
               SSE/32bits: 42c
               SSE/64bits: 31c

Use of SSE4.1 dpps to perform the final sum is slower.
Not unrolling to perform 8 operations in a loop yields 10 more cycles.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-02-23 15:50:06 -08:00
Ronald S. Bultje
3ab9a2a557 rv34: change most "int stride" into "ptrdiff_t stride".
This prevents having to sign-extend on 64-bit systems with 32-bit ints,
such as x86-64. Also fixes crashes on systems where we don't do it and
arguments are not in registers, such as Win64 for all weight functions.
2012-02-20 14:58:25 -08:00
Ronald S. Bultje
8fb26950ed h264: don't use redzone in loopfilter on win64.
Red zone usage is not allowed in the Win64 ABI.
2012-02-19 15:31:03 -08:00
Christophe GISQUET
f3e084909b mpegaudio: replace memcpy by SIMD code
By replacing memcpy with an unrolled loop using the alignment knowledge
it has, some speedup can be obtained.

Before (gcc 4.6.1): ~400 cycles
After: ~370 cycles

Overall, around 2% speed increase when decoding a 2400s mp3 to f32le.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-02-15 20:11:54 -08:00
Martin Storsjö
efd29844eb mpegvideo: Add ff_ prefix to nonstatic functions
Signed-off-by: Martin Storsjö <martin@martin.st>
2012-02-15 22:07:23 +02:00
Martin Storsjö
873c89e2a6 dsputil: Add ff_ prefix to inv_zigzag_direct16
Signed-off-by: Martin Storsjö <martin@martin.st>
2012-02-15 22:06:42 +02:00
Martin Storsjö
9cf0841ef3 dsputil: Add ff_ prefix to the dsputil*_init* functions
Signed-off-by: Martin Storsjö <martin@martin.st>
2012-02-15 22:06:34 +02:00
Justin Ruggles
d483bb58c3 ac3dsp: do not use pshufb in ac3_extract_exponents_ssse3()
We need to do unsigned saturation in order to cover the corner case when the
absolute coefficient value is 16777215 (the maximum value).

Fixes Bug #216
2012-02-09 21:04:44 -05:00
Diego Biurrun
0bba26466f cosmetics: Delete empty lines at end of file. 2012-02-09 12:26:45 +01:00
Ronald S. Bultje
ce1e250ee9 h264: manually save/restore XMM registers for functions using INIT_MMX.
On Win64, these registers are callee-save, so not saving/restoring them
correctly is a violation of ABI and can lead to crashes or corrupt data.
2012-02-08 10:31:14 -08:00
Ronald S. Bultje
4ff6dea390 pngdsp: swap argument inversion. 2012-02-07 14:32:26 -08:00
Michael Kostylev
3206cccc0e h264: mark h264_idct_add8_10 with number of XMM registers.
This fixes XMM register clobber problems on Win64.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-02-07 11:37:13 -08:00
Ronald S. Bultje
7e4d9d5d45 win64: add a XMM clobber test configure option.
This will be useful to test more aggressively for failures to mark XMM
registers as clobbered in Win64 builds, and prevent regressions thereof.

Based on a patch by Ramiro Polla <ramiro.polla@gmail.com>
2012-02-02 12:00:48 -08:00
Justin Ruggles
236a550c3f Fix a typo in the x86 asm version of ff_vector_clip_int32()
Specifies the correct number of xmm registers used so that they can be saved
and restored on Win64 if necessary.
2012-02-01 19:02:32 -05:00
Christophe Gisquet
e5c9de2ab7 rv40: x86 SIMD for biweight
Provide MMX, SSE2 and SSSE3 versions, with a fast-path when the weights are
multiples of 512 (which is often the case when the values round up nicely).

*_TIMER report for the 16x16 and 8x8 cases:
C:
9015 decicycles in 16, 524257 runs, 31 skips
2656 decicycles in 8, 524271 runs, 17 skips
MMX:
4156 decicycles in 16, 262090 runs, 54 skips
1206 decicycles in 8, 262131 runs, 13 skips
MMX on fast-path:
2760 decicycles in 16, 524222 runs, 66 skips
995 decicycles in 8, 524252 runs, 36 skips
SSE2:
2163 decicycles in 16, 262131 runs, 13 skips
832 decicycles in 8, 262137 runs, 7 skips
SSE2 with fast path:
1783 decicycles in 16, 524276 runs, 12 skips
711 decicycles in 8, 524283 runs, 5 skips
SSSE3:
2117 decicycles in 16, 262136 runs, 8 skips
814 decicycles in 8, 262143 runs, 1 skips
SSSE3 with fast path:
1315 decicycles in 16, 524285 runs, 3 skips
578 decicycles in 8, 524286 runs, 2 skips

This means around a 4% speedup for some sequences.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-01-30 23:58:25 +01:00
Diego Biurrun
91bafb52ae x86: Give RV40 init file a more suitable name. 2012-01-30 23:58:24 +01:00
Diego Biurrun
c30b198381 x86: Place mm_flags variable declaration below the appropriate #ifdef.
This fixes some unused variable warnings with YASM disabled.
2012-01-30 23:58:23 +01:00
Christophe Gisquet
6b03900382 x86 dsputil: provide SSE2/SSSE3 versions of bswap_buf
While pshufb allows emulating bswap on XMM registers for SSSE3, more
shuffling is needed for SSE2. Alignment is critical, so specific codepaths
are provided for this case.

For the huffyuv sequence "angels_480-huffyuvcompress.avi":
C (using bswap instruction): ~ 55k cycles
SSE2:                        ~ 40k cycles
SSSE3 using unaligned loads: ~ 35k cycles
SSSE3 using aligned loads:   ~ 30k cycles

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-01-30 10:19:55 +01:00
Ronald S. Bultje
af79a0c48a png: add support for bpp>4 to paeth x86 SIMD code.
This fixes playback of e.g. RGB48 (bpp=6) content on x86 CPUs. Fixes
bug 214.
2012-01-29 21:22:50 -08:00
Ronald S. Bultje
f91c4b7824 png: add SSE2 version for add_bytes_l2. 2012-01-29 18:52:17 -08:00
Ronald S. Bultje
59f474b49d png: convert DSP functions to yasm. 2012-01-29 18:47:50 -08:00
Ronald S. Bultje
20a7d3178f png: add missing #if HAVE_SSSE3 around function pointer assignment. 2012-01-29 12:31:59 -08:00
Ronald S. Bultje
331e7c4cb3 imdct36: mark SSE functions as using all 16 XMM registers.
On x86-64, it indeed uses all 16 registers (and on x86-32, this gets
clipped to 8). Not marking it properly causes callers of this function
to fail randomly because of XMM register clobbering.
2012-01-29 08:14:05 -08:00
Ronald S. Bultje
e92003514d png: move DSP functions to their own DSP context. 2012-01-29 08:11:18 -08:00
Ronald S. Bultje
3b15a6d742 config.asm: change %ifdef directives to %if directives.
This allows combining multiple conditionals in a single statement.
2012-01-27 10:19:57 +08:00
Ronald S. Bultje
c3af52fa8b dsputil: use vertical component for drawing bottom edge.
Current code only writes 8 pixels of vertical edge for YUV422, which
causes MC artifacts when subsequent frames use data from that edge.
2012-01-25 18:06:36 +08:00
Christophe GISQUET
9ba9c34024 rv34: 1-pass inter MB reconstruction
Implement 1-pass inverse transform and reconstruction for inter blocks.
2012-01-16 19:26:41 +01:00
Christophe GISQUET
d78062386e rv34: Intra 16x16 handling
Extract processing of intra 16x16 blocks from intra macroblock
processing.
Also implement a function performing inverse transform and block
reconstruction for DC-only blocks in 1 pass instead of 2.
2012-01-16 00:41:51 +01:00
Christophe GISQUET
3faa303a47 rv34: DC-only inverse transform
When decoding coefficients, detect whether the block is DC-only, and take
advantage of this knowledge to perform DC-only inverse transform.

This is achieved by:
- first, changing the 108x4 element modulo_three_table into a 108 element
  table (kind of base4), and accessing each value using mask and shifts.
- then, checking low bits for 0 (as they represent the presence of higher
  frequency coefficients)

Also provide x86 SIMD code for the DC-only inverse transform.

Signed-off-by: Kostya Shishkov <kostya.shishkov@gmail.com>
2012-01-12 09:52:33 +01:00
Henrik Gramner
e7d02b04dc fft: init functions with INIT_XMM/YMM.
This is required to handle clobbering of XMM registers on Win64
correctly. Fixes FFT and all tests depending on FFT on Win64.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2012-01-11 20:12:26 +01:00
Vitor Sessak
39df0c434c mpegaudiodec: optimized iMDCT transform
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-01-08 17:40:55 -08:00
Martin Storsjö
676a9ee1d2 x86: Fix constraints for decode_significance*_x86
Originally, prior to 8742a4ff8, the caller code was compiled
within this condition:

ARCH_X86 && HAVE_7REGS && HAVE_EBX_AVAILABLE && !defined(BROKEN_RELOCATIONS)

Since HAVE_7REGS is defined as
(ARCH_X86_64 || (HAVE_EBX_AVAILABLE && HAVE_EBP_AVAILABLE))
the subcondition HAVE_7REGS && HAVE_EBX_AVAILABLE is equal
to HAVE_7REGS (for 32 bit at least). The correct simplification
of the original condition thus is HAVE_7REGS, not
HAVE_EBX_AVAILABLE.

This fixes compilation in some cases where HAVE_EBP_AVAILABLE = 0
and HAVE_EBX_AVAILABLE = 1.

Signed-off-by: Martin Storsjö <martin@martin.st>
2011-12-27 09:05:14 +02:00
Diego Biurrun
6fdb2ce34a x86: Tighten register constraints for decode_significance*_x86.
On 32-bit OS X with gcc 4.0/4.2 and shared libraries enabled, the ebx register
is not available, but required to assemble the functions.

This reverts commit 8742a4f to a simplified version of the original constraints.
2011-12-21 12:06:37 +01:00
Diego Biurrun
30bbd5cbc0 x86: conditionally compile dnxhd encoder optimizations 2011-12-19 13:54:10 +01:00
Diego Biurrun
88b9735753 build: conditionally compile x86 H.264 chroma optimizations 2011-12-14 11:58:45 +01:00
Martin Storsjö
f1dba9e498 x86: Require 7 registers for the cabac asm
The change in 599b4c6ef didn't turn out to work properly on
i386 on OS X, where it broke building with PIC enabled.

Signed-off-by: Martin Storsjö <martin@martin.st>
2011-12-12 15:36:20 +02:00
Mans Rullgard
599b4c6efd x86: cabac: replace explicit memory references with "m" operands
This replaces the explicit offset(reg) memory references with
"m" operands for the same locations.  As a result, one fewer
register operand is needed for these inline asm statements.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-12-11 22:29:22 +00:00
Diego Biurrun
da9cea77e3 Fix a bunch of common typos. 2011-12-11 00:32:25 +01:00
Justin Ruggles
0e8fdd41c2 dsputil: use cpuflags in x86 emu_edge_core
avoids passing around the extra argument among all the macros it uses
2011-11-22 15:40:51 -05:00
Justin Ruggles
395f2e70dd dsputil: use movups instead of movdqu in ff_emu_edge_core_sse()
This allows emulated_edge_mc_sse() and gmc_sse() to be used under
AV_CPU_FLAG_SSE.
2011-11-22 15:40:51 -05:00
Justin Ruggles
9d06037d48 twinvq: add SSE/AVX optimized sum/difference stereo interleaving 2011-11-11 14:13:58 -05:00
Diego Biurrun
ce33320b30 Remove redundant filename self-references inside files.
Filenames are brittle across renames and add no useful information.
2011-11-08 17:52:56 +01:00
Diego Biurrun
276b995d85 x86: drop pointless ARCH_X86 #ifdef from files in x86 subdirectory 2011-11-08 17:52:55 +01:00
Justin Ruggles
b8f02f5b4e dsputil: use cpuflags in x86 versions of vector_clip_int32() 2011-11-06 20:50:06 -05:00
Ronald S. Bultje
717401aff2 h264_weight: remove duplication functions. 2011-11-05 07:16:30 -07:00
Justin Ruggles
5463e83dbc fmtconvert: fix int32_to_float_fmul_scalar() for windows x86_64
The calling convention only allows 4 non-stack parameter, with each
float or int register being skipped if not used.

fixes Bug 64
2011-11-02 21:44:58 -04:00
Daniel Kang
ded3e9f054 H.264: Cometics to dsputil_mmx.c
Add whitespace.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-10-26 06:41:32 -07:00
Ronald S. Bultje
b0b3231074 h264_weight: initialize "height" function argument properly.
Right now it's not actually initialized on 32-bit, leading to crashes
on win32.
2011-10-22 00:23:24 -07:00
Justin Ruggles
aad3429d4e fmtconvert: port float_to_int16_interleave() 2-channel x86 inline asm to yasm 2011-10-21 10:13:05 -04:00
Justin Ruggles
4e8e262476 fmtconvert: port int32_to_float_fmul_scalar() x86 inline asm to yasm 2011-10-21 10:13:05 -04:00
Justin Ruggles
185142a5ea fmtconvert: check compile-time x86 instruction set flags 2011-10-21 10:13:05 -04:00
Justin Ruggles
708ab7dd69 fmtconvert: port float_to_int16() x86 inline asm to yasm 2011-10-21 10:13:05 -04:00
Ronald S. Bultje
c2d337429c H264: change weight/biweight functions to take a height argument.
Neon parts by Mans Rullgard <mans@mansr.com>.
2011-10-21 01:00:45 -07:00
Ronald S. Bultje
229d263cc9 Support for lossless and inter H264 4:2:2. 2011-10-21 01:00:45 -07:00
Baptiste Coudurier
76741b0e56 h264: 4:2:2 intra decoding support
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-10-21 01:00:41 -07:00
Diego Biurrun
265980dabc x86: Move some variable declarations below the appropriat #ifdef.
This avoids some unused variable warnings with YASM disabled.
2011-10-20 16:19:27 +02:00
Diego Biurrun
2cb7c81669 x86: Fix linking of ProRes DSP ASM with YASM disabled. 2011-10-20 16:19:13 +02:00
Ronald S. Bultje
05c8f119cc proresdsp: fix function prototypes.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2011-10-14 21:34:46 +02:00
Ronald S. Bultje
e3f530feca prores: idct sse2/sse4 optimizations.
~3.0-3.5x as fast as original C version, 1.6x as fast overall.
2011-10-11 07:50:48 -07:00
Sean McGovern
c2d3f56107 fft: avoid a signed overflow
As a signed integer, 1<<31 overflows, so force it to unsigned.

Signed-off-by: Alex Converse <alex.converse@gmail.com>
2011-09-23 17:02:58 -07:00
Ronald S. Bultje
38e06c2969 Move clipd macros to x86util.asm.
This allows sharing them between multiple .asm files.
2011-08-17 20:56:06 -07:00
Dave Yeo
cc73511e8e Fix NASM include directive
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-08-15 11:24:35 -07:00
Alex Converse
48f7163f13 dsputil_mmx: Honor HAVE_AMD3DNOW 2011-08-15 11:20:08 -07:00
Ronald S. Bultje
b2c087871d Move x86util.asm from libavcodec/ to libavutil/.
This allows using it in swscale also.
2011-08-12 11:43:03 -07:00
Ronald S. Bultje
3a39195b1d Move x86inc.asm to libavutil/.
This allows using it in libswscale/ also.
2011-08-12 11:43:02 -07:00
Kostya Shishkov
d241f51e0f Move RV3/4-specific DSP functions into their own context
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-08-11 16:07:15 -07:00
Vitor Sessak
18b131de04 dct32: Add SSE2 ASM optimizations
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-08-02 10:17:29 -07:00
Jason Garrett-Glaser
a3bf7b864a H.264: tweak some other x86 asm for Atom 2011-07-29 12:24:15 -07:00
Mans Rullgard
3ad1684126 x86: cabac: add operand size suffixes missing from 6c32576
This fixes build with clang.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-07-28 18:59:23 -07:00
Mans Rullgard
f5f004bc5a x86: cabac: don't load/store context values in asm
Inspection of compiled code shows gcc handles these fine on its own.
Benchmarking also shows no measurable speed difference.

Removing the remaining cases in get_cabac_bypass_sign_x86() does
cause more substantial changes to the compiled code with uncertain
impact.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-07-28 22:25:21 +01:00
Jason Garrett-Glaser
6c32576548 H.264: optimize CABAC x86 asm for Atom 2011-07-28 13:06:13 -07:00
Mans Rullgard
da4c7cce21 x86: fix build with gcc 4.7
The upcoming gcc 4.7 has more advanced constant propagation
resulting some inline asm operands becoming constants and thus
emitted as literals, sometimes in contexts where this results
in invalid instructions.

This patch changes the constraints of the relevant operands
to "rm" thus forcing a valid type.  While obviously suboptimal,
this is what older gcc versions already did, and there is no
change to the code generated with these.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-07-26 22:17:43 +01:00
Daniel Kang
406fbd24dc H.264: Add optimizations to predict x86 assembly.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-07-22 14:54:33 -07:00
Joseph Artsimovich
5ab21439fd dnxhd: 10-bit support
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-07-21 18:44:40 +01:00
Mans Rullgard
a617c6aaa3 dsputil: update per-arch init funcs for non-h264 high bit depth
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-07-21 18:10:58 +01:00
Mans Rullgard
874f1a901d dsputil: template get_pixels() for different bit depths
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-07-21 18:10:58 +01:00
Mans Rullgard
0a72533e98 jfdctint: add 10-bit version
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-07-21 18:10:58 +01:00
Mans Rullgard
e7a972e113 simple_idct: add 10-bit version
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-07-20 17:49:48 +01:00
Diego Biurrun
65083b4911 dsputil: remove disabled code 2011-07-18 11:48:35 +02:00
Martin Storsjö
8f62ef0f95 x86: Use LOCAL_ALIGNED in mpegvideo_mmx_template
Signed-off-by: Martin Storsjö <martin@martin.st>
2011-07-18 00:10:45 +03:00
Diego Biurrun
e0ae2174db simple_idct: remove disabled code 2011-07-17 17:32:37 +02:00
Daniel Kang
ac4a85f476 H.264: Add more x86 assembly for 10-bit H.264 predict functions
Mainly ported from 8-bit H.264 predict.

Some code ported from x264. LGPL ok by author.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-07-13 18:44:51 -07:00
Jason Garrett-Glaser
b5bbc84fe2 H.264: add filter_mb_fast support for >8-bit decoding
Much faster high bit depth deblocking.
2011-07-11 14:58:50 -07:00
Mans Rullgard
710b8df949 dsputil: remove ff_emulated_edge_mc macro used in one place
This macro can cause problems in conjunction with the bitdepth
template expansion.  It was presumably added to keep source
compatibility when high bitdepth support was added.  However,
emulated_edge_mc is a dsputil pointer and should not be called
directly, so there is little reason to keep such a macro.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-07-10 17:55:58 +01:00
Daniel Kang
c0483d0c7a H.264: Add x86 assembly for 10-bit H.264 predict functions
Mainly ported from 8-bit H.264 predict.

Some code ported from x264. LGPL ok by author.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-07-08 15:59:29 -07:00
Daniel Kang
3c7c16fde3 YASM: Shut up unused variable compiler warning with --disable-yasm.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2011-07-04 18:49:09 +02:00
Daniel Kang
567a32b5b2 x86_32: Fix build on x86_32 with --disable-yasm.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-07-04 08:47:09 -07:00
Daniel Kang
58f7aad051 Fix build with --disable-yasm.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-07-03 22:56:09 -07:00
Daniel Kang
9bfa5363da H.264: Add x86 assembly for 10-bit H.264 qpel functions.
Mainly ported from 8-bit H.264 qpel.

Some code ported from x264. LGPL ok by author.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-07-03 07:43:38 -07:00
Justin Ruggles
f99a5ef92e ac3dsp: add x86-optimized versions of ac3dsp.extract_exponents(). 2011-07-01 13:02:11 -04:00
Justin Ruggles
6054cd25b4 ac3enc: add int32_t array clipping function to DSPUtil, including x86 versions. 2011-07-01 13:02:11 -04:00
Diego Biurrun
d2ee495fb2 configure: Drop check for availability of ten assembler operands.
This was done to support gcc 2.95, which is an old legacy compiler
that fails to compile the current codebase anyway.
2011-06-28 13:14:37 +02:00
Diego Biurrun
adbfc605f6 doxygen: Consistently use '@' instead of '\' for Doxygen markup.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2011-06-24 00:37:49 +02:00
Daniel Kang
84e70ef004 h264: Add x86 assembly for 10-bit weight/biweight H.264 functions.
Mainly ported from 8-bit H.264 weight/biweight.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2011-06-21 15:24:13 +02:00
Mans Rullgard
c5ee740745 x86: cabac: fix register constraints for 32-bit mode
Some operands need to be accessed in byte mode, which restricts the
available registers in 32-bit mode.  Using the 'q' constraint selects
a suitable register.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-06-20 23:36:40 +01:00
Mans Rullgard
2143d69bdd cabac: move x86 asm to libavcodec/x86/cabac.h
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-06-20 22:36:31 +01:00
Mans Rullgard
d075e7d540 x86: h264: cast pointers to intptr_t rather than int
Only the low-order bits are used here so the type is not important,
but this avoids a compiler warning.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-06-20 22:36:31 +01:00
Mans Rullgard
3a4edb76d6 x86: h264: remove hardcoded edi in decode_significance_8x8_x86()
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-06-20 22:36:31 +01:00
Mans Rullgard
b92c1a6d26 x86: h264: remove hardcoded esi in decode_significance[_8x8]_x86()
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-06-20 22:36:31 +01:00
Mans Rullgard
3fc4e36c78 x86: h264: remove hardcoded edx in decode_significance[_8x8]_x86()
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-06-20 22:36:31 +01:00
Mans Rullgard
e4b5a204aa x86: h264: remove hardcoded eax in decode_significance[_8x8]_x86()
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-06-20 22:36:30 +01:00
Mans Rullgard
018c33838e x86: cabac: remove hardcoded ebx in inline asm
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-06-20 22:36:30 +01:00
Mans Rullgard
6b712acc0e x86: cabac: remove hardcoded struct offsets from inline asm
Signed-off-by: Mans Rullgard <mans@mansr.com>
2011-06-20 22:36:30 +01:00
Ronald S. Bultje
ed63f527f2 Fix build if yasm is not available. 2011-06-18 08:34:14 -04:00
Daniel Kang
f188a1e0ca H.264: Add x86 assembly for 10-bit MC Chroma H.264 functions.
Mainly ported from 8-bit H.264 MC Chroma.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-06-18 07:52:19 -04:00
Jason Garrett-Glaser
c90b94424c 4:4:4 H.264 decoding support
Note: this is 4:4:4 from the 2007 spec revision, not the previous (now deprecated) 4:4:4 mode in H.264.
2011-06-13 21:16:30 -07:00
Jason Garrett-Glaser
504811baea Roll back 4:4:4 H.264 for now
Needs some ARM/PPC asm modifications.
2011-06-13 13:38:46 -07:00
Jason Garrett-Glaser
c9c493872c 4:4:4 H.264 decoding support
Note: this is 4:4:4 from the 2007 spec revision, not the previous (now deprecated) 4:4:4 mode in H.264.
2011-06-13 12:21:39 -07:00
Oskar Arvidsson
6c031a3338 h264: Fix 10-bit H.264 x86 chroma v loopfilter asm.
The tc variable was not splatted correctly.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-06-10 14:44:57 -04:00
Daniel Kang
4de83b7b6d H264: x86 predict init cosmetics.
Change indentation and whitespace; also move HAVE_YASM blocks.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2011-06-08 00:22:52 +02:00
Daniel Kang
a8d44f9dd5 Add x86 assembly for some 10-bit H.264 intra predict functions.
Parts are inspired from the 8-bit H.264 predict code in Libav.
Other parts ported from x264 with relicensing permission from author.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2011-06-06 01:31:02 +02:00
Loren Merritt
53be7b23e9 Cosmetic changes to h264_idct_10bit.asm.
Removes redundant dword tags and whitespace changes.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-06-02 07:07:15 -07:00
Loren Merritt
994c3550ff 2x faster h264_idct_add8_10.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-06-02 07:07:02 -07:00
Ronald S. Bultje
e6635a9a19 h264: remove CONFIG_GPL from x86 intra prediction code.
The authors permitted relicensing to LGPL a long time ago (Holger,
Loren and Jason).
2011-06-02 07:02:46 -07:00
Daniel Kang
f3aa65af3a h264/10bit: add HAVE_ALIGNED_STACK checks.
Fixes regression in 836f47d34b in ICC-10.x,
since ICC<=11.0 doesn't align stack upon function calls.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-05-31 21:43:20 -07:00
Daniel Kang
348493db60 Update 8-bit H.264 IDCT function names to reflect bit-depth.
Signed-off-by: Ronald S. Bultje <rbultje@google.com>
2011-05-31 15:02:32 -07:00
Daniel Kang
836f47d34b Add IDCT functions for 10-bit H.264.
Ports the majority of IDCT functions for 10-bit H.264.

Parts are inspired from 8-bit IDCT code in Libav; other parts ported from x264 with relicensing permission from author.

Signed-off-by: Ronald S. Bultje <rbultje@google.com>
2011-05-31 15:02:32 -07:00
Justin Ruggles
70bb747a57 ac3dsp: do not use the ff_* prefix when referencing ff_ac3_bap_bits.
this should fix the windows builds

Signed-off-by: Martin Storsjö <martin@martin.st>
2011-05-28 22:43:40 +03:00
Justin Ruggles
6ca23db9cc ac3enc: modify mantissa bit counting to keep bap counts for all values of bap
instead of just 0 to 4.

This does all the actual bit counting as a final step.
2011-05-28 12:39:28 -04:00
Diego Biurrun
5e528cffcf x86: Add appropriate ifdefs around certain AVX functions.
nasm versions prior to 2.09 have trouble assembling some of our AVX code.
Protect these sections by preprocessor macros to allow compilation to pass.
2011-05-27 21:18:12 +02:00
Dave Yeo
a10fb79070 x86 asm: Add SECTION_TEXT to dct32_sse.asm.
This fixes the following error on OS/2:
error: segment name `.text align=16' not recognized

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2011-05-23 12:47:53 +02:00
Loren Merritt
422b2362fc dct32_sse: eliminate some spills
125->104 cycles on penryn (x86_64 only)
2011-05-22 19:27:18 +02:00
Vitor Sessak
165c7c420d Fix dct32() compilation with --disable-yasm
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2011-05-22 07:10:19 -04:00
Vitor Sessak
6204feb160 dct32: Add AVX implementation of 32-point DCT 2011-05-21 17:42:26 +02:00