What's my best bet for computing the dot product of a vector x with a large number of vectors y_i, where x and y_i are of length 10k or so.

- Shove the y's in a matrix and use an optimized
`s/dgemv`

routine? - Or maybe try handcoding an SSE2 solution (I don't have SSE3, according to cpuinfo).

I'm just looking for general guidance here, so any suggestions will be useful.

And yes, I do need the performance. Thanks for any light.

Actionscript 3 import package.* vs import package.Class

## 1:

Optimizing Kohana-based Websites for Speed and Scalability

I think GPUs are specifically designed to perform operations like this quickly (among others).

When is an object in Javascript constructed?

So you could probably make use of DirectX or OpenGL libraries to perform the vector operations.

Which conditional statement is faster in SQL?

D3DXVec2Dot This will also save you CPU time..

Is there a way to load embedded YouTube videos faster on my website?

When is an object in Javascript constructed?

So you could probably make use of DirectX or OpenGL libraries to perform the vector operations.

Which conditional statement is faster in SQL?

D3DXVec2Dot This will also save you CPU time..

Is there a way to load embedded YouTube videos faster on my website?

Optimizing bitwise filtering in SQL

Does avoiding functions increase the performance?

## 2:

Slow implementation and runs out of heap space (even when vm args are set to 2g)

Alternatives for optimised BLAS routines:.

## 3:

Handcoding a SSE2 solution is not very difficult and will bring a nice speedup over a pure C routine.

How much this will bring over a BLAS routine must be determined by you.. The greatest speedup is derived by structuring the data into a format, so that you can exploit data parallelism and alignment.

.

How much this will bring over a BLAS routine must be determined by you.. The greatest speedup is derived by structuring the data into a format, so that you can exploit data parallelism and alignment.

.

## 4:

I use a GotoBLAS.

It's the hight perfomance kernel routines.

The many times better than MKL and BLAS..

It's the hight perfomance kernel routines.

The many times better than MKL and BLAS..

## 5:

The following provides BLAS level 1 (vector operations) routines using SSE.

. http://www.applied-mathematics.net/miniSSEL1BLAS/miniSSEL1BLAS.html. If you have an nVidia graphics card you can get cuBLAS which will perform the operation on the graphics card.

. http://developer.nvidia.com/cublas. For ATI (AMD) graphics cards. http://developer.amd.com/libraries/appmathlibs/pages/default.aspx.

. http://www.applied-mathematics.net/miniSSEL1BLAS/miniSSEL1BLAS.html. If you have an nVidia graphics card you can get cuBLAS which will perform the operation on the graphics card.

. http://developer.nvidia.com/cublas. For ATI (AMD) graphics cards. http://developer.amd.com/libraries/appmathlibs/pages/default.aspx.