Lightsoft: AltiVec Overview

Document History
30th Sept - General update. Additions to streams.

29th Sept 98 - Update to MSR information (yet again!).

28th Sept 98 - Update to MSR information.

27th Sept 98 - Original Version

An overview of Altivec on the PowerPC

by Stuart Ball and Robert Probin.

What is Altivec?

Altivec is a technology that accelerates software by enhancing the PowerPC instruction set. It is implemented as an independent PowerPC unit, much the same as the Floating Point Unit (FPU) is an independent unit and thus AltiVec instructions can be dispatched in parallel (during the same clock cycle) with other instructions. AltiVec's prime mission is to improve performance of processing.

For example, graphics processing requires the processor to take input data, modify it in some way and then output it as quickly as possible. Let's take a very simple case of increasing the brightness of an 8 bit greyscale picture by roughly 10%. We'll keep our sample simple and say that each pixel is an unsigned byte. Thus the brightness can vary between 0 and 255.

Our non AltiVec algorithm looks like this:

1. Read the byte

2. Add 25 to the byte

3. Did the maths overflow?

4. If yes, correct (make the byte 255)

5. Store the byte

6. Continue until all done

Five steps are required to modify one byte.

(Actually, a far quicker non-AltiVec way to do it is with a look up table, which avoids the add and required overflow check)

AltiVec can do sixteen (not just one) pixels in three instructions and no compares:

1. Read 16 bytes into a vector register

2. Unsigned add 25 to all the bytes in the register with saturation enabled.

3. Store the 16 bytes

4. Continue until all done

Three steps are required to modify sixteen bytes.

And don't forget, normal integer and floating point processing can continue whilst the AltiVec unit is processing sixteen pixels. And even better, AltiVec can utilize data bus bandwidth that might otherwise be idle!

Altivec versus MMX?

We mention MMX(tm) <http://developer.intel.com/drg/mmx/Manuals/overview/index.htm> for it has (allegedly) similar capabilities. It turns out that Altivec is superior on several counts, but this isn't (or shouldn't be) just because of speed. If it is not included on every G4, then the usage of Altivec will be limited to niche markets. If however, it will be available on every G4 Mac, then the potential applications are limitless.

MMX(tm) although better than nothing has several current problems: it is built on a legacy platform, is a limited implementation, provides an extremely complex problem for compilers to use these extensions effectively. It kills FPU registers, and prohibits dual use of FPU unit along with MMX instructions.

Altivec however is built to a full specification; indeed, it is a superset of MMX. It allows all execution units to dispatch and be used simultaneously assuming a well scheduled instruction stream. It provides 32 new registers and many more instructions than MMX's basic 57. It also provides some very specific "media processing" and digital signal processing instructions that MMX lacks.

Conclusion: MMX does not compare well to AltiVec.

Is it important?

Rather than just "multimedia extensions", the possible uses are many:-

multiple software modems

graphic (pixel) manipulation - texture mapping etc.

matrix multiplications - 3d etc.

video decoding (mpeg)

sound manipulation (both production and recognition)

General acceleration of standard software functions - array modification for example.

But more so than this - we do it a dis-service by listing specific applications, since the real power in Altivec is that it can be applied to nearly any problem.

It is really a paradigm shift, and one that, once made, can be leveraged into the next generation of computing speed.

What will make it successful?

We ask this question, because it is not always obvious with other competing technologies such as MMX.

Take the case for floating point. Originally contained in seperate units, this now forms a key part in all modern microprocessors.

Altivec will need to do several things:-

(a) be present in all new PowerPC microprocessors (and hence must be cheap enough)

(b) be highly orthogonal

(c) be easy to use, both from assembler, and high-level languages. Although this is a document about the assembler usage, it would be foolish not to recognize that much of the worlds software is written in languages such as C. Currently, it is impossible to efficiently use MMX extensions in C. Apple have addressed this with Altivec by including specific vector extensions in MrC, allowing the use of vector processing in C programs.

(d) have the same sort of optimisation rules as the rest of the processor architecture.

Intel custom wrote MMX filters for Photoshop in assembler in order to compete with the speed figures of the basic PowerPC. This is not the sign of a generic processing unit, but of a redundant technology that nobody will adopt properly!

So, what is AltiVec?

AltiVec is a vector unit "add-on" to the PowerPC architecture.

Definition: A vector processor is a processor that can operate on entire vectors with one instruction. Normal processors are termed "scalar" because they operate on scalar quantities - individual discrete datums.

To simplify somewhat; a lot of high performance computer programs need to process a lot of data in the same way. In processor terms this means a relatively few instructions executed many times over many sets of data.

A vector processor aids this by providing specific instructions (vector instructions) which deal with multiple sets of data. Naturally, there is an acronym for this; "SIMD" or single instruction, multiple data stream. Another name for a vector processor is "Array Processor."

In order to realistically achieve this on a microprocessor, it is actually necessary to reduce what is possible with classic vector machines <http://ei.cs.vt.edu/~history/Parallel.html>, which have very versatile processing capabilities - Altivec limits the types and sizes of data that can be held in the registers, and the manipulation that can be performed upon that data.

This should not be seen as a problem, however, for the whole microprocessor revolution has been on the back of such simplification of processing.

A vector processor at the simplest level is a processor that is able to work on fixed sizes of data in parallel. For example 16 bytes, eight 16 bit halves, four 32 bit words or four 32 bit floating singles.

Each vector register is 128 bits in size, and the data held in a vector register isn't interpreted as a single 128 bit quantity, but rather n items where n is derived from the size of each vector - if each item is byte sized, then the register is actually holding 16 individual items. Maths performed on these vectors behaves as if the register was actually n registers. In the case of byte sized data it's equivalent to performing 16 operations on 16 byte sized operands all at the same time. The operation on byte 0 does not affect byte 1 as it might if we placed 4 bytes into an integer register (a 32 bit number) and added 4 bytes (another 32 bit number) to it.

Example examining the lower two bytes of a vector register as byte sized items in modulo mode (see below for info on modulo mode):

The vector register contains 0xffff

To this we add 0x0101 (held in another vector register).

The result in this case is 0x0000

We have performed two byte sized additions, allowing the result to "roll over" at the same time. In a traditional integer unit the result would be 0x0100 (it would really be 0x10100 but we're only looking at the lower two bytes).

A Vector processor can be used in general purpose programing. Consider:

FOR N = 1 TO 16
A(N) = B(N) + C(N)
NEXT N

If A, B and C are all byte sized arrays, then this loop can be replaced with a single vector instruction (ignoring the required load and stores).

Architecture Overview

The AltiVec unit contains thirty two 128 bit vector registers identified as v0 through v31. Data is represented in vector registers as either integer (byte, half, word size) or single sized (32 bit) floating point data.

A vector Status and Control Register (VSCR) contains two bits:

Bit 15: A Non-Java/IEEE floating point mode bit where Java (simplified operations) is default.

Bit 31: A sticky "saturation" bit which is set and remains set until cleared if the result of a vector operation saturizes (see below).

Two instructions, mfvscr and mtvscr, are used to load and read values from this register into or from a vector register.

VRSAVE is a user accessible register that MUST be maintained by your software. It indicates which vector registers are currently in use. The leftmost bit of this register indicates v0 is in use, the rightmost bit indicates v31 is in use. Immediately before using a register you should set the relevant bit in this register and clear it when you are finished using the vector register. See our assembly language programmers guide for more details.

MSR (Updated): A new bit is defined in the Machine Status Register MSR[VEC] that is used to by the System Software to enable and disable all vector units, much as the MSR[FP] bit is used for floating-point units. It does not indicate that vector capabilities are present in the processor -- that is determined by the Nanokernel strictly by accessing the Processor Version Register (PVR) and looking up the capabilities of the processor in a table.

Using Gestalt is the recommended way of determining whether AltiVec is available as this will return true if the emulator OR hardware is available.
A new selector "ppcf" will return the following bits set (if the Gestalt selector "ppcf" is present):

Bit 0 = has fres, frsqrte, and fsel instructions Bit 1 = has stfiwx instruction Bit 2 = has fsqrt and fsqrts instructions Bit 3 = has dcba instruction Bit 4 = has vector instructions Bit 5 = has dst, dstt, dstst, dss, and dssall instructions

Alignment: AltiVec data must be 16 byte aligned. If your vector data is not aligned, you will lose data!

Instruction Set Overview

The Altivec instruction set can be broken down as follows:

Vector integer arithmetic instructions. They include computational, logical, rotate, and shift instructions.
Vector integer arithmetic instructions
Vector integer compare instructions (cr6 is the cr field used if the record form is used).
Vector integer logical instructions
Vector integer rotate and shift instructions

Vector floating-point arithmetic instructions - These include floating-point arithmetic instructions defined by the User Instruction Set Architecture (UISA).
Vector floating-point arithmetic instructions
Vector floating-point multiply/add instructions
Vector floating-point rounding and conversion instructions
Vector floating-point compare instruction
Vector floating-point estimate instructions

Vector load and store instructions - These include load and store instructions for vector registers.

Vector permutation and formatting instruction.
Vector pack instructions
Vector unpack instructions
Vector merge instructions
Vector splat instructions
Vector permute instructions
Vector select instructions
Vector shift instructions

Processor control instructions - These instructions are used to read and write from the AltiVec status and control register (VSCR).

Memory control instructions - These instructions are used for managing of caches (user level and supervisor level).

Most of the above are fairly self explanatory. The pack/merge/shift/select instructions are used to re-order or modify the location of individual items within a vector register. These instructions have many uses including conversion from one data format to another, and accelerating various vector maths and graphics operations.

For example, (from the Programmers Environment Manual): One special purpose form of Vector Pack Pixel (vpkpx) instruction is provided that packs eight 32-bit (8/8/8/8) pixels from two concatenated source operands into a single result of eight 16-bit 1/5/5/5 aRGB pixels. The least significant bit of the first 8-bit element becomes the 1-bit a field, and each of the three 8-bit R, G, and B fields are reduced to 5 bits by discarding the 3 lsbs.

The vector splat instruction allows you to set up a vector register with immediate data to be used as part of a vector operation. For example, getting the number 123 into the 16 bytes of v0 for subsequent use in an add operation.

Saturation

Most integer instructions have both signed and unsigned versions and many have both modulo and saturating "clamping" modes.

Saturation is when the result of an arithmetic operation would overflow if not clamped to either the "ceiling" or "floor". For example, with byte sized data, adding 1 to 255 when unsigned saturation is in effect will result in 255. Subtracting 1 from 0 will result in 0.

Modulo operation would allow the result to wrap-around. For example, if 1 is added to a byte sized 255, then the result is 0.

Modulo, saturation and signed/unsigned operation is specified in the instruction; i.e. the mnemonic you use. See Lightsoft's Introduction to programming Altivec in assembler for more details.

Data Streams

These instructions give a hint to the processor that very shortly a bunch of data is going to be read in or written from or to a certain address. The processor may then take steps to ensure the data is easily accessed, for example in cache. The specification allows for up to four streams (0 to 3) to be defined at any one time.

There are three basic instructions:

dst - Data Stream Touch (reading data from memory)
dstst - Data Stream Touch for STore (writing to memory)
These take an address and a definition of the stream. The definition defines how large the data set is and how it's made up. Unlike a traditional Vector Processor which may have just one machine specific stream definition register, AltiVec can use any general purpose register.

There are variations on these instructions (by appending a "t" to the instruction) that allows the program to specify that a stream is transient (likely to be read from or written to only once or twice, or maybe infrequently, thus helping the memory system decide how best to deal with the stream).

These instructions are best used in small steps. For example, if you are processing a large block of data, rather than trying to touch the whole lot in one go, it's advised that small sections are touched just prior to being needed.

dss - Data Stream Stop
Stop a given stream. A variation, dssall, stops all data streams which can be quite handy if a program is not sure which streams are currently active.

Conclusion

The Altivec unit will be a very useful addition to the PowerPC architechture. If you are further interested in programming it with Fantasm you may want try Lightsoft's Introduction to programming Altivec in assembler.

AltiVec is a trademark of Motorola. MMX is a trademark of Intel. All other trademarks acknowledged.