Home Tutorials Cell Broadband Engine Unlocking the Power of the PLAYSTATION 3
Unlocking the Power of the PLAYSTATION 3

The Cell Broadband Engine™ (Cell) is a microprocessor jointly developed by SONY, Toshiba, and IBM. It is a multi-core chip containing the 64-bit Power processor core and multiple, independent SIMD processor cores. This architecture makes real time multimedia processing possible, which enables this chip to be used not only in next generation electronics and computer entertainment systems, but also in many other fields such as defence, communications, science and medical systems. Currently the PLAYSTATION®3, IBM & Mercury Blade Servers, Mercury and Fixstar PCIe boards use the Cell BE.

An inherent problem with the Cell is it's complexity. With it's 9 cores, DMA engines and 128 bit vector registers, there is quite a learning curve to climb before a programmer is able to leverage the power of the Cell. This is where the Cell BE Execution Framework (CEF) comes in. 

The CEF works with the IBM Cell BE Software Development Kit, on any hardware with utilises a Cell microprocessor, including the SONY PLAYSTATION3 (PS3). By utilising the CEF, a programmer is able to cut down the nasty learning curve required to understand the complexities of the chip, and focus on porting or writing an algorithm that can take advantage of the enormous power of the Cell processor.

This article is intended to explain the details of the CEF, so that the programmer can understand how their algorithm can be implemented.


Cell Be Execution Framework Data Flow 
 


The Cell BE is a complex processor having 9 cores in a single die, one PPE (Power Processing Element) & 8 SPEs (Symmetric Processing Element). The diagram above shows how the data flows through the CEF, from the PPE to one of the eight SPEs. The CEF can be configured to execute the algorithm on a number of SPEs simultaneously. Note that the Cell chip within PLAYSTATION3 has only 6 cores available, whereas a normal Cell has 8. 
 
The PPE is the control center which passes data to SPEs and gathers the processed data back. This parallel processing of data is at the heart of the processing power of Cell. The data transfer between PPE and SPE is done in form of DMA (Direct Memory Access) transfers. The point to note here is that all the data transfers between PPE and SPE are done using threads and hence are done in parallel. PPE creates threads one for each SPE, and the SPEs DMA data into local memory, processes the data (in userAlgo.c) and then DMAs the data back to PPE in the thread. 

CEF Code Snippets 

Lets first see the structures defining the PPE & SPE buffers.

cef/core/ppe/cef.h defines PPE input and output buffers as follows: 
 1 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 9 
typedef struct
{
    vector float    buf[2][PPE_DATA_WINDOW];
} ppeBufIn;

typedef struct
{
    vector float    buf[2][PPE_DATA_WINDOW];
} ppeBufOut;


cef/core/spe/cef_spu.h defines SPE input and output buffers as follows:
  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
typedef struct
{
    volatile vector float   inbufs[2][SPE_DATA_WINDOW];
    volatile vector float   outbufs[2][SPE_DATA_WINDOW];

    /* Context */
    volatile vector float * pInputDMASource;
    volatile vector float * pNextInputDMASource;
    volatile vector float * pOutputDMASource;
    volatile vector float * pNextOutputDMASource;

} SPE_DATA;

 
cef/core/ppe/cef.h also defines one more important data structure called context :
  1 
  2 
  3 
  4 
  5 
  6 

  7 

  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
typedef union
{
    struct
    {
        int     nData;           /* number of data to process per thread */
        vector float *  pBufIn;  /* address of PPE input buffer in global 
          address space  */
        vector float *  pBufOut; /* address of PPE output buffer in global 
          address space */
        int mode;                /* to pass mode info to SPE */
        /* generic parameters from command line */
        double p1;
        double p2;
        double p3;
        double p4;
        double p5;
        double p6;
        double p7;
        double p8;
    } p;
    int pad[32];
} context;

 

The context is essentially passing some important information like the PPE I/O buffer addresses, number of data element to process by the SPE, verbose/benchmark mode information, generic command line parameters etc to the SPE. The context is DMAed in first by the SPE to get this information and stored locally. This is essential as the SPE will need PPE buffer addresses to set up DMA transfers. Other important use is to pass the user provided generic parameters on the command line to user’s Algorithm. The reason for context to be a union is that the DMA transfer size must be an integer multiple of 16. 

As mentioned above the threads are created per SPE using the context so each SPE knows what portion of the PPE_DATA_WINDOW it needs to process.

In Depth

Now lets have a peek into cef.c and get an idea how it works. cef.c contains code to create contexts, create threads and distribute the data to number of SPEs. We use spe_context_create() to create a context for a thread. We then initialize the context parameters and using pthread_create() function create a thread to run on a SPE. The thread starts execution upon call to spe_context_run().

cef_spu.c contains code for the DMA transfers from PPE to SPE and vice versa. We use a composite intrinsic function spu_mfcdma32() with last argument as MFC_GET_CMD to get data DMA'd into the SPE. Similarly, the last argument to spu_mfcdma32() is MFC_PUT_CMD, which will DMA the results back to the PPE. To maximize the performance and cut down the waiting time for these DMAs to finish we use double buffering. In essence we will process the already DMA'd data while we get the next set of data to improve the throughput. 

Here is where userAlgo.c comes into picture. It contains implementation of a function called process_buffer(). The CEF has provided the code for creating threads, to DMA data in and out of SPE etc., so that we have the data available for the user's algorithm to process it the way they want. Users can focus on implementing their algorithm to process the data and are saved from other complexities. However, the user will still have to understand how to utilise the 128 bit vector registers to get the maximum processing power out of the SPE. For this, see the Cell SDK. 

The CEF can be downloaded from sourceforge.net or from the downloads section of this website, if you are registered: cef-0.1.0.tar.bz2.