Basic UVM testbench for a Stream Processor

This page: The generic testbenches

The generic testbenches

From the outside, a oneproc unit has a lot in common with the top level of a stream processor with its theResource of N oneprocs each with M separate fuctions. So the unit level testbench for oneproc will be nearly identical to the testbench for the stream procesor subsystem. Here is a diagram of the generic testbench for a oneproc DUT:

This is mostly a straight-up UVM verification environment. But some elaboration is needed for the sequences, the scoreboard, the checkers in the DUT assertion package, and dependency of the transactions and the drivers on the data record.

Sequences

Scoreboard

Checkers

Data Records and base transactions and drivers

Sequences

The write and read channels operate independently so using a virtual sequence at the top level makes sense. Each read is a pair of a read_request sequence followed by a read sequence, and each write is a set of a write request followed by at one or more write type sequences; so using hierarchical sequences make sense. The first diagram below shows the current collection of top-level virtual sequences, the mid-level constructor sequences, and the bottom-level atomic sequences. The diagram following shows the connections for the actual collection of sequences that make up the most complicated virtual sequence currently in the set.

Now a note about the naming of the tests, sequences, and transactions.
There is a lot about the transactions and sequences here that cannont be randomized. The command and command argument fields for example. It make no more sense to pass a random value to the command field that it does to pass a "fly" command to an automobile unequipped for such activity. Likewise, in the actual system, the operation in progress and whether it is a single operation, one of many in a series, or the final operation in a series will be know to both the write and read channels. Some of this operational level information is set by the test in the uvm_config_db while some is in the command field written to the oneproc unit itself. So there is no notion of an atomic generic "write". That's why even the atomic sequences at the bottom are named with specific functionallity implied.

Scoreboard

The predictor part of the scoreboard will require out-of-order compares and will use a reference model written in C. Because the class of problems the stream processors will address include things like curve fitting, universal approximators, matrix operations, and some signal analysis and generation, and because these already exist in C, there is no reason to rewrite them in behavioral SystemVerilog. Those reference models will come in via the DPI.

Checkers

In the usual way, functional coverage will be collected from the monitors and protocol sequence checkers.

But based on the information in The generic Device Under Test, if it is true that the external user of the stream processor need not know how many processors are available, or whether or not each of them have all functions available, or even if they are using external or internal memory, etc, then some checkers will be required within the DUT. At minimum they will need coverage bins for per-processor x per-function, parallel and serial mode verifiers, and sequence checkers on internal control lines to ensure, for example, that a processor actually aborted and didn't just happen to end near the abort time. That's just to give an idea of what's in the asserion package. Full details are provided in DUT HWSP specification and integrated testplan and A specific testbench.

Data Records and base transactions and drivers

Data records

The content and form of write and read transactions depends on the data record as specified in dut_pkg.sv. A data record, DataRec, is an array of bytes, with the first 4 bytes reserved:

frameID
CMMD
arg0
arg1
byte4
byte5
byte6
byte7
byte8
byte9
... 
byteM-1

The DataRec is whatever it needs to be to service the computation performed by the particular DUT. Data, like polynomial coefficients, neural network weights, genetic algorithm "genes", and matrix elements can be passed directly. Pointers to DMA or streaming channels can also be given in place of data, and the arg0 and arg1 fields are used to indicate meaning. Data records can contain instructions, or even be viewed as instructions with large argument lists. Space for results can be reserved as part of the record.

Geometry of the data record may need to take in consideration the intended target, such as a natural size of FPGA SRAM sizes. The simplest view of a data record object is an array of bytes, as shown below on the left. The drivers and monitors of the testbench will need to use the REC_R and REC_C values to determine the write data width of the particular DUT, its natural word size, and number of addresses.

There are two "natural" geometries of a DataRec. The first is the organization according to the computation. If the processing is done on an M-byte stream, then the natural geometry is a byte array. If it is a collection of N triplets of points P represented in 10 bits, then the natural geometry may be a two dimensional array of N 36-bit words. Those two cases above are the conceptual geometry. Then there is the databus geometry to consider.

The block below shows both the byte stream and the point triplets in a natural conceptual form side-by side with both an 8-bit and 32-bit data bus width. The final version of this testbench needs to pick information out of a DUT configuration file to turn transactions into sequences and drivers of the correct widths and lengths.

For an M-byte stream of data:

Conceptual
byte0
byte1
byte2
byte3
byte4
byte5
byte6
byte7
byte8
byte9
... 
byteM-1

8-bit bus
byte0
byte1
byte2
byte3
byte4
byte5
byte6
byte7
byte8
byte9
... 
byteM-1

32-bit bus
byte3 byte2 byte1 byte0 
byte7 byte6 byte5 byte4
... 
byteM-1 byteM-2 byteM-3 byteM-4

For a set of N 10-bit triplets:

Conceptual
  word 0, P[2] : P[1] : P[0] 
  word 1, P[2] : P[1] : P[0] 
  word 2, P[2] : P[1] : P[0] 
... 
word N-1, P[2] : P[1] : P[0]

8-bit bus
word   [0]P[0]      [7:0]
word   [0]P[0]000000[9:8]
word   [0]P[1]      [7:0]
word   [0]P[1]000000[9:8]
word   [0]P[2]      [7:0]
word   [0]P[2]000000[9:8]
word   [1]P[0]      [7:0]
word   [1]P[0]000000[9:8]
word   [1]P[1]      [7:0]
word   [1]P[1]000000[9:8]
word   [1]P[2]      [7:0]
word   [1]P[2]000000[9:8]
...
word [N-1]P[0]      [7:0]
word [N-1]P[0]000000[9:8]
word [N-1]P[1]      [7:0]
word [N-1]P[1]000000[9:8]
word [N-1]P[2]      [7:0]
word [N-1]P[2]000000[9:8]

32-bit bus
word     [0]{00,P[2]P[1]P[0]}
word     [1]{00,P[2]P[1]P[0]}
word     [2]{00,P[2]P[1]P[0]}
...
word   [N-1]{00,P[2]P[1]P[0]}

This is the current description of a record in dut_pkg.svh for the HWSP:

////////////////////////////////////////////////////////////////////
//                    What A Record Is
//  A "record" is what is contained in the oneproc's memory. 
//  The content and meaning of a record depend on the process, meaning 
//  that they depend on the particulars of the Stream Processor's 
//  implemented computational unit. 
//
//  A record is abstracted here into the form of a matrix, R x C, or
//  REC_ADDRWIDTH x REC_DATAWIDTH
//  The geometry R x C is: 
//  R is the number of rows of data, (addresses) 
//  C is the number of columns (byte selects),
//  so, bare, it's just a memory. 
//
//  Each column is one byte, so a 4x3 record would be 4 rows of 3 bytes:
//    byte2, byte1, byte0 | addr m+0
//    byte2, byte1, byte0 | addr m+1
//    byte2, byte1, byte0 | addr m+2
//    byte2, byte1, byte0 | addr m+3
//
//  The beginning of a record is column 0, row 0. 
//  The natural word size is based on the number of columns. Above, the
//  natural word size is 3 columns. 3 bytes. 24 bits.
//  
//  While the geometry of a oneproc unit's memory is flexible, all 
//  computational units in a oneproc unit must adhere to one-in-the-same 
//  geometry as specified by REC_DATAWIDTH and REC_ADDRWIDTH.
//  
//  The only incontrovertible fields are the first four bytes, and they are:
//  byte00 - frameID, 256 possible values which are NOT interpreted at all
//           by the stream processor. They are for an external task manager.
//  byte01 - CMMD. There are 127 possible commands. Bit 7 is dedicated as a 
//           "final" bit. This is part of the mechanism for pipelining. When
//            bit 7==0 it serves as an indication to the read channel that
//            the processing is not totally done, and that following the 
//            read of the first 4 bytes of the DataRec the read channel
//            should say thanks and release the oneproc. 
//   byte02 - arg0. See the specific command for details.
//   byte03 - arg1. See the specific command for details.
////////////////////////////////////////////////////////////////////

`ifndef DUT_PKG
`define DUT_PKG

package dut_pkg ;

//----------------------------------------------------------------------------
// This is the *specific* record format for the "Hello World" StreamProc,
// HWSP, pearsons_r computational unit. CMMD == CORR.
// It is a 64 byte record of 16 32-bit words organized as follows:
//     31:24   23:16   15:8     7:0 
//     byte3   byte2   byte1   byte0
//  0   arg1    arg0   CMMD   frameID   when arg0==1 it means end when cc >= arg1
//  1   SL       SS      RL     RS    // S[]Length("32"), S[0]Start("24"), Ref[]Length("11"), Ref[0]Start("12")
//  2   res    status   offset  cc    
//  3    l       l       e       H
//  4    o       w               o  
//  5   NA       d       l       r
//  6   S[3]    S[2]    S[1]    S[0]
//  7   S[7]    S[6]    S[5]    S[4]
//  8  S[11]   S[10]    S[9]    S[8]
//  9  S[15]   S[14]   S[13]   S[12]
// 10  S[19]   S[18]   S[17]   S[16]
// 11  S[23]   S[22]   S[21]   S[20]
// 12  S[27]   S[26]   S[25]   S[24] 
// 13  S[31]   S[30]   S[29]   S[28] 
// 14  unused  unused  unused  unused 
// 15  unused  unused  unused  unused 
//
// CORR : with arg0 == 0, arg1 is ignored
// CORR : with arg0 == 0, arg1 is ignored
// CORR : with arg0 == 1 the end-at-threshold is enabled, and arg1 is 
//        the threshold value for pearson's r.
//        Normally device will try all 21 possible correlations by
//        using the 21 different possible offsets, but with end-of-threshold
//        enabled it will stop the firts time the cc comes back >= arg1.
`define HWLENGTH 11
// The define "SLENGTH 32" is NOT part of the DUT. It's the TB's decision. See the TBdefines



//----------------------------------------------------------------------------
// This is the *specific* record format for the "Hello World" StreamProc,
// HWSP, smoothing computational unit. CMMD == SMOOTH.
// It is a 64 byte record of 16 32-bit words organized as follows:
//     31:24   23:16   15:8     7:0
//     byte3   byte2   byte1   byte0
//  0   arg1    arg0   CMMD   frameID  // 
//  1   res    status    DL      DS    // D[]Length, D[0]Start
//  2   S[3]    S[2]    S[1]    S[0]
//  3   S[7]    S[6]    S[5]    S[4]
//  4  S[11]   S[10]    S[9]    S[8]
//  ...
//  X  S[N]   S[N-1]   S[N-2]   S[N-3]  (for example)
//  15 ...
// 
// 
// For the array S, S[0] to S[N-1], where
// S[0]   = (byte4), and
// S[N-1] = (byte4) + (byte5),
// SMOOTH overwrites values by averaging:
//
//     S0 = S[0] ;
//     for (i=0; i < N-1; i++)
//       S[i] = (S[i] + S[i+1])/2;
//     S[N-1] = (select) ? S[N-1] : (S[N-1] + S[0])/2 ;
//
// The value in byte5 indicates N, the length of the array.
// The value of byte4 indicates the offset from 0 where the array begins.
// Note that the array cannot begin at an offset less than 8 without
//     overwriting the header information and reserved bytes.
//     This puts a lower limit of 'hC for DS/byte4.
// Note also that with a 64 byte memory, the absolute maximum number for
//     length, for the value of N, is 63-12 = 51.
// If arg0[0] == 0, leave the value of S[N-1] untouched.
// If arg0[0] == 1, S[N-1] is averaged with the original value of S[0].
// A value of DL > 63-12 yields unconsidered writing behaviour. 



//----------------------------------------------------------------------------
// This is the *specific* record format for the "Hello World" StreamProc,
// HWSP, test computational unit, CMMD == KS.
// It is a 64 byte record of 16 32-bit words organized as follows:
//     31:24   23:16    15:8     7:0 
//     byte3   byte2   byte1   byte0
//  0   arg1    arg0    CMMD  frameID    arg0 and arg1 not yet specified
//  1   duration     duration          SA0      Source Address
//  1   SA3     SA2     SA1     SA0      Source Address
//  2   DA3     DA2     DA1     DA0      Destination Address
//  3  len3    len2    len1    len0      Lenght
//  4  unused  unused  unused  unused 
// ...
// 15  unused  unused  unused  unused 
// 
// KS, the Karplus-Strong plucked-string algorithm is not yet implemented.

//----------------------------------------------------------------------------
// This is the *specific* record format for the "Hello World" StreamProc,
// HWSP, test computational unit, CMMD == WINDOW.
// It is a 64 byte record of 16 32-bit words organized as follows:
//     31:24   23:16   15:8     7:0 
//     byte3   byte2   byte1   byte0
//  0   arg1    arg0   CMMD   frameID    arg0 and arg1 not yet specified
//  1   SA3     SA2     SA1     SA0      Source Address
//  2   DA3     DA2     DA1     DA0      Destination Address
//  3  len3    len2    len1    len0      Lenght
//  4  unused  unused  unused  unused 
// ...
// 15  unused  unused  unused  unused 
// 
// The windowing functions for DFT are not yet implemented

//----------------------------------------------------------------------------
// This is the *specific* record format for the "Hello World" StreamProc,
// HWSP, test computational unit, CMMD == TEST.
// It is a 64 byte record of 16 32-bit words organized as follows:
//     31:24   23:16   15:8     7:0 
//     byte3   byte2   byte1   byte0
//  0   arg1    arg0   CMMD   frameID        arg0 and arg1 not yet specified
//  1  rslt3   rslt2  rslt1   rslt0     for initial DMA tests, details coming later
//  2  <     address pointer 1    >     for initial DMA tests, details coming later
//  3  <     address pointer 2    >     for initial DMA tests, details coming later
//  4  unused  unused  unused  unused 
// ...
// 15  unused  unused  unused  unused 
// 
// TEST is not yet implemented


`define RECORDSIZE 64

// The defines below specify a 32-bit write data width and 4-bit write address
// configuration.
// One will note that with REC_ADDRWIDTH and RECORDSIZE that REC_ROWS could be 
// determined where needed. HOWEVER, rather than spread that little calculation 
// around to every block that needs the non-given values, I'll just put them both
//  here.
`define REC_DATAWIDTH 32
`define REC_ADDRWIDTH  4
`define REC_ROWS 16

// This is a placeholder for loading all oneprocs in parallel or serial.
//  It is not connected to any functioning yet.
`define LOADMODE 1

// This is for the verification environment. The number of computational units
// should be of no concern to the RTL integrator of the stream processor *as a
// unit*. 
`define HWSP_PROCS 8

typedef      bit[`REC_DATAWIDTH-1:0] DataRec [`REC_ROWS]  ;
typedef enum bit[6:0] { NOP, CORR, SMOOTH, KS, WINDOW, LFSR16X, TEST, BADCMMD } cmmd_type ;
// Keep pushing BADCMMD to the right.

endpackage

`endif

Base transactions

When we go looking for a base transaction we find that there is very little we can count on besides those first 4 bytes to make a base transaction out of. Only the fact that it must have a frameID, a command, and two argument fields arg0 and arg1. So the base transaction is pretty boring (the same is true for read channel transactions):

class Base_txn extends uvm_sequence_item;

  bit       [`RECORDSIZE-1:0] DataRecAddressTag ;
  rand  bit [`REC_ADDRWIDTH-1:0] addr; // 64 bytes, in 16-32 bit words
  randc bit [7:0] frameID;
  rand      DataRec data ;
  rand      bus_access_type b_a_t;

  function new(string name = "");
    super.new(name);
  endfunction: new

  `uvm_object_utils_begin(Base_txn)
    `uvm_field_int(DataRecAddressTag, UVM_ALL_ON)
    `uvm_field_int(addr, UVM_ALL_ON)
    `uvm_field_int(frameID, UVM_ALL_ON)
    `uvm_field_sarray_int(data, UVM_ALL_ON)
    `uvm_field_enum(bus_access_type,  b_a_t, UVM_ALL_ON)
  `uvm_object_utils_end

endclass: Base_txn

Where the bus_access_type is:

typedef enum bit[2:0] { WRITE_REQ, WRITE, READ_REQ, READ } bus_access_type ;

So what this means for the testbench is that the transactions have to be built with knowledge of the dut_pkg, which is mostly a normal thing; but in order to make the drivers and monitors reusable they will have to be able to pick up geometry information from the dut_pkg file as well. It would be nice not to rewrite the driver or monitor just because the bus width changes.

The Driver

In addition to the 4 basic bus access types, what I want from the conceptual level of a transaction is to be able to indicate what I want to do on a write to be specified in the most natural way, and what I care to read to be specfied in the most natural way. Here, the natural way is to mimic the behavior of a task scheduler that would be connected to the stream processor. Here is a small table of the possibilities:

High/transaction level	Sequence level	Driver level
Write
Write an entire data frame, which implies that whatever size the DataRec is, it's all getting written.	Based on config info from the test and the sequence type, fill out a write transaction and create a sequence that will write the entire data frame.	Choose the right strategy for writing only the portions of the data frame requested to be written, and write them according to the geometry of the DUT as specified in the dut_pkg.vh file.
Write or overwrite only an portion of a data frame.	Based on config info from the test and the sequence type, fill out a write transaction and create a sequence that will write/update only the required values.
Perform a null write, updating no portion of the data frame. This starts the same computation over again. Note that in the case of a computation being performed on external data, the data operated on can be different each time.	Create a sequence that will perform a write access with no actual writes.
Read
Read an entire data frame, which implies that whatever size the DataRec is, it all gets read out.	Based on config info from the test and the status of the read channel, create a sequence that will read the entire data frame.	Choose the right strategy for reading only the portions of the data frame requested to be read, and read them according to the geometry of the DUT as specified in the dut_pkg.vh file.
Read only an portion of a DataRec.	Based on config info from the test and the status of the read channel, create a sequence that will read only the required values.
Perform a null read, just acknowledge the read channel and free it up.	Based on config info from the test and the status of the read channel, create a sequence that will initiate only an acknowledge.

This testbench has an unusual requirement when it comes to the division of the Test, Sequence, and Driver levels in terms of how much knowledge each should have of the other. There are three approaches here. All require the test setting up config information, but then one could :

Make a lot of specialized sequences that use specialized drivers (full data records, partial data records, no data record updates), meaning the every new conceptual type of (trans)action would require new sequence/driver pairs,
Make a lot of specialized sequences that use a driver that has a lot of specialized capability, like (1) above, but meaning that there would be an ever growing number of sequences but one driver with an ever growing control structure like this:
else if (seq_type == this or that) begin ... , or
Make one sequence that figures out from the test config how to "tag" the transaction, and make one driver that figures out from test config info and the transaction itself both how and what to drive.

I implemented the last one. This means that the test-to-driver flow looks like this:

The test will set up the config based on the dut_pkg,
sequences pick up that information,
which is used to create tagged transactions. Part of the transaction is for the DUT, part for the driver, and
the drivers are able to use information from the config and from the tag part of the transaction to determine what and how to drive.

For example:
It is desired to update only arg0 and arg1, leaving all other fields unchanged. So,

An entire DataRec transaction will be created.

The sequence will tag only bytes 2 and 3 of the transaction.

The driver, assuming that it is a 32-bit bus, will only write the first word and then end the write cycle.

This means that the user needs to maintain the frameID and CMMD fields as well since they wil get rewritten. That is not a problem because if a user knows that enough to know that only arg0 and arg1 change, they know the CMMD and the frameID associated with the frame.

Perhaps it is time to look at a specific DUT.

This is a work in progress