巴西专利BR112019023395B1 LOW LATENCY MATRIX MULTIPLICATION UNIT

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
methods, systems and apparatus for a matrix multiplication unit implemented in a systolic matrix are disclosed. each cell of the multiplication matrix includes: a weight matrix register configured to receive a weight input from both a transposed and an untransposed weight shift register; a transposed weight shift register configured to receive an input of weights from a horizontal direction to be stored in the weight shift register; an untransposed weight shift register configured to receive an input of weights from a vertical direction to be stored in the weight shift register; and a multiplication unit which is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register by a vector data input for the purpose of obtaining a multiplication result.
公开号:BR112019023395B1
申请号:R112019023395-4
申请日:2018-05-17
公开日:2021-08-17
发明作者:Andrew Everett Phelps；Norman Paul Jouppi
申请人:Google Llc；
IPC主号:

专利说明:

BACKGROUND
[001] This specification refers to the execution of computation of neural networks in hardware.
[002] Neural networks are machine learning models that use one or more layers of models to generate an output, for example, a classification, for an input received. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, that is, the next hidden layer or the network's output layer. Each layer of the network generates an output from an input received according to current values of a respective set of parameters. SUMMARY
[003] This specification describes technologies referring to special-purpose hardware circuits that train neural networks, compute neural network inferences, or both, and specifically to special-purpose hardware circuits that decrease latency through a matrix multiplication unit by increasing the rate at which weight values are loaded into weight matrix registers within the matrix multiplication unit.
[004] A systolic array is wired to perform matrix multiplications and typically has a uniform structure throughout the array. A matrix multiplication unit of a systolic array is composed of multiplication-addition subunits, each of which takes an input operand, multiplies the operand by a stored weight to obtain a result, and adds the result to a partial sum for produce a new running sum.
[005] One way to decrease latency is to increase the loading rate of weights into multiply-add units.
[006] In general, an innovative aspect of the subject described in this specification can be realized in a special purpose hardware circuit that trains neural networks, computes inferences from neural networks, or both.
[007] Other embodiments of this aspect include corresponding computer systems, apparatus and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. For a system of one or more computers, being configured to perform specific operations or actions means that the system has software, firmware, hardware, or a combination of these installed on it that in operation cause the system to perform the operations or actions. For one or more computer programs, being configured to perform specific operations or actions means that the one or more programs include instructions which, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.
[008] The aforementioned modalities and other modalities may each include one or more of the following features, singly or in combination. In particular, a modality includes all of the features combined below.
[009] A matrix multiplication unit can be implemented as a systolic array of cells. Each cell of the cell array may include a weight matrix register configured to receive a weight input from either a transposed or untransposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; an untransposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiplication unit which is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector input for the purpose of obtaining a multiplication result. Each cell may include a multiplexer configured to select between the weight entry of the transposed weight shift register and the untransposed weight shift register and forward the selected weight entry to the weight matrix register.
[0010] The matrix multiplication unit may include a first weight holding register configured to contain a weight value from either the transposed weight shift register or the untransposed weight shift register.
[0011] The matrix multiplication unit may include a second weight holding register configured to hold a weight value from either the transposed weight shift register or the untransposed weight shift register.
[0012] Weight values can be loaded into the matrix multiplication unit from a weight shift register transposed in a horizontal direction into the first weight retention register and from a weight shift register not transposed in a vertical direction into the second weight retention register.
[0013] The weight matrix register can be loaded with a value from either the first or second weight holding register.
[0014] In another embodiment, a matrix multiplication unit implemented as a systolic array may include a plurality of cells arranged in columns of the systolic array; two chains of weight shift registers per column of the systolic array; a weight-per-cell matrix register configured to store a weight entry received from a weight shift register; and a multiplication unit which is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector input for the purpose of obtaining a multiplication result. Each weight shift register is connected to only one string and each cell is connected to only one weight shift register.
[0015] Weight values can be sent to the two weight shift register chains from a vector register that contains weight value pairs.
[0016] A holding register at the top of each column can contain a weight value when two weight values are unavailable from the vector register.
[0017] When two weight values are available, the two weight values are shifted in the clock-cycle to the weight shift registers in the cells.
[0018] When two weight values are unavailable, in a first clock cycle where a first weight value is available, the hold register is loaded with the first weight value as a held value and no shift is made. In the next clock cycle, when a second weight value is available, the second weight value and the retained value are shifted by the two shift chains. A value is shifted by each shift chain for weight shift registers connected to the shift chains.
[0019] Each displacement chain can have two injection points for injection of weight values, one at the top of the column and the other at a second point in the column. A vector register can contain packed sets of four 8-bit integers each representing a separate weight value. Two of the four integers can be injected at the top of the column and the other two of the four integers can be injected at the second point in the array.
[0020] The matter described in this specification can be implemented in specific modalities so as to realize one or more of the following advantages. A matrix multiplication unit with two chains of weight shift registers per column of the systolic array can deliver weights to the matrix multiplication unit at twice the rate from a vector register as a matrix multiplication unit with only one chain of weight shift registers. Additionally, a matrix multiplication unit with two strings of weight shift registers per column that sends weight values to two points in the array, ie, the top and middle point of the array, can deliver weights to the multiplication unit of matrices at four times the rate from a vector register as a matrix multiplication unit with only one chain of weight shift registers.
[0021] Additionally, or alternatively, a matrix multiplication unit may have cells that each contain an untransposed weight shift register and a transposed weight shift register. The matrix multiplication unit can then use separate registers for vertical and horizontal weight shift chains which results in the matrix multiplication unit being able to carry weight values at twice the rate of matrix multiplication units that have no registers separate for the two weight shift chains.
[0022] These weight shift loading methods can be combined to obtain eight times the increase in load time from a matrix multiplication unit without two strings of weight shift registers per column and separate registers for strings of displacement of vertical and horizontal weights. These separate register weights and/or offset strings can be added to a matrix multiplication unit without significantly increasing the complexity or dimensions of the matrix multiplication unit.
[0023] Details of one or more embodiments of the subject matter of this specification are presented in the accompanying drawings and in the description below. Other features, aspects and advantages of the subject will become evident from the description, drawings and claims. BRIEF DESCRIPTION OF THE DRAWINGS
[0024] FIG. 1A shows a high-level diagram of an example special-purpose hardware chip for training a neural network.
[0025] FIG. 1B shows an example of a high-level computing core.
[0026] FIG. 1C shows an example neural network processing system.
[0027] FIG. 2 illustrates an example core architecture that includes array multiplication units. Each matrix multiplication unit is a two-dimensional systolic array.
[0028] FIG. 3 illustrates an example architecture of a multiple cell within a systolic array.
[0029] FIG. 4 shows an architectural example of a matrix multiplication unit with two strings of weight shift registers per column for the purpose of increasing the loading rate of weight values.
[0030] FIG. 5 is a flowchart of an example method for loading weight values into a column of a given multiple cell.
[0031] FIG. 6 shows an architectural example of a matrix multiplication unit with two strings of weight shift registers per column that sends weight values at two points in the column to increase the load rate of weight values.
[0032] FIG. 7 shows an architectural example of a matrix multiplication unit with separate registers for horizontal weight shift and vertical weight shift to increase the loading rate of weight values.
[0033] FIG. 8 shows an example cell with a set of holding registers to increase the loading rate of weight values.
[0034] Reference numbers and similar designations in the various drawings indicate similar elements. DETAILED DESCRIPTION
[0035] A neural network with multiple layers can be trained and then used to compute inferences. For example, the neural network has parameters that are individually initialized with a value. During training, the neural network performs a neural network training procedure to adjust the values of neural network parameters, for example, to determine trained parameter values from initial parameter values using backpropagation. The trained neural network can then compute inferences, that is, process input through the neural network layers to generate a neural network output for the input.
[0036] For example, provided an input, the neural network can compute an inference for the input. The neural network computes this inference by processing the input through each of the layers of the neural network. In some implementations, the layers of the neural network are arranged in sequence.
[0037] Therefore, in order to compute an inference from a received input, the neural network receives the input and processes it through each of the neural network layers in sequence to generate the inference, with the output from of one layer of the neural network being provided as input to the next layer of the neural network. Data inputs to a neural network layer, for example, or the input to the neural network or the outputs from the layer below the layer in sequence, to a neural network layer can be referred to as trigger inputs to the layer.
[0038] In some implementations, the layers of the neural network are arranged in a directed graph. That is, any specific layer can receive multiple inputs, multiple outputs, or both. The layers of the neural network can also be arranged so that an output from one layer can be sent back as an input to a previous layer.
[0039] FIG. 1A shows a high-level diagram of an example special-purpose hardware chip for training a neural network. As illustrated, a single special-purpose hardware chip includes two independent processors, eg 102a, 102b. Each processor 102a, 102b contains two distinct cores: (1) a computing core, ie a very long instruction word machine (VLIW) (103a, 103b) and (2) a sparse computing core, ie, an embedding layer accelerator (105a, 105b).
[0040] Each computation core, eg 103a and 103b, is optimized for dense linear algebra problems. Each computing core is controlled by a single, very long instruction word. Each computing core executes its own instruction stream of very long instruction words.
[0041] A sparse computing core, eg 105a or 105b, maps very sparse high-dimensional data into dense low-dimensional data so that the rest of the layers process densely packed input data. For example, the sparse computing core can perform computation of any layers of embedding in the neural network being trained.
[0042] To perform this sparse-to-dense mapping, the sparse computing core uses a pre-built lookup table, an embedding table. For example, when there is a series of search words as user input, each search word is converted to a residual identifier or a one-hot encoded vector. Using the identifier as a table index, the embedding table returns the corresponding dense vector, which can be an input activation vector for the next layer. The sparse computing core can also perform reduction operations through the search words to create a dense activation vector. The sparse computing cores work together to perform sparse efficient distributed lookups since the embedding table can be huge and not fit into the limited-capacity high-bandwidth memory of one of the special-purpose hardware chips. More details regarding sparse computing core functionality can be found in US Patent Application No. 15/016,486, entitled MATRIX PROCESSING APPARATUS, which was filed on February 5, 2016.
[0043] FIG. 1B shows a high-level computing core example (101). The computing core can be a machine, that is, a VLIM machine, which controls several computing units in parallel. Each computing core (101) contains: a scalar memory (104), a vector memory (108), a scalar processing unit (107), vector registers (106) and extended vector units (ie, a multiplication unit of matrices (MXU) (113)), a transposition unit (XU) (114), and a reduction and permutation unit (RPU) (116).
[0044] An example scalar processor executes VLIM instruction fetch/execute loop and controls the computation core. After fetching and decoding an instruction packet, the scalar processor itself executes the instructions found in the scalar slots of the packet using multiple multi-bit registers, i.e. 32-bit scalar processor registers (107) and scalar memory (104). The scalar instruction set includes normal arithmetic operations, for example, as used in calculating addresses, load/store instructions, and branch instructions. The remaining instruction slots encode instructions for the vector processing unit or other extended vector units (113, 114, 116). The decoded vector instructions are forwarded to the vector processing unit.
[0045] Along with vector instructions, the scalar processor (107) can forward values of up to three scalar registers to the other processor and units to perform operations. The scalar processor can also directly retrieve computation results from the vector processor. However, in some implementations, the example chip has a low-bandwidth path from the vector processor to the scalar processor.
[0046] A vector instruction dispatcher is located between the scalar processor and the vector processor. This dispatcher receives decoded instructions from the non-scalar VLIM slots and broadcasts these instructions to the vector processing unit. The vector processing unit is described in detail in relation to FIG. 1C.
[0047] An example scalar processor (107) accesses a small fast private scalar memory (104), which is supported by much larger but slower High Bandwidth memory (HBM) (110). Similarly, an exemplary vector processing unit accesses a small fast private vector memory (108), which is also supported by HBM (110). Word granularity access occurs either between the scalar processor (107) and the scalar memory (104) or between the vector processing unit and the vector memory (108). The granularity of loads and stores between the vector processor and vector memory is a 128-word 32-bit vector. Direct memory access occurs between scalar memory (104) and HBM (110) and vector memory (108) and HBM (110). In some implementations, memory transfers from the HBM (110) to the processing units (107) can only be done through the scalar and vector memories. Additionally, there can be no direct memory transfers between scalar memory and vector memory.
[0048] Instructions may specify extended vector unit operations. Along with each vector unit instruction executed, there are two two-dimensional vector units, ie 128 by 8, which can each send a register value to the extended vector units as input operands. Each extended vector unit takes the input operands, performs corresponding operations, and returns the results back to the vector processor (306). Extended vector units will be described below in relation to FIG. 4.
[0049] FIG. 1C shows an example of a special purpose integrated circuit 100 for performing neural network computations. As illustrated, the chip contains two computing cores (103a, 103b) and two sparse computing cores (152a, 152b).
[0050] The chip has a shared area that includes a host interface to a host computer (150), four high-bandwidth memory stacks along the bottom (156a-156d) and an interconnection between chips (148) that it connects the interfaces and memory to each other, as well as data from other chips. Two high-bandwidth memory stacks (156a-b, 156c-d) are associated with each computing core (103a, 103b).
[0051] The chip stores data in high-bandwidth memory (156c-d), reads the data in and out of vector memory (108) and processes the data. The computing core (103b) itself includes a vector memory (108) that is in S-RAM on the chip that is divided into two dimensions. Vector memory has address space in which addresses hold floating point numbers, that is, 128 numbers that are each 32 bits. The computing core (103b) also includes a computational unit that computes values and a scalar unit that controls the computational unit.
[0052] The vector processing unit consists of a two-dimensional array of vector processing units, that is, 128 x 8, all of which execute the same instruction in a single instruction, multiple data (SIMD) mode. The vector processor has bands and subbands, that is, 128 bands and 8 subbands. Within range, the vector units communicate with each other via load and store instructions. Each vector unit can access a 4-byte value at a time. Array units that do not belong to the same range cannot communicate directly. These vector units must use the reduction/permutation unit that is described below.
[0053] The computational unit includes vector registers, that is, 32 vector registers, in a vector processing unit (106) that can be used for both floating point and integer operations. The computational unit includes two arithmetic logic units (ALUs) (126c-d) to perform computations. One ALU (126c) performs floating-point addition and the other ALU (126d) performs floating-point multiplication. Both ALUs (126c-d) can perform various other operations such as offsets, masking and comparisons. For example, a computing core (103b) might want to add a vector register, V1, to a second vector register, V2, and place the results in a third vector register, V3. For the purpose of computing the addition, the computing core (103b) performs multiple, ie, 1024, operations in one clock cycle. Using these registers as operands, each of the vector units simultaneously executes two ALU instructions, a load and a store instruction, every clock cycle. A base address for a load or a store instruction can be computed in the scalar processor and forwarded to the vector processor. Each of the vector units in each subrange can compute its own offset address using various methods such as transposition and a special indexed address register.
[0054] The computational unit also contains extended unary piping (EUP) (116) which performs operations such as square root and reciprocal. The computing core (103b) takes three clock cycles to perform these operations, as they absorb one operand at a time. Since EUP processing takes more than one clock cycle, there is a first-in-first-out data store to store results. When an operation ends, the results are stored in FIFO. The computing core can later use a separate instruction to pull the data out of the FIFO and into the vector register. A random number generator (120) allows the computing core (103b) to generate random numbers per cycle, that is, 128 random numbers per cycle.
[0055] As described above, each processor has three extended vector units: a matrix multiplication unit (113) that performs matrix multiplication operations; a transverse unit (XLU) which includes a transpose unit (XU) (114) which performs an operation of transposing a matrix, i.e., matrix 128 by 128, and a reduction and permutation unit, illustrated as separate units in FIG. . 1C, reduction unit 115 and permutation unit 116.
[0056] The matrix multiplication unit performs matrix multiplications between two matrices. The matrix multiplication unit (113) receives data, since the computing core needs to load a set of numbers which is the matrix that is to be multiplied. As illustrated, the data comes from vector registers (106). Each vector register contains a number, that is, a 32-bit number. However, floating point conversion can occur when data is sent to the matrix multiplication unit (113) to change the numbers to a smaller bit size, i.e., from 32 bits to 16 bits. A serializer (130) ensures that when numbers are read out of vector registers, a two-dimensional array, ie a 128 by 8 matrix, is read as sets of 128 numbers which are sent to the matrix multiplication unit (113) for each one of the next eight clock cycles. After matrix multiplication has completed its computations, the results are deserialized (132a,b), which means that the result matrix is maintained for a number of clock cycles. For example, for a 128 x 8 array, 128 numbers are kept for every 8 clock cycles and then pushed to a suitable FIFO, eg FIFO Transposition Result (TRF) 134 or FIFO Multiplication Result (MRF ) 136, so that a two-dimensional array of 128 X 8 numbers can be picked up in one clock cycle and stored in the vector registers contained in the vector processing unit (106).
[0057] Over a period of cycles, ie 128 cycles, weights are shifted into the matrix multiplication unit (113) as the numbers by which to multiply the matrix. Once the matrix and weights have been loaded, the computing core (103b) can send sets of numbers, i.e. 128 X 8 numbers, to the matrix multiplication unit (113). Each row in the set can be multiplied by the matrix to produce a number of results, ie 128 results per clock cycle. While the computing core is performing matrix multiplications, the computing core also shifts new sets of numbers in the background to the next matrix by which the computing core will be multiplied so that the next matrix is available when the computational process stops. previous array has finished. The matrix multiplication unit (113) can process weight inputs, which are data in a matrix that is to be multiplied, and left-hand data inputs, which are data in a vector that is to be multiplied by the matrix, and provide a vector of outputs to the vector processing unit. The vector processing unit can process the vector of outputs and store a vector of processed outputs in vector memory. For example, the vector processing unit can apply a non-linear function to the outputs of the matrix multiplication unit to generate vector data values. In some implementations, the vector processing unit 106 generates normalized values, grouped values, or both. The processed output vector can be used as the left-hand data inputs to matrix multiplication unit 113, for example, for use in a subsequent layer in the neural network.
[0058] The transpose unit transposes a matrix. The transposed logic unit (114) receives numbers and transposes them so that the number across one band is transposed with the number in the other dimension. In some implementations, the vector processor includes 128 x 8 vector units. Therefore, to transpose a 128 x 128 matrix, sixteen individual transpose instructions are required for the complete transposition of the matrix. Once the transpose is finished, the transposed matrix will be available. However, an explicit instruction is needed to move the transposed matrix into the vector register file.
[0059] The reduction/permutation unit (or units 115, 116) solves the communication problem between bands by supporting various operations such as permutation, band rotation, rotation permutation, band reduction, swapped band reduction and band reduction permuted segmented. As illustrated, these computations are separate, however, a computation core can use one or the other, or one chain to the other. The reduction unit (115) reduces each row of numbers and supplies the numbers to the permutation unit (116). The permutation unit changes data between different tracks. The transpose unit, reduction unit, permutation unit and matrix multiplication unit each take more than one clock cycle to complete. Therefore, each unit has a FIFO associated with it so that the results of computations can be pushed into the FIFO and a separate instruction can be executed later to pull the data out of the FIFO and into a vector register. By using FIFOs, core computing does not need multiple vector registers to be reserved for the duration of long operations. As illustrated, each of the units takes data from the vector registers in the vector processing unit (106).
[0060] The compute core uses a scalar unit to control the computational unit. The scalar unit has two main functions: (1) perform cycle counting and addressing, and (2) generate direct memory address (DMA) requests so that the DMA controller moves data in the background between high-width memory. bandwidth (156c-d) and vector memory (108) and then for inter-chip connection (148) to other chips in an example system. The scalar unit contains an instruction memory (104), an instruction decoder and emitter (102), a scalar processing unit (107) that contains scalar registers, i.e. 32 bits, a scalar memory (104) and two ALUs (126a,b) to perform two operations per clock cycle. The scalar unit can provide operands and immediate values to vector operations. Each instruction can be sent from the instruction decoder and sender (102) as an instruction bundle containing the instructions that execute in the vector registers in the vector processing unit (106). Each instruction bundle is a very long instruction word (VLIW) with each instruction being a bit-width number divided into a number of instruction fields.
[0061] FIG. 2 illustrates an example core 200 architecture that includes matrix multiplication units (MXUs) 201a and 201b. Each MXU is a two-dimensional systolic array. The array is wired to perform matrix multiplication operations. An MXU multiplies a 128-element array by a preloaded 128 x 128 matrix, with a constant throughput of one multiplication per clock cycle.
[0062] Each MXU can have 128 rows and 128 columns. An MXU can be divided into identical blocks, referred to as tiles. For example, an MXU can be divided into 32 tiles, each of which contains 32 rows by 16 columns. Each tile can be further divided into cells of multiplication-addition subunits. Each cell takes a vector input operand, multiplies the operand by stored weights to obtain a result, and adds the result to a running sum to produce a new running sum. In some implementations, subunit cells can be grouped into larger multiple cells, i.e., 2 x 2 arrays of multiplication-addition subunit cells or 4 x 4 arrays of multiplication-addition subunit cells, referred to as secim cells. Rather than moving input data from one cell of multiply-add subunits to the next at a rate of one per clock cycle, the data can move through the systolic array to one multiple cell per clock cycle.
[0063] Before starting a series of vector-matrix multiplications, a matrix needs to be preloaded into the MXU. The data for this matrix is called “weights” data. The weight matrix is delivered to the MXU on source buses by buses connected to the MXU and shifted to weight shift registers. The contents of the weight shift registers are then loaded into a weight matrix register so that matrix multiplication can begin. This weight-loading process is described in more detail in relation to FIGS. 3-8.
[0064] As illustrated in FIG. 2, each MXU, eg 113a and 113b, is connected to three buses, a first source bus for untransposed weights (230a, 230b), a second source bus for transposed weights (220a, 220b) and a left side bus (210a, 210b) for vector data to be multiplied by the matrix stored in the MXU. The MXUs are connected to the busbars by wires that connect to the edges of the MXU. Each transpose unit (UX), for example, 114a and 114b, is also connected to the first source bus and the second source bus.
[0065] The first and second source buses are multipurpose buses that contain data sent from the vector processing unit to be consumed by either the XU or the MXU. Data processing takes place in the vector processing data path, which includes vector registers 206, a serialization processing unit 202 and a selection unit 204. There are several ways in which the vector processing unit can send weights on a bus. Weights can be shipped normal, “high” or “low”. Eight 32-bit floating-point numbers per range (one per sub-range) are rounded to bfloats, 16-bit floating-point numbers. These values are packaged into four pairs and sent to the MXU every other cycle during the course of 8 cycles. The difference between normal, “high” and “low” is how the vector processing unit converts from floating point 32 to bfloat. Packable weights means that each of the eight 32-bit values per band contains a packed pair of bfloats. Sixteen values, instead of eight, values are sent to the MXU using the source bus every cycle for eight consecutive cycles. During odd cycles, the low 16 bits of each subrange are sent to the MXU, and during the even cycles the high 16 bits of each subrange are sent. Weights can additionally or alternatively be sent by bytes. Each 32-bit operand contains a packed set of four signed 8-bit 2-'s complement integers. Each byte is converted to a modified magnitude sign value. These values are sent to the MXU by a source bus for eight consecutive cycles.
[0066] Weights can be sent as untransposed or transposed instructions using the first and second source buses and shifted into weight shift registers. When triggered with a load operation, the contents of the weight shift registers are loaded into the weight matrix registers as described below. The load path from weight shift registers to weight matrix registers is also where the conversion of the modified magnification signal to bfloat is done with byte mode data. A load control bus indicates whether this conversion should be done.
[0067] Depending on the instruction being executed, the 32-bit values coming from the source buses may contain a packed pair of 16-bit floating point values with the values in bits [15:0] representing the earliest value (in time) , or a packed set of four 8-bit integers in a modified signal amplification format with the value in bits [7:0] representing the value even earlier (in time) and the other values following sequentially. When the MXU receives data from the buses, the data values are spread evenly across the MXU with the value 0 on the left side and the value 127 on the right side.
[0068] The Left Side Data Bus (LHS) delivers 128 16-bit floating point numbers in a specific format, eg bfloat, to be multiplied by the array stored in the connected MXU. The data on the LHS data bus comes from the vector processing unit and goes through the transpose unit, eg 114a and 114b. When the LHS input arrives at the MXU, the values are spread evenly across the MXU with value 0 on the left side and value 127 on the right side.
[0069] The result of matrix multiplication is evenly dispersed through the MXU and sent from the MXU to the FIFO matrix result (MRF), for example, 136a and 136b. XU results are sent to the corresponding FIFO transposed result (TRF), eg 134a and 134b.
[0070] FIG. 3 illustrates an example architecture of a multiple cell within a matrix multiplication unit. As described above, the matrix multiplication unit is a two-dimensional systolic array. The array includes multiple multiplication-addition subunits that can be grouped into multiple cells. In some implementations, a first dimension of the systolic array corresponds to columns of cells and a second dimension of the systolic array corresponds to rows of cells. The systolic array can have more rows than columns, more columns than rows, or an equal number of rows and columns. This specification describes certain processing for columns or vertically. However, different projects can perform processing for lines or horizontally.
[0071] In the illustrated example, the left side data registers 315a, 315b send vector data inputs to rows of the array. Weight shift chains 301a and 301b send weight input values to array columns, and weight shift chains 302a and 302b send weight input values to array rows. An offset chain is a connected path along which values can be passed, for example, from a memory and to each of several registers within the matrix multiplication unit.
[0072] Each weight shift register 305 is designed to shift its weight content values from a source bus along the chain of weight shift registers 305. After the data is shifted, a copy operation on parallel ensures that all data is copied from the 305 weight shift registers to the corresponding 325 weight matrix registers. When the data is in the 325 weight matrix registers, the data is used in any number of multiplication cycles. During this time, more weights can be (and typically are) shifted into the background weight registers 305 in preparation for the next set of multiplications.
[0073] Left side data registers 315a, 315b can receive vector data inputs. Each data record on the left side holds an LHS data item in each clock cycle during one clock cycle. Each vector data input received by a multiple cell can freely flow into a corresponding left side register of the multiple cell, such as left side data registers 315a, 315b. The left-hand data registers store vector data entries that can be provided by a vector register by an adjacent multiple cell located to the left of the given multiple cell, depending on the position of the multiple cell within the array. For example, if the multiple cell 300 is located at the leftmost position within the systolic array of the matrix multiplication unit, the vector data entries are provided by a vector register. The vector register can provide multiple different vector data entries to the multiple cell 300, in which each received vector data entry can then be stored by a different register from the left side registers 315. Each row receives a value every clock cycle , regardless of the number of rows that are grouped in a multiple cell.
[0074] Each record on the left side can be coupled to cells along a first dimension of the multiple cell array. Connection of left-hand records to cells is indicated by dotted lines in FIG. 3. For example, left-side data record 315a (a left-side data record) in the multiple cell is coupled to cells 350a and 350c in the first row. Similarly, left-hand data record 315b (a second left-hand record) in the multiple cell is coupled to cells 350b and 350d in the second row. Each left-hand register 315 transfers the stored vector data input to cells 350 to which the left-hand register is coupled. Therefore, for a given number of cells that span a first dimension (for example, along a given row or along a given column), vector data entries can go through all cells in the multiple cell. , and not just for a single cell, thus causing vector data input to quickly disperse across the entire array of cells, improving the efficiency of multiple cell operation.
[0075] Multiple vector data entries can also be sent to an adjacent register on the left side so that multiple vector data entries can be used in another multiple cell of the array. This process allows vector data entries to be offset for use in another multiple cell specific to the array.
[0076] Each cell 350 of a multiple cell 300 contains a stored weight value. Before starting the matrix multiplication process, weights are carried by their displacements into the cells of the systolic array. Dedicated strings and weight shift registers are provided for shifting weights so that new weights can be shifted concurrently with the previous matrix multiplication processing run. Weight entries can be loaded to multiple cells in ways that decrease the latency of the overall operational matrix multiplication processing.
[0077] As discussed above, the weight shift chains 301, 302 can receive weight inputs from a memory unit, e.g. vector memory 108 of FIG. 1. Shift chains can send multiple corresponding weight entries to weight matrix registers 315 associated with multiple cell 300.
[0078] In some implementations, weight shift registers shift vector data entries throughout the array along one dimension, for example, to the right, while shifting weight input throughout the array along one or both of the dimensions, for example, to the right or down. For example, over the course of a clock cycle, each vector data entry of the multiple vector data entries in multiple cell 300 can be moved to a corresponding left-hand data record in the next multiple cell in the same row. Horizontal data (left-hand data) and vertical data (partial sums) each move by a multiple cell per clock cycle, every clock cycle. Weights only move when instructed by the system and, depending on the implementation of the executed instructions, they can move 1, 2 or 4 rows (or columns).
[0079] A multiplexer 330 selects a weight either from a weight shift register 305 from the first shift chain 301 or from the second shift chain 302 and forwards the selected input to a single row in the weight matrix register 325. multiplexers 330 are shown outside the boundary lines of cell 350, in some implementations multiplexers 330 exist within cells 350.
[0080] In one clock cycle, each multiple cell can process the multiple inputs of supplied weights and the multiple inputs of vector data supplied to generate multiple accumulated outputs. Generally, processing includes a multiplication operation to multiply an input vector data by a stored weight. The accumulated outputs can also be passed to an adjacent multiple cell down along the same dimension as the supplied weight inputs. In some implementations, weights are shifted over a multiple cell during a given clock cycle to transition from one convolution calculation to another.
[0081] The accumulated outputs can be passed along the same columns as the weight inputs, for example towards the bottom of the column in the array. In some implementations, a running sum register 310a, 311a passes a running sum value to the multiple cell from a previous multiple cell. The array may include partial sum registers 310b, 311b that store the cumulated outputs of each column of the multiple cells. For each column of the multiple cell, the products generated by the subunit cells in the column are combined with the partial sum received from the multiple cell above and then sent as the next partial sum. For certain multiple cells, for example the multiple cells in the lower column of the systolic array, the cumulated outputs may include final cumulated values that can be transferred to a vector computation unit. In some implementations, the final accumulated values are transferred directly from the lower multiple cells of the array to the vector computation unit while in other implementations, the final accumulated values are first stored in memory or are processed by a different component before being sent to the unit. of vector computing.
[0082] FIG. 4 shows an architecture example of a multiple cell matrix multiplication unit with two strings of weight shift registers per column of the multiple cell subarray for the purpose of increasing the loading rate of weight values. As shown in FIG. 4, cell 435a and cell 435b make up one column of multiple cell 400, and cell 435c and cell 435d make up a second column of multiple cell 400. Each column has two strings of weight shift records. Each cell in a given column is configured to receive weight entries from only one of the two strings in the column. As shown in FIG. 4, a string 401 connects to weight shift registers on even-numbered rows, and a chain 402 connects to weight shift registers on odd-numbered rows. At each cycle, two new values are shifted into each column and all existing weight values are shifted down by two rows. Therefore, weights can be loaded into a multiple cell at twice the rate of multiplication units of matrices that do not have two column strings of systolic array weight shift registers.
[0083] As illustrated, the weight values are shifted from vector registers 403. In one implementation, there is one vector register 403 per column of the matrix multiplication unit. Although vector registers 403 are illustrated on top of the matrix multiplication unit in the example of FIG. 3, vector registers 403 may be physically located at various positions relative to the matrix multiplication unit, for example, at the bottom of the unit.
[0084] A vector register 403 can hold register values that are some magnitudes greater or less than the values operated by the matrix multiplication unit. For example, a register can hold n-bit values while the matrix multiplication unit operates n/2-bit values. In some implementations, each vector register holds 32-bit values and the matrix multiplication unit operates on 16-bit values. An example matrix multiplication unit has a way to treat each 32-bit value in the register as a pair of 16-bit values, where a 16-bit value from the pair is sent to the first weight shift chain 401 and the second 16-bit value of the pair is sent to the second weight shift chain 402. Although one vector register 403 per column is shown, there can only be one vector register 403 per multiple cell. Additionally or alternatively, each string may be connected to a separate vector register 303 that provides a single 16-bit weight value to the string. In this case, 32-bit floating-point values in vector register 403 are converted to 16-bit values.
[0085] In some implementations, weight values may not be available to send values at twice the rate of a matrix multiplication unit without two offset strings per column. In order to handle this situation, a hold register 445 is placed at the top of each column to hold a weight value until two weight values are available, one for each column of vertical displacement. On the first clock cycle where only one weight value is available, the available weight value is copied to hold register 445. On the next clock cycle where a new weight value is available, the weight value in the register The holding register will be shifted from the holding register to a weight shift register by a weight shift chain and the new weight value available in the clock cycle will be shifted to a second weight shift register by the second weight shift chain .
[0086] A 405 horizontal offset chain can provide weight values to cells as described above. In some implementations, there may be two horizontal shift chains that function to decrease weight load latency in the same way as the vertical shift chains 401, 402 described above.
[0087] A multiplexer 430 determines whether a weight value sent to a weight matrix register within a cell comes from the horizontal shift chain 405 or the vertical shift chain 401b or 402b. Once a weight value has been loaded into the weight matrix register and the left side data register 415 provides vector data entry, a matrix multiplication can then be performed by cell 435.
[0088] FIG. 5 is a flowchart of an exemplary process 500 for loading weight values into a column of a given multiple cell. The interface receives at least one weight value from a 501 vector register.
[0089] The interface determines if multiple weight values are available 502.
[0090] If multiple weight values are available, the interface shifts weight values through the shift chains in the clock cycle to the weight shift registers in cells 435 within multiple cell 504.
[0091] The interface continues loading weight values until all weight values from a weight matrix are loaded into matrix multiplication unit 506.
[0092] If two weight values are not available in the same clock cycle, on the first cycle where a single weight value is available, hold register 445 is loaded with the available weight value and no offset is made 503.
[0093] In the next cycle when another weight value becomes available, the interface shifts the new value and the value held in holding register 445 through the two shift chains to the weight shift registers in multiple cell 505.
[0094] Then, the interface continues loading weight values until all weight values from a weight matrix have been loaded into matrix multiplication unit 506.
[0095] In the case that multiple weight values are not available per cycle, the interface only activates the shift chains in all other cycles.
[0096] FIG. 6 shows an architectural example of a matrix multiplication unit with two strings of weight shift registers per column that injects weight values at two points into the column for the purpose of increasing the loading rate of weight values by four times . As shown in FIG. 3, a matrix multiplication unit has two displacement chains per column of the systolic array. Each cell 650 contains a shift register 635 that is connected to only one shift chain. As discussed above, a vector register 603 can hold register values that are some magnitude greater or less than the values operated by the matrix multiplication unit. For example, a register can hold n-bit values while the matrix multiplication unit operates n/2-bit values. Values in the vector register can be divided or otherwise transformed to match the expected value size by the matrix multiplication unit.
[0097] In one implementation, each register 603 can hold 32-bit values. The values in each vector register 603 are treated as a packed set of four 8-bit signed integers, each with a separate weight value. Each 8-bit signed integer is sent by two 16-bit strings as illustrated in FIG. 3. However, integers are sent to two injection points 680, 681 per column in the systolic array. Integers are sent to the top (680a, 681a) and another point down in the array (680b, 681b). The modality with multiple injection points as described can be combined with other modalities and features discussed in this document.
[0098] In some implementations, if the integers are sent midway down the array, no extra connection is needed to inject integers since the strings coming from the vector registers to the top of the array go through the entire length of arrangement from bottom to top. At the top of each column, two of the integers are converted to 16-bit floating point values of the format used by the array, which are then injected into the two weight shift strings (680a, 681a) as described above. The shift chains are cut at the midpoint by a multiplexer, and a second set of integer-to-float converters at that point takes the other two integers from each 32-bit value, converts them, and injects them into that point (680b, 681b). For example, a 32-bit word can be divided into four equal 8-bit parts: A, B, C, and D. A weights interface can send the A and B parts to the top of the array and convert them to values 16 bits to be operated by the matrix multiplication unit. The weights interface can also send the C and D parts to the midpoint of the array via a multiplexer. In this implementation, parts C and D are not sent to the top of the array, but are injected into the cell weight shift registers at the midpoint in the shift chains. There is a multiplexer on the midpoint shift chains so that weight values are picked from the injection point and not from the previous weight shift register in the shift chain.
[0099] It is possible that the injection point of the second pair of weights in the array is not at the midpoint, but at some other point. For example, it could be a point a quarter of the way down the array. In this case, the injected weights at the top are shifted to the first quarter of the matrix multiplication unit cells, and to the third one quarter of the matrix multiplication unit cells while the injected weights at the quarter point are shifted for the second and fourth one quarter of the cells of the matrix multiplication unit. This process requires additional connections, but allows the weights to start shifting earlier while a previous matrix multiplication is finishing.
[00100] As shown, the two shift chains occur per column. However, in some implementations, the two shift chains can occur additionally or alternatively per line with two injection points per shift chain.
[00101] FIG. 7 shows an architectural example of a matrix multiplication unit with separate registers for shifting transposed weights and shifting normal untransposed weights to increase the loading rate of weight values. Each multiple cell 700 includes multiple cells 750 and can be loaded with weight values coming from either vertical or horizontal direction. Loading weights from the top in a vertical direction results in a matrix of weights being stored in the matrix multiplication unit. Loading the same weights in the same order but from the side results in the transpose of the weight matrix being stored in the matrix multiplication unit. In neural network system training, both the untransposed weight matrix and the transposed weight matrix must be loaded at different steps of the training algorithm. When weights are loaded in the vertical direction from the top, the weight values are shifted down through the cells. When weights are loaded from the left in a horizontal direction, the weight values are shifted to the right through multiple cell 700. FIG. 7 illustrates normal shift chains 701a, 701b connected to normal shift registers 705. Transposed shift chains 702a, 702b are connected to transposed shift registers 705. A multiplexer 730 determines from which shift chain 701, 702 to load a 725 weight matrix register.
[00102] In some implementations, it takes n cycles to move a set of weights into the weight matrix registers of a matrix multiplication unit. The second set of weights can start its displacement n/2 cycles after the first weight value is loaded, and a new set of weights can be loaded from the shift registers into the weight matrix registers every n/ 2 cycles.
[00103] In some implementations it is not always necessary to use an integer set of 128 x 128 weights. Weights in unused positions can be set to zero, making the weight matrix effectively smaller. A matrix multiplication unit therefore does not need to shift data for all rows and columns of weight shift records. Each weight shift instruction will shift 8 rows, or for transposed loads, 8 columns, of data in the systolic array. Sixteen weight shift instructions will load the entire 128 x 128 array overwriting all previous data. Each weight shift register is cleared when data is copied from the weight shift register to the corresponding weight matrix register. The shift of new data to the weight shift registers can begin immediately after this load-and-clear signal starts its propagation. The weight shift signal is inhibited for all cells below and to the right of the load-and-clear wavefront so that data does not shift before it has an opportunity to load. Since old data is entirely cleaned up, there is no need to shift the allowed rows or columns of data. Only the top (or left) portion of the shift registers will be filled with new data and the rest will remain zero thus causing input data for these rows to be ignored (or output data for these columns to be zero).
[00104] FIG. 8 shows an example cell 800 with a set of holding registers to increase the loading rate of weight values. Cell 800 includes one or more sets of weight retention records that are used as temporary storage for sets of weights that have been transferred. The values from a weight shift register set 805a can be copied, instead of or in addition to being copied to weight matrix registers 825, for a weight holding register set 845a. Values from a second set of weight shift registers 805b can be copied, instead of or in addition to being copied to weight matrix registers 825, to a second set of weight holding registers 845b. At the time when a set of weight values is about to be loaded into the weight matrix registers, the set of weight values can be taken from one of the 845 holding register sets rather than directly from the weight shift registers 805a, 805b. This process allows a set of weight values to be loaded more than once after being moved into the array. For example, if an algorithm requires switching between two sets of weights, the weight values from a shift chain can be shifted into the hold registers between loads. This process also allows decoupling the moment of weight displacement from weight loading. For example, when a new set of weight values starts shifting every n/2 cycles, it is possible to shift both sets of weight values at the same time and when the first set is loaded into the weight matrix registers the other set is moved to a weight retention register. After n/2 additional cycles, the second set is loaded from the holding registers to the weight matrix registers.
[00105] In some implementations, the two shift chains of FIGS. 4 and/or 6 can be combined with the addition of normal shift registers and transpose for an increase in the amount of weight values that can be loaded at any given time into the matrix multiplication unit.
[00106] Subject matter modalities and functional operations described in this specification may be implemented in digital electronic circuits, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of these. Modalities of the subject described in this specification may be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded in a non-transient tangible storage medium for execution by, or for controlling the operation of apparatus. of data processing. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a serial or random access memory device, or a combination of one or more of these. Alternatively or in addition, program instructions may be encoded in an artificially generated propagated signal, for example, a machine-generated electrical, optical or electromagnetic signal, which is generated to encode information for transmission to receiver apparatus suitable for execution by an apparatus. of data processing.
[00107] The term "data processing apparatus" refers to data processing hardware and encompasses all types of apparatus, devices and machines for processing data, including as examples a programmable processor, a computer, or multiple processors or computers. The apparatus may also be, or even include, special purpose logic circuitry, for example an FPGA (Field Programmable Gate Arrangement) or an ASIC (application-specific integrated circuit). The apparatus may optionally include, in addition to hardware, code that creates an execution environment for computer programs, for example, code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of these.
[00108] A computer program, which may also be referred to or described as a program, software, a software application, an application, a module, a software module, a script, or code, may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program can, but need not, match a file on a file system. A program can be stored in a portion of a file that holds other programs or data, for example, one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, for example, files that store one or more modules, subprograms, or pieces of code. A computer program can be deployed to run on one computer or on multiple computers that are located in one location or distributed across multiple locations and interconnected by a data communication network.
[00109] The processes and flowcharts described in this specification can be executed by one or more programmable computers that execute one or more computer programs to perform functions by operating input data and generating output. Processes and flowcharts can also be performed by special purpose logic circuits, for example, an FPGA or an ASIC, or by a combination of special purpose logic circuits and one or more programmed computers.
[00110] Computers suitable for the execution of a computer program can be based on microprocessors for general or special purposes or both, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory, or both. The essential elements of a computer are a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and memory can be supplemented by, or incorporated into, logic circuitry for special purposes. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, for example, magnetic, magneto-optical, or optical disks. However, a computer does not need to have such devices. In addition, a computer can be built into another device, for example, a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver , or a portable storage device, for example a universal serial bus (USB) flash drive, just to name a few.
[00111] Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including as examples semiconductor memory devices, eg EPROM, EEPROM and memory devices flash; magnetic disks, for example internal hard disks or removable disks; magneto-optical discs; and CD-ROM and DVD-ROM discs.
[00112] To provide interaction with a user, subject modalities described in this specification can be implemented on a computer with a display device, for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor , to display information to the user, and a keyboard and pointing device, for example, a mouse or trackball, through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with a user; for example, the feedback provided to the user can be any form of sensory feedback, for example, visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, vocal or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to an internet browser on a user's device in response to requests received from the internet browser. In addition, a computer can interact with a user by sending text messages or other forms of messages to a personal device, for example, a smartphone, running a messaging app, and in turn receiving responsive messages from the user.
[00113] Modalities of the subject described in this specification can be implemented in a computer system that includes a data access component (back-end), for example, a data server, or that includes a middleware component, for example, an application server, or that includes a presentation (front-end) component, for example, a client computer that has a graphical user interface, a web browser, or an application through which a user can interact with an implementation of the subject described in this specification, or any combination of one or more of the data access, middleware or presentation components. The system components can be interconnected by any form or means of digital data communication, for example, a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), for example the Internet.
[00114] The computing system may include clients and servers. A client and server are usually remote from each other and typically interact through a communication network. The client and server relationship arises by virtue of computer programs that run on the respective computers and that have a client-server relationship with each other. In some embodiments, a server transmits data, for example, an HTML page, to a user, for example, for the purpose of displaying data and receiving user input from a user who interacts with the device, which acts as a client. Data generated on the user device, for example a result of user interaction, can be received on the server from the device.
[00115] Mode 1 is a matrix multiplication unit implemented as a systolic cell array, each cell of the cell array comprising: a weight matrix register configured to receive a weight input from a weight shift register either transposed or not transposed; a transposed weight shift register configured to receive an input of weights from a horizontal direction to be stored in the weight matrix register; an untransposed weight shift register configured to receive an input of weights from a vertical direction to be stored in the weight matrix register; and a multiplication unit which is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector input for the purpose of obtaining a multiplication result.
[00116] Mode 2 is the matrix multiplication unit of mode 1, in which each cell further comprises: a multiplexer configured to select between the weight input of the transposed weight shift register and the untransposed weight shift register and forward the selected weight entry to the weight matrix register.
[00117] Mode 3 is the matrix multiplication unit of modes 1 or 2, further comprising a first weight retention register configured to retain a weight value from either the transposed weight shift register or the weight shift register weights not transposed.
[00118] Mode 4 is the matrix multiplication unit of any one of modes 1-3, further comprising a second weight retention register configured to retain a weight value from either the transposed weight shift register or the register of displacement of weights not transposed.
[00119] Mode 5 is the matrix multiplication unit of any one of modes 1-4, in which a weight value is loaded from a weight shift register transposed to the first weight retention register and a value of weight is loaded from a vertical direction to the second weight retention register.
[00120] Mode 6 is the matrix multiplication unit of any one of modes 1-5, wherein the weight matrix register is loaded with a value from either the first or second weight holding register.
[00121] Modality 7 is a matrix multiplication unit implemented as a systolic array that comprises: a plurality of cells arranged in columns of the systolic array; two chains of weight shift registers per column of the systolic array; where each weight shift register is connected to only one string and each cell is connected to only one weight shift register; a weight-per-cell matrix register configured to store an entry of weights from a weight shift register; and a multiplication unit which is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector input for the purpose of obtaining a multiplication result.
[00122] Mode 8 is the matrix multiplication unit of mode 7, in which weight values are sent to the two weight shift register chains from a vector register that contains pairs of weight values.
[00123] Mode 9 is the matrix multiplication unit of modes 7 or 8, further comprising a holding register at the top of each column to hold a weight value when two weight values are unavailable from the vector register.
[00124] Mode 10 is the matrix multiplication unit of any of modes 7-9, where when two weight values are available, the two weight values are shifted in the clock cycle to the weight shift registers in the cells.
[00125] Mode 11 is the matrix multiplication unit of any of modalities 7-10, where when two weight values are unavailable: in a first clock cycle where a first weight value is available, the register hold is loaded with the first weight value as a held value and no offset is made; and on a following clock cycle, when a second weight value is available, the second weight value and the retained value are shifted, by the two shift chains, by a value shifted by each shift chain, to shift registers of weights connected to displacement chains.
[00126] Mode 12 is the matrix multiplication unit of any of modalities 7-11, further comprising: each displacement chain has two injection points to inject weight values, one at the top of the column and the other at a second dot in the column.
[00127] Mode 13 is the matrix multiplication unit of any one of modes 7-12, further comprising: a vector register containing packed sets of four integers of 8 bits each representing a separate weight value.
[00128] Mode 14 is the matrix multiplication unit of any of the modes 7-13, further comprising: injecting two of the four whole numbers at the top of the column and injecting the other two of the four whole numbers at a second point in the arrangement.
[00129] Although this specification contains many details of specific implementations, these should not be regarded as limitations on the scope of any invention or on the scope of what can be claimed, but rather as descriptions of features that may be specific to specific embodiments. of specific inventions. Certain features that are described in this specification in the context of separate modalities can also be implemented in combination in a single modality. Conversely, multiple features that are described in the context of a single modality can also be implemented in multiple modalities separately or in any suitable combination. Furthermore, although features may have been described as acting in certain combinations and even initially claimed as such, one or more features of a claimed combination may in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a subcombination.
[00130] Similarly, although operations are depicted in the drawings in a specific order, this is not to be understood as requiring that such operations be performed in the specific order shown or in sequential order, or that all illustrated operations be performed, to achieve desired results . In certain circumstances, multitasking and parallel processing can be advantageous. Furthermore, the separation of various modules and system components in the modalities described above should not be understood as requiring such separation in all modalities, and it should be understood that the described program components and systems can generally be integrated together in a single product software or bundled into multiple software products.
[00131] Specific modalities of the subject have been described. Other embodiments are within the scope of the following claims. For example, the actions mentioned in the claims can be performed in a different order and still achieve desired results. As an example, the processes depicted in the attached figures do not necessarily require the specific order shown, or sequential order, to achieve desired results. In some cases, multitasking and parallel processing can be advantageous.

权利要求:
Claims (11)
[0001]
1. Matrix multiplication unit implemented as a systolic cell array (750), the matrix multiplication unit characterized by each cell of the cell array comprising: a weight matrix register (725) configured to receive a weight input from a weight shift register either transposed (705) or not transposed (706); a transposed weight shift register (705) configured to receive an input of weights from a horizontal direction to be stored in the matrix register weights (725); an untransposed weight shift register (706) configured to receive an input of weights from a vertical direction to be stored in the weight matrix register (725); and a multiplication unit which is coupled to the weight matrix register (725) and configured to multiply the weight input of the weight matrix register with a vector input for the purpose of obtaining a multiplication result.
[0002]
2. Matrix multiplication unit according to claim 1, characterized in that each cell further comprises: a multiplexer (730) configured to select between the weight input of the transposed weight shift register (705) and the register of untransposed weight shift (706) and forward the selected weight entry to the weight matrix register (725).
[0003]
3. Matrix multiplication unit according to claim 1 or 2, characterized in that it further comprises a first weight retention register configured to retain a weight value from either the transposed weight shift register (705) or the register of untransposed weight displacement (706).
[0004]
4. Matrix multiplication unit according to claim 3, characterized in that it further comprises a second weight retention register configured to retain a weight value from either the transposed weight shift register (705) or the register of untransposed weight displacement (706).
[0005]
5. Matrix multiplication unit according to claim 4, characterized in that a weight value is loaded from a transposed weight shift register (705) to the first weight retention register and a value of weight be loaded from a vertical direction to the second weight retention register.
[0006]
6. Matrix multiplication unit according to claim 5, characterized in that the weight matrix register (725) is loaded with a value from either the first or second weight retention register.
[0007]
7. Matrix multiplication unit according to claim 6, characterized in that when the data is in the weight matrix register (725), the data is used in any number of multiplication cycles.
[0008]
8. Matrix multiplication unit according to claim 7, characterized in that during the number of multiplication cycles, more weights are shifted to the weight shift registers in the background in preparation for a next set of multiplications.
[0009]
9. Matrix multiplication unit according to claim 7, characterized in that during the number of multiplication cycles, the weight input of the weight matrix register is multiplied with a vector data input for the purpose of obtaining a multiplication result.
[0010]
10. Matrix multiplication unit, according to claim 1, characterized in that the vector data input moves by a multiple cell per clock cycle.
[0011]
11. Matrix multiplication unit according to claim 1, characterized in that weights are shifted based on instructions when instructions are received.

类似技术:

公开号 | 公开日 | 专利标题

BR112019023395B1|2021-08-17|LOW LATENCY MATRIX MULTIPLICATION UNIT

US10621269B2|2020-04-14|Performing matrix multiplication in hardware

KR20190029515A|2019-03-20|An arithmetic unit that supports arithmetic data with different bit widths, arithmetic method, and arithmetic unit

JP2022003532A|2022-01-11|Dedicated neural network training chip

TW202147149A|2021-12-16|Low latency matrix multiply unit

Wyrzykowski et al.2010|Towards efficient execution of erasure codes on multicore architectures

JP2022500782A|2022-01-04|Data processing systems, methods, and programs

de Souza et al.2009|Architecture for dense matrix multiplication on a high-performance reconfigurable system

Karppa et al.2019|Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures

Gorrod et al.1989|Parallel processing of Monte Carlo simulations using a transputer array

Clausen et al.2017|Prototyping the Next Generation of Aria.

Zhuo et al.2007|Optimizing Matrix Multiplication on Heterogeneous Reconfigurable Systems.

Perry et al.2013|Optimising Euroben Kernels on Maxwell

同族专利:

公开号 | 公开日

BR112019023395A2|2020-06-16|

TWI685757B|2020-02-21|

US20180336164A1|2018-11-22|

US20190354571A1|2019-11-21|

JP6929958B2|2021-09-01|

EP3500945A1|2019-06-26|

EP3526683A1|2019-08-21|

EP3526683B1|2020-08-19|

BR112019022916A2|2020-05-26|

EP3500945B1|2020-09-30|

US10698976B2|2020-06-30|

JP2020516991A|2020-06-11|

US20210209193A1|2021-07-08|

US10970362B2|2021-04-06|

WO2018213628A1|2018-11-22|

JP2021184293A|2021-12-02|

US20180336163A1|2018-11-22|

US10698974B2|2020-06-30|

EP3800563A1|2021-04-07|

TW201908996A|2019-03-01|

US20200226202A1|2020-07-16|

CN109997132A|2019-07-09|

US10635740B2|2020-04-28|

WO2018213635A1|2018-11-22|

KR20190116434A|2019-10-14|

US20200327186A1|2020-10-15|

KR102302608B1|2021-09-15|

EP3757823A1|2020-12-30|

CN109937416A|2019-06-25|

TW202024961A|2020-07-01|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US4720780A|1985-09-17|1988-01-19|The Johns Hopkins University|Memory-linked wavefront array processor|

US7161995B1|2002-03-15|2007-01-09|Xilinx, Inc.|Method and apparatus for Viterbi synchronization|

US8620984B2|2009-11-23|2013-12-31|Xilinx, Inc.|Minimum mean square error processing|

US8543634B1|2012-03-30|2013-09-24|Altera Corporation|Specialized processing block for programmable integrated circuit device|

US10438117B1|2015-05-21|2019-10-08|Google Llc|Computing convolutions using a neural network processor|

US10049322B2|2015-05-21|2018-08-14|Google Llc|Prefetching weights for use in a neural network processor|

US10192162B2|2015-05-21|2019-01-29|Google Llc|Vector computation unit in a neural network processor|

US9747546B2|2015-05-21|2017-08-29|Google Inc.|Neural network processor|

US11244225B2|2015-07-10|2022-02-08|Samsung Electronics Co., Ltd.|Neural network processor configurable using macro instructions|

EP3500945B1|2017-05-17|2020-09-30|Google LLC|Low latency matrix multiply unit|

US11188814B2|2018-04-05|2021-11-30|Arm Limited|Systolic convolutional neural network|US10853448B1|2016-09-12|2020-12-01|Habana Labs Ltd.|Hiding latency of multiplier-accumulator using partial results|

EP3500945B1|2017-05-17|2020-09-30|Google LLC|Low latency matrix multiply unit|

US11157287B2|2017-07-24|2021-10-26|Tesla, Inc.|Computational array microprocessor system with variable latency memory access|

US11157441B2|2017-07-24|2021-10-26|Tesla, Inc.|Computational array microprocessor system using non-consecutive data formatting|

US10915297B1|2017-11-15|2021-02-09|Habana Labs Ltd.|Hardware accelerator for systolic matrix multiplication|

US11164074B2|2018-02-08|2021-11-02|Western Digital Technologies, Inc.|Multi-core systolic processor system for neural network processing|

US10990396B2|2018-09-27|2021-04-27|Intel Corporation|Systems for performing instructions to quickly convert and use tiles as 1D vectors|

US20200250518A1|2019-02-01|2020-08-06|Lightelligence, Inc.|Processing Matrix Operations for Rate Limited Systems|

KR20200107295A|2019-03-07|2020-09-16|에스케이하이닉스 주식회사|Systolic array and processing system|

US10990397B2|2019-03-30|2021-04-27|Intel Corporation|Apparatuses, methods, and systems for transpose instructions of a matrix operations accelerator|

CN112149049A|2019-06-26|2020-12-29|北京百度网讯科技有限公司|Apparatus and method for transforming matrix, data processing system|

US11188618B2|2019-09-05|2021-11-30|Intel Corporation|Sparse matrix multiplication acceleration mechanism|

US11195818B2|2019-09-12|2021-12-07|Taiwan Semiconductor Manufacturing Company, Ltd.|Backside contact for thermal displacement in a multi-wafer stacked integrated circuit|

US20210124794A1|2019-10-29|2021-04-29|Facebook, Inc.|High throughput matrix processor with support for concurrently processing multiple matrices|

US20210182465A1|2019-12-13|2021-06-17|Intel Corporation|Implementing Large Multipliers in Tensor Arrays|

US20200327271A1|2019-12-13|2020-10-15|Martin Langhammer|FPGA Specialist Processing Block for Machine Learning|

WO2021179224A1|2020-03-11|2021-09-16|深圳市大疆创新科技有限公司|Data processing device, data processing method and accelerator|

GB2594971A|2020-05-13|2021-11-17|Advanced Risc Mach Ltd|Variable position shift for matrix processing|

法律状态:
2021-06-22| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2021-08-17| B16A| Patent or certificate of addition of invention granted [chapter 16.1 patent gazette]|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 17/05/2018, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US201762507766P| true| 2017-05-17|2017-05-17|

US62/507,766|2017-05-17|

PCT/US2018/033261|WO2018213628A1|2017-05-17|2018-05-17|Low latency matrix multiply unit|

[返回顶部]