专利摘要:
the present invention relates to a deep neural network module ("dnn") that can compress and decompress activation data generated by neurons to reduce memory bus bandwidth utilization. the compression unit can receive an uncompressed block of data generated by a neuron in the dnn module. the compression unit generates a mask part and a data part of a compressed output block. the mask part encodes the presence and location of the zero and non-zero bytes in the uncompressed data block. the data portion stores non-zero truncated bytes of the uncompressed block of data. a decompression unit can receive a compressed block of memory data in the dnn processor or memory from an application host. the decompression unit decompresses the compressed data block using the mask part and the data part. this can reduce memory bus utilization, allow a dnn module to complete processing operations more quickly and reduce power consumption.
公开号:BR112019021541A2
申请号:R112019021541-7
申请日:2018-04-16
公开日:2020-05-12
发明作者:Leon Corkery Joseph;Eliot Lundell Benjamin;Marvin Wall Larry;Balling Mcbride Chad;Ashok Ambardekar Amol;Petre George;D. Cedola Kent;Bobrov Boris
申请人:Microsoft Technology Licensing, Llc;
IPC主号:
专利说明:

Invention Patent Report for NEURAL NETWORK PROCESSOR USING ACTIVATION DATA COMPRESSION AND DECOMPRESSION TO REDUCE MEMORY BAND WIDTH USE.
Background
[001] Deep neural networks (DNNs) are freely modeled after information processing and communication patterns in biological nervous systems, such as the human brain. DNNs can be used to solve complex classification problems such as, but not limited to, object detection, semantic tagging and resource extraction. As a result, DNNs form the basis for many artificial intelligence applications, such as computer vision, speech recognition and machine translation. DNNs can match or exceed human precision in many of these domains.
[002] The high level of performance of DNNs stems from its ability to extract high-level resources from input data after using statistical learning on a large data set to obtain an effective representation of an input space. However, the superior performance of DNNs comes with the cost of high computational complexity. High-performance general-purpose processors, such as graphics processing units (GPUs), are commonly used to provide the high level of computational performance required by many DNN applications.
[003] Although general purpose processors, such as GPUs, can provide a high level of computational performance to implement DNNs, these types of processors are typically not suitable for use when performing DNN operations over long durations on computing devices where power consumption is low. low energy is critical. For example, general purpose processors, such as
Petition 870190103180, of 10/14/2019, p. 20/87
2/49
GPUs may not be suitable for use when performing long-running DNN tasks on portable battery powered devices, such as smartphones or alternate / virtual reality (AR / VR) devices, where reduced power consumption is required to extend life battery.
[004] Reduced energy consumption while performing continuous DNN tasks, such as detecting person movement, can also be important in non-battery powered devices, such as a power security camera over Ethernet (POE), for example . In this specific example, POE switches can only supply a limited amount of power and reducing the power consumption of POE devices such as security cameras allows the use of POE switches that provide less power.
[005] Application specific integrated circuits (ASICs) have been developed that can provide high performance DNN processing while at the same time reducing energy consumption when compared to general purpose processors. Despite advances in this area, however, there is a continuing need to improve performance and reduce energy consumption for ASICs that perform DNN processing, particularly for use in computing devices where low power consumption is critical.
[006] It is in relation to these and other technical challenges that the description made in this document is presented.
summary
[007] A module, or DNN processor, is described that can compress and decompress activation data to reduce the use of memory bus bandwidth. In particular, the DNN module can use compression to reduce the use of width
Petition 870190103180, of 10/14/2019, p. 21/87
3/49 bus band between neuron output and internal or external memory. The DNN module can also use decompression to reduce the memory bus bandwidth utilization between internal or external memory and neuron input. Utilization of reduced bandwidth can enable faster processing and, consequently, can also reduce energy consumption. Other technical benefits not specifically mentioned in this document can also be realized through implementations of the subject in question.
[008] In order to realize the technical benefits mentioned briefly above, a DNN processor is described that includes one or more neurons and a compression unit. The compression unit can receive an uncompressed block of data generated by one or more of the neurons. The uncompressed block of data includes a fixed number of bytes, such as 64 bytes, in some embodiments.
[009] In order to compress the uncompressed data block, the compression unit can generate a mask part and a data part of a compressed output block. The mask portion of the compressed output block includes a number of bits equivalent to the fixed number of bytes in the uncompressed data block. For example, if the uncompressed block of data includes 64 bytes of data, the mask portion will include 64 bits (that is, 8 bytes).
[010] Each bit in the mask part of the compressed output block corresponds to a byte in the uncompressed data block in some modes. For example, bit one of the mask part can correspond to the first byte in the uncompressed data block, bit two of the mask part can correspond to the second byte in the uncompressed data block, and so on. In other embodiments, two or more bits in the mask part of the output block with
Petition 870190103180, of 10/14/2019, p. 22/87
4/49 primido correspond to a byte in the uncompressed block of data. In these modalities, the bits in the mask part of the compressed output block can indicate not only that they are a corresponding byte in the uncompressed block, but also their approximate magnitude.
[011] When individual bits of the mask part correspond to bytes in the uncompressed block, the compression unit sets each bit in the mask part of the compressed output block to a false logic (which can also be referred to in this document as a logical zero ), where a corresponding byte in the uncompressed data block contains all zeros (that is, a byte of zeros). The compression unit also sets each bit in the mask portion of the compressed output block to a true logic (which can also be referred to in this document as a logical one), where a corresponding byte in the uncompressed data block contains at least one bit nonzero (that is, a non-zero byte). In this mode, the mask portion of the compressed output block encodes the presence and location of the zero and non-zero bytes in the uncompressed data block.
[012] The compression unit generates the data portion of the compressed output block when determining the number of non-zero bytes in the uncompressed data block. The compression unit then determines, based on the number of non-zero bytes in the uncompressed data block and the number of bytes available in the data part of the compressed output block, the number of bits in the data part of the output block that are available to store each byte of nonzeros in the uncompressed block of data. For example, if the data portion of the compressed data block is 24 bytes in size (that is, 192 bits) and there are 47 non-zero bytes in the uncompressed data block, four bits are available
Petition 870190103180, of 10/14/2019, p. 23/87
5/49 in the data part to store each nonzero byte of the uncompressed data block.
[013] In some embodiments, the compression unit can also determine the number of additional bits, if any, in the data portion of the compressed output block that are available to store nonzero bytes of the uncompressed data block. In the case given above, for example, four additional bits are available to store non-zero bytes (ie 192 mod 47 ™ four bits). The compression unit may assign these additional bits to one or more of the nonzero bytes in the uncompressed block of data before truncating the one or more of the nonzero bytes. For example, the compression unit can assign these additional bits to the first few bytes in the data portion of the compressed output block.
[014] The compression unit then truncates the nonzero bytes in the uncompressed data block to the given number of bits available in the data portion to store each nonzero byte (that is, four in the example given above). The compression unit truncates the least significant bits (LSBs) of the non-zero bytes to fit within the available number of bits in the data part in one mode. In another embodiment, the compression unit truncates the most significant bits (MSBs) of the non-zero bytes. The compression unit then stores the truncated non-zero bytes in the data portion of the compressed output block. The compressed output block, including the mask part and the data part, can then be sent, for example, to internal memory on the DNN processor or to external memory on a host of DNN processor applications.
[015] The DNN module can also include a decompression unit that can decompress blocks of data that have been
Petition 870190103180, of 10/14/2019, p. 24/87
6/49 tablets in the method described above. For example, the decompression unit can receive a compressed block of data from memory in the DNN processor or from an application host's memory. The decompression unit can then determine the number of non-zero bytes in the data portion of the uncompressed data block based on the number of true logical bits in the mask portion of the compressed output block. The decompression unit can also determine the locations of the non-zero bytes in the uncompressed data block based on the locations of the true logical bits in the mask portion of the compressed output block. The decompression unit can determine the location of the zero bytes in the uncompressed block of data in a similar way.
[016] The decompression unit can also determine the number of bits used by the compression unit to store the truncated nonzero bytes in the data portion of the compressed output block. The decompression unit can determine the number of bits used to store each truncated nonzero byte based on the number of nonzero bytes in the compressed data block and the number of bytes available in the data portion of the uncompressed output block.
[017] In the case given above, for example, if the data portion of the compressed data block is 24 bytes in size (ie, 192 bits) and there are 47 non-zero bytes in the uncompressed data block, the unit Compression uses four bits to store each truncated nonzero byte of the uncompressed block of data in the data portion. The decompression unit can also determine the number of additional bits, if any, that the compression unit allocates for one or more of the truncated nonzero bytes stored in the data portion of the compressed output block.
Petition 870190103180, of 10/14/2019, p. 25/87
7/49
[018] For each bit position in the mask part of the compressed output block, which is a logical zero, the decompression unit inserts a zero byte in the corresponding position of the decompressed output block. For each position in the mask part which is a logical 1, the decompression unit inserts the truncated non-zero byte from the corresponding position of the compressed input block into a corresponding position of the decompressed output block along with a number of zero bits equivalent to number of bits truncated during compression of the compressed output block. The zero bits can be inserted into the LSBs or MSBs of the truncated non-zero bytes depending on which bits were truncated during compression.
[019] In some embodiments, the decompression unit also adds an offset (for example, 00000001) to one or more of the truncated non-zero bytes stored in the uncompressed output block. For example, an offset can be added to the nonzero bytes of the uncompressed block of data that are taken as zero bytes after compression. In this mode, non-zero bytes will not become zero bytes when compressed and uncompressed. An offset can be added to all bytes in the output block uncompressed in other modalities.
[020] As briefly discussed above, implementations of the technologies described in this document can reduce memory bus bandwidth utilization in a DNN module, allow a DNN module to complete processing operations more quickly and reduce power consumption. Other technical benefits not specifically identified in this document can also be realized through implementations of the technologies described.
[021] It must be realized that the matter in question described above can be implemented as a device controlled by
Petition 870190103180, of 10/14/2019, p. 26/87
8/49 computer, a method implemented by a computer, a computing device or as an article of manufacture such as a computer-readable medium. These and several other resources will be apparent from a reading of the Detailed Description below and an analysis of the associated drawings.
[022] This Summary is provided to introduce a brief description of some aspects of the technologies described in a simplified form, which are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the matter in question, nor is it intended to be used to limit the scope of the matter in question. In addition, the subject matter claimed is not limited to implementations that address some or all of the disadvantages noted elsewhere in this description.
Brief Description of Drawings
[023] Figure 1 is a computer architecture diagram that shows aspects of the configuration and operation of a DNN module that implements aspects of the technologies described in this document, according to one modality;
[024] Figures 2A and 2B are diagrams of computing system architectures showing aspects of the configuration and operation of a DNN module to compress activation data, according to a modality;
[025] Figure 3 is a data structure diagram that illustrates aspects of the operation of a DNN module to compress activation data with reference to an example block of uncompressed activation data, according to a modality;
[026] Figure 4 is a flowchart showing a routine that illustrates aspects of the operation of the DNN module described to compress activation data, according to a modality described in this document.
Petition 870190103180, of 10/14/2019, p. 27/87
9/49 document;
[027] Figures 5A and 5B are diagrams of computing system architectures showing aspects of the configuration and operation of a DNN module to decompress activation data, according to a modality;
[028] Figure 6 is a data structure diagram that illustrates aspects of the operation of a DNN module to decompress activation data with reference to an example block of compressed activation data, according to a modality;
[029] Figure 7 is a flowchart showing a routine that illustrates aspects of the operation of the DNN module described to decompress activation data, according to a modality described in this document;
[030] Figure 8 is a computer architecture diagram showing an illustrative computer hardware and software architecture for a computing device that can act as an application host for the DNN module presented in this document, according to one modality ; and
[031] Figure 9 is a network diagram illustrating a distributed computing environment in which aspects of the technologies described can be implemented, according to the various modalities presented in this document.
Detailed Description
[032] The following detailed description is for a DNN module that can compress and decompress activation data to reduce memory bus bandwidth utilization. As briefly discussed above, implementations of the technologies described can reduce memory bus bandwidth utilization in a DNN module, allow a DNN module to complete processing operations more quickly and re
Petition 870190103180, of 10/14/2019, p. 28/87
10/49 reduce energy consumption. Other technical benefits not specifically mentioned in this document can also be realized through implementations of the subject in question.
[033] Although the subject matter described in this document is presented in the general context of a hardware DNN module, those skilled in the art will recognize that other implementations can be performed in combination with other types of computing systems and modules. Those skilled in the art will also realize that the subject matter described in this document can be practiced with other computer system configurations, including handheld devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, embedded computing or processing systems. on devices (such as wearable computing devices, automobiles, home automation, etc.), minicomputers, large computers and more.
[034] As will be described in more detail below, a DNN module is described that is configured to compress the output of your neurons. The compressed output can be stored in memory in the DNN module or in memory that is external to the DNN module, such as memory provided by an application host for the DNN module. The DNN module can later decompress the previously compressed data and provide the uncompressed data to the neurons.
[035] According to one embodiment, a compression unit in the DNN processor compresses blocks of fixed lengths (for example, 64 bytes) of uncompressed activation data in a fixed compression ratio (for example, 2: 1). Compressed activation data generated by the compression unit can include blocks of data having a fixed length (for example, 32
Petition 870190103180, of 10/14/2019, p. 29/87
11/49 bytes), which include a fixed length mask part (for example, 8 bytes) and a fixed length data part (for example, 24 bytes).
[036] The bits of the mask part of a compressed output block correspond to bytes within an uncompressed input block in one mode. For example, the first bit of a mask part can match the first byte in an uncompressed input block, the second bit of the mask part can match the second byte in an uncompressed input block, and so on. Bits in the mask portion of the compressed activation data can be set to a logical zero if the corresponding byte in the uncompressed input block is zero and can be set to a logical 1 if the corresponding byte in the uncompressed input block is not zeros.
[037] As briefly discussed above, two or more bits in the mask portion of the compressed output block correspond to a byte in the uncompressed data block in some embodiments. In these modalities, the bits in the mask part of the compressed output block can indicate that they are not only a corresponding byte in the uncompressed block, but also their approximate magnitude.
[038] The data portion of a compressed output block includes the nonzero bytes of an uncompressed input block that has been truncated to represent the nonzero bytes of the input block using the number of bits available in the compressed data. The number of bits available in the data part of the compressed output block for each byte of nonzeros is determined in some ways by dividing the total number of bits available in the data part (for example, 192 bits) by the number of bytes of no zeros in the uncompressed input block. The result of this compu
Petition 870190103180, of 10/14/2019, p. 30/87
12/49 tation indicates the number of bits in the data portion of the compressed output block that are available to represent each byte of nonzero data in the uncompressed input block. Any remaining bits can be used to provide an additional bit to represent some of the nonzero values in the data portion of the compressed output block.
[039] Once the number of bits available in the data portion of the compressed output block to represent each byte of nonzeros in the uncompressed input block has been determined, the LSBs of the nonzero values in the uncompressed input block are truncated to fit within the available number of bits. The MSBs of nonzero values can be truncated in other ways. Truncated nonzero values can then be stored in the data portion of the compressed output block. This process can be repeated for each block of uncompressed input trigger values. The compressed output blocks can then be stored in internal or external module memory for later decompression and use by neurons.
[040] The DNN module described can also include a decompression unit to decompress activation values that have been compressed by the compression unit in the manner previously described. The decompression unit receives compressed activation data blocks that include a mask part and a data part. The decompression unit can use the bits of the mask part to identify the number of bytes of non-zeros that will be present in an uncompressed output block and their locations within the decompressed output block. The mask also indicates the locations of zero bytes in the uncompressed output block.
[041] In some modalities, the decompression unit
Petition 870190103180, of 10/14/2019, p. 31/87
13/49 determines the number of bits that was used by the compression unit to represent each byte of non-zeros by dividing the total number of bits available in the data portion (for example, 192 bits) of a compressed block by the number of bytes of no zeros in the uncompressed input block as specified by the mask. The decompression unit can also assume that the compression unit used any remaining bits to provide an additional bit to represent some of the nonzero values in the data portion of the compressed block (for example, the first N values).
[042] For each bit position in the mask that is a logical zero, the decompression unit can insert a zero byte in the uncompressed output block at its corresponding position. For each bit position in the mask that is a logical 1, the decompression unit inserts the truncated non-zero bytes from the corresponding position in the data portion of the compressed input block at the corresponding position in the decompressed output block. The decompression unit also inserts zeros in the LSBs, or MSBs as appropriate, of the nonzero values to replace those bits that were truncated during compression.
[043] In some embodiments, the decompression unit adds an offset value to the truncated nonzero values to ensure that uncompressed nonzero values do not become zero bytes when uncompressed. The uncompressed output block can then be stored in internal or external module memory for use by neurons. Additional details regarding the operation of the DNN module, the compression unit and the decompression unit will be provided below.
[044] In the following detailed description, references are made to the accompanying drawings that form a part of this document, which are shown by way of illustration of configurations or examples.
Petition 870190103180, of 10/14/2019, p. 32/87
Specific 14/49. Referring now to the drawings, in which equal numbers represent equal elements across all the various figures, aspects of a DNN module that can compress and decompress activation data to reduce the memory bus bandwidth usage will be described.
[045] Figure 1 is a computer architecture diagram that shows aspects of the configuration and operation of a DNN 105 module that implements the technologies described in this document, according to one modality. The DNN 105 module described in this document is configured in some ways to solve classification problems (and related problems) such as, but not limited to, object detection, semantic tagging and resource extraction.
[046] In order to provide this functionality, the DNN 105 module can implement a return-only neural network and programmatically support a wide variety of network structures. Training for the network implemented by the DNN 105 module can be performed offline on a set of servers, data center or in another suitable computing environment. The result of training a DNN is a set of parameters that can be known as weights or cores. These parameters represent a transformation function that can be applied to an input with the result being a classification output or semantically labeled.
[047] The DNN 105 module described in this document can be considered a superscalar processor. The DNN 105 module can dispatch one or more instructions to multiple execution units, called 105F neurons. The execution units can be of simultaneous dispatch, simultaneous completion, where each execution unit is synchronized with each of the other uniPetition 870190103180, from 10/14/2019, p. 33/87
15/49 execution activities. The DNN 105 module can be classified as a single instruction flow and multiple data flow (SIMD) architecture.
[048] The DNN 105 module includes a number of 105F neurons (for example, a power of two). A 105F neuron is the base unit in artificial neural networks that is used to model a biological neuron in the brain. The 105F neuron model can include the internal product of an input vector by a weight vector added for a predisposition, with an activation function applied. The processing performed by a 105F neuron in the DNN 105 module described in this document is rigidly mapped to an artificial neuron.
[049] Each 105F neuron in the DNN 105 module is capable of performing weighted sum, maximum combination, deviation and potentially other types of operations. The 105F neurons process input and weight data with each clock cycle. Each 105F neuron is synchronized with all other 105F neurons in terms of progress within a nucleus to minimize the flow of nucleus data within the DNN 105 module.
[050] Each 105F neuron can contain a multiplier, a summer, a comparator and several accumulators (not shown in figure 1). By having multiple accumulators, 105F neurons are able to maintain context for multiple different active nuclei at a time. Each accumulator is capable of being charged from a reading on the BaSRAM 150 (described below). The accumulators can add themselves with the contents of other accumulators of other 105F neurons.
[051] The DNN 105 module accepts planar data as input, such as image data. Entry to the DNN 105 module, however, is not limited to image data. In particular, the module
Petition 870190103180, of 10/14/2019, p. 34/87
16/49
I DNN 105 can operate with any input data presented for the DNN 105 module in a uniform planar format. In a particular mode, the DNN 105 module can accept single-byte or two-byte multiplanar data frames as input.
[052] Each input frame can be wrapped with an NxKxHxW set of cores, where N is the number of cores, K is the number of channels per core, H is the height and W is the width. Convolution is performed by overlapping intervals through the input data where the interval is defined by advances in the X and Y directions. These functions are performed by neurons 105F and managed by the DNN 105 module and control registers visible by software.
[053] The DNN 105 module supports three main data types: weights; maps of data inputs / resources; and activation data. Data entry / resource maps and activation data, in many cases, are two names for the same data with the distinction that when referring to an output from a layer the term activation data is used. When referring to the entry of a layer, the term input / resource maps is used.
[054] Neurons 105F in module DNN 105 compute a weighted sum of their inputs and pass the weighted sum through an activation function or transfer function. The transfer function commonly has a sigmoid shape, but it can also take the form of a linear set of parts function, step function or another type of function. The activation function allows 105F neurons to be trained for a larger set of desired inputs and outputs where classification limits are non-linear.
[055] The DNN 105 module operates on a list of layer descriptors that correspond to the layers of a neural network. The list of layer descriptors can be handled by the DNN 105 module
Petition 870190103180, of 10/14/2019, p. 35/87
17/49 as instructions. These descriptors can be retrieved in advance from the DNN 105 module and executed in order. The descriptor list acts as a set of instructions for the DNN 105 module. Tools and / or software compilers can be run on devices external to the DNN 105 module to create the descriptor list that is executed on the DNN 105 module.
[056] In general, there can be two main classes of descriptors: memory-to-memory displacement (M2M) descriptors; and operation descriptors. M2M descriptors can be used to move data to / from main memory to / from local temporary storage (i.e., line temporary storage 125 described below) for consumption by the operation descriptors. M2M descriptors follow a different execution thread than the operation descriptors. The target thread for M2M descriptors can be the internal DMA mechanism 105B or the 105G configuration records, while the target thread for operating descriptors can be neurons 105F.
[057] Operational descriptors specify a specific operation that 105F neurons must perform on a data structure located in local random access memory (SRAM). Operational descriptors are processed in order and are capable of many operations of different layers, at least some of which are described in this document.
[058] As shown in figure 1, the DNN 105 module has a memory subsystem with a unique temporary storage structure L1 and L2. The temporary stores L1 and L2 shown in figure 1 are designed specifically for neural network processing. For example, the L2 150 temporary storage can maintain a selected storage capacity with a high speed private interface
Petition 870190103180, of 10/14/2019, p. 36/87
18/49 selected frequency. Temporary storage L1 125 can maintain a selected storage capacity that can be divided between core and activation data. L1 staging 125 can be referred to in this document as line 125 storing, and L2 staging 150 can be referred to in this document as BaSRAM 150.
[059] Computational data (ie input data, weights and activation data) are stored on the main line of the BaSRAM 150 in some modes. The computational data can be organized as two line buffer stores, where one line buffer contains input data, which can be referred to in this document as the input buffer, and the other line buffer, which can be referred to in this document as temporary weight storage, contains core weights. The line temporary stores are filled from the BaSRAM 150 by the 105C load / storage unit. Data is accumulated in each line temporary storage until it has reached its predetermined capacity. The line buffer data is then copied to a parallel buffer in some modalities and presented to the 105F neurons.
[060] The DNN 105 module can also comprise several other components including, but not limited to, a 105A prefetch unit, a 105E save / restore unit, a 105D layer controller and a 105G registration interface. The DNN 105 module can include additional or alternative components in some modalities.
[061] The DNN 105 module operates in combination with other external computing components in some configurations. Per
Petition 870190103180, of 10/14/2019, p. 37/87
19/49 example, the DNN 105 module is connected to a system on a host application processor chip (the host SoC) 130 in some embodiments. The DNN 105 module can be connected to host SoC 130 via a PCIe interface, for example. Appropriate PCIe components, such as the PCIe 135 endpoint, can be used to enable these connections.
[062] The SoC Host 130 serves as the application processor for the DNN 105 module. The main operating system, application and auxiliary sensor processing are performed by host SoC 130. Host SoC 130 can also be connected to a data source input 102, such as an external camera, which provides input data, such as image data, to the DNN 105 module.
[063] DDR DRAM 155 can also be connected to host SoC 130, which can be used as the main system memory. This memory is accessible by host SoC 130 through the high bandwidth structure 120 (for example, PCIe bus) through a memory controller 145. The high bandwidth structure 120 provides bidirectional transactions of small direct access messages memory (DMA) and larger DMA transactions. A bridge 115 and low bandwidth structure 110 can connect the DNN module 105 to host SoC 130 for sub-module configuration and other functions.
[064] The DNN 105 module can include a DMA 105B mechanism that is configured to move data to and from main memory 155. The DMA 105B mechanism has two channels in some embodiments. One channel is dedicated to search for operation descriptors while the other channel is dedicated for M2M operations. A DMA descriptor can be incorporated into the M2M descriptor. Descriptors in this context are DMA descriptors that are used to move
Petition 870190103180, of 10/14/2019, p. 38/87
20/49 memory contents, and should not be confused with the operation descriptors described above.
[065] To offload the local BaSRAM memory 150, and to provide more space for input data and weight data, the activation output can optionally be transferred directly to DDR 155 memory. When data is transferred in continuous flow for DDR 155 memory, the DNN 105 module will accumulate enough data for a burst transaction in the high bandwidth structure 120 and temporarily store enough transactions to minimize back pressure on 105F neurons. Additional details regarding the operation of the DNN 105 module will be provided below.
[066] Figures 2A and 2B are diagrams of computing system architectures showing aspects of the configuration and operation of the DNN 105 module to compress activation data, according to a modality. As shown in figure 2A and discussed briefly earlier, the DNN module 105 includes one or more neurons 105F and a compression unit 200. The compression unit 200 is implemented by the load / storage unit 105C in some embodiments, but may be implemented in other ways in other modalities.
[067] The compression unit 200 can receive an uncompressed block of activation data 202 generated by one or more of the 105F neurons. The uncompressed data block 202 includes a fixed number of bytes, such as 64 bytes, in some embodiments.
[068] The compression unit 200 can compress the uncompressed block of data 202 to generate a compressed block of activation data 204. The compressed block of activation data 204 can then be stored in memory 206. For example, the compressed block activation data 204 can be stored in memory
Petition 870190103180, of 10/14/2019, p. 39/87
21/49
LPDDR4 155 provided by the application host or can be stored on the BASRAM 150 provided by the DNN 105 module. As will be described in more detail below, the technologies described in this document can use compression and decompression to reduce the memory bus utilization when storing or retrieve compressed or decompressed activation data in the LPDDR4 155 or BASRAM 150 memory. Additional details regarding these technologies are described below with reference to figures 2A-9.
[069] As shown in figure 2B, the compression unit 200 can generate a mask part 208 and a data part 210 of a compressed data output block 204. The mask part 208 of the compressed output block 204 includes a number of bits equivalent to the fixed number of bytes in the uncompressed data block 202. For example, if the uncompressed data block 202 includes 64 data bytes, the mask portion 208 of the compressed output block 204 will include 64 bits ( that is, 8 bytes).
[070] Each bit in the mask portion 208 of the compressed output block 204 corresponds to a byte in the uncompressed data block 202 in some embodiments. For example, bit one of mask part 208 can correspond to the first byte in uncompressed data block 202, bit two of mask part 208 can correspond to the second byte in uncompressed data block 202, and so on.
[071] The compression unit 200 sets each bit in the mask portion 208 of the compressed output block 204 to a logical zero where a corresponding byte in the uncompressed data block 202 is a zero byte. The compression unit 200 also sets each bit in the mask part 208 of the compressed output block 204 to a logic 1, where a corresponding byte in the block does not.
Petition 870190103180, of 10/14/2019, p. 40/87
22/49 data tablet 202 is a non-zero byte. In this mode, the mask portion 208 of the compressed output block 204 encodes the presence and location of the zero and non-zero bytes in the uncompressed data block 202.
[072] The compression unit 200 generates the data portion 210 of the compressed output block 204 by determining the number of non-zero bytes in the uncompressed data block 202. The compression unit 200 then determines, based on the number of nonzero bytes in the uncompressed data block 202 and the number of bytes available in the data portion 210 of the compressed output block 204, the number of bits in the data portion 210 of the compressed output block 204 that are available to store each nonzero byte of uncompressed data block 202. For example, if data portion 210 of compressed data block 204 is 24 bytes in size (ie 192 bits) and there are 47 nonzero bytes in the block uncompressed data 202, four bits will be available in data portion 210 to store each nonzero byte of uncompressed data block 202.
[073] In some embodiments, the compression unit 200 can also determine the number of additional bits, if any, in the data portion 210 of the compressed output block 204 that are available to store nonzero bytes of the uncompressed data block 202. In the case given above, for example, four additional bits are available to store non-zero bytes (ie, 192 mod 47 ~ four bits). The compression unit 200 may assign these additional bits to one or more of the nonzero bytes in the uncompressed data block 204 before truncating the one or more of the nonzero bytes. For example, the compression unit 200 may assign these additional bits to the first N bytes in the data portion 210 of the compressed output block 204.
Petition 870190103180, of 10/14/2019, p. 41/87
23/49
[074] Compression unit 200 then truncates the nonzero bytes in the uncompressed data block 202 to the given number of bits available in data portion 210 to store each nonzero byte (that is, four in the example given above) ). The compression unit 200 truncates the LSBs of the nonzero bytes to fit within the available number of bits in the data portion 210 in one embodiment. In another embodiment, the compression unit 200 truncates the MSBs of the non-zero bytes. The compression unit 200 then stores the truncated non-zero bytes in the data part 210 of the compressed output block 204. The compressed output block 204, including the mask part 208 and the data part 210, can then be sent, for example, for internal memory in the DNN 105 module or for external memory in a host of applications of the DNN 105 module. Additional details regarding the compression process described above will be provided below with reference to figures 3 and 4.
[075] As briefly discussed above, two or more bits in the mask portion 208 of the compressed output block 204 correspond to a byte in the uncompressed data block 202 in some embodiments. In these embodiments, the bits in the mask portion 208 of the compressed output block 204 may indicate that they are not only a corresponding byte in the uncompressed block 202, but also their approximate magnitude. For example, and without limitation, mask portion 208 may include two bits per byte in the uncompressed data block 202. In this example, 00 may indicate that the MSB of the corresponding non-zero value in the uncompressed data block 202 is zero , 01 can indicate that the MSB is <64, 10 can indicate that the MSB is <128, and 11 can indicate that the MSB is 128. These values can be used to identify which bytes MSBs in the uncompressed data block 202 can be truncated. For example, if
Petition 870190103180, of 10/14/2019, p. 42/87
24/49 the MSB of a particular byte is <64, so the top two MSBs can be truncated without data loss.
[076] Figure 3 is a data structure diagram illustrating aspects of the operation of the DNN 105 module to compress uncompressed activation data blocks 202 with reference to an example uncompressed activation data block 202, according to a modality. In the example shown in figure 3, an uncompressed block of activation data 202 is 64 bytes in size. The zero, one and 63 bytes of the uncompressed activation data block 202 are zero bytes. Bytes two, three and 62 of the uncompressed activation data block 202 are non-zero bytes, storing the values 112, 121 and two, respectively. Bytes 4 to 61 of the uncompressed activation data example block 202 can store zero or non-zero bytes.
[077] As discussed earlier, the compression unit 200 can generate a mask portion 208 that encodes the presence and location of the zero and non-zero bytes in the uncompressed activation data block 202. In this case, for example, bits zero, one and 63 of mask part 208 are set to logical zeros to indicate the presence of zero bytes in the corresponding locations in the uncompressed activation data block 202. Similarly, bits two, three and 62 of mask part 208 is set to a logic to indicate that bytes two, three and 62 of the uncompressed activation data block 202 store non-zero bytes.
[078] As discussed earlier, the compression unit 200 generates the data portion 210 of the compressed output block 204 when determining the number of non-zero bytes in the uncompressed data block 202. In the case shown in figure 3, for example For example, the uncompressed data block 202 includes 47 bytes of non-zeros (nor
Petition 870190103180, of 10/14/2019, p. 43/87
25/49 all are shown in figure 3). The compression unit 200 then determines, based on the number of non-zero bytes in the uncompressed data block 202 and the number of bytes available in the data part 210 of the compressed output block 204, the number of bits in the data part 210 of the compressed output block 204 which are available to store each nonzero byte of the uncompressed data block 202.
[079] In the case shown in figure 3, for example, the data portion 210 of the compressed data block 204 is 24 bytes in size (ie 192 bits) and there are 47 non-zero bytes in the uncompressed data block. data 202. As a result, four bits are available in data part 210 to store each nonzero byte of the uncompressed data block 202 (i.e., 192/47 = 4 with remainder of 4).
[080] As also discussed earlier, the compression unit 200 can also determine the number of additional bits, if any, in the data portion 210 of the compressed output block 204 that are available to store nonzero bytes of the uncompressed block of data 202. In the case shown in figure 3, for example, four additional bits are available to store non-zero bytes (ie 192 mod 47 four bits). The compression unit 200 may assign these additional bits to one or more of the nonzero bytes in the uncompressed data block 204 before truncating the one or more of the nonzero bytes. In the example shown in figure 3, one of the four additional bits is assigned to each of the first four non-zero bytes in the uncompressed activation data block 202. As a result, the first four bytes of the uncompressed activation data block 202 will be truncated to five bits instead of four.
[081] The compression unit 200 then truncates the bytes of
Petition 870190103180, of 10/14/2019, p. 44/87
26/49 nonzeros in the uncompressed data block 202 for the given number of bits available in data portion 210 to store each nonzero byte (that is, five bits for the first four nonzero bytes in the example given above) . In the example shown in Figure 3, the compression unit 200 truncates the LSBs of the nonzero bytes to fit within the available number of bits (i.e., four in this example) in the data portion 210 in one embodiment. In another embodiment, the compression unit 200 truncates the MSBs of the non-zero bytes.
[082] As shown in figure 3, the second byte of the uncompressed activation data block 202 stores the value 113 (01110001). Because five bits have been assigned to the first four nonzero values in the uncompressed activation data block 202, the three LSBs of this value are truncated resulting in the value 01110 being stored at the first location in the compressed activation data block 210 The third byte of the uncompressed activation data block 202 stores the value 121 (01111001). Because five bits have been assigned to the first four non-zero values in the uncompressed activation data block 202, the three LSBs of this value are truncated resulting in the value 01111 being stored at the second location in the compressed activation data block 210 .
[083] In the example shown in figure 3, the 62nd byte of the uncompressed activation data block 202 stores the value 2 (00000010). Because four bits are assigned to the 5th to 63rd values of not zero in block uncompressed 202 activation data, the four LSBs of the value is truncated resulting in the value 0000 being stored at the 62 location in the compressed block activation data 210. Other non-zero bytes in the uncompressed activation data block 202 can be truncated and stored in the
Petition 870190103180, of 10/14/2019, p. 45/87
27/49 data part 210 of the compressed activation data block 204 in a similar mode.
[084] Once all of the nonzero bytes in the uncompressed activation data block 202 have been stored in the data portion 203, the compression unit 200 stores the compressed output block 204, including the mask portion 208 and the data portion 210, for example, in internal memory in the DNN 105 module or in external memory of an application host of the DNN 105 module. Further details regarding the compression process are provided below with reference to figure 4.
[085] Figure 4 is a flowchart showing a routine 400 that illustrates aspects of the operation of the DNN 105 module to compress uncompressed activation data blocks 202, according to an embodiment described in this document. It should be noted that the logical operations described in this document with reference to figure 4, and to other figures, can be implemented (1) as a sequence of procedures implemented by a computer or program modules running on a computing device and / or ( 2) as interconnected machine logic circuits or circuit modules within a computing device.
[086] The particular implementation of the technologies described in this document is a matter of choice depending on the performance and other requirements of the computing device. Therefore, the logical operations described in this document are referred to variably as states, operations, structural devices, procedures or modules. These states, operations, structural devices, procedures and modules can be implemented in hardware, software, firmware, in special use digital logic and in any combination thereof. It should be noted that more or less operations than those shown in the figures, and described in this document, can be
Petition 870190103180, of 10/14/2019, p. 46/87
28/49 executed. These operations can also be performed in an order other than that described in this document.
[087] Routine 400 starts at operation 402, where the compression unit 200 determines the number of non-zero bytes in the uncompressed activation data block 202. Routine 400 then proceeds to operation 404, where the compression unit 200 determines whether the number of non-zero bytes in the uncompressed activation data block 202 is equal to or less than the number of bytes available in the data portion 210 of the compressed activation data block 204. The non-zero bytes of the block uncompressed activation data 202 does not need to be compressed if the number of non-zero bytes is equal to or less than the number of bytes available in the data part 210 of the compressed activation data block 204. Therefore, in this case, routine 400 proceeds to operation 408, where the non-zero bytes are stored in data part 210 without truncation.
[088] If the number of non-zero bytes in the uncompressed activation data block 202 is greater than the number of bytes available in the data portion 210 of the compressed activation data block 204, routine 400 proceeds from operation 406 to operation 412. In operation 412, the compression unit 200 determines the number of bits of data portion 210 of the compressed output data block 204 available to store the truncated nonzero bytes of the uncompressed activation data block 202 in described above. Routine 400 then proceeds from operation 412 to operation 414.
[089] In operation 414, the compression unit 200 determines the number of additional bits, if any, in the data portion 210 of the compressed output block 204 that are available to store nonzero bytes of the uncompressed data block 202. As I discussed
Petition 870190103180, of 10/14/2019, p. 47/87
29/49 previously, compression unit 200 may assign these additional bits to one or more of the nonzero bytes in the uncompressed data block 204 before truncating the one or more of the nonzero bytes. This occurs in operation 416.
[090] From operation 416, routine 400 proceeds to operation 418, where the compression unit 200 sets bits in the mask part 208 of the compressed activation data block 204 to a logic 1 where the corresponding byte in the uncompressed block of activation 202 is not zeros. The compression unit 200 also sets bits in the mask portion 208 of the compressed activation data block 204 to a logical zero where the corresponding byte in the uncompressed activation block 202 is zeros.
[091] From operation 418, routine 400 then proceeds to operation 420, where compression unit 200 truncates the LSBs or MSBs of the non-zero bytes in the uncompressed data block 202 to the determined number of bits available in the data 210 for each byte of non-zeros. The truncated non-zero bytes are then stored in the data part 210 of the compressed activation data block 204. The compression unit 200 then stores the compressed output block 204, including the mask part 208 and the data part 210, in internal memory in the DNN 105 module or in external memory of an application host in the DNN 105 module. From operations 408 and 420, routine 400 proceeds to operation 410, where it ends.
[092] Figures 5A and 5B are diagrams of computing system architectures showing aspects of the configuration and operation of the DNN 105 module to decompress compressed activation data, according to one modality. As briefly discussed above, and as shown in figures 5A and 5B, the DNN module 105 can also include a decompression unit 500 that can
Petition 870190103180, of 10/14/2019, p. 48/87
30/49 to decompress activation data blocks 204 that have been compressed in the manner described above.
[093] For example, the decompression unit 500 can receive a compressed block of activation data 204 from storage 206, such as memory in the DNN processor or memory from an application host. The decompression unit 500 can then determine the number of non-zero bytes in the data portion 210 of the compressed data block 204 based on the number of true logical bits in the mask portion 208 of the compressed block 204. The decompression unit 500 also can determine the locations of the non-zero bytes in the uncompressed data block 502 based on the locations of the logical logical bits in the mask portion 208 of the compressed output block 204. The decompression unit 500 can determine the locations of the zero bytes in the block uncompressed data 502 in a similar way.
[094] The decompression unit 500 can also determine the number of bits used by the compression unit 200 to store each of the truncated non-zero bytes in the data portion 210 of the compressed output block 204. The decompression unit 500 can determine the number of bits used to store each truncated nonzero byte based on the number of nonzero bytes in the compressed data block 204 (as indicated by mask portion 208) and the target size of the uncompressed output block 502.
[095] In the case given above, for example, if the data portion of the compressed data block 204 is 24 bytes in size (ie, 192 bits) and there are 47 non-zero bytes in the uncompressed data block 202 , this means that the compression unit 200 uses four bits to store each truncated nonzero byte of the uncompressed data block 202 in the data portion 210. The compression unit
Petition 870190103180, of 10/14/2019, p. 49/87
Decompression 31/49 can also determine the number of additional bits, if any, that the compression unit 200 allocates for one or more of the truncated nonzero bytes stored in the data portion 210 of the compressed output block 204.
[096] For each bit position in the mask part 208 of the compressed output block 204 which is a logical zero, the decompression unit 500 inserts a zero byte in the corresponding position of the decompressed output block 502. For each position in the part of mask 208 which is a logical 1, the decompression unit 500 inserts the truncated non-zero byte from the corresponding position of the compressed input block 204 into a corresponding position of the uncompressed output block 502 along with a number of zero bits equivalent to the number of truncated bits during compression of the compressed output block 204. The zero bits can be inserted into the LSBs or MSBs of the truncated non-zero bytes depending on which bits were truncated during compression.
[097] As mentioned earlier, the decompression unit 500 also adds an offset (for example, 00000001) to one or more of the truncated nonzero bytes stored in the decompressed output block 502 in some embodiments. For example, an offset can be added to nonzero bytes in the uncompressed data block 202 that become zero bytes after compression. In this mode, non-zero bytes will not become zero bytes when uncompressed.
[098] Figure 6 is a data structure diagram that illustrates aspects of the operation of the DNN 105 module to decompress activation data with reference to an example block of compressed activation data, according to a modality. The example shown in figure 6 illustrates decompression of the compressed activation data 204 generated in the example described previously with re
Petition 870190103180, of 10/14/2019, p. 50/87
32/49 reference to figure 3. As shown in figure 6, mask part 208 stores zeros in bits zero, one and 63 and stores ones in bits two, three and 62. Data part 210 stores values 01110 , 01111 and 0000 in the mode shown in figure 6.
[099] As the decompression unit 500 performs the processing operations described above, the logical zero in the first bit position of mask part 208 will induce the decompression unit 500 to store a zero byte as the first byte of the block uncompressed activation data 502. Similarly, the logical zero in the second bit position of mask part 208 will induce the decompression unit 500 to store a zero byte as the second byte of the uncompressed data block 502.
[0100] The logical one in the third position of the mask part 208 will induce the decompression unit 500 to recover the first five bits (that is, 01110) of the data part 210 and to insert three LSBs, resulting in the value 01110000 (112) being stored as the third byte of the uncompressed activation data block 502. Similarly, the logical one in the fourth bit position of mask part 208 will induce the decompression unit 500 to retrieve the five second bits (ie, 01111) data portion 210 and to insert three LSBs, resulting in the value 01111000 (120) being stored as the fourth byte of the uncompressed activation data block 502.
[0101] The logical in 63 to bit position of the mask portion 208 will induce the decompression unit 500 to retrieve the last four bits of data portion 210 (i.e., 0000) and to insert four LSBs zeros, resulting in a value zero being stored in the byte position 63 and the decompressed data block 502. the activation logic zero in the last bit position of the mask 208 will induce
Petition 870190103180, of 10/14/2019, p. 51/87
33/49 the decompression unit 500 to store a zero byte as the last byte of the uncompressed data block 502.
[0102] As previously discussed, the decompression unit 500 can add an offset value to certain bytes in the uncompressed activation data block 502. For example, the decompression unit 500 can add an offset value, such as 00000001, to non-zero bytes in the uncompressed activation data block 202, but which were compressed to zero bytes in the compressed activation data block 204.
[0103] In the example shown in figure 6, the last byte in data part 210 was not zeros (that is, two) in the uncompressed activation data block 202, but was taken from zeros in the compressed activation data block 504. Therefore, the decompression unit 500 can add an offset value, such as 00000001, to this byte, thereby ensuring that nonzero bytes in the uncompressed blocks of activation data 202 will not be compressed to zero bytes.
[0104] Figure 7 is a flowchart showing a routine 700 that illustrates aspects of the operation of the DNN 105 module to decompress activation data, according to a modality described in this document. Routine 700 begins at operation 702, where the decompression unit 500 uses the mask portion 208 of a compressed activation data block 204 to determine the number of non-zero bytes and their locations in the uncompressed activation data block 502.
[0105] Routine 700 proceeds from operation 702 to operation 704, where the decompression unit 500 determines whether the number of non-zero bytes in the compressed activation data block 204 is equal to or less than the number of bytes in the uncompressed block in
Petition 870190103180, of 10/14/2019, p. 52/87
34/49 activation data 502. As previously discussed, the nonzero bytes in the compressed activation data block 204 do not need to be uncompressed if the number of nonzero bytes is equal to or less than the number of bytes in the uncompressed block of activation data 502. Therefore, in this case routine 700 proceeds to operation 708, where the non-zero bytes in the compressed activation data block 204 are stored in the uncompressed activation data block 502 without modification.
[0106] If the number of non-zero bytes in the compressed activation data block 504 is greater than the number of bytes in the uncompressed activation data block 502, routine 700 proceeds from operation 706 to operation 712. In operation 712 , the decompression unit 500 determines the number of bits of the data portion 210 of the compressed output data block 204 that the compression unit 200 used to store each truncated nonzero byte of the uncompressed activation data block 202. A routine 700 then proceeds from operation 712 to operation 714 in the manner previously described.
[0107] In operation 714, the decompression unit 500 determines the number of additional bits, if any, that were used to store nonzero bytes of the uncompressed data block 202. The decompression unit 500 can designate these additional bits for one or more of the non-zero bytes in the uncompressed data block 502 in the manner previously described. This occurs in operation 716.
[0108] From operation 716, routine 700 proceeds to operation 718, where the decompression unit 500 inserts a zero byte in the corresponding position of the decompressed output block 502 for each bit position in the mask part 208 of the output block compressed 204 which is a logical zero. For each bit position on the part
Petition 870190103180, of 10/14/2019, p. 53/87
35/49 mask 208 of the compressed output block 204 which is a logical 1, the decompression unit 500 inserts the truncated non-zero bytes of the corresponding positions of the compressed input block 204 in a corresponding position of the decompressed output block 502 together with a number of zero bits equivalent to the number of truncated bits during compression of the compressed output block 204. The zero bits can be inserted into the LSBs or MSBs of the truncated non-zero bytes depending on which bits were truncated during compression. This occurs in operation 720.
[0109] The decompression unit 500 can also add an offset value to one or more of the truncated non-zero bytes stored in the uncompressed output block 502 in some embodiments. For example, an offset can be added to nonzero bytes in the uncompressed data block 202 that become zero bytes after compression. In this mode, non-zero bytes will not become zero bytes when compressed and uncompressed. An offset can be added to all bytes in the uncompressed block of activation data 502 in other modalities.
[0110] The decompression unit 500 then stores the uncompressed output block 502 in internal memory in the DNN 105 module or in the external memory of an application host of the DNN 105 module for use by neurons 105F. From operations 708 and 720, routine 400 proceeds to operation 710, where it ends.
[0111] Figure 8 is a computer architecture diagram showing an illustrative computer hardware and software architecture for a computing device that can act as an application host for the DNN 105 module presented in this document. In particular, the architecture illustrated in figure 8 can be used to implement a server computer, mobile phone,
Petition 870190103180, of 10/14/2019, p. 54/87
36/49 an electronic reader, a smartphone, a desktop computer, an AR / VR device, a tablet, a laptop or another type of computing device suitable for use with the DNN 105 module.
[0112] The computer 800 illustrated in figure 8 includes a central processing unit 802 (CPU), a system memory 804, including a random access memory 806 (RAM) and a read-only memory (ROM) 808, and a system bus 810 that couples memory 804 to the CPU 802. A basic input / output system (BIOS or firmware) containing the basic routines that help transfer information between elements within the computer 800, such as during startup, can be stored in the ROM 808. Computer 800 additionally includes a mass storage device 812 to store an operating system 822, application programs and other types of programs. The mass storage device 812 can also be configured to store other types of programs and data.
[0113] The mass storage device 812 is connected to the 802 CPU via a mass storage controller (not shown) connected to the 810 bus. The mass storage device 812 and its associated computer-readable media provide non-storage volatile for computer 800. Although the description of computer-readable media in this document refers to a mass storage device, such as a hard disk, CD-ROM drive, DVD-ROM drive, or USB storage key, It should be understood by those skilled in the art that computer-readable media can be any available computer storage media or communication media that can be accessed by computer 800.
[0114] Communication media include instructions readable by
Petition 870190103180, of 10/14/2019, p. 55/87
37/49 computer, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and include any delivery media. The term modulated data signal means a signal that has one or more of its characteristics changed or established in a way to encode information in the signal. For example, and not by way of limitation, communication media includes wired media such as a wired network or direct wired connection, and wireless media such as acoustic, radio frequency, infrared and other wireless media. Combinations of any media from those indicated above must also be included in the scope of computer-readable media.
[0115] By way of example, and not by way of limitation, computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information such as computer-readable instructions, data structures , program modules or other data. For example, computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, digital versatile discs (DVD), HD-DVD, BLU-RAY, or other optical storage, magnetic tapes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other media that can be used to store the desired information and that can be accessed by the 800 computer. purpose of the claims, the phrase computer storage media and variations thereof does not induce perse waves or signals or communication media.
[0116] According to various configurations, computer 800 can operate in a network environment using logical connections for
Petition 870190103180, of 10/14/2019, p. 56/87
38/49 remote computers over a network such as network 820. Computer 800 can connect to network 820 through a network interface unit 816 connected to the 810 bus. It should be noted that the network interface unit 816 can also be used to connect to other types of networks and remote computer systems. Computer 800 may also include an input / output controller 818 for receiving and processing inputs from a variety of other devices, including a keyboard, mouse, touch input, an electronic pointer (not shown in figure 8), or a physical sensor such as like a video camera. Similarly, the 818 input / output controller can provide output to a display screen or other type of output device (also not shown in figure 8).
[0117] It should be noted that the software components described in this document, when loaded into the 802 CPU and run, can transform the 802 CPU and the total computer 800 from a general purpose computing device into a custom special purpose computing device to facilitate the functionality presented in this document. The 802 CPU can be constructed from any number of transistors or other distinct circuit elements, which can assume any number of states individually or collectively. More specifically, the 802 CPU can operate as a finite state machine, in response to executable instructions contained in the software modules described in this document. These computer executable Instructions can transform the 802 CPU by specifying how the 802 CPU changes between states, thereby transforming the transistors or other distinct hardware elements making up the 802 CPU.
[0118] Coding the software modules presented in this document can also transform the physical structure of legitimate media
Petition 870190103180, of 10/14/2019, p. 57/87
39/49 by computer presented in this document. The specific transformation of physical structure depends on several factors, in different implementations of this description. Examples of such factors include, but are not limited to, the technology used to implement computer-readable media, whether computer-readable media is characterized as primary or secondary storage and more. For example, if computer-readable media is implemented as semiconductor-based memory, the software described in this document can be encoded on computer-readable media by transforming the physical state of semiconductor memory. For example, the software can transform the state of transistors, capacitors, or other distinct circuit elements making up semiconductor memory. The software can also transform the physical state of such components in order to store data afterwards. [0119] As another example, the computer-readable media described in this document can be implemented using magnetic or optical technology. In such implementations, the software presented in this document can transform the physical state of magnetic or optical media, when the software is encoded in them. These transformations may include altering the magnetic characteristics of particular locations within given magnetic media. These transformations may also include changing the physical characteristics or resources of particular locations within given optical media, to change the optical characteristics of those locations. Further transformations of physical media are possible without departing from the scope and spirit of this description, with the examples given above provided only to facilitate this discussion.
[0120] Considering the above, it must be realized that many types of physical transformations take place in computer 800 in order to store and execute the software components presented.
Petition 870190103180, of 10/14/2019, p. 58/87
40/49 sitting in this document. It should also be noted that the architecture shown in figure 8 for computer 800, or a similar architecture, can be used to implement other types of computing devices, including handheld computers, video game devices, embedded computer systems, mobile devices such as smartphones, tablets and AR / VR devices, and other types of computing devices known to those skilled in the art. It is also considered that computer 800 may not include all components shown in figure 8, it may include other components that are not shown explicitly in figure 8, or it may use an architecture completely different from that shown in figure 8.
[0121] Figure 9 is a network diagram illustrating a distributed network computing environment 900 in which aspects of the technologies described can be implemented, according to the various modalities presented in this document. As shown in figure 9, one or more 900A server computers can be interconnected via an 820 communications network (which can be any or a combination of a wired or wireless LAN, WAN, intranet, extranet, non-network) hierarchical network, virtual private network, the Internet, Bluetooth communications network, proprietary low voltage communications network or other communications network) with various client computing devices such as, but not limited to, a 900B tablet, a gaming console. computer 900C, a smart watch 900D, a phone 900E, such as a smartphone, a personal computer 900F and an AR / VR 900G device.
[0122] In a network environment where the 820 communications network is the Internet, for example, the 900A server computer can be a dedicated operable server computer to process and send data to and from 900B client computing devices
Petition 870190103180, of 10/14/2019, p. 59/87
41/49
900G via any of several known protocols, such as hypertext transfer protocol (HTTP), file transfer protocol (FTP) or simple object access protocol (SOAP). In addition, the networked computing environment 900 can use several data security protocols such as secure socket layer (SSL) or very good privacy (PGP). Each of the 900B-900G client computing devices can be equipped with an operable operating system to support one or more computing applications or terminal sessions, such as a network browser (not shown in Figure 9), or another graphical user interface. user (not shown in figure 9), or a mobile desktop environment (not shown in figure 9) to gain access to the 900A server computer.
[0123] The 900A server computer can be connected communicatively to other computing environments (not shown in figure 9) and can receive data regarding a network of interactions / resources of the participating user. In an illustrative operation, a user (not shown in figure 9) can interact with a computing application running on a 900B-900G client computing device to obtain desired data and / or run other computing applications.
[0124] The data and / or computing applications can be stored on the 900A server, or on the 900A servers, and transmitted to cooperating users via the 900B-900G client computing devices over an exemplary 820 communications network. A participating user (not shown in figure 9) can request access to specific applications and data hosted entirely or in part on the 900A server computer. This data can be transmitted between the 900B-900G client computing devices and the 900A server computer for processing and storage.
Petition 870190103180, of 10/14/2019, p. 60/87
42/49
[0125] The 900A server computer can host computing applications, processes and applets for generating, authenticating, encrypting and sending data and applications, and can cooperate with other server computing environments (not shown in figure 9), service providers external entity services (not shown in figure 9), network-connected storage (NAS) and storage area networks (SAN) to perform application / data transactions.
[0126] It should be noted that the computing architecture shown in figure 8 and the distributed network computing environment shown in figure 9 are simplified for ease of discussion. It should also be realized that the computing architecture and the distributed computing network can include and use many more computing components, devices, software programs, network devices and other components not specifically described in this document.
[0127] The description presented in this document also covers the matter in question exposed in the following clauses:
[0128] Clause 1. A neural network processor, comprising: one or more neurons; and a compression unit configured to receive an uncompressed block of data generated by at least one of the neurons in the neural network processor, the uncompressed block of data comprising a fixed number of bytes; generate a mask part of a compressed output block, the mask part comprising a number of bits equivalent to the fixed number of bytes in the uncompressed data block, each bit in the mask part corresponding to a byte in the uncompressed data block , and where each bit in the mask part is set to a logical zero where a corresponding byte in the uncompressed block of data is zero and is set to a logical 1 where a color byte
Petition 870190103180, of 10/14/2019, p. 61/87
43/49 respondents in the uncompressed data block are non-zeros; generate a piece of data from the compressed output block when determining a number of non-zero bytes in the uncompressed data block, determine, based on the number of non-zero bytes in the uncompressed data block, a number of bits in the part data from the compressed output block available to store truncated nonzero bytes of the uncompressed data block, truncating the nonzero bytes in the uncompressed data block to the specified number of bits, and storing the truncated nonzero bytes in the data portion of the compressed output block; and producing the compressed output block, the compressed output block comprising the mask part and the data part.
[0129] Clause 2. The neural network processor of clause 1, in which the neural network processor additionally comprises a decompression unit configured to: receive the compressed output block; determining the number of non-zero bytes in the data portion of the uncompressed data block based on the mask portion of the compressed output block; determining locations of non-zero bytes in the uncompressed data block based on the mask portion of the compressed output block; determining the number of bits used by the compression unit to store the truncated non-zero bytes in the data portion of the compressed output block; for each position in the mask part of the compressed output block, which is a logical zero, insert a zero byte in a corresponding position of an uncompressed output block; and for each position in the mask part which is a logical 1, insert the truncated nonzero byte of the corresponding position of the compressed input block in a corresponding position of the uncompressed output block and a number of zero bits equivalent to the number of truncated bits during compression of the compressed outlet block.
Petition 870190103180, of 10/14/2019, p. 62/87
44/49
[0130] Clause 3. The neural network processor of any one of clauses 1 and 2, in which the compression unit is additionally configured to: determine a number of additional bits in the data portion of the compressed output block available to store bytes non-truncated zeroes in the uncompressed data block; and allocating the additional bits to one or more of the nonzero bytes in the uncompressed block of data before truncating the one or more of the nonzero bytes.
[0131] Clause 4. The neural network processor of any one of clauses 1-3, in which the decompression unit is additionally configured to determine the number of additional bits allocated for the one or more of the non-zero bytes stored in the part data from the compressed output block.
[0132] Clause 5. The neural network processor of any one of clauses 1-4, in which the decompression unit is additionally configured to add an offset to one or more of the truncated nonzero bytes stored in the uncompressed output block.
[0133] Clause 6. The neural network processor of any of clauses 1-5, in which one or more less significant bits (LSBs) of the non-zero bytes are truncated.
[0134] Clause 7. The neural network processor of any of clauses 1-6, in which one or more significant bits (MSBs) of the non-zero bytes are truncated.
[0135] Clause 8. A neural network processor, comprising: one or more neurons; and a decompression unit configured to receive a compressed block of data comprising a mask portion and a data portion; determining a number of non-zero bytes in an uncompressed block of data based on bits in the mask part; determine, based on at least
Petition 870190103180, of 10/14/2019, p. 63/87
45/49 part in the number of non-zero bytes, a number of bits used to store truncated non-zero bytes in the data part of the compressed data output block; for each bit position in the mask part of the compressed data block which is a logical zero, insert a zero byte in a corresponding position in the uncompressed data block; and for each position in the mask part of the compressed data block which is a logical 1, insert a truncated nonzero byte from the corresponding position in the data part of the compressed data block in a corresponding position in the uncompressed data block and a number zero bits equivalent to a number of bits truncated during compression of the compressed block of data.
[0136] Clause 9. The neural network processor of clause 8, additionally comprising a compression unit configured to: receive an uncompressed block of data generated by at least one of the neurons in the neural network processor, the uncompressed block of data comprising a fixed number of bytes; generating the mask portion of the compressed data block, the mask portion comprising a number of bits equivalent to the fixed number of bytes in the uncompressed data block, each bit in the mask portion corresponding to one byte in the uncompressed data block, and wherein each bit in the mask part comprises a logical zero where a corresponding byte in the uncompressed data block is zero and comprises a logical 1 where a corresponding byte in the uncompressed data block is non-zero; generate the data portion of the compressed data block when determining a number of non-zero bytes in the uncompressed data block, determine, based on the number of non-zero bytes in the uncompressed data block, a number of bits in the portion data from the compressed block of data available to store truncated nonzero bytes of the
Petition 870190103180, of 10/14/2019, p. 64/87
46/49 uncompressed data block, truncate the nonzero bytes in the uncompressed data block to the specified number of bits, and store the truncated nonzero bytes in the data portion of the compressed data block; and producing the compressed data block, the compressed data block comprising the mask part and the data part.
[0137] Clause 10. The neural network processor of any of clauses 8 and 9, in which the compression unit is additionally configured to store the non-zero bytes in the uncompressed data block in the data portion of the compressed block of data. data without truncation if the number of non-zero bytes in the uncompressed data block is equal to or less than a number of bytes in the data portion of the compressed data block.
[0138] Clause 11.0 neural network processor of any of clauses 8-10, in which the compression unit is additionally configured to: determine a number of additional bits in the data portion of the compressed output block available to store no bytes truncated zeros of the uncompressed data block; and allocating the additional bits to one or more of the nonzero bytes in the uncompressed block of data before truncating the one or more of the nonzero bytes.
[0139] Clause 12. The neural network processor of any of clauses 8-11, in which the decompression unit is additionally configured to determine the number of additional bits allocated to the one or more of the non-zero bytes stored in the part data from the compressed output block.
[0140] Clause 13. The neural network processor of any of clauses 8-12, in which one or more less significant bits (LSBs) of the non-zero bytes are truncated during compression of the compressed block of data.
Petition 870190103180, of 10/14/2019, p. 65/87
47/49
[0141] Clause 14. The neural network processor of any of clauses 8-13, in which one or more significant bits (MSBs) of the non-zero bytes are truncated during compression of the compressed block of data.
[0142] Clause 15. A method implemented by computer, comprising: receiving, in a compression unit of a neural network processor, an uncompressed block of data generated by at least one neuron in the neural network processor, the uncompressed block data comprising a fixed number of bytes; generate a mask part of a compressed output block, the mask part comprising a number of bits equivalent to the fixed number of bytes in the uncompressed data block, each bit in the mask part corresponding to a byte in the uncompressed data block , and wherein each bit in the mask part comprises a logical zero where a corresponding byte in the uncompressed data block is zero and comprises a logical 1 where a corresponding byte in the uncompressed data block is non-zero; generate a piece of data from the compressed output block when determining a number of non-zero bytes in the uncompressed data block, determine, based on the number of non-zero bytes in the uncompressed data block, a number of bits in the part data from the compressed output block available to store truncated nonzero bytes of the uncompressed data block, truncating the nonzero bytes in the uncompressed data block to the specified number of bits, and storing the truncated nonzero bytes in the data portion of the compressed output block; and storing the compressed output block in a memory of the neural network processor, the compressed output block comprising the mask part and the data part.
[0143] Clause 16. The computer-implemented method of
Petition 870190103180, of 10/14/2019, p. 66/87
48/49 clause 15, further comprising: determining a number of additional bits in the data portion of the compressed output block available for storing truncated nonzero bytes of the uncompressed data block; and allocating the additional bits to one or more of the nonzero bytes in the uncompressed block of data before truncating the one or more of the nonzero bytes.
[0144] Clause 17. The computer-implemented method of any of clauses 15 and 16, further comprising storing the nonzero bytes in the uncompressed data block in the data portion of the compressed data block without truncation if the number of bytes of no zeros in the uncompressed data block is less than or equal to a number of bytes in the data portion of the compressed data block.
[0145] Clause 18. The computer-implemented method of any of clauses 15-17, further comprising: receiving, in a decompression unit of a neural network processor, the compressed output block; determining the number of non-zero bytes in the data portion of the uncompressed data block based on the mask portion of the compressed output block; determining locations of non-zero bytes in the uncompressed data block based on the mask portion of the compressed output block; determining the number of bits used by the compression unit to store the truncated non-zero bytes in the data portion of the compressed output block; for each bit position in the mask part of the compressed output block, which is a logical zero, insert a zero byte in a corresponding position of an uncompressed output block; and for each position in the mask part of the compressed output block which is a logical 1, insert the truncated non-zero byte of the corresponding position of the compressed output block in a corresponding position of the uncompressed output block and a
Petition 870190103180, of 10/14/2019, p. 67/87
49/49 number of bits zero equivalent to the number of bits truncated during compression of the compressed output block.
[0146] Clause 19. The computer-implemented method of any of clauses 15-18, further comprising adding an offset to one or more of the truncated non-zero bytes stored in the uncompressed output block.
[0147] Clause 20. The computer-implemented method of any of clauses 15-19, in which the offset is added to one or more less significant bits (LSBs) of the truncated non-zero bytes stored in the uncompressed output block.
[0148] Based on the above, it should be noted that a DNN module that can compress and decompress activation data to reduce the memory bus bandwidth utilization has been described in this document. Although the matter in question presented in this document has been described in specific language for structural computer resources, methodological and transformational procedures, specific computing devices, and computer-readable media, it is to be understood that the matter in question set out in the attached claims does not it is necessarily limited to the specific resources, media or procedures described in this document. In particular, specific resources, media and procedures are described as an example of how to implement the matter in question.
[0149] The material in question described above is provided by way of illustration only and should not be construed as limiting. Various modifications and changes can be made to the subject matter described in this document without following the illustrated and described example configurations and applications, and without departing from the scope of this description, which is set out in the following claims.
权利要求:
Claims (15)
[1]
1. Neural network processor, characterized by the fact that it comprises:
one or more neurons; and a compression unit configured to receive an uncompressed block of data generated by at least one of the neurons in the neural network processor, the uncompressed block of data comprising a fixed number of bytes;
generate a mask part of a compressed output block, the mask part comprising a number of bits equivalent to the fixed number of bytes in the uncompressed data block, each bit in the mask part corresponding to a byte in the uncompressed data block , and where each bit in the mask part is set to a logical zero where a corresponding byte in the uncompressed data block is zero and is set to a logical 1 where a corresponding byte in the uncompressed data block is nonzero ;
generate a piece of data from the compressed output block when determining a number of non-zero bytes in the uncompressed data block, determine, based on the number of non-zero bytes in the uncompressed data block, a number of bits in the part data from the compressed output block available to store truncated nonzero bytes of the uncompressed data block, truncating the nonzero bytes in the uncompressed data block to the specified number of bits, and storing the truncated nonzero bytes in the data portion of the compressed output block; and produce the compressed output block, the output block
Petition 870190103180, of 10/14/2019, p. 69/87
[2]
2/8 compressed comprising the mask part and the data part.
2. Neural network processor, according to claim 1, characterized by the fact that it additionally comprises a decompression unit configured for:
receiving the compressed output block;
determining the number of non-zero bytes in the data portion of the uncompressed data block based on the mask portion of the compressed output block;
determining locations of non-zero bytes in the uncompressed data block based on the mask portion of the compressed output block;
determining the number of bits used by the compression unit to store the truncated non-zero bytes in the data portion of the compressed output block;
for each position in the mask part of the compressed output block, which is a logical zero, insert a zero byte in a corresponding position of an uncompressed output block; and for each position in the mask part which is a logical 1, insert the truncated nonzero byte of the corresponding position of the compressed input block in a corresponding position of the uncompressed output block and a number of zero bits equivalent to the number of truncated bits during compression of the compressed outlet block.
[3]
3. Neural network processor, according to claim 1, characterized by the fact that the compression unit is additionally configured for:
determining a number of additional bits in the data portion of the compressed output block available to store truncated nonzero bytes of the uncompressed data block; and allocate the additional bits to one or more of the non-bytes
Petition 870190103180, of 10/14/2019, p. 70/87
3/8 zeros in the uncompressed data block before truncating the one or more of the non-zero bytes.
[4]
4. Neural network processor according to claim 3, characterized by the fact that the decompression unit is additionally configured to determine the number of additional bits allocated for the one or more of the non-zero bytes stored in the data part of the compressed outlet block.
[5]
5. Neural network processor, according to claim 2, characterized by the fact that the decompression unit is additionally configured to add an offset to one or more of the truncated non-zero bytes stored in the decompressed output block.
[6]
6. Neural network processor, according to claim 1, characterized by the fact that one or more less significant bits (LSBs) of the non-zero bytes are truncated.
[7]
7. Neural network processor, characterized by the fact that it comprises:
one or more neurons; and a decompression unit configured to receive a compressed block of data comprising a mask portion and a data portion;
determining a number of non-zero bytes in an uncompressed block of data based on bits in the mask part;
determining, based at least in part on the number of non-zero bytes, a number of bits used to store truncated non-zero bytes in the data portion of the compressed data output block;
for each bit position in the mask part of the compressed data block which is a logical zero, insert a zero byte in a corresponding position in the uncompressed data block; and
Petition 870190103180, of 10/14/2019, p. 71/87
4/8 for each position in the mask part of the compressed data block which is a logical 1, insert a truncated non-zero byte from the corresponding position in the data part of the compressed data block in a corresponding position in the uncompressed data block and a number of zero bits equivalent to a number of truncated bits during compression of the compressed block of data.
[8]
8. Neural network processor, according to claim 7, characterized by the fact that it additionally comprises a compression unit configured for:
receiving an uncompressed block of data generated by at least one of the neurons in the neural network processor, the uncompressed block of data comprising a fixed number of bytes;
generating the mask portion of the compressed data block, the mask portion comprising a number of bits equivalent to the fixed number of bytes in the uncompressed data block, each bit in the mask portion corresponding to one byte in the uncompressed data block, and wherein each bit in the mask part comprises a logical zero where a corresponding byte in the uncompressed data block is zero and comprises a logical 1 where a corresponding byte in the uncompressed data block is non-zero;
generate the data portion of the compressed data block when determining a number of non-zero bytes in the uncompressed data block, determine, based on the number of non-zero bytes in the uncompressed data block, a number of bits in the portion data from the compressed data block available to store truncated nonzero bytes of the uncompressed data block, truncating the nonzero bytes in the uncompressed data block to the specified number of bits, and
Petition 870190103180, of 10/14/2019, p. 72/87
5/8 store the truncated non-zero bytes in the data portion of the compressed data block; and producing the compressed data block, the compressed data block comprising the mask part and the data part.
[9]
9. Neural network processor according to claim 8, characterized in that the compression unit is additionally configured to store the nonzero bytes in the uncompressed data block in the data part of the compressed data block without truncation. if the number of non-zero bytes in the uncompressed data block is equal to or less than a number of bytes in the data portion of the compressed data block.
[10]
10. Neural network processor, according to claim 8, characterized by the fact that the compression unit is additionally configured for:
determining a number of additional bits in the data portion of the compressed output block available to store truncated nonzero bytes of the uncompressed data block; and allocating the additional bits to one or more of the nonzero bytes in the uncompressed block of data before truncating the one or more of the nonzero bytes.
[11]
11. Neural network processor, according to claim 8, characterized by the fact that the decompression unit is additionally configured to determine the number of additional bits allocated to the one or more of the non-zero bytes stored in the data part of the compressed outlet block.
[12]
12. Method implemented by computer, characterized by the fact that it comprises:
receive, in a compression unit of a neural network processor, an uncompressed block of data generated by at least one neuron in the neural network processor, the block does not
Petition 870190103180, of 10/14/2019, p. 73/87
6/8 compressed data comprising a fixed number of bytes;
generate a mask part of a compressed output block, the mask part comprising a number of bits equivalent to the fixed number of bytes in the uncompressed data block, each bit in the mask part corresponding to a byte in the uncompressed data block , and wherein each bit in the mask part comprises a logical zero where a corresponding byte in the uncompressed data block is zero and comprises a logical 1 where a corresponding byte in the uncompressed data block is non-zero;
generate a piece of data from the compressed output block when determining a number of non-zero bytes in the uncompressed data block, determine, based on the number of non-zero bytes in the uncompressed data block, a number of bits in the part data from the compressed output block available to store truncated nonzero bytes of the uncompressed data block, truncating the nonzero bytes in the uncompressed data block to the specified number of bits, and storing the truncated nonzero bytes in the data portion of the compressed output block; and storing the compressed output block in a memory of the neural network processor, the compressed output block comprising the mask part and the data part.
[13]
13. Method implemented by computer, according to claim 12, characterized by the fact that it additionally comprises:
determine a number of additional bits in the data portion of the compressed output block available to store bytes of
Petition 870190103180, of 10/14/2019, p. 74/87
7/8 non-truncated zeroes of the uncompressed data block: and allocate the additional bits for one or more of the non-zero bytes in the uncompressed data block before truncating the one or more of the non-zero bytes.
[14]
14. Method implemented by computer, according to claim 12, characterized by the fact that it additionally comprises storing the non-zero bytes in the uncompressed data block in the data part of the compressed data block without truncation if the number of bytes of no zeros in the uncompressed data block is less than or equal to a number of bytes in the data portion of the compressed data block.
[15]
15. Method implemented by computer, according to claim 12, characterized by the fact that it additionally comprises:
receiving, in a decompression unit of a neural network processor, the compressed output block;
determining the number of non-zero bytes in the data portion of the uncompressed data block based on the mask portion of the compressed output block;
determining locations of non-zero bytes in the uncompressed data block based on the mask portion of the compressed output block;
determining the number of bits used by the compression unit to store the truncated non-zero bytes in the data portion of the compressed output block;
for each bit position in the mask part of the compressed output block, which is a logical zero, insert a zero byte in a corresponding position of an uncompressed output block; and for each position in the mask part of the compressed output block which is a logical 1, insert the truncated non-zero byte of
Petition 870190103180, of 10/14/2019, p. 75/87
8/8 corresponding position of the compressed output block in a corresponding position of the uncompressed output block and a number of zero bits equivalent to the number of truncated bits during compression of the compressed output block.
类似技术:
公开号 | 公开日 | 专利标题
BR112019021541A2|2020-05-12|NEURAL NETWORK PROCESSOR USING ACTIVATION DATA COMPRESSION AND DECOMPRESSION TO REDUCE USE OF MEMORY BAND WIDTH
CN109597658A|2019-04-09|Dynamically enable and disable in a computing environment the technology of accelerator facility
RU2767447C2|2022-03-17|Neural network processor using compression and decompression of activation data in order to reduce memory bandwidth use
US20200409757A1|2020-12-31|Managing workloads of a deep neural network processor
JP7004741B2|2022-01-21|Neural network processor that uses activation data compression and decompression to reduce memory bandwidth utilization
US20200409663A1|2020-12-31|Neural processing element with single instruction multiple data | compute lanes
US20200410329A1|2020-12-31|Increased precision neural processing element
同族专利:
公开号 | 公开日
EP3612991A1|2020-02-26|
CO2019011014A2|2019-10-21|
AU2018256212A1|2019-09-19|
US20180300603A1|2018-10-18|
US11010315B2|2021-05-18|
US20180300605A1|2018-10-18|
WO2018194847A1|2018-10-25|
MX2019012388A|2020-01-23|
CN110546611A|2019-12-06|
WO2018194988A1|2018-10-25|
EP3612989A1|2020-02-26|
WO2018194845A1|2018-10-25|
WO2018194940A1|2018-10-25|
WO2018194846A1|2018-10-25|
CN110582785A|2019-12-17|
US20180300602A1|2018-10-18|
CN110520909A|2019-11-29|
EP3612934A1|2020-02-26|
SG11201909175XA|2019-11-28|
ZA201905874B|2020-11-25|
CL2019002864A1|2020-03-06|
US20200356500A1|2020-11-12|
US20180300613A1|2018-10-18|
CN110520909B|2021-03-19|
WO2018194939A1|2018-10-25|
EP3612947A1|2020-02-26|
EP3612942A1|2020-02-26|
EP3613026A1|2020-02-26|
CN110520870A|2019-11-29|
PH12019550191A1|2020-06-29|
US10540584B2|2020-01-21|
EP3612945A1|2020-02-26|
CN110546628A|2019-12-06|
WO2018194994A2|2018-10-25|
US11100391B2|2021-08-24|
WO2018194849A1|2018-10-25|
WO2018194996A1|2018-10-25|
EP3612933A1|2020-02-26|
WO2018194995A1|2018-10-25|
CN110520856A|2019-11-29|
US20180300601A1|2018-10-18|
US11100390B2|2021-08-24|
US10963403B2|2021-03-30|
EP3612942B1|2021-03-03|
RU2019136750A3|2021-12-22|
EP3612990A1|2020-02-26|
US11205118B2|2021-12-21|
US20180299943A1|2018-10-18|
US20180300633A1|2018-10-18|
CN110546654A|2019-12-06|
CN110537194A|2019-12-03|
US11256976B2|2022-02-22|
EP3612936A1|2020-02-26|
WO2018194851A1|2018-10-25|
RU2019136750A|2021-05-18|
US20180300617A1|2018-10-18|
EP3612937A1|2020-02-26|
US11182667B2|2021-11-23|
US20180300614A1|2018-10-18|
CN110520857A|2019-11-29|
US11030131B2|2021-06-08|
WO2018194993A1|2018-10-25|
US20180300607A1|2018-10-18|
EP3612948A1|2020-02-26|
EP3613026B1|2021-05-26|
US10628345B2|2020-04-21|
EP3612988A2|2020-02-26|
WO2018194850A1|2018-10-25|
US20180300634A1|2018-10-18|
WO2018194998A1|2018-10-25|
KR20190141694A|2019-12-24|
CN110520846A|2019-11-29|
CN110678843A|2020-01-10|
US10795836B2|2020-10-06|
US20180300616A1|2018-10-18|
EP3612946A1|2020-02-26|
US20180300604A1|2018-10-18|
CN110546610A|2019-12-06|
CN110506260A|2019-11-26|
US20200233820A1|2020-07-23|
US20180300606A1|2018-10-18|
CA3056660A1|2018-10-25|
JP2020517014A|2020-06-11|
US11176448B2|2021-11-16|
WO2018194848A1|2018-10-25|
US20210232904A1|2021-07-29|
CN110520853A|2019-11-29|
US20180300615A1|2018-10-18|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US4298954A|1979-04-30|1981-11-03|International Business Machines Corporation|Alternating data buffers when one buffer is empty and another buffer is variably full of data|
JPH0642237B2|1983-12-28|1994-06-01|株式会社日立製作所|Parallel processor|
JP2703010B2|1988-12-23|1998-01-26|株式会社日立製作所|Neural net signal processing processor|
US5487153A|1991-08-30|1996-01-23|Adaptive Solutions, Inc.|Neural network sequencer and interface apparatus|
WO1993014459A1|1992-01-17|1993-07-22|Caelum Research Corporation|Modular parallel processing system|
JPH06195322A|1992-10-29|1994-07-15|Hitachi Ltd|Information processor used as general purpose neurocomputer|
JP2938711B2|1993-05-10|1999-08-25|松下電器産業株式会社|Parallel computer|
US5859990A|1995-12-29|1999-01-12|Intel Corporation|System for transferring data segments from a first storage device to a second storage device using an alignment stage including even and odd temporary devices|
US5933654A|1996-09-24|1999-08-03|Allen-Bradley Company, Llc|Dynamic buffer fracturing by a DMA controller|
US6006325A|1996-12-19|1999-12-21|Institute For The Development Of Emerging Architectures, L.L.C.|Method and apparatus for instruction and data serialization in a computer processor|
US6547364B2|1997-07-12|2003-04-15|Silverbrook Research Pty Ltd|Printing cartridge with an integrated circuit device|
US6307867B1|1998-05-14|2001-10-23|Telefonaktiebolaget Lm Ericsson |Data transmission over a communications link with variable transmission rates|
US6785239B1|1999-06-01|2004-08-31|Cisco Technology, Inc.|Reducing delays in a communication network using a re-fragmentation pipeline|
JP2001188767A|1999-12-28|2001-07-10|Fuji Xerox Co Ltd|Neutral network arithmetic unit and method|
US6424737B1|2000-01-24|2002-07-23|Sony Corporation|Method and apparatus of compressing images using localized radon transforms|
GB2382899B|2000-12-29|2003-12-17|Zarlink Semiconductor Ltd|A data queue system|
US6990079B2|2001-06-08|2006-01-24|International Business Machines Corporation|Optimizing fragment sizes in frame relay networks|
US7012893B2|2001-06-12|2006-03-14|Smartpackets, Inc.|Adaptive control of data packet size in networks|
US7106968B2|2001-07-06|2006-09-12|Optix Networks Inc.|Combined SONET/SDH and OTN architecture|
US6836767B2|2001-10-03|2004-12-28|International Business Machines Corporation|Pipelined hardware implementation of a neural network circuit|
US7245627B2|2002-04-23|2007-07-17|Mellanox Technologies Ltd.|Sharing a network interface card among multiple hosts|
US7539608B1|2002-05-10|2009-05-26|Oracle International Corporation|Techniques for determining effects on system performance of a memory management parameter|
US7444637B2|2003-02-18|2008-10-28|Microsoft Corporation|Systems and methods for scheduling coprocessor resources in a computing system|
EP1701244B1|2003-12-09|2015-04-01|Panasonic Corporation|Electronic device, control method thereof, host device, and control method thereof|
US7480640B1|2003-12-16|2009-01-20|Quantum Leap Research, Inc.|Automated method and system for generating models from data|
US7376853B2|2004-03-15|2008-05-20|Canon Kabushiki Kaisha|Network apparatus, method for controlling the same, and program for the same|
US7284075B2|2004-03-23|2007-10-16|Intel Corporation|Inbound packet placement in host memory|
US9143393B1|2004-05-25|2015-09-22|Red Lambda, Inc.|System, method and apparatus for classifying digital data|
US7363397B2|2004-08-26|2008-04-22|International Business Machines Corporation|System and method for DMA controller with multi-dimensional line-walking functionality|
CN1790918A|2004-12-17|2006-06-21|中国科学院半导体研究所|Lossless data compression method based on virtual information source and neural network|
EP1701249A1|2005-03-11|2006-09-13|Interuniversitair Microelektronica Centrum Vzw|Ultra low power ASIP microcomputer|
US9542642B2|2006-04-06|2017-01-10|Samuel F. Wood|Packet data neural network system and method|
US8718065B2|2006-08-15|2014-05-06|Broadcom Corporation|Transmission using multiple physical interface|
US7496707B2|2006-08-22|2009-02-24|International Business Machines Corporation|Dynamically scalable queues for performance driven PCI express memory traffic|
US8249171B2|2006-11-10|2012-08-21|Texas Instruments Incorporated|MPEG-2 transport stream packet synchronizer|
US20080168013A1|2006-12-05|2008-07-10|Paul Cadaret|Scalable pattern recognition system|
US8103606B2|2006-12-08|2012-01-24|Medhat Moussa|Architecture, system and method for artificial neural network implementation|
US8190834B2|2007-06-15|2012-05-29|Emc Corporation|Process for contiguously streaming data from a content addressed storage system|
US7822951B2|2007-08-01|2010-10-26|Advanced Micro Devices, Inc.|System and method of load-store forwarding|
CN101183873B|2007-12-11|2011-09-28|广州中珩电子科技有限公司|BP neural network based embedded system data compression/decompression method|
US8244953B1|2007-12-18|2012-08-14|Emc Corporation|System and method for faster data retrieval from tape media|
GB0811057D0|2008-06-17|2008-07-23|Univ Ulster|Artificial neural network architecture|
US20100180100A1|2009-01-13|2010-07-15|Mavrix Technology, Inc.|Matrix microprocessor and method of operation|
US20100257174A1|2009-04-02|2010-10-07|Matthew Dino Minuti|Method for data compression utilizing pattern-analysis and matching means such as neural networks|
US20100281192A1|2009-04-30|2010-11-04|Novafora, Inc.|Apparatus and method for transferring data within a data processing system|
US8442927B2|2009-07-30|2013-05-14|Nec Laboratories America, Inc.|Dynamically configurable, multi-ported co-processor for convolutional neural networks|
US8713260B2|2010-04-02|2014-04-29|Intel Corporation|Adaptive block pre-fetching method and system|
US8515882B2|2010-11-18|2013-08-20|International Business Machines Corporation|Efficient storage of individuals for optimization simulation|
CN102480337B|2010-11-30|2016-04-13|国际商业机器公司|Radio software system and for its decoding device and method|
US8966413B2|2011-02-17|2015-02-24|The Board Of Trustees Of The Leland Stanford Junior University|System and method for a chip generator|
US8892488B2|2011-06-01|2014-11-18|Nec Laboratories America, Inc.|Document classification with weighted supervised n-gram embedding|
CN102332162A|2011-09-19|2012-01-25|西安百利信息科技有限公司|Method for automatic recognition and stage compression of medical image regions of interest based on artificial neural network|
US9326075B2|2011-10-07|2016-04-26|Cochlear Limited|Flexible protocol for an implanted prosthesis|
CN102523452B|2011-12-29|2014-12-17|西安空间无线电技术研究所|Method for conversion, compression and transmission of images|
US9015096B2|2012-05-30|2015-04-21|Qualcomm Incorporated|Continuous time spiking neural network event-based simulation that schedules co-pending events using an indexable list of nodes|
CN107562444B|2012-12-26|2020-12-18|英特尔公司|Merging adjacent gather/scatter operations|
CN104050200B|2013-03-15|2017-12-08|伊姆西公司|Method and apparatus for data copy|
US9053015B2|2013-06-17|2015-06-09|Topcon Positioning Systems, Inc.|NAND flash memory interface controller with GNSS receiver firmware booting capability|
US9851771B2|2013-12-28|2017-12-26|Intel Corporation|Dynamic power measurement and estimation to improve memory subsystem power performance|
US20150286873A1|2014-04-03|2015-10-08|Bruce L. Davis|Smartphone-based methods and systems|
US9219499B2|2014-05-16|2015-12-22|Robert Bosch Gmbh|Run time compression method for a vehicle communication bus|
US9959142B2|2014-06-17|2018-05-01|Mediatek Inc.|Dynamic task scheduling method for dispatching sub-tasks to computing devices of heterogeneous computing system and related computer readable medium|
US10909137B2|2014-10-06|2021-02-02|Fisher-Rosemount Systems, Inc.|Streaming data for analytics in process control systems|
US9990307B1|2014-10-29|2018-06-05|Netronome Systems, Inc.|Split packet transmission DMA engine|
EP3035204B1|2014-12-19|2018-08-15|Intel Corporation|Storage device and method for performing convolution operations|
EP3035249B1|2014-12-19|2019-11-27|Intel Corporation|Method and apparatus for distributed and cooperative computation in artificial neural networks|
US10223635B2|2015-01-22|2019-03-05|Qualcomm Incorporated|Model compression and fine-tuning|
US20160267377A1|2015-03-12|2016-09-15|Staples, Inc.|Review Sentiment Analysis|
WO2016149944A1|2015-03-26|2016-09-29|北京旷视科技有限公司|Face recognition method and system, and computer program product|
US9378044B1|2015-03-28|2016-06-28|Vmware, Inc.|Method and system that anticipates deleterious virtual-machine state changes within a virtualization layer|
FR3035243B1|2015-04-20|2018-06-29|Commissariat A L'energie Atomique Et Aux Energies Alternatives|PLACING A CALCULATION TASK ON A FUNCTIONALLY ASYMMETRIC PROCESSOR|
US20160328644A1|2015-05-08|2016-11-10|Qualcomm Incorporated|Adaptive selection of artificial neural networks|
US20160335119A1|2015-05-12|2016-11-17|minds.ai inc|Batch-based neural network system|
US20160350653A1|2015-06-01|2016-12-01|Salesforce.Com, Inc.|Dynamic Memory Network|
CN106250981A|2015-06-10|2016-12-21|三星电子株式会社|The impulsive neural networks of bandwidth consumption in minimizing memory access and network|
US10275001B2|2015-06-26|2019-04-30|Intel Corporation|Thermal throttling of electronic devices|
US20160378491A1|2015-06-26|2016-12-29|Microsoft Technology Licensing, Llc|Determination of target location for transfer of processor control|
US11244225B2|2015-07-10|2022-02-08|Samsung Electronics Co., Ltd.|Neural network processor configurable using macro instructions|
US11029949B2|2015-10-08|2021-06-08|Shanghai Zhaoxin Semiconductor Co., Ltd.|Neural network unit|
US10471594B2|2015-12-01|2019-11-12|Kindred Systems Inc.|Systems, devices, and methods for the distribution and collection of multimodal data associated with robots|
US11232085B2|2016-01-07|2022-01-25|Amazon Technologies, Inc.|Outlier detection for streaming data|
CN107578099B|2016-01-20|2021-06-11|中科寒武纪科技股份有限公司|Computing device and method|
US10528864B2|2016-08-11|2020-01-07|Nvidia Corporation|Sparse convolutional neural network accelerator|
WO2018058452A1|2016-09-29|2018-04-05|北京中科寒武纪科技有限公司|Apparatus and method for performing artificial neural network operation|
US10296292B2|2016-10-20|2019-05-21|Advanced Micro Devices, Inc.|Dynamic variable precision computation|
CN106530200B|2016-10-23|2020-01-07|深圳大学|Steganographic image detection method and system based on deep learning model|
US9959498B1|2016-10-27|2018-05-01|Google Llc|Neural network instruction set architecture|
US20180164866A1|2016-12-13|2018-06-14|Qualcomm Incorporated|Low-power architecture for sparse neural network|
US20180189641A1|2017-01-04|2018-07-05|Stmicroelectronics S.R.L.|Hardware accelerator engine|
US10909447B2|2017-03-09|2021-02-02|Google Llc|Transposing neural network matrices in hardware|
US10795836B2|2017-04-17|2020-10-06|Microsoft Technology Licensing, Llc|Data processing performance enhancement for neural networks using a virtualized data iterator|
US10671349B2|2017-07-24|2020-06-02|Tesla, Inc.|Accelerated mathematical engine|
KR20190066473A|2017-12-05|2019-06-13|삼성전자주식회사|Method and apparatus for processing convolution operation in neural network|
US20200250842A1|2019-01-31|2020-08-06|Samsung Electronics Co., Ltd.|Method and apparatus with convolution neural network processing|
US20200336272A1|2019-04-17|2020-10-22|Samsung Electronics Co., Ltd.|Hardware channel-parallel data compression/decompression|US10635969B2|2016-10-14|2020-04-28|International Business Machines Corporation|Core utilization optimization by dividing computational blocks across cores|
US10248906B2|2016-12-28|2019-04-02|Intel Corporation|Neuromorphic circuits for storing and generating connectivity information|
US10795836B2|2017-04-17|2020-10-06|Microsoft Technology Licensing, Llc|Data processing performance enhancement for neural networks using a virtualized data iterator|
US11164071B2|2017-04-18|2021-11-02|Samsung Electronics Co., Ltd.|Method and apparatus for reducing computational complexity of convolutional neural networks|
EP3651020A1|2017-11-20|2020-05-13|Shanghai Cambricon Information Technology Co., Ltd|Computer equipment, data processing method, and storage medium|
EP3489865B1|2017-11-22|2021-01-06|Commissariat à l'énergie atomique et aux énergies alternatives|A stdp-based learning method for a network having dual accumulator neurons|
US10747844B2|2017-12-12|2020-08-18|Tesla, Inc.|Systems and methods for converting a matrix input to a vectorized input for a matrix processor|
US10948966B1|2018-03-07|2021-03-16|Facebook, Inc.|Systems and methods for optimizing power usage for systems within quality-of-service constraints|
US11126362B2|2018-03-14|2021-09-21|International Business Machines Corporation|Migrating storage data|
US10404276B1|2018-04-27|2019-09-03|Nicira, Inc.|Stable variable-length order-preserving encoding scheme|
US10360654B1|2018-05-25|2019-07-23|Intel Corporation|Software scoreboard information and synchronization|
US10657087B2|2018-05-31|2020-05-19|Toshiba Memory Corporation|Method of out of order processing of scatter gather lists|
EP3821346A1|2018-08-09|2021-05-19|Huawei Technologies Co., Ltd.|Device and method for compacting compressed and uncompressed data blocks|
KR20200034499A|2018-09-21|2020-03-31|삼성전자주식회사|Data processing device and method of communicating with memory device|
CN109359732B|2018-09-30|2020-06-09|阿里巴巴集团控股有限公司|Chip and data processing method based on chip|
CN109669774B|2018-11-14|2020-12-08|新华三技术有限公司成都分公司|Hardware resource quantification method, hardware resource arrangement method, hardware resource quantification device and hardware resource arrangement device and network equipment|
US20200159529A1|2018-11-19|2020-05-21|Advanced Micro Devices, Inc.|Family of lossy sparse load simd instructions|
KR102107077B1|2018-11-20|2020-05-06|주식회사 아나패스|Line-based memory management method for performing convolution operation in convolutional neural network inference and its inference device|
US10990525B2|2018-12-12|2021-04-27|Mipsology SAS|Caching data in artificial neural network computations|
US20200195273A1|2018-12-14|2020-06-18|Advanced Micro Devices, Inc.|Lossy significance compression with lossy restoration|
CN109657788A|2018-12-18|2019-04-19|北京中科寒武纪科技有限公司|Data processing method, device and Related product|
CN109740735B|2018-12-29|2020-12-29|百度在线网络技术(北京)有限公司|Multi-neural-network output method and device, server and computer readable medium|
CN109922007A|2019-01-15|2019-06-21|西安仙农电子科技有限公司|A kind of load-balancing method based on convolutional neural networks|
CN111488114B|2019-01-28|2021-12-21|北京灵汐科技有限公司|Reconfigurable processor architecture and computing device|
US11036545B2|2019-03-15|2021-06-15|Intel Corporation|Graphics systems and methods for accelerating synchronization using fine grain dependency check and scheduling optimizations based on available shared memory space|
CN109992225B|2019-04-04|2022-02-22|中科寒武纪科技股份有限公司|Data output method and related device|
US10504005B1|2019-05-10|2019-12-10|Capital One Services, Llc|Techniques to embed a data object into a multidimensional frame|
US11204745B2|2019-05-23|2021-12-21|Xilinx, Inc.|Dataflow graph programming environment for a heterogenous processing system|
US11175898B2|2019-05-31|2021-11-16|Apple Inc.|Compiling code for a machine learning model for execution on a specialized processor|
KR20200139909A|2019-06-05|2020-12-15|삼성전자주식회사|Electronic apparatus and method of performing operations thereof|
US20200401898A1|2019-06-18|2020-12-24|Qualcomm Incorporated|Optimizing machine learning model performance|
US20200409705A1|2019-06-26|2020-12-31|Intel Corporation|Systems and methods to skip inconsequential matrix operations|
US10915811B1|2019-09-18|2021-02-09|International Business Machines Corporation|Intercalation cells for multi-task learning|
KR20210065605A|2019-11-27|2021-06-04|한국전자통신연구원|Method and apparatus for controlling memory using prefetch information|
CN111162792A|2019-12-19|2020-05-15|深圳市航天泰瑞捷电子有限公司|Compression method and device for power load data|
CN111178490B|2019-12-31|2021-08-24|北京百度网讯科技有限公司|Data output method, data acquisition method, data output device, data acquisition device and electronic equipment|
CN111126589A|2019-12-31|2020-05-08|北京百度网讯科技有限公司|Neural network data processing device and method and electronic equipment|
US11023400B1|2020-01-20|2021-06-01|International Business Machines Corporation|High performance DMA transfers in host bus adapters|
CN111290979B|2020-03-23|2021-08-17|优刻得科技股份有限公司|Data transmission method, device and system|
GB202008299D0|2020-06-02|2020-07-15|Imagination Tech Ltd|Manipulation of data in a memory|
WO2021254856A1|2020-06-18|2021-12-23|Interdigital Vc Holdings France, Sas|Systems and methods for encoding/decoding a deep neural network|
FR3113158A1|2020-08-03|2022-02-04|Commissariat A L'energie Atomique Et Aux Energies Alternatives|Systolic computational architecture for the implementation of artificial neural networks dealing with several types of convolutions|
KR102298766B1|2021-02-15|2021-09-07|주식회사 딥이티|Apparatus and method for converting deep learning model for target device|
法律状态:
2021-10-19| B350| Update of information on the portal [chapter 15.35 patent gazette]|
优先权:
申请号 | 申请日 | 专利标题
US201762486432P| true| 2017-04-17|2017-04-17|
US62/486,432|2017-04-17|
US15/953,356|2018-04-13|
US15/953,356|US20180300606A1|2017-04-17|2018-04-13|Neural network processor using compression and decompression of activation data to reduce memory bandwidth utilization|
PCT/US2018/027840|WO2018194998A1|2017-04-17|2018-04-16|Neural network processor using compression and decompression of activation data to reduce memory bandwidth utilization|
[返回顶部]