法国专利FR3072801A1 SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESS MATRIX

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
A computer includes: a plurality of processing units, each comprising instruction storage containing a local program, a thread executing the local program, data storage for maintaining data; an input interface having a set of input leads, and an output interface having a set of lead wires; a switching fabric connected to each of the processing units by the respective set of output wires and connectable to each of the processing units by the respective input wires through switching circuits controllable by each processing unit; a synchronization module operable to generate a synchronization signal for controlling the computer to switch between a calculation phase and an exchange phase, the processing units being arranged to execute their local programs according to a common clock, the local programs being such that in the exchange phase at least one processing unit executes a sending instruction from its local program to transmit at a time of transmission a data packet on its set of output connection wires, the packet for receiving at least one destination processing unit but not having a destination identifier, and at a predetermined switching time the destination processing unit executing a switch control instruction from its local program for controlling its switching circuits to connect its set of input wires to the switching fabric for received have the data packet at a reception instant, the instant of transmission, the instant of commutation and the moment
公开号:FR3072801A1
申请号:FR1859639
申请日:2018-10-18
公开日:2019-04-26
发明作者:Simon Christian Knowles；Daniel John Pelham WILKINSON；Richard Luke Southwell Osborne；Alan Graham Alexander；Stephen Felix；Jonathan Mangnall；David Lacey
申请人:Graphcore Ltd；
IPC主号:

专利说明:

DESCRIPTION
TITLE: SYNCHRONIZATION IN A MULTIPLE PAVING TREATMENT MATRIX
Technical Field [0001] The present description relates to the synchronization of the workloads of multiple different blocks in a processor comprising multiple blocks, each block comprising a processing unit provided with a local memory. In particular, the description relates to a massive synchronous parallel computer protocol (BSP) in which each block of a group of blocks must complete a calculation phase before any of the blocks of the group can carry out an exchange phase.
Prior Art [0002] Parallelism in computer science takes different forms. Program fragments can be organized to run simultaneously (case in which they overlap in time but can share execution resources) or in parallel, case in which they run on different resources possibly at the same time .
Parallelism in computer science can be obtained in a number of ways, such as by means of a matrix of multiple interconnected processor blocks, or of a multi-wire processing unit, or indeed of a multi matrix -paved in which each block includes a multi-wire processing unit.
When parallelism is obtained by means of a processor comprising a matrix of multiple blocks on the same chip (or chips in the same integrated circuit package), each block comprises its own separate respective processing unit provided with a local memory (comprising a program memory and a data memory). So
B17816 FR-408525FR Separate portions of program code can be executed simultaneously on different blocks. The blocks are connected to each other via an interconnection on the chip which allows the code executed on the various blocks to communicate between the blocks. In some cases the processing unit on each block can take the form of a barrel-wire processing unit (or another multi-wire processing unit). Each block can have a set of contexts and an execution pipeline so that each block can execute multiple intertwined threads simultaneously.
In general, there may be dependencies between the portions of a program executing on different blocks in the matrix. A technique is therefore necessary to prevent a piece of code on a block from running ahead of data on which it depends which is made available by another piece of code on another block. There are a number of possible schemes for achieving this, but the scheme that interests us here is known as massive synchronous parallel (BSP). According to the BSP diagram, each block performs a calculation phase and an exchange phase alternately. During the calculation phase, each block performs one or more calculation tasks locally on the block, but does not communicate any result of its calculations to any other of the blocks. In the exchange phase each block is authorized to exchange one or more calculation results from the previous calculation phase to and / or from one or more other of the group's blocks, but does not start a new phase of calculation before the paver has completed its exchange phase. In addition, according to this form of BSP principle, a barrier synchronization is placed at the join making the transition between the calculation phase and the exchange phase, or the transition between the exchange phases and the
B17816 FR-408525FR calculation phase, or both. That is, either: (a) all blocks must complete their respective calculation phases before any of the group is allowed to proceed to the next exchange phase, or (b) all blocks of the group must complete their respective exchange phases before a group block is authorized to proceed to the next calculation phase, or (c) both. When we use here the sentence between a calculation phase and an exchange phase we include all these options.
An example of using multi-wire and / or multi-block parallel processing can be found in artificial intelligence. As will be familiar to those skilled in the art of artificial intelligence, artificial intelligence algorithms are capable of producing models of knowledge and using the model of knowledge to execute algorithms for learning and inference. An artificial intelligence model incorporating the knowledge model and algorithms can be represented as a graph of multiple interconnected nodes. Each node represents a function of its inputs. Some nodes receive inputs from the graph and some nodes receive inputs from one or more other nodes. Activating the output of certain nodes forms the inputs of other nodes, and the output of certain nodes provides the output of the graph, and the inputs of the graph provide the inputs to certain nodes. In addition, the function located at each node is parameterized by one or more respective parameters, for example weights. During a learning phase the goal is, on the basis of a set of experimental input data, to find values for the various parameters so that the graph as a whole generates an output
B17816 FR-408525FR desired for a possible range of inputs. Various algorithms for achieving this are known in the art, such as a back propagation algorithm based on a stochastic gradient descent. On multiple iterations the parameters are gradually adjusted to reduce their errors, and thus the graph converges towards a solution. In a next stage, the learned model can then be used to make predictions of outputs given a specified set of inputs or to make inferences regarding inputs (causes) given a specified set of outputs, or d Other introspective forms of analysis can be performed on this one.
The implementation of each node will involve data processing, and the interconnections of the graph correspond to data to be exchanged between the nodes. Typically, at least part of the processing of each node can be performed independently of some or all of the other nodes in the graph, and therefore large graphs present opportunities for massive parallelism.
SUMMARY OF THE INVENTION As previously mentioned, an artificial intelligence model representing the knowledge model and the algorithmic information regarding how the knowledge model is used to learn and to make inferences can in general be be represented by a graph of multiple interconnected nodes, each node having a data processing requirement. Interconnections of the graph indicate data to be exchanged between the nodes and consequently cause dependencies between the fragments of program executed at the level of the nodes. In general, processing at a node can be performed independently of another node, and therefore of
B17816 FR-408525FR large graphs have an enormous parallelism. A highly distributed parallel machine is a machine structure suitable for the calculations of such artificial intelligence models. This functionality makes it possible to design a machine to provide certain guarantees of temporal determinism.
An element of knowledge models which is used in the present description is the generally static nature of the graph. That is, the structure of nodes and of the graph constituting the graph does not usually change during the execution of artificial intelligence algorithms. The inventors have produced a machine which gives certain guarantees of temporal determinism in order to optimize the calculations on artificial intelligence models. This allows a compiler to partition and schedule work across all nodes in a deterministic manner over time. It is this temporal determinism which is used in the embodiments described below for significant optimizations in the design of a computer optimized for processing workloads on the basis of knowledge models.
According to one aspect of the invention, there is provided a computer comprising a plurality of processing units, each comprising an instruction storage containing a local program, an execution unit executing the local program, a data storage to maintain data; an input interface provided with a set of input wires, and an output interface provided with a set of output wires; a switching fabric connected to each of the processing units by the respective set of output wires and connectable to each of the processing units by the respective input wires via switching circuits controllable by each processing unit; a
B17816 FR-408525FR synchronization module operable to generate a synchronization signal to control the computer for switching between a calculation phase and an exchange phase, the processing units being arranged to execute their local programs according to a common clock, the local programs being such that in the exchange phase at least one processing unit executes a sending instruction from its local program to send at a time of transmission a packet of data on its set of output connection wires , the data packet being intended for at least one recipient processing unit but not comprising a destination identifier, and at a predetermined switching instant the recipient processing unit executing a switch control instruction from its program room to control its switching circuits to connect its set of input wires to the switching fabric ation for receiving the data packet at a reception instant, the transmission instant, the switching instant and the reception instant being governed by the common clock with respect to the synchronization signal.
Another aspect of the invention provides a method of calculating a function in a computer comprising: a plurality of processing units, each comprising a
storage to instruct ions containing a program 1 .ocal, a unit of execution for perform the program local, a storage of data for maintain of the data one interface
input provided with a set of input wires, and an output interface provided with a set of output wires; a switching fabric connected to each of the processing units by the respective sets of output wires and connectable to each of the processing units by its respective input wires via switching circuits controllable by each processing unit; and
B17816 FR-408525FR a synchronization module operable to generate a synchronization signal to control the computer to switch between a calculation phase and an exchange phase, the method comprising: the processing units execute their local programs in a phase of calculation according to a common clock, at a predetermined instant in the exchange phase at least one processing unit executing a sending instruction from its local program to transmit at a transmission instant a packet of data on its set of output connection wires, the data packet being intended for at least one recipient processing unit but not comprising a destination identifier, and at a predetermined switching time the recipient processing unit executing a switch control instruction from his local program to control the switching circuits to connect his set of wires d input to the switching fabric to receive the data packet at a time of reception, the time of transmission, the time of switching and the time of reception being governed by the common clock with respect to the synchronization signal.
In principle, the synchronization signal could be generated to control the switching from a calculation phase to the exchange phase, or from the exchange phase to the calculation phase. For the deterministic architecture in time defined here, it is however preferable that the synchronization signal is generated so as to start the exchange phase. In one embodiment, each processing unit indicates to the synchronization module that its own calculation phase has ended, and the synchronization signal is generated by the synchronization module when all the processing units have indicated that their own calculation phase is over, to start the exchange phase.
B17816 FR-408525FR [0013] The time of transmission should be predetermined to allow the deterministic exchange over time to end correctly. It can be determined by being at a known number of clock cycles after the time when the send instruction is executed, assuming that the time when the send instruction is executed is predetermined. Alternatively, the transmission instant could be at a known time, determined in another way, from a known time from the execution of the sending instruction. What is important is that the time of transmission is known with respect to the time of reception on a target recipient processing unit.
The characteristics of the sending instruction may include that the sending instruction explicitly defines the sending address identifying a location in the data storage from which the data packet is to be sent. Alternatively, no send address is explicitly defined in the send instruction, and data packets are transmitted from a send address defined in a register implicitly defined by the send instruction. . The local program may include a send address update instruction to update the send address in the default register.
In the embodiments described here, the switching circuit comprises a multiplexer comprising a set of output wires connected to its processing unit, and several sets of input wires connected to the switching fabric, whence it As a result, one of the multiple sets of input wires is selected by control by the processing unit. Each set can include 32 bits. When using 64-bit data, a pair of multiplexers can be connected to a processing unit and jointly controlled.
B17816 FR-408525FR In the embodiment described, the destination processing unit is arranged to receive the data packet and load it into the data storage at a memory location identified by a memory pointer. The memory pointer can be automatically incremented after each data packet has been loaded into the data store. Alternatively, the local program at the destination processing unit may include a memory pointer update instruction which updates the memory pointer.
The sending instruction can be arranged to identify a number of data packets to be sent, each data packet being associated with a different transmission time, since they are sent in series from the unit. treatment.
One of the sets of input wires of the multiplexer can be controlled so as to be connected to a zero input. This could be used to ignore data that otherwise arrives at the processing unit.
The recipient processing unit which is intended to receive a particular data packet could be the same processing unit as that which executed a sending instruction at an earlier instant, where it follows that the same unit processing is arranged to send a data packet and receive this data packet at a later time. The goal that a processing unit sends to itself can be to adhere an arrangement in its memory of incoming data interleaved with data received from other processing unit. In some embodiments at least two of the processing units can cooperate in a transmitting pair where a first data packet is transmitted from a first processing unit
B17816 FR-408525FR of the pair via its set of output connection leads, and a second data packet is sent from the first processing unit of the pair via the set of leads output connection of the second processing unit of the pair to perform a double width transmission. In some embodiments, at least two of the processing units can operate as a receiving pair where each processing unit in the pair controls its switching circuits to connect its respective set of input wires to the
switching fabric for receive data packets respective from of cobbles respective of a pair Issuer. The multiples unites from treatment can to be arranged to execute of the instruc sending respec tives
for transmitting respective data packets, at least some of the data packets not being intended for any recipient processing unit.
The function being calculated can be provided in the form of a static graph comprising a plurality of interconnected nodes, each node being implemented by a codelet of the local programs. A codelet defines a vertex (node) in the graph, and can be considered as an atomic thread, described below in the description. In the calculation phase, each codelet can process data to produce a result, some of the results not being necessary for a subsequent calculation phase and not being received by any recipient processing unit. They are effectively deleted, but without the need for a positive deletion action. In the exchange phase, the data packets are transmitted between processing units via the switching fabric and the switching circuits. Note that in the phase
B17816 FR-408525FR exchange certain instructions are executed from the local program to implement the exchange phase. These instructions include the sending instruction. Although the calculation phase is responsible for the calculations, it should be noted that it may be possible to include certain arithmetic or logical functions during the exchange phase, provided that these functions do not involve any data dependence on the time sequence. of the local program so that it remains synchronous.
The deterministic architecture over time described here is particularly useful in contexts where the graph represents an artificial intelligence function.
A switching fabric can be arranged so that in the exchange phase of the data packets are transmitted therein by pipeline via a succession of temporary storage, each storage holding a packet of data during a common clock cycle.
In another aspect, there is provided a computer implemented method of generating multiple programs to deliver a computer function, each program being intended to be executed in a processing unit of a computer comprising a plurality of units processing unit each comprising a storage of instructions for maintaining a local program, an execution unit for executing the local program and a data storage for maintaining data, a switching fabric connected to an output interface of each processing unit and connectable to an input interface of each processing unit by switching circuits controllable by each processing unit, and a synchronization module operable to generate a synchronization signal, the
B17816 FR-408525FR method comprising: generating a local program for each processing unit comprising a sequence of executable instructions; determining for each processing unit a relative execution time of instructions of each local program from which it follows that a local program allocated to a processing unit is scheduled to execute, with a predetermined delay with respect to a signal synchronization, a sending instruction for transmitting at least one data packet at a predetermined transmission instant, with respect to the synchronization signal, intended for a destination processing unit but having no destination identifier, and a local program allocated to the destination processing unit is scheduled to execute at a predetermined switching time a switch control instruction to control the switching circuits for
connect its unit wire treatment at tissue of switching to receive the data packet to a moment reception. In certain modes of achievement, the units of
processing have a fixed position relationship therebetween, and the determining step includes determining a fixed delay based on the position relationship between each pair of processing units in the computer.
In certain embodiments, the fixed position relation comprises a matrix of rows and columns, each processing unit comprising an identifier which identifies its position in the matrix.
In certain embodiments, the switching circuits include a multiplexer comprising a set of output wires connected to its processing unit, and multiple sets of input wires connectable to the switching fabric, the multiplexer being located on
B17816 FR-408525FR the computer at a predetermined physical location relative to its processing unit, and the determining step includes determining the multiplexer control instruction and for that of a fixed delay for that switch to reach the output data packet from the multiplexer reaches the input interface of its processing unit.
In some embodiments, the method includes the step of providing in each program a synchronization instruction which indicates to the synchronization module that a calculation phase at the processing unit has ended.
In some embodiments, the determining step comprises determining for each processing unit a fixed delay between a synchronization event on the chip and the return reception at the level of the processing unit of an acknowledgment reception indicating that a synchronization event has occurred.
In certain embodiments, the determination step comprises access to a correspondence table containing information relating to delays making it possible to determine the predetermined sending instant and the predetermined switching instant.
In some embodiments, the computer function is a machine learning function.
In some embodiments, the switching circuit comprises a multiplexer having a set of output wires connected to its processing unit, and multiple sets of input wires connectable to the switching fabric, the multiplexer being located on the computer at a predetermined physical location relative to its processing unit, and the determining step includes determining the delay
B17816 FR-408525FR sets so that the switch control instruction reaches the multiplexer and that an output data packet from the multiplexer reaches the input interface of its processing unit.
In certain embodiments, a synchronization instruction is provided in each program which indicates to the synchronization module that a calculation phase at the level of the processing unit has ended.
In certain embodiments, the determination step comprises the determination for each processing unit of a fixed delay between a synchronization event on the chip and the return reception at the processing unit. an acknowledgment that a synchronization event has occurred.
In certain embodiments, the determination step comprises access to a correspondence table containing information concerning delays making it possible to determine the predetermined sending instant and the predetermined switching instant.
In some embodiments, the computerized function is a machine learning function.
In another aspect, there is provided a compiler comprising a processor programmed to carry out a method for generating multiple programs for delivering a computer function, each program being intended to be executed in a processing unit of a computer comprising a plurality processing units each comprising a storage of instructions for maintaining a local program, an execution unit for executing the local program and a storage of data for maintaining data, a switching fabric connected to a chague output interface processing unit and connectable to a
B17816 FR-408525FR input interface of each processing unit by switching circuits controllable by each processing unit, and a synchronization module operable to generate a synchronization signal, the method comprising: generating a local program for each processing unit processing comprising a sequence of executable instructions; determining for each processing unit a relative instant of execution of instructions of each local program from which it follows that a local program allocated to a given processing unit is scheduled to execute with a predetermined delay relative to a signal synchronization a sending instruction for transmitting at least one data packet at a predetermined transmission time, with respect to the synchronization signal, intended for a destination processing unit but not comprising a destination identifier, and that a program local of predetermined switching instant switching control to execute an instruction one of switches to control the processing circuits at the switching fabric to receive the data packet to a connected so as to receive a fixed graph structure representing the function computer and a table containing deadlines allowing to determine the moment of sending predetermined and the predetermined switching time for each processing unit.
In some embodiments, the computer function is a machine learning function.
In some embodiments, the fixed graph structure includes a plurality of nodes, each node being represented by a codelet in a local program.
B17816 FR-408525EN In certain embodiments, the fixed graph structure comprises a plurality of nodes, each being represented by a codelet in a local program.
In another aspect, there is provided a computer program recorded on a non-transmissible medium and comprising instructions readable by a computer which when executed by a processor of a compiler implement a method of generating multiple programs for delivering a computer function, each program being intended to be executed in a processing unit of a computer comprising a plurality of processing units each comprising a storage of instructions for maintaining a local program, an execution unit for executing the local program and a data storage for maintaining data, a switching fabric connected to an output interface of each processing unit and connectable to an input interface of each processing unit by switching circuits controllable by each unit processing module and an actuation synchronization module to generate u n synchronization signal, the method comprising: generating a local program for each processing unit comprising a sequence of executable instructions; determining for each processing unit a relative instant of the execution of instructions of each local program from which it follows that a local program allocated to a given processing unit is scheduled to execute with a predetermined delay relative to a signal synchronization a sending instruction for transmitting at least one data packet at a predetermined transmission instant, with respect to the synchronization signal, intended for a destination processing unit but comprising no destination identifier, and a local program allocated to the recipient processing unit is planned
B17816 FR-408525FR to execute at a predetermined switching time a switch control instruction for controlling the switching circuits to connect its processing unit wire to the switching fabric to receive the data packet at a time of reception.
In another aspect, a computer program is provided comprising a sequence of instructions intended to be executed on a processing unit comprising a storage of instructions for containing the computer program, an execution unit for executing the computer program and data storage for maintaining data, the computer program comprising one or more instructions executable by a computer which, when executed, implement: a send function which brings a data packet intended for a processing unit recipient to be transmitted on a set of connection wires connected to the processing unit, the data packet not comprising a destination identifier but being transmitted at a predetermined transmission time; and a switch control function which causes the processing unit to control switching circuits to connect a set of connecting wires of the processing unit to a switching fabric to receive a data packet at a time of reception predetermined.
In some embodiments, said one or more instructions include a switch control instruction and a send instruction which defines a send address defining a location in the instruction store from which the data packet must be sent.
B17816 FR-408525EN In some embodiments, the sending instruction defines a number of data packets to be sent, each packet being associated with a different predetermined transmission time.
In some embodiments, the sending instruction does not explicitly define a sending address but implicitly defines a register in which a sending address is contained.
In some embodiments, the computer program includes another function for updating the sending address in the register defined implicitly.
In certain embodiments, the computer program comprises at least one other instruction defining a memory pointer update function which updates a memory pointer identifying a memory location in the data storage for storing the data packet which is received at the destination processing unit.
In some embodiments, said one or more instructions are a merged instruction which merges the send function and the switch control function in a single execution cycle, whereby the unit of processing is arranged to act to transmit a data packet and to control its switching circuits to receive a different data packet from another processing unit.
In some embodiments, said at least one other instruction is a merged instruction which merges the send function and the memory pointer update function.
B17816 FR-408525EN In some embodiments, the merged instruction is configured in a common format with a portion of operation code which indicates whether it merges the send function with the memory pointer update function or switch control function.
In some modes of real ization, said one or many instructions are a alone instruction who merged function sending, the function control of
switches and memory pointer update function in a single run cycle.
In some embodiments, each of said one or more instructions has a first width in bits which agrees with a width in bits of an extraction stage of the execution unit.
In certain embodiments, each of said one or more instructions has a first width in bits which agrees with a width in bits of an extraction stage of the execution unit, and the instruction which merges the send function, switch control function, and memory pointer update function has a second bit width which is double the bit width of the fetch stage of the thread. .
In certain embodiments, each of said one or more instructions has a first width in bits which agrees with a width in bits of an extraction stage of the execution unit, and the instruction of a first bit width identifies an operand of the first bit width, the operand implementing the switch control function and the memory write update function.
In certain embodiments, the computer program includes a synchronization instruction which
B17816 FR-408525FR generates an indication when the calculation phase of the processing unit has been completed.
In some embodiments, the computer program is recorded on a non-transmissible medium readable by a computer.
In certain embodiments, the computer program has the form of a transmissible signal.
According to another aspect, there is provided a processing unit comprising an instruction storage, an execution unit arranged to execute a computer program and a data storage for maintaining data, and the instruction storage contains a computer program comprising one or more instructions executable by a computer which, when executed by the execution unit, implement a send function which causes a data packet intended for a destination processing unit to be sent on a set of connection wires connected to the processing unit, the data packet not comprising a destination identifier but being transmitted at a predetermined transmission time; and a switch control function which causes the processing unit to control switching circuits to connect a set of connecting wires of the processing unit to a switching fabric to receive a data packet at a time of reception predetermined.
In another aspect, a computer is provided comprising one or more chips in an integrated housing, the computer comprising a plurality of processing units, each processing unit comprising a storage of instructions for containing a computer program, an execution unit arranged to execute the computer program and a data storage for maintaining data, in
B17816 FR-408525FR which the instruction storage for each processing unit maintains a computer program comprising one or more instructions executable by a computer which, when executed, implement a send function which brings a packet of data intended for a destination processing unit to be transmitted over a set of connection wires connected to the processing unit, the data packet comprising no destination identifier but being transmitted at a predetermined transmission time; and a switch control function which causes the processing unit to control switching circuits to connect a set of connecting wires of the processing unit to a switching fabric to receive a data packet at a time of reception predetermined.
BRIEF DESCRIPTION OF THE DRAWINGS To facilitate understanding of the present description and to show how it can be implemented, reference will now be made by way of example to the accompanying drawings.
[Fig. 1] Figure 1 schematically illustrates the architecture of a single chip processor;
[Fig. 2] Figure 2 is a block diagram of a block connected to the switching fabric;
[Fig. 3] Figure 3 is a diagram illustrating a BSP protocol;
[Fig. 4] FIG. 4 is a diagram representing two blocks in a deterministic exchange over time;
[Fig. 5] Figure 5 is a schematic timing diagram illustrating a deterministic exchange over time;
[Fig. 6] Figure 6 is an example of an artificial intelligence graph;
B17816 FR-408525FR [0067] [Fig. 7] Figure 7 is a schematic architecture illustrating the operation of a compiler to generate deterministic programs over time;
[Fig. 8] Figures 8 to 11 illustrate instruction formats of different instructions used in a deterministic architecture over time;
[0069] [Fig. 12] the figure 12 is a diagram of of them cobbles working as emission pair ; and [0070] [Fig. 13] the figure 13 is a diagram of of them cobbles working as pair of reception.
Detailed description of preferred embodiments FIG. 1 schematically illustrates the architecture of a processor in a single chip 2. The processor is called here IPU (from the English Intelligence Processing Unit) for indicate its adaptation to artificial intelligence applications. In a computer, the processors in a single chip can be connected to each other, as will be described below, by using links on the chip to form a computer. This description focuses on the architecture of a single chip processor 2. The processor 2 includes multiple processing units called blocks. In one embodiment, there are 1216 blocks organized in matrices 6a, 6b which are called here North and South. In the example described, each matrix has eight columns of 76 tiles (in fact in general there will be 80 tiles, for the purpose of redundancy). Note that the concepts described here extend to a number of different physical architectures, an example given here to facilitate understanding. Chip 2 has two chip to host links 8a, 8b and 4 chip to chip links 30a, 30b arranged on the western edge of chip 2. Chip 2 receives work from a host (not
B17816 FR-408525FR shown) which is connected to the chip via one of the card-to-host links in the form of input data to be processed by the chip 2. The chips can be connected together in cards by 6 other chip-to-chip links 30a, 30b arranged along the east side of the chip. A host can access a computer that is structured as a single chip 2 processor as described here or a group of multiple processors in a single chip 2 interconnected depending on the workload from the host application.
The chip 2 includes a clock 3 which controls the time activity of the chip. The clock is connected to all the circuits and components of the chip. Chip 2 includes a deterministic switching fabric over time 34 to which all the blocks and all the connections are connected by sets of connecting wires, the switching fabric being stateless, that is to say having no visible program status. Each set of connection wires is fixed from end to end. The wires are in pipeline. In this embodiment, a set includes 32 data wires plus control wires, for example a validity bit. Each set can transport a 32-bit data packet, but it should be noted here that the word packet designates a set of bits representing a piece of data (sometimes called here data element), perhaps with one or more bits of validity. Packets do not have headers or other forms of destination identifiers that allow a intended recipient to be uniquely identified, nor do they include end-of-packet information. Instead, each of them represents a numerical or logical value that is supplied as input or obtained as output from a block. Each block has its own local memory (described below) The blocks do not share memory. The switching fabric
B17816 FR-408525FR constitutes a crossed set of connection wires connected only to multiplexers and blocks as described below and contains no visible program status. The switching fabric is considered to be stateless and does not use memory. The exchange of data between blocks is carried out deterministically over time as described here. A connection wire in pipeline comprises a series of temporary storages, for example locks or flip-flops which maintain data during a clock cycle before releasing it to the next storage. The travel time along the wire is determined by these temporary stores, each using up to a clock cycle time in a path between any two points.
FIG. 2 illustrates an example of block 4 according to embodiments of the present description. In the tile, multiple threads are interleaved in a single execution pipeline. Block 4 comprises: a plurality of contexts 26 each of which is arranged to represent the state of a respective respective thread among a plurality of threads; a shared instruction memory 12 common to the plurality of wires; a shared data memory 22 which is also common to the plurality of wires; a shared execution pipeline 14, 16, 18 which is again common to the plurality of threads; and a thread scheduler 24 for scheduling the plurality of threads for execution in the interleaved shared pipeline. The thread scheduler 24 is schematically represented in the diagram by a sequence of time slots S0 ... S5, but in practice it is a hardware mechanism managing the program counters of the threads in relation to their time slots. The execution pipeline comprises an extraction stage 14, a decoding stage 16 and an execution stage 18 comprising a unit
B17816 FR-408525FR runtime (EXU) and a load / store unit (LSU). Each of the contexts 26 includes a respective set of registers R _o , Ri ... to represent the program state of the respective thread.
The extraction stage 14 is connected so as to extract instructions to be executed from the instruction memory 12, under the control of the thread scheduler 24, The thread scheduler 24 is arranged for control the extraction stage 14 to extract instructions from the local program for execution in each time slot as will be described in more detail below.
The extraction stage 14 has access to a program counter (PC) of each of the wires which is currently allocated to a time slot. For a given wire, the extraction stage 14 extracts the next instruction from this wire from the following address in the instruction memory 12, as is
indicated by counter of thread program. We note that Referring to a inst ruction here, we designates a instruction from coded machi born, that is to say a instance of
one of the basic instructions in the computer's instruction set, consisting of an operation code and zero or more operands. It will also be noted that the program loaded in each block is determined by a processor or a compiler to allocate work on the basis of the graph of the artificial intelligence model which is supported.
The extraction stage 14 then passes the extracted instruction to the decoding stage 16 to be decoded, and the decoding stage 16 then passes an indication of the decoded instruction to the execution stage 18 accompanied by the decoded addresses of the operand registers of the current context specified in the instruction, so that the instruction is executed.
B17816 FR-408525EN In this example, the thread scheduler 24 interleaves the threads according to a turn diagram in which, in each turn of the diagram, the turn is divided into a sequence of time slots So, Si, S2 , S3, each being intended to execute a respective thread. Typically, each slot has a length of one processor cycle and the different slots have equal dimensions (although this is not necessarily the case in all possible embodiments). This pattern is then repeated, each round comprising a respective instance of each of the time slots (in embodiments in the same order each time, although again this is not necessary in all possible embodiments). It should therefore be noted that a time slot as designated here designates the place allocated repeatedly in the sequence, not a particular instance of the time slot in a given repetition of the sequence. In the illustrated embodiment, there are eight slots, but other numbers are possible. Each time slot is associated with a material resource, for example a register, to manage the context of an execution thread.
One of the contexts 26, denoted SV, is reserved for a special function, to represent the state of a supervisor (SV) whose job is to coordinate the execution of work threads. The supervisor can be implemented in the form of a program organized in the form of one or more supervisor threads which can run simultaneously. The supervisor wire can also be responsible for carrying out barrier synchronizations described in the following or can be responsible for an exchange of data on the pad and outside the pad, as well as in and out of the local memory so that it can be shared between the work threads between calculations. The thread scheduler 24 is arranged
B17816 FR-408525FR so that, when the program as a whole starts, it begins by allocating the supervisor thread to all the time slots, that is to say that the supervisor SV starts by executing in all the time slots
S0 ... S5. However, the supervisor thread is provided with a mechanism for, at a certain later point (either immediately or after having performed one or more supervision tasks), temporarily abandon each of the slots in which it is executed at a respective one of the working son. C _o , Ci represents slots in which work threads have been allocated. This is achieved by the fact that the supervisor thread is executing an abort instruction, called RUN by way of example here. In embodiments, this instruction takes two operands: an address of a work thread in the instruction memory 12 and an address of certain data for this work thread in the data memory 22:
RUN task_addr, data_addr Each work thread is a codelet intended to represent a vertex in the graph and to be executed in an atomic manner. That is, all of the data it consumes is available at launch and all of the data it produces is not visible to other threads until it comes out. It runs until its completion (except in error conditions). The data address can specify certain data on which the codelet must act. Alternatively, the abandonment instruction may take a single operand specifying the address of the codelet, and the data address could be included in the code of the codelet; or the single operand could point to a data structure specifying the address of the codelet and the data. Codelets can be performed simultaneously and independently of each other.
B17816 FR-408525FR In all cases, the abandonment instruction (RUN) acts on the thread scheduler 24 so as to abandon the current time slot, that is to say the time slot in which this instruction is executed, with the work wire specified by the operand. Note that it is implicit in the abandonment instruction that it is the time slot in which this instruction is executed that is abandoned (implicit in the context of machine code instructions means that there is no need an operand to specify this - this is implicit from the operation code itself). Thus the slot which is abandoned is the slot in which the supervisor executes the abandonment instruction. In other words, the supervisor runs in the same space as the one he abandons. The supervisor says to execute this codelet in this time slot, then from this point the slot is the property (temporarily) of the concerned work thread. Note that when a supervisor uses a slot, he does not use the context associated with this slot but uses his own SV context.
The supervisor wire SV performs a similar operation in each of the time slots, to abandon all of its slots Co, Ci to different respective son among the working son. Once he has done this for the last slot, the supervisor pauses his execution, since he has no slots in which to run. It should be noted that the supervisor may not abandon all of his slots, he may retain certain slots to execute himself.
When the supervisor thread determines that it is time to execute a codelet, it uses the abandon instruction (RUN) to allocate this codelet to the slot in which it executes the RUN instruction.
B17816 FR-408525FR [0083] Each of the working wires in the slots Co, Ci proceeds to carry out their calculation task (s). At the end of his task (s), the work thread returns the time slot in which it is executed to the supervisor thread.
This is obtained by the working thread by executing an output instruction (EXIT). In one embodiment, the EXIT instruction takes at least one operand and preferably a single operand, exit state (for example a binary value), to be used for any purpose desired by the programmer to indicate a state of the respective codelet during its termination. EXIT exit_state In one embodiment, the EXIT instruction acts on the scheduler 24 so that the time slot in which it is executed is returned to the supervisor thread. The supervisor wire can then perform one or more of the following supervisor tasks (for example a synchronization barrier and / a movement of data in memory to facilitate the exchange of data between work wires), and / or continue to execute another abandonment instruction to allocate a new work thread (W4, etc.) to the niche in question. It will also be noted that consequently the total number of execution threads in the instruction memory 12 can be greater than the number that the barrel thread processing unit 10 can interleave at a given instant. It is the role of the supervisor thread SV to plan which of the working threads W0 ... Wj coming from the instruction memory 12, on which stage in the global program, are to be executed.
In another embodiment, the EXIT instruction does not need to define an output state.
This instruction acts on the son scheduler 24 so that the time slot in which it is executed is returned to the supervisor wire. The supervisor wire can then
B17816 FR-408525FR perform one or more of the following supervisor tasks (eg, barrier synchronization and / or data exchange), and / or continue to execute another abort instruction, and so on.
As was briefly mentioned previously, data is exchanged between blocks in the chip. Each chip operates a Massive Synchronous Parallel protocol, comprising a calculation phase and an exchange phase. The protocol is illustrated for example in FIG. 3. The diagram on the left side in FIG. 3 represents a calculation phase in which each block 4 is in a phase where the dynamic codelets are executed on the local memory (12, 22). Although in FIG. 3 the blocks 4 are shown arranged in a circle, this is done only for the purpose of explanation and does not reflect the actual architecture.
After the calculation phase, there is a synchronization indicated by an arrow 30. To achieve this, a SYNC instruction (synchronization) is provided in the instruction set of the processor. The SYNC instruction has the effect of causing the supervisor thread SV to wait until all of the running working threads W have been output by means of an EXIT instruction. In embodiments, the SYNC instruction takes a mode as an operand (in embodiments it is only the operand), the mode specifying whether the SYNC instruction should act only locally in relation to only the work threads running locally on the same processor module 4, for example the same block, or if instead it must be applied in multiple blocks or even in multiple chips.
SYNC mode // mode G [tile, chip, zone 1, zone 2}
B17816 FR-408525FR The BSP scheme in itself is known in the art. According to the BSP diagram, each block 4 performs a calculation phase 52 and an exchange phase 50 (sometimes called communication or passing messages) in an alternating cycle. The calculation phase and the exchange phase are carried out by executing instructions on the pad. During the calculation phase 52 each block 4 performs one or more calculation tasks locally on a block, but does not communicate the results of these calculations with other blocks 4. In the exchange phase 50, each block 4 is authorized to exchange (communicate) one or more calculation results from the previous calculation phase to and / or from one or more of the others in the group, but does not yet carry out new calculations which potentially may depend on a task performed on another block 4, or on which a task on another block 4 could potentially depend (it is not excluded that other operations such as operations associated with internal control may be carried out in the exchange phase) . In addition, according to the BSP principle, a barrier synchronization is placed at the join passing from the calculation phases 52 to the exchange phase 50, or at the join passing from the exchange phases 50 to the calculation phase 52, or 2. That is to say: either (a) all the blocks 4 must complete their respective calculation phases 52 before any of the blocks in the group is authorized to proceed to the next exchange phase 50 , either (b) all blocks 4 in the group must complete their respective exchange phases 50 before any of the blocks in the group is authorized to proceed to the next calculation phase 52, or (c) these two conditions are imposed. This sequence of exchange phases and calculation phases can then be repeated in multiple repetitions. In BSP terminology, each repetition of the exchange phase and
B17816 FR-408525FR calculation phase is called here super-step, which is consistent with the use of some previous descriptions of the BSP. It will be noted here that the term super-step is sometimes used in the art to designate each of the exchange phase and the calculation phase.
The execution unit (EXU) of the execution stage 18 is arranged such that, in response to the operation code of the SYNC synchronization instruction, when it is qualified by an operand on the chip (inter-block), it causes the supervisor wire in which the SYNC chip was executed to be paused until all the blocks 4 of the matrix 6 have finished executing the work wires. This can be used to implement a barrier to the next BSP super-step, i.e. after all of the blocks 4 on chip 2 have passed the barrier, the program between blocks as a whole can progress to the next exchange phase 50.
Chague block indicates its synchronization state to a synchronization module 36. Once it has been established that each block is ready to send data, the synchronization process 30 causes the system to enter a phase of exchange mistletoe is represented on the right side of FIG. 3. In this exchange phase, data values move between the blocks (in fact between the memories of the blocks in a movement of data from memory to memory). In the exchange phase, there are no calculations that can induce risks of simultaneity between paver programs. In the exchange phase, each piece of data moves along connection wires on which it leaves a block from a sending block to one or more destination blocks. At each clock cycle, a piece of data travels a certain distance along its path (from storage to storage), in a pipeline. When data is sent from
B17816 FR-408525FR of a block, it is not sent with a header identifying a recipient block. Instead, the destination block knows that it will wait for data from a certain sending block at a certain time. Thus, the computer described here is deterministic over time. Each block activates a program which has been allocated to it by the programmer or by a compiler exercise, the programmer or the compiler function having knowledge of what will be emitted by a particular block at a certain moment and of what must be received. by a recipient block at a certain time. In order to achieve this, SEND instructions are included in the local programs executed by the processor on each block, the execution time of the SEND instruction being predetermined with respect to the time position of other instructions which are executed on d other pavers in the computer. This is described in more detail below, but first we will describe the mechanism by which a destination block can receive data at a predetermined time. Each block 4 is associated with its own multiplexer 210 and thus the chip comprises 1216 multiplexers. Each multiplexer has 1216 inputs, each having a width of 32 bits (more optionally certain control bits). Each input is connected to a respective set of 140 _x connection wires in the switch fabric 34. The connection wires of the switch fabric are also connected to a set of data output connection wires 218 from each block (a diffusion exchange bus, described below), and so there are 1216 sets of connection wires which in this embodiment extend in one direction through the chip. To facilitate the illustration, a single set of wires 140 _sc in bold line is shown connected to the data output wires 218 _s , coming from a block not shown in FIG. 2, in the south matrix 6b. This set of wires is
B17816 FR-408525FR noted 140 _x to indicate that it is one of a number of sets of crossed wires 140 _c ._ 140i ₂ i5. As we can now see in Figure 2, it will be noted that when the multiplexer 210 is switched to the input marked 220 _x , then this will connect it to the crossed wires 140 _x and thus to the output data wires 218 _s of the block (not shown in Figure 2) from the south block 6b. If the multiplexer is controlled so as to switch to this input (220 _sc ) at a certain time, then the data received on the data output wires which are connected to the set of connection wires 140 _x will appear at the output 230 of the multiplexer 210 at a certain time. It will arrive at block 4 with a certain delay in relation to this, the delay depending on the distance between the multiplexer and the block. Since multiplexers are part of the switching fabric, the delay between the pad and the multiplexer can vary depending on the location of the pad. To implement the switching, the local programs executed on the blocks include switching control instructions (PUTi) which cause the emission of a multiplexer control signal 214 to control the multiplexer associated with this block to switch its input a certain time before the moment when a particular datum is expected to be received at the level of the block. In the exchange phase, multiplexers are switched and packets (data) are exchanged between blocks using the switching fabric. With this explanation, it is clear that the switching fabric has no state - the movement of each data is predetermined by the particular set of wires on which the input of each multiplexer is switched.
In the exchange phase, communication from all the blocks to all the blocks is activated. The exchange phase can have multiple cycles. Each block 4 has control of its own unique input multiplexer 210. The traffic
B17816 FR-408525FR incoming from any other block in the chip, or from one of the connection links, can be selected. Note that it is possible that a multiplexer is set to receive a 'null' input - that is, no input from any other block in this particular exchange phase. The selection can change cycle by cycle in an exchange phase; it doesn't have to be constant all the time. Data can be exchanged on a chip, or from chip to chip or from chip to host depending on the link which is selected. The present application is mainly concerned with communication between blocks on a chip. To achieve synchronization on the chip, a small number of pipeline signals are provided from all the blocks to a synchronization controller 36 on the chip and a pipeline sync-ack signal is broadcast from the synchronization controller to return towards all the cobblestones. In one embodiment, the pipeline signals are AND / OR signals daisy-chained one bit wide. One mechanism by which synchronization between blocks is obtained is the SYNC instruction mentioned above, or described below. Other mechanisms can be used: what is important is that all the blocks can be synchronized between a chip calculation phase and a chip exchange phase (Figure 3). The SYNC instruction triggers the following functionality in dedicated synchronization logic on block 4, and in the synchronization controller 36. The synchronization controller 36 can be implemented in the hardware interconnection 34 or, as shown , in a separate module on the chip. This functionality of both the synchronization logic on the blocks and the synchronization controller 36 is implemented in dedicated hardware circuits so that, once the
B17816 FR-408525FR
SYNC chip is executed, the rest of the functionality proceeds without any other instructions being executed to do so.
First, the synchronization logic on each block causes the supervisor to issue instructions on the block 4 in question to automatically pause (it brings the extraction stage 14 and the scheduler 24 to suspend issuing instructions from the supervisor). Once all the current work threads on local block 4 have executed an EXIT, then the synchronization logic automatically sends a sync_req synchronization request to synchronization controller 36. Local block 4 then continues to wait with the transmission instructions from the supervisor on break. A similar process is also implemented on each of the other blocks 4 in the matrix 6 (each comprising its own instance of the synchronization logic). Thus at a certain point, once all the final work threads in the current calculation phase 52 have left all the blocks 4 in the matrix 6, the synchronization controller 36 will have received a respective synchronization request (sync_req) at starting from all the blocks 4 of the matrix 6. It is only then, in response to the reception of the sync_req coming from each block 4 of the matrix 6 on the same chip 2, that the synchronization controller 36 sends back a signal of synchronization acknowledgment sync_ack to the synchronization logic on each of the blocks 4. Until reaching this point, each of the blocks 4 has had its transmission of supervisor instructions paused awaiting the acknowledgment signal of synchronization (sync_ack). On reception of the sync_ack signal, the synchronization logic located in block 4 automatically removes the pause in the transmission of instructions from the supervisor for the respective supervisor wire on this block 4. The supervisor is then free to carry out a
B17816 FR-408525FR data exchange with other blocks 4 via interconnection 34 in a subsequent exchange phase 50.
Preferably the sync_req and sync_ack signals are sent and received to and from the synchronization controller, respectively, via one or more dedicated synchronization wires connecting each block 4 to the synchronization controller 36 in the interconnection 34.
The connection structure on the block will now be described in more detail.
Each block has three interfaces:
an exin interface 224 which passes data from the switching fabric 34 to block 4;
- an exout interface 226 which passes data from the block to the switching fabric on the diffusion exchange bus 218; and
- an exmux interface 228 which passes the multiplexer control signal 214 (mux-select) from block 4 to its multiplexer 210.
In order to ensure that each individual pad executes SEND instructions and switch control instructions at appropriate times to transmit and receive the correct data, exchange planning requirements must be met by the programmer or the compiler that allocates individual programs to individual tiles in the computer. This function is performed by an exchange planner who must be aware of the following exchange time parameters (BNET). In order to understand the parameters, a simplified version of FIG. 2 is represented in FIG. 4. FIG. 4 also represents a destination block as well as a transmitter block.
B17816 FR-408525FR I. The relative SYNC acknowledgment time for each block, BNET_RSAK (TÏD). TID is the block identifier contained in a TILE_ID register described below. It is a number of cycles always greater than or equal to 0 indicating the moment when each block receives the acknowledgment signal from the synchronization controller 36 relative to the most advance receiving block. It can be calculated from the block ID, noting that the block ID indicates the particular location of the chip on that block, and therefore reflects physical distances. FIG. 4 represents a transmitter block 4 _T , and a receiver block 4 _R. Although this is shown only schematically and not to scale, block 4 _T is shown closer to the synchronization controller and block 4 _R is shown further, with the consequence that the synchronization acknowledgment delay will be shorter for block 4 _T than for block 4 _R. A specific value will be associated with each block for the synchronization acknowledgment delay. These values can be maintained, for example, in a time table, or can be calculated on the fly each time based on the block ID.
II. The exchange multiplexer control loop delay, BNET_MXP (receiver keypad TID). It is the number of cycles between the emission of an instruction (PUTi-MUXptr) which modifies a selection of input multiplexer of the block and the most advanced point where the same block could issue a load instruction (hypothetical ) to exchange data stored in memory as a result of the new multiplexer selection. Looking at FIG. 4, this delay includes the delay for the control signal to arrive from the exmux 228 _R interface of the destination block 4 _R at its multiplexer 210 _R and the length of the line between the output of the multiplexer and l exin data entry interface 224.
B17816 FR-408525FR [0101] III. The block-to-block exchange delay, BNET_TT (TID of sending block, TID of receiving block). It is the number of cycles between the emission of a SEND instruction transmitted on a block and the most advanced point where the receiver block could transmit a loading instruction (hypothetical) pointing to the value sent in its own memory. This was determined from the IDs of the transmitting and receiving blocks, either by accessing a table as already described, or by calculation. If we look again at Figure 4, we see that the delay includes the time for data to move from the transmitter pad 4ψ from its interface ex_out 226 _T to the switching fabric 14 along its exchange bus 218 _T then via the input multiplexer 210 _R at the level of the receiver block 4 _R to the interface ex_in 224 _R of the receiver block.
IV. The exchange traffic memory pointer update delay, BNET_MMP (). It is the number of cycles between the emission of an instruction (PUTi-MEMptr) which modifies a pointer in memory of entry traffic of exchange of the block and the most advanced point where this same block could emit a loading instruction (hypothetical) for an exchange of data stored in memory as a result of the new pointer. It is a small fixed number of cycles. The memory pointer has not yet been described, but it is represented in FIG. 2 with the reference 232. It acts as a pointer in the data memory 202 and indicates where incoming data coming from the interface ex_in 224 must be memorized. This is described in more detail below.
Figure 5 shows in more detail the timing of the exchanges. On the left side of FIG. 4 are the IPU clock cycles going from 0 to 30. The action on the sending pad 4 _T occurs between the IPU clock cycles 0 and 9, starting with the issuance of an instruction
B17816 FR-408525FR sending (SEND F ₃ ). In IPU clock cycles 10 to 24, the data follows its pipeline path in the switching fabric 34.
Looking at the reception block 4 _R in the IPU clock cycle 11 we see that a PUTi instruction is executed to change the selection of the block input multiplexer: PUTi-MXptr (F ₃ ). In Figure 5, this PUTi instruction is named PUTi INCOMING MUX (F3).
In cycle 18, the memory pointer instruction is executed, PUTi-MEMptr (F ₃ ), allowing a loading instruction in the ITU 25 clock cycle. In FIG. 5, this PUTi instruction is denoted PUTi INCOMING ADR (F3).
On the sending pad 4 _t , the IPU clock cycles 1, 3 and 5 are denoted Transport (). This is an internal block delay between the issuance of a SEND instruction and the manifestation of the SEND instruction data on the exout interface. F4, El, E3, etc. represent data from previous SEND instructions in the transport to the exout interface. The IPU 2 clock cycle is allocated to form an EO address for a SEND instruction (F. adr (E0) for sending). Note that this is the address from which EO should be extracted, not its destination address. In the IPU 4 clock cycle, a memory macro is executed to extract E2 from the memory. In the IPU 6 clock cycle, a parity check is performed on E4. In the IPU 7 clock cycle an MUX output instruction is executed to send E5.
In the IPU 8 clock cycle, E6 is encoded and in the IPU E7 clock cycle is output.
In the exchange fabric 34, the IPU clock cycles 10 to 24 are denoted exchange pipeline stage. In each cycle, data moves one step along the pipeline (between temporary storage).
B17816 FR-408525FR [0108] Cycles 25 to 28 represent the delay on the receiver pad 4 _R between the reception of data at the exin interface (see Mem Macro (E2) for Exc), while the cycles 25 to 39 represent the delay between receiving data at the exin interface and loading it into memory (see Mem Macro (E2)) for LD. Other functions can be performed within this time - see Earliest LD (F3), Reg file rd (F4), Form adds (EO), Transport (El).
In simple terms, if the processor of the reception block 4 _R wants to act on a datum (for example F3) which was the output of a process on the transmitter block 4 _T , then the transmitter block 4ψ must execute an instruction SEND [SEND (F3)] at a certain time (for example the IPU 0 clock cycle in FIG. 5), and the reception block must execute a command to control switches PUTi EXCH MXptr (as in the clock cycle IPU 11) a certain time after the execution of the SEND [SEND (F3)] instruction on the transmitter keypad. This will ensure that data arrives at the receiver block in time to be loaded [earliest LD (F3)] in the IPU cycle 25 for use in a codelet which is executed at the receiver block.
It will be noted that the reception process at a receiver block does not need to involve the adjustment of the memory pointer as with the PUTi MEMptr instruction. Instead, the memory pointer 232 (Figure 2) automatically increments after receiving each data at the exin interface 224. The received data is then just loaded into the next available memory location. However, the ability to change the memory pointer allows the receiver pad to alter the memory location where the data is written. All of this can be determined by the compiler or programmer who writes the individual programs in the individual blocks so that they communicate
B17816 FR-408525FR correctly. This causes the temporal characteristics of an internal exchange (internal exchanges on the chip) to be completely deterministic over time. This temporal determinism can be used by the exchange scheduler to strongly optimize exchange sequences.
FIG. 6 illustrates an example of application of the processor architecture described here, namely an artificial intelligence application.
As previously mentioned and is known to those skilled in the art in the field of artificial intelligence, artificial intelligence begins with a learning phase in which the algorithm of artificial intelligence learns a model of knowledge. The model can be represented in the form of a graph 60 of interconnected nodes 102 and links 104. The nodes and links can be called vertices and edges. Each node 102 of the graph has one or more input edges and one or more output edges, some of the input edges of some of the nodes 102 being the output edges of some of the other nodes, thus connecting the nodes together form the graph. In addition, one or more of the input edges of one or more of the nodes 102 form the inputs of the graph as a whole, and one or more of the input edges of one or more of the nodes 102 form the outputs of the graph in its together. Each stop 104 communicates a value commonly in the form of a tensor (n-dimensional matrix), this forming the inputs and outputs provided to the nodes 102 and from the nodes on their input and output edges respectively.
Each node 102 represents a function of its one or more inputs received on its input stop (s), the result of this function being the output (s) provided
B17816 FR-408525FR on the edge or outlet edges. These results are sometimes called activations. Each function is parameterized by one or more respective parameters (sometimes called weights, although they do not necessarily have to be multiplier weights). In general, the functions represented by the different nodes 102 can have different forms of function and / or can be parameterized by different parameters.
In addition, each of said one or more parameters of each function of a node is characterized by a respective error value. In addition, a respective error condition can be associated with the error or errors in the parameter or parameters of each node 102. For a node 102 representing a function parameterized by a single error parameter, the error condition can be a simple threshold, i.e. the error condition is satisfied if the error is within the specified threshold but is not satisfied if the error is beyond the threshold. For a node 102 parameterized by more than a single respective parameter, the error condition for this node 102 can be more complex. For example, the error condition can be satisfied only if each of the parameters of this node 102 is within the limits of the respective threshold. In another example, a combined metric can be defined as combining the errors in the different parameters for the same node 102, and the error condition can be satisfied provided that the value of the combined metric is within the limits of a specified threshold, but otherwise the error condition is not satisfied if the value of the combined metric is beyond the threshold (or vice versa depending on the definition of the metric). Whatever the error condition, this gives a measure of whether the error in the
B17816 FR-408525FR or the parameters of the node fall below a certain level or a certain degree of acceptability.
In the learning stage, the algorithm receives experimental data, that is to say multiple data points representing different possible combinations of entries in the graph. As experimental data is received, the algorithm gradually adjusts the parameters of the various nodes 102 of the graph based on the experimental data so as to try to minimize errors in the parameters. The goal is to find parameter values such that the output of the graph is as close as possible to a desired result. When the graph as a whole tends towards such a state, we say that the calculation converges.
For example, in a supervised approach, the experimental input data take the form of training data, that is to say data which correspond to known outputs. With each data point, the algorithm can adjust the parameters so that the output matches the known output as closely as possible for the given input. In the next prediction stage, the graph can then be used to map an input query to an approximate predicted output (or vice versa if we make an inference). Other approaches are also possible. For example, in an unsupervised approach, there is no concept of reference result per input data, and instead we let the artificial intelligence algorithm identify its own structure in the output data . Or, in a reinforcement approach, the algorithm tries at least one possible output for each data point in the experimental input data, and it is told whether that output is positive or negative (and potentially a degree with which
B17816 FR-408525FR it is positive or negative), for example won or lost, reward or punishment, or the like. On many tests the algorithm can gradually adjust the parameters of the graph to be able to predict inputs which will lead to a positive output. The various approaches and algorithms for learning a graph are known to those skilled in the art in the field of artificial intelligence.
According to an example of application of the techniques described here, each work thread is programmed to carry out the calculations associated with a respective individual node of the nodes 102 in an artificial intelligence graph. In this case, the edges 104 between the nodes 102 correspond to the exchanges of data between the wires, at least some of which may involve exchanges between blocks.
FIG. 7 is a diagram illustrating the function of a compiler 70. The compiler receives a graph like graph 60 and compiles the functions found in the graphs in a multitude of codelets, which are contained in local programs referenced 72 in figure 7. Each local program is designed to be loaded in a particular block of the computer. Each program includes one or more codelets 72a, 72b ... plus a supervisor subroutine 73, each consisting of a sequence of instructions. The compiler generates the programs in such a way that they are linked in time, that is to say that they are deterministic in time. In order to accomplish this, the compiler accesses block data 74 which includes block identifiers which are indicative of the location of the blocks and therefore the delays that the compiler must understand in order to generate the local programs. The deadlines have already been mentioned previously, and can be calculated on the basis of the paving stones data. In
B17816 FR-408525FR variant, the block data can incorporate a data structure in which these delays are available via a correspondence table.
Next, a description will be given of new instructions which have been developed as part of the set of instructions for the computer architecture defined here. FIG. 8 represents a 32-bit SEND instruction. A SEND instruction indicates data transmission from the keypad memory. It causes one or more data stored at a particular address in the local memory 22 of a block to be transmitted at the level of the exout interface of a block. Each piece of data (called an element in the instruction) can be one or more words in length. A SEND instruction acts on one or more words to implement a send function. The SEND instruction includes an operation code 80, a field 82 indicating a message count, the number of elements to be sent in the form of one or more packets from the SEND address indicated in an address field 84 Field 84 defines the address in local memory from which items are to be sent as an immediate value which is added to a base value stored in a base address register. The SEND instruction also includes a command field 86 (SCTL) which indicates the word size, selected at 4 or 8 bytes. The package does not have a destination identifier in it. In other words, the receiving block which is to receive the elements is not uniquely identified in the instruction. The send function ensures that the specified number of data items from the send address are accessed from local memory and placed at the ex_out interface of the keypad for transmission to the d cycle. next clock. In another variant of the SEND instruction, the address from
B17816 FR-408525FR from which items should be sent could be implied, taken from a base value in the base address register and a delta value in an outgoing delta register. The delta value can be set based on information from a previous SEND instruction. Instead of a unique identifier of the target recipient block, the compiler has arranged for the correct recipient block to switch its local multiplexer (s) at the correct time to receive the data (data elements) as already been described here. It should be noted that a target recipient block could be the sending block itself in certain cases.
For this purpose, a switch control function is provided, as described above. Figure 9 illustrates a PUT-i-MUX instruction which performs this function. An operation code field 90 defines the instruction as being a PUT-i-MUX instruction. A delay period can be specified by an immediate delay value 92. This delay value can be used to replace 'no op' statements, and is a way to optimize code compression. This instruction, when executed, defines in the incoming_mux 98 field which input of the multiplexer 210 should be selected to 'listen' for elements that have been sent from another block. For compactness purposes, this multiplexer control function could be combined in a single instruction with a send function defined previously, as shown in Figure 10. Note that there is no connection between the send function, which causes the pad to behave like a sending pad, and the switch control function, which is a function used when the pad behaves like a destination pad, other than
B17816 FR-408525FR those which can be carried out in a single execution cycle on the same block.
FIG. 10 is an example of a fusion instruction. In this context, a merge instruction designates an instruction which defines two or more functions which can be performed at the same time (in a single execution cycle) on a block.
FIG. 10 illustrates a form of send-to-merge instructions, in which a send function is combined with a second function which can modify the state maintained in registers at the level of the block. One function is to change the memory pointer for data received at this block. Another function is to adjust the incoming MUX. The PUTi_MEMptr function identifies a memory location in the local memory where the next data received by the keypad must be loaded. This function could be carried out by a dedicated 'reception' instruction, although its function is not to allow the reception of data but to modify the memory pointer. In fact, no specific instructions need to be executed to receive data at the pad level. The data arriving at the level of the exin interface will be loaded into the following memory location identified by the memory pointer, under the control of the exin interface. The instruction of FIG. 10 includes an operation code field 100 and a field 102 indicating a number of elements to be sent. The immediate value in the incoming state change field 106 is written to an exchange configuration status register specified by field 104.
In a first form, the state modification field 106 can write an incoming delta to calculate the reception address to which the memory pointer must be set. In another form, the exchange configuration state is
B17816 FR-408525FR written with the input MUX multiplexer value which regulates the multiplexer input.
For this form of fusion instructions, the send function uses a send address determined from values stored in one or more registers which is implicit in the instruction. For example, the sending address can be determined from the base register and the delta register.
FIG. 11 represents a double-width instruction, called an exchange instruction (EXCH). This instruction initiates data transmission from an address indicated in the keypad memory and sets the incoming exchange configuration state (the multiplexer and / or the memory pointer to receive data). The EXCH instruction is unique in that it is immediately followed by an online 32-bit payload located at the memory location located immediately after the instructions. The EXCH instruction includes an operation code field 110 which indicates an EXCH exchange instruction. The payload includes a 'co-issue' flag 119.
The EXCH instruction comprises a format field 112 which comprises a single bit which specifies the data width of the incoming format (32 bits or 64 bits). The data width may have implications for the placement of the multiplexer lines, as will be explained below. An element field 114 defines the number of elements whose exchange instruction will cause sending. These elements are sent from a sending address calculated using the immediate field 116, as in the sending instruction of FIG. 9. The value of this field is added to the value found in the register of based.
B17816 FR-408525EN [0126] The reference numeral 118 designates a control field which defines the word size for the data sent. The payload includes a switch control field 120 which acts as switch control for the incoming multiplexer, as previously described in connection with Figure 9. Reference 122 indicates a payload field defining an incoming delta for calculate the address at which the incoming data must be stored, as described above in relation to the instruction in FIG. 10. The 64-bit wide EXCH exchange instruction in FIG. 11 can be executed at each cycle d clock and thus allows simultaneously:
• sending from a particular address • updating the input multiplexer • updating the input address [0127] Thus, any exchange scheme can be coded in a single instruction. The instructions in Figures 8, 9 and 10 perform similar functions, but since they are only 32 bits long, they can be used to minimize the size of the exchange code in the local memory of each block. The decision as to which instruction to use in a particular context is taken at the level of the compiler 70 when constructing the codelets for the local program 72.
[0128] A list of key registers and their meanings will follow to support the above-mentioned instructions. These registers are part of the bank of registers present on each block.
[Table 1]
TILE ID Contains a unique identifier for this block
B17816 FR-408525FR
INCOMING MUX[INCOMING MUXPAIR] Contains the source block pad ID for incoming messages, which acts to select the 'listen' input for the multiplexer associated with the receiver pad. INCOMING DELTA Contains an auto-increment value to calculate the address at which incoming data must be stored: it can be overwritten by an explicit field [see for example figure 10]. It is added to INCOMING BASE. INCOMING BASE Contains a common base address for updating a memory pointer (added to INCOMING DELTA). OUTGOING BASE Contains a common base address for sending instructions OUTGOING DELTA Contains a delta to calculate send address instructions. A 'send' address is equal to outgoing base + outgoing delta INCOMING FORMAT Identifies incoming 32-bit or 64-bit data
It will be noted that the registers 1NCOMING_DELTA and INCOMING_MUX are part of the exchange state of the block.
We will now refer to Figures 12 and 13 to explain the pairing of paving stones which is a feature by which a physical pair of paving stones can collaborate in order to make more efficient use of their combined exchange resources. Pairing of pavers can be used to double
B17816 FR-408525FR the bandwidth in transmission of a single block by using a transmission bus of a neighbor, or to double the bandwidth in reception for the two blocks of a pair of blocks by sharing a reception bus d 'a neighbor and an associated incoming multiplexer.
FIG. 12 illustrates the logic associated with blocks of a pair of blocks to make a double-width transmission. A double width transmission is obtained by borrowing outgoing exchange resources from a neighbor for the duration of a SEND. The neighboring block cannot perform its own data transmission during this time. A SEND instruction can perform a data transfer in single or double width, the width of the transfer being specified by a value contained in a register, or an immediate field. The width can be specified as 32 bits (one word) in which case the field has a value of 0, or 64 bits (two words) in which case the field has a value of 1. Other logical definitions are possible. The specified width is passed from a register on the chip 4 to a control storage 1200 of the Ex Out interface 226 of the block. Figure 12 shows two tiles paired in this way, TID00 and TID01. The Ex Out 226 interface has buffers to support the least significant word (LSW) and the most significant word (MSW). In this context, each word has 32 bits. The least significant word is connected directly to an input of a width control multiplexer 1202. The output of the multiplexer is connected to the corresponding crossed wires of the exchange bus 34, the crossed wires corresponding to the output wire for this particular block . If the transmit width is set to 32 bits, the width control multiplexers 1202 are set to receive inputs from the respective LSWs of the paired blocks, to
B17816 FR-408525FR thus allow the blocks of the pair to transmit a respective 32-bit word simultaneously.
If an element of the pair wishes to send a word of bits, the width control multiplexer 1202 of the neighboring block is adjusted so as to receive the most significant word supplied by the sending block and to pass it to the multiplexer output. This will cause the most significant word of the 64-bit output from the send block to be placed on the crossed wires of the exchange bus associated with the neighboring blocks (which at this stage is inhibited and cannot send anything). For the sake of clarity, the command line of MUX coming from the width control flag in the storage 1200 of the send block TID00 is shown connected to the control input of the multiplexer 1202 of the neighboring block (not transmitter) TID01. Similarly, the neighboring block TID01 also has a MUX command line connected from its control memory 1200 to the input of the width control multiplexer 1202 from its paired block, although this is not shown in FIG. 12 in a goal of clarity.
We will now refer to Figure 13 to explain a double width reception using paired blocks. The blocks paired in Figure 13 are referenced TID03 and TID04, although it is easy to understand that this functionality can be used in combination with the double width transmission functionality so that a block like TID00 could also have the functionality shown on TID03 for example. Dual width reception is achieved by sharing incoming exchange resources from neighbors for the duration of a transfer. When configured for double-width reception, each tile within a pair of tiles can choose to sample or ignore incoming data. If both choose to sample, they will see the same data
B17816 FR-408525FR inbound. The double-width reception is activated in collaboration with the neighboring block via the value INCOMING_FORMAT described above which identifies whether the incoming data is 32-bit or 64-bit. The value of the incoming multiplexer 210 of the primary block of the pair of blocks must be set to the block ID of the sending block. The 'monitor input' of the incoming multiplexer 210 of the secondary block of the pair of blocks must be set to the block ID of the other block of the send pair. It will be noted that in this case, from a strict point of view, the send block of the pair of send blocks (for example TID01) is not currently in emission, but has provided its most significant word for use the exchange resources of the TID00 block. Thus, the incoming multiplexers 210 of the blocks of the pair of receiving blocks must be connected respectively to the crossed wires on which the individual words of the double-width transmission output of the transmitting pair are placed.
It will be noted that in certain embodiments, even if the incoming multiplexers 210 are switched to simultaneously listen to their respective crossed wires of the exchange, this does not necessarily mean that the incoming values will be received at the level of the blocks of the pair of receiving blocks simultaneously, due to the different travel latencies between the exchange and individual blocks. There are therefore three possibilities to consider in a pair of receiving pads.
In a first possibility, the two incoming buses of the Exin interface must be treated independently (no block of the pair of blocks participates in a reception in double width).
According to the second possibility, the local incoming exchange bus is in use to transfer the
B17816 FR-408525FR anterior component of a double width element (and this component should now be delayed). This implies that the neighbor bus will be used to transfer the non-anterior component of the same double-width element.
According to the third possibility, the local incoming exchange bus is in use for transferring the non-anterior component of a double-width element. This implies that the neighbor bus was used to transfer the earlier component of the same double width element (and therefore the earlier data component of the neighbor bus should have been delayed).
FIG. 13 represents circuits 1300 which deal with these scenarios using multiplexers 1302 and 1304. It will be noted that circuits 1300 are duplicated on the input of each block of the pair of reception blocks, but are only shown on the entry of TID03 for clarity.
The control of the multiplexer comes from the input format control which is provided by a register in an Exin 224 interface. If the TID03 block must operate in a 32-bit mode, it controls the multiplexer 1302 to let a word pass 32 bits at the upper input of the multiplexer in Figure 13 via a pipeline stage 1306 and a control buffer 1308.
If the receiving blocks operate in pairs, the multiplexer 1302 is controlled so as to block its upper input and allow the least significant word coming from the lower input to pass through the pipeline stage 1306. In the following cycle , the most significant word is selected to pass through the multiplexer 1304 in the control buffer 1308, accompanied by the least significant word which has been advanced by clock
B17816 FR-408525FR in pipeline stage 1306. The control buffer 1308 can decide whether or not to receive the 64-bit word. It will be noted that according to the logic, the 64-bit word will be received simultaneously at the level of the neighboring block (TID04). In some circumstances the two blocks may want to read the same 64-bit value, but in other circumstances one of the blocks may wish to ignore it.
Note that there may be embodiments in which the LSW and the MSW of a 64-bit transfer can be received simultaneously at their paired receiver blocks, in which case the relative delay of the stage 1306 pipeline would not be required.
We have described here a new computer paradigm which is particularly effective in the context of knowledge models for machine learning. An architecture is envisaged which uses temporal determinism as in a phase of exchange of a BSP paradigm to efficiently process very large amounts of data. Although particular embodiments have been described, other applications and variants of the techniques described may appear to those skilled in the art once the description has been made. The field of the present description is not limited by the embodiments described but only by the appended claims.

权利要求:
Claims (22)
[1" id="c-fr-0001]
1. Computer including:
a plurality of processing units, each comprising an instruction storage containing a local program, an execution unit executing the local program, a data storage for maintaining data; an input interface provided with a set of input wires, and an output interface provided with a set of output wires;
a switching fabric connected to each of the processing units by the respective set of output wires and connectable to each of the processing units by the respective input wires via switching circuits controllable by each processing unit; a synchronization module operable to generate a synchronization signal to control the computer for switching between a calculation phase and an exchange phase, the processing units being arranged to execute their local programs according to a common clock, the local programs being such that in the exchange phase at least one processing unit executes a sending instruction from its local program to send at a time of transmission a packet of data on its set of output connection wires, the packet data being intended for at least one recipient processing unit but not comprising a destination identifier, and at a predetermined switching instant the recipient processing unit executing a switch control instruction from its local program to control its switching circuits to connect its set of input wires to the switching fabric for receiving oir the data packet at a time of reception, the time of transmission, the time of switching and the time of
B17816 FR-408525FR reception being governed by the common clock with respect to the synchronization signal.
[2" id="c-fr-0002]
2. The computer of claim 1, wherein the send instruction explicitly defines a send address identifying a location in the data store from which the data packet is to be sent.
[3" id="c-fr-0003]
The computer of claim 1, wherein no sending address is explicitly defined in the sending instruction, and the data packet is transmitted from the sending address defined in an implicitly defined register. by sending instruction.
[4" id="c-fr-0004]
4. The computer of claim 3, wherein the local program comprises a sending address update instruction for updating the sending address in the default register.
[5" id="c-fr-0005]
5. Computer according to any one of the preceding claims, in which the transmission instant is a known number of clock cycles after the transmission instant at which the instruction is executed.
[6" id="c-fr-0006]
6. Computer according to any one of the preceding claims, in which the switching circuits comprise a multiplexer comprising a set of output wires connected to its processing unit, and multiple sets of input wires connected to the switching fabric, whereby one of the multiple sets of input wires is selected as controlled by the processing unit.
[7" id="c-fr-0007]
7. Computer according to any one of the preceding claims, in which the recipient processing unit is arranged to receive the data packet and to load it.
B17816 FR-408525FR in the storage of data at a memory location identified by a memory pointer.
[8" id="c-fr-0008]
8. The computer of claim 7, wherein the memory pointer is automatically incremented after each data packet has been loaded into the data store.
[9" id="c-fr-0009]
9. The computer of claim 7, wherein the local program at the destination processing unit comprises a memory pointer update instruction which updates the memory pointer.
[10" id="c-fr-0010]
10. The computer as claimed in claim 1, in which the sending instruction identifies a number of data packets to be sent, each data packet being associated with a different transmission instant.
[11" id="c-fr-0011]
11. The computer of claim 6, wherein one of the sets of input wires is connected to a zero input.
[12" id="c-fr-0012]
12. Computer according to any one of the preceding claims, in which the destination processing unit is the same processing unit as the processing unit which has executed a sending instruction at an earlier instant, from which it results that the same processing unit is arranged to send a data packet and receive this data packet at a later time.
[13" id="c-fr-0013]
13. Computer according to any one of the preceding claims, in which multiple processing units are arranged to execute respective sending instructions for transmitting respective data packets, and in which at least some of the data packets do not are intended for no recipient processing unit.
B17816 FR-408525FR
[14" id="c-fr-0014]
14. Computer according to any one of the preceding claims, in which at least two of the processing units cooperate in a transmitting pair, in which a first data packet is transmitted from a first processing unit of the pair by l '' through its set of output connection wires,
and a second data packet is issued to from the first unit treatment of the pair through 1 'int er wire set median connection of exit of the second uni . processing amount of the pair for carry out
a double width transmission.
[15" id="c-fr-0015]
15. Computer according to any one of the preceding claims, in which at least two of the processing units operate as a receiving pair in which each processing unit of the pair controls its switching circuits to connect its respective set of input wires to the switching fabric for receiving respective data packets from respective blocks of a send pair.
[16" id="c-fr-0016]
16. A method of calculating a function in a computer comprising: a plurality of processing units each comprising an instruction store containing a local program, an execution unit for executing the local program, a data store for containing data, and an input interface provided with a set of input wires and an output interface equipped with a set of output wires; a switching fabric connected to each of the processing units by the respective sets of output wires and connectable to each of the processing units by their respective input wires via switching circuits controllable by each processing unit; and an actuation synchronization module for generating a signal
B17816 FR-408525FR synchronization to control the computer for switching between a calculation phase and an exchange phase, the method comprising:
the processing units execute their local programs in the calculation phase according to a common clock, and in the exchange phase at least one processing unit executes a sending instruction from its local program to transmit at a time of transmits a data packet on its set of output connection wires, the data packet being intended for at least one destination processing unit but not comprising a destination identifier, and at a predetermined switching instant the unit recipient processing executes a switch control instruction from its local program to control the switching circuits to connect its set of input wires to the switching fabric to receive the data packet at a receiving time, the time d transmission, the switching instant and the reception instant being governed by the common clock with respect to the synchronization signal.
[17" id="c-fr-0017]
17. The method of claim 16, wherein the function is provided in the form of the static graph comprising a plurality of interconnected nodes, each node being implemented by a codelet of the local programs.
[18" id="c-fr-0018]
18. The method of claim 17, in which in the calculation phase each codelet processes data to produce a result, in which some of the results are not necessary for a next calculation phase and are not received by any recipient processing unit. .
B17816 FR-408525FR
[19" id="c-fr-0019]
19. Method according to any one of claims 16 to 18, in which in the exchange phase the data packets are transmitted between processing units via the switching fabric and switching circuits.
[20" id="c-fr-0020]
20. Method according to any one of claims 16 to 19, in which each processing unit indicates to the synchronization module that its own calculation phase has been completed, and in which the synchronization signal is generated by the synchronization module when all the processing units have indicated that their own calculation phase has been completed, to start the exchange phase.
[21" id="c-fr-0021]
21. The method of claim 17, wherein the graph represents a machine learning function.
[22" id="c-fr-0022]
22. Method according to any one of claims 16 to 21, in which in the exchange phase data packets are transmitted via the switching fabric in a pipeline mode via a succession of temporary storage, each storage holding a data packet during a cycle of the common clock.

类似技术:

公开号 | 公开日 | 专利标题

FR3072801A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESS MATRIX

FR3072797A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVING AND MULTI-CHIP TREATMENT ARRANGEMENT

EP2232368B1|2019-07-17|System comprising a plurality of processing units making it possible to execute tasks in parallel, by mixing the mode of execution of control type and the mode of execution of data flow type

FR3072798A1|2019-04-26|ORDERING OF TASKS IN A MULTI-CORRECTION PROCESSOR

FR3072800A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESSING ARRANGEMENT

FR2578071A1|1986-08-29|MULTITRAITE INSTALLATION WITH SEVERAL PROCESSES

CA3021414C|2021-08-10|Instruction set

FR3072799A1|2019-04-26|COMBINING STATES OF MULTIPLE EXECUTIVE WIRES IN A MULTIPLE WIRE PROCESSOR

FR2875982A1|2006-03-31|SEMI-AUTOMATIC COMMUNICATION ARCHITECTURE NOC FOR "DATA FLOWS" APPLICATIONS

WO2007051935A1|2007-05-10|Method and system for conducting intensive multitask and multiflow calculation in real-time

JP6797878B2|2020-12-09|How to compile

US10963003B2|2021-03-30|Synchronization in a multi-tile processing array

EP1158405A1|2001-11-28|System and method for managing a multi-resource architecture

US20200183878A1|2020-06-11|Controlling timing in computer processing

FR3090924A1|2020-06-26|EXCHANGE OF DATA IN A COMPUTER

US11275661B1|2022-03-15|Test generation of a distributed system

US20200201794A1|2020-06-25|Scheduling messages

FR3047821A1|2017-08-18|METHOD AND DEVICE FOR MANAGING A CONTROL DEVICE

Petrović2015|Efficient Communication and Synchronization on Manycore Processors

同族专利:

公开号 | 公开日

JP6722251B2|2020-07-15|

DE102018126001A1|2019-04-25|

CN109697185A|2019-04-30|

KR102167059B1|2020-10-16|

GB2569430B|2021-03-24|

TWI708186B|2020-10-21|

GB2569430A|2019-06-19|

CA3021450C|2021-11-02|

TW201928666A|2019-07-16|

JP2019079529A|2019-05-23|

US10936008B2|2021-03-02|

GB201816892D0|2018-11-28|

KR20190044574A|2019-04-30|

US20190121387A1|2019-04-25|

CA3021450A1|2019-04-20|

GB201717295D0|2017-12-06|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US5434861A|1989-02-02|1995-07-18|Pritty; David|Deterministic timed bus access method|

US5734826A|1991-03-29|1998-03-31|International Business Machines Corporation|Variable cyclic redundancy coding method and apparatus for use in a multistage network|

US5408646A|1991-03-29|1995-04-18|International Business Machines Corp.|Multipath torus switching apparatus|

KR100304063B1|1993-08-04|2001-11-22|썬 마이크로시스템즈, 인코포레이티드|2-point interconnection communication utility|

US5541921A|1994-12-06|1996-07-30|National Semiconductor Corporation|Isochronous serial time division multiplexer|

GB2303274B|1995-07-11|1999-09-08|Fujitsu Ltd|Switching apparatus|

US6876652B1|2000-05-20|2005-04-05|Ciena Corporation|Network device with a distributed switch fabric timing system|

US20040172631A1|2001-06-20|2004-09-02|Howard James E|Concurrent-multitasking processor|

US20020165947A1|2000-09-25|2002-11-07|Crossbeam Systems, Inc.|Network application apparatus|

US7100021B1|2001-10-16|2006-08-29|Cisco Technology, Inc.|Barrier synchronization mechanism for processors of a systolic array|

JP2005032018A|2003-07-04|2005-02-03|Semiconductor Energy Lab Co Ltd|Microprocessor using genetic algorithm|

JP2005167965A|2003-11-12|2005-06-23|Matsushita Electric Ind Co Ltd|Packet processing method and apparatus|

US7904905B2|2003-11-14|2011-03-08|Stmicroelectronics, Inc.|System and method for efficiently executing single program multiple data programs|

US7635987B1|2004-12-13|2009-12-22|Massachusetts Institute Of Technology|Configuring circuitry in a parallel processing environment|

US8018849B1|2005-03-25|2011-09-13|Tilera Corporation|Flow control in a parallel processing environment|

US7818725B1|2005-04-28|2010-10-19|Massachusetts Institute Of Technology|Mapping communication in a parallel processing environment|

US7577820B1|2006-04-14|2009-08-18|Tilera Corporation|Managing data in a parallel processing environment|

US8194690B1|2006-05-24|2012-06-05|Tilera Corporation|Packet processing in a parallel processing environment|

JP5055942B2|2006-10-16|2012-10-24|富士通株式会社|Computer cluster|

US8571021B2|2009-06-10|2013-10-29|Microchip Technology Incorporated|Packet based data transmission with reduced data size|

GB2471067B|2009-06-12|2011-11-30|Graeme Roy Smith|Shared resource multi-thread array processor|

GB201001621D0|2010-02-01|2010-03-17|Univ Louvain|A tile-based processor architecture model for high efficiency embedded homogenous multicore platforms|

JP5568048B2|2011-04-04|2014-08-06|株式会社日立製作所|Parallel computer system and program|

JP2013069189A|2011-09-26|2013-04-18|Hitachi Ltd|Parallel distributed processing method and parallel distributed processing system|

US8990497B2|2012-07-02|2015-03-24|Grayskytech, LLC|Efficient memory management for parallel synchronous computing systems|

US9116738B2|2012-11-13|2015-08-25|International Business Machines Corporation|Method and apparatus for efficient execution of concurrent processes on a multithreaded message passing system|

US9733847B2|2014-06-02|2017-08-15|Micron Technology, Inc.|Systems and methods for transmitting packets in a scalable memory system protocol|

US20160164943A1|2014-12-05|2016-06-09|Qualcomm Incorporated|Transport interface for multimedia and file transport|

JP6450330B2|2016-02-03|2019-01-09|日本電信電話株式会社|Parallel computing device and parallel computing method|GB2580165B|2018-12-21|2021-02-24|Graphcore Ltd|Data exchange in a computer with predetermined delay|

GB201904265D0|2019-03-27|2019-05-08|Graphcore Ltd|A partitionable networked computer|

GB201904266D0|2019-03-27|2019-05-08|Graphcore Ltd|A networked computer with embedded rings|

GB201904267D0|2019-03-27|2019-05-08|Graphcore Ltd|A networked computer with multiple embedded rings|

GB201904263D0|2019-03-27|2019-05-08|Graphcore Ltd|A networked computer|

JP2021051351A|2019-09-20|2021-04-01|富士通株式会社|Information processing apparatus, information processing system and communication management program|

CN113222126B|2020-01-21|2022-01-28|上海商汤智能科技有限公司|Data processing device and artificial intelligence chip|

KR20220003621A|2020-03-26|2022-01-10|그래프코어 리미티드|Network computer with two built-in rings|

法律状态:
2019-10-15| PLFP| Fee payment|Year of fee payment: 2 |

2020-10-29| PLFP| Fee payment|Year of fee payment: 3 |

2021-10-27| PLFP| Fee payment|Year of fee payment: 4 |

优先权:

申请号 | 申请日 | 专利标题

GBGB1717295.8A|GB201717295D0|2017-10-20|2017-10-20|Synchronization in a multi-tile processing array|

GB1717295.8|2017-10-20|

GB1816892.2A|GB2569430B|2017-10-20|2018-10-17|Synchronization in a multi-tile processing array|

[返回顶部]