法国专利FR3072798A1 ORDERING OF TASKS IN A MULTI-CORRECTION PROCESSOR

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
A processor comprising: an execution unit arranged to execute a respective thread in each time slot of a repetitive sequence of different time slots; and a plurality of sets of context registers, each comprising a respective set of registers for representing a respective state of a respective thread. The sets of context registers comprise a set of respective thread context registers for each of the time slots that the thread is able to interleave, and at least one set of additional context registers. The sets of work context registers represent the respective states of respective work threads, and the set of additional context registers represents the state of a supervisor thread. The processor is arranged to begin executing the supervisor wire in each of the time slots, and to allow the supervisor wire to then individually drop each of the time slots in which it executes at a respective one of the work threads.
公开号:FR3072798A1
申请号:FR1859640
申请日:2018-10-18
公开日:2019-04-26
发明作者:Simon Christian Knowles
申请人:Graphcore Ltd；
IPC主号:

专利说明:

DESCRIPTION
TITLE: SCHEDULING OF TASKS IN A MULTIPLE EXECUTION WIRE PROCESSOR
Technical Field This description relates to the scheduling of tasks to be carried out by different simultaneous execution threads in a processor with multiple execution threads.
BACKGROUND ART A multiple thread processor is a processor capable of executing multiple program threads side by side. The processor may include hardware that is common to the multiple different threads (eg, an instruction memory, a data memory, and / or a common thread); but to support multi-wire operation, the processor also includes dedicated hardware specific to each wire.
The dedicated hardware comprises at least one bank of respective context registers for each of the numerous execution threads which can be executed at the same time. A "context", when talking about multi-thread processors, refers to the program status of a respective one of the threads running side by side (for example program counter value, status and values current operands). The context register bank designates the respective set of registers intended to represent this program state of the respective thread. The registers of a bank of registers are distinct from the general purpose memory in that the addresses of the registers are fixed in the form of bits in instruction words, while the memory addresses can be calculated by executing instructions. The registers of a given context typically include a respective program counter for the respective thread of execution, and a respective set of registers of operands to temporarily maintain the data on which one acts and which are supplied by the respective thread during the calculations performed by this thread. Each context can also have a respective state register to store a state of the respective thread (for example if it is paused or running). Thus each of the threads in progress has its own separate program counter, and optionally operand registers and one or more status registers.
One possible form of multi-wire operation is parallelism. That is, as well as multiple contexts, multiple execution pipelines are provided: that is, there is a separate execution pipeline for each instruction flow to run in parallel. However, this requires a large amount of duplication when it comes to hardware.
Therefore, instead, another form of processor with multiple execution threads uses simultaneity rather than parallelism, from which it follows that the threads share a common execution pipeline (or at least a common part of a pipeline) and different threads are interleaved in this same shared execution pipeline. The performance of a multi-thread processor can be further improved compared to non-simultaneous or parallel operation, thanks to improved opportunities to mask pipeline latency. Also, this approach does not require as much additional hardware dedicated to each wire as in a completely parallel processor with multiple execution pipelines, and thus does not require as much additional silicon.
A multi-thread processor also needs certain means to coordinate the execution of the different simultaneous threads. For example, it needs to determine which compute tasks should be allocated and to which threads. In another example, one or more first simultaneous children can contain a calculation which depends on the result of the calculation made by one or more others of the simultaneous execution children. In this case a barrier synchronization must be carried out to bring the wires in question to a common execution point, so that said one or more first wires do not attempt to carry out these dependent calculations before said one or more other wires carry out the calculations on which they depend. Instead, barrier synchronization requires that the other wire (s) reach a specified point before the first wire (s) can proceed. Summary of the invention One or more of such functions for coordinating the execution of simultaneous wires could be implemented in dedicated hardware. However, this would increase the silicon area of the processor and would not be as flexible as a programmatic software approach. On the other hand, a completely programmatic software approach would not be effective with regard to code density. It would be desirable to find a more subtle approach to coordinating threads, which strikes a balance between these two approaches.
According to one aspect described here, there is provided a processor comprising: an execution unit arranged to execute a respective thread in each time slot of a repetitive sequence of different time slots, the sequence consisting of a plurality of slots temporal in which the execution logic can be actuated to interleave the execution of the respective threads; and a plurality of sets of context registers, each comprising a respective set of registers for representing a respective state of a respective thread, the sets of context registers comprising a set of respective work thread context registers for each of time slots that the thread is capable of interleaving in said sequence and at least one set of additional context registers, so that the number of sets of context registers is at least one greater than the number of time slots that the thread is capable of interleaving, the sets of work context registers being arranged to represent the respective states of respective work threads which perform computational tasks, and the set of registers additional context being arranged to represent the state of a supervisor thread which schedules the execution of the tasks performed by the threads of t ravail; wherein the processor is arranged to begin executing the supervisor thread in each of the time slots, and to allow the supervisor thread to then individually abandon each of the time slots in which it runs to a respective one of the work threads.
In embodiments, the processor can be arranged to authorize the supervisor wire to carry out said abandonment by executing one or more abandonment instructions in the time slot in which it is executed.
In embodiments, said one or more abandonment instructions are a single abandonment instruction.
In embodiments, the execution unit can be arranged to operate according to a set of instructions defining types of machine code instructions recognized by the processor, each machine code instruction being defined by a code. respective operation; and wherein at least one of said one or more abandonment instructions can be a dedicated instruction from the instruction set having an operation code which when executed triggers said abandonment.
In embodiments, it is implicit in the operation code of said at least one abandonment instruction that the time slot which is abandoned is the time slot in which said at least one abandonment instruction is executed.
In embodiments, said one or more instructions from the instruction set comprising at least said one or more abandon instructions can be reserved for use by the supervisor wire and cannot be executed by the work wires. .
In embodiments, said one or more abandonment instructions can specify as an operand an address of the work thread to which the abandoned time slot is abandoned.
In embodiments, the processor can be arranged to authorize the working wire, to which one of the time slots has been abandoned, to return the time slot in which it is executed to the supervisor wire by executing an instruction exit in the time slot in which it is running.
In embodiments, the execution unit can be arranged to operate according to a set of instructions defining types of machine code instructions recognized by the processor, each machine code instruction being defined by a code. respective operation; and the output instruction can be a dedicated instruction from the instruction set having an operation code which when executed realizes the return of the abandoned time slot to the supervisor wire.
In the embodiments, it is implicit in the operation code of the output instruction that the time slot which is returned is the time slot in which the output instruction is executed.
In embodiments, it is implicit in the operation code of the output instruction that the wire to which the returned time slot is returned is the supervisor wire.
In embodiments, one or more instructions in the instruction set comprising at least the output instruction can be reserved for use by the working wires and cannot be executed by the supervisor wire.
In embodiments, the supervisor wire can perform barrier synchronization to synchronize the working son.
In embodiments, the supervisor wire can communicate with an external resource on the part of one or more of the working son.
In embodiments, the abandonment instruction can also copy one or more modes from one or more status registers of the set of supervisor context registers into one or more registers d corresponding state of the working thread launched by the abandon instruction, thereby controlling the working thread so that it adopts said one or more modes.
In embodiments, the processor can also be arranged to execute an instruction which launches a set of more than a single working thread together in certain respective slots of said slots, all of the threads executing the same code.
In embodiments, the set of instructions that the configuration of the processor is intended to execute may further comprise a multi-execution instruction which launches a plurality of working threads together in respective slots of said slots, the plurality of working wires being three or more; wherein one of the working threads includes code retrieved from a first address specified by an operand of the multi-execution instruction, and wherein the other threads of the plurality of working threads include retrieved code from respective addresses spaced by a step of progression of a step value with respect to the first address, in which the step value is specified by another operand of the multi-execution instruction. That is, each other thread of the plurality of work threads includes code extracted from an address offset from the first address of a respective integer multiple of the step value, the integer multiples forming the sequence of natural numbers (1, 2, 3, ..), that is to say the sequence of positive integers starting at 1 and spaced by increments of 1 (increasing by one at each time slot).
In embodiments, the number of working son can be equal to the number of time slots. That is, the multi-execution instruction launches a thread in each of the time slots, each from a respective address different from a set of staggered addresses specified by the first address and operands of step values of the multi-execution instruction.
According to another aspect described here, a method of actuating a processor is provided, the method comprising: using an execution unit to execute a respective thread in each slot of a repetitive sequence of different time slots, the sequence consisting of a plurality of time slots in which the execution logic can be actuated to interleave the execution of the respective threads; wherein the processor comprises a plurality of sets of context registers, each comprising a respective set of registers for representing a respective state of a respective thread, wherein the sets of context registers comprise a set of working context registers respective for each of the plurality of time slots that the thread is capable of interleaving in said sequence and at least one additional set of context registers, so that the number of sets of context registers is greater at least one relative to the number of time slots that the thread is capable of interleaving, the sets of work context registers being used to represent the respective states of respective work threads which perform computation, and the set of additional context registers being used to represent the state of a supervisor thread that i plan the execution of the tasks performed by the work threads; and the method further comprises starting the execution of the supervisor thread in each of the time slots, and the supervisor thread then individually abandons each of the time slots in which it runs at a respective one of the work threads.
According to another aspect described here, there is provided a computer program product comprising code incorporated on a storage readable by a computer and which is arranged to run on the processor of any of the embodiments described here, in which the code includes the supervisor wire and the work wires.
BRIEF DESCRIPTION OF THE DRAWINGS To facilitate understanding of the present description and to show how it can be implemented, reference will be made, by way of example, to the accompanying drawings in which: [0029] [FIG. 1] Figure 1 is a block diagram of a multi-wire processor; [Fig. 2] Figure 2 is a block diagram of a plurality of context children; [Fig. 3] FIG. 3 illustrates a diagram of interlaced time slots; [Fig. 4] FIG. 4 schematically illustrates a supervisor thread and a plurality of working threads executing in a plurality of interlaced time slots; [Fig. 5] FIG. 5 is a block diagram of a processor comprising a matrix of constituent processors; and [0034] [Fig. 6] schematically illustrates a graph used in an artificial intelligence algorithm.
Detailed description of preferred embodiments [0035] FIG. 1 illustrates an example of processor 4 according to embodiments of the present description. For example, processor 4 can be a block of a matrix of similar processor blocks on the same chip, or can be implemented on its own chip. The processor 4 comprises a multi-wire processing unit 10 in the form of a barrel processing unit, and a local memory 11 (that is to say on the same block in the case of a multi-matrix pad, or the same chip in the case of a single processor chip). A barrel processing unit is a type of multi-wire processing unit in which the pipeline execution time is divided into a repeating sequence of interleaved time slots, each of which may be owned by a given wire. This will be described in more detail in a moment. The memory 11 includes an instruction memory 12 and a data memory 22 (which can be implemented in different addressable memory modules or in different regions of the same addressable memory module). The instruction memory 12 stores machine code to be executed by the processing unit 10, while the data memory 22 stores both data on which the executed code will operate and output data produced by the executed code (for example a result of such operations).
The memory 12 stores various different threads of a program, each thread comprising a respective sequence of instructions for performing a certain task or certain tasks. Note that an instruction as mentioned here designates a machine code instruction, that is to say an instance of one of the fundamental instructions of the processor instruction set, consisting of a single operation code and of zero or more operands.
The program described here includes a plurality of working son, and a supervisor subroutine which can be arranged in the form of one or more supervisor son. This will be described in more detail in a moment. In embodiments, each of some or all of the working threads take the form of a respective "codelet". A "codelet" is a special type of thread, sometimes also called an "atomic" thread. It has all the input information it needs for its execution from the start of the thread (from the moment of launch), i.e. it takes no input from any other part in the program or in memory after being launched. In addition, no other part of the program will use outputs (results) of the wire until it is finished (it ends). Unless he encounters an error, he is guaranteed to finish. It will be noted that certain literatures also define a "codelet" as being stateless, that is to say that if it is executed twice it will not be able to inherit any information coming from its first execution, but this additional definition is not adopted here. It will also be noted that not all of the working threads need to be (atomic) codelets, and in some embodiments some or all of the working threads may instead be able to communicate with each other.
In the processing unit 10, multiple different threads among the threads coming from the instruction memory 12 can be interleaved in a single execution pipeline 13 (although typically only a subset of all the threads stored in the instruction memory can be interleaved at any point in the overall program). The multi-thread processing unit 10 comprises: a plurality of banks of registers 26, each arranged to represent the state (context) of a respective different thread among the threads to be executed simultaneously; a shared execution pipeline 13 which is common to the threads executed simultaneously; and a scheduler 24 for scheduling the simultaneous threads for their execution in the shared pipeline in an interlaced manner, preferably in turn. The processing unit 10 is connected to a shared instruction memory 12 common to the plurality of wires, and to a shared data memory 22 which is also common to the plurality of wires.
The execution pipeline 13 comprises an extraction stage 14, a decoding stage 16 and an execution stage 18 comprising an execution unit which can perform arithmetic and logical operations, address calculations and loading and storage operations, and other operations, as defined by the instruction set architecture. Each of the context register banks 26 includes a respective set of registers for representing the program state of a respective thread.
An example of the registers which each constitute the banks of context registers 26 is illustrated in FIG. 2. Each of the banks of context registers 26 comprises one or more respective control registers 28, comprising at least one program counter (PC ) for the respective thread (to keep track of the instruction address at which the thread is currently running), and in embodiments also a set of one or more status registers (SR ) recording a current state of the respective thread (as if it is running or paused, for example since it has encountered an error). Each of the context register banks 26 also comprises a respective set of operand registers (OP) 32, for temporarily maintaining operands of the instructions executed by the respective thread, that is to say values on which one operates or resulting from operations defined by the operation codes of the instructions of the respective threads when they are executed. It will be noted that each of the banks of context registers 26 can optionally comprise one or more other types of respective registers (not shown). It will also be noted that although the term "bank of registers" is sometimes used to designate a group of registers in a common address space, this will not necessarily be the case in the present description and each of the material contexts 26 (each of the sets of registers 26 representing each context) can more generally comprise one or more banks of registers of the kind.
As will be described in more detail below, the arrangement described comprises a bank of work context registers CXO ... CX (Ml) for each of the M wires which can be executed simultaneously (M = 3 in the example illustrated, but this is not limiting), and a bank of additional supervisor context registers CXS. The banks of work context registers are reserved for memorizing the contexts of work threads, and the bank of registers of supervisor context is reserved for memorizing the context of a supervisor wire. It will be noted that in embodiments the supervisor context is special, and that it comprises a number of registers different from that of the working threads. Each of the working contexts preferably includes the same number of status registers and operand registers as the others. In embodiments, the supervisor context may include fewer operand registers than each of the work threads. Examples of operand registers that the work context may include and that the supervisor does not include are: floating point registers, accumulator registers and / or dedicated weighting registers (to contain neural network weights ). In embodiments, the supervisor can also include a different number of status registers. Furthermore, in embodiments, the architecture of the instruction set of processor 4 can be arranged in such a way that the working threads and the supervisor thread (s) execute different types of instructions but also share certain types of instructions. instructions.
The extraction stage 14 is connected so as to extract from the instruction memory 12 instructions to be executed, under the control of the scheduler 24. The scheduler 24 is arranged to control the stage of extraction 14 to extract an instruction from each thread of a set of threads executing simultaneously in a repetitive sequence of time slots, thus dividing the resources of the pipeline 13 into a plurality of time slots interleaved in time, as will be seen describe in more detail in a moment. For example, the scheduling scheme could be a turn or a weighted turn. Another term for a processor operating in this way is a barrel execution thread processor.
In some embodiments, the scheduler 24 may have access to one of the SR status registers of each thread indicating whether the thread is paused, so that the scheduler 24 actually controls the extraction stage 14 for extracting the instructions from only the wires which are currently active. In embodiments, preferably each time slot (and the corresponding context register bank) is always owned by one thread or another, that is to say that each slot is always occupied by a certain thread, and each slot is always included in the sequence of the scheduler 24; although it may happen that the thread occupying a given slot can be paused at this instant, in which case when the sequence comes to this slot, the instruction extraction for the respective thread is skipped. As a variant, it is not excluded, for example, that in less preferred variant embodiments, certain slots may be temporarily vacant and excluded from the scheduled sequence. When referring to the number of time slots that the thread is capable of interleaving, or the like, this means the maximum number of slots that the thread is capable of running simultaneously, that is that is, the number of simultaneous slots that the thread's hardware supports.
The extraction stage 14 has access to the program counter (PC) of each of the contexts. For each respective thread, the extraction stage 14 extracts the next instruction of this thread from the following address in the program memory 12 as indicated by the program counter. The program counter increments with each execution cycle unless it is bypassed by a branch instruction. The extraction stage 14 then passes the extracted instruction to the decoding stage 16 so that it is decoded, and the decoding stage 16 then passes an indication of the decoded instruction to the execution unit 18 accompanied by the decoded addresses of all the operand registers 32 specified in the instruction, so that the instruction is executed. The execution unit 18 has access to the operand registers 32 and to the control registers 28, which it can use in the execution of the instruction on the basis of the addresses of decoded registers, as in the case of an arithmetic instruction (for example by adding, multiplying, subtracting or dividing the values in two operand registers and providing the result to another operand register of the respective wire). Or if the instruction defines a memory access (loading or storage), the loading / storage logic of the execution unit 18 loads a value from the data memory into an operand register of the respective thread, or stores a value from an operand register of the respective wire in the data memory 22 in accordance with the instruction. Or if the instruction defines a connection or a change of state, the execution unit changes the value in the program counter PC or one of the state registers SR accordingly. It will be noted that while an instruction of a thread is executed by the execution unit 18, an instruction originating from the thread located in the next time slot in the interlaced sequence may be being decoded by the stage of decoding 16; and / or while an instruction is decoded by the decoding stage 16, the instruction originating from the wire being in the next time slot after this one may be being extracted by the extraction stage 14 (although in general the scope of the description is not limited to an instruction by time slot, for example in alternative scenarios a batch of two or more instructions could be issued by a given thread by time slot). Interlacing thus advantageously masks the latency in the pipeline 13, in accordance with known techniques for processing barrel wires.
An example of the interleaving diagram implemented by the scheduler 24 is illustrated in FIG. 3. Here the simultaneous wires are interleaved according to a turn diagram in which, in each turn of the diagram, the turn is divided in a sequence of time slots S0, SI, S2 ..., each intended to execute a respective thread. Typically, each slot has a length of one processor cycle and the different slots have equal sizes although this is not necessary in all possible embodiments, for example a weighted rotation scheme is also possible, scheme in which some threads get more cycles than others on each run. In general, the execution of barrel wires can use either an equal turn diagram or a weighted turn diagram, in the latter case the weighting can be fixed or adaptive.
Whatever the sequence for each turn of execution, this pattern is then repeated, each round comprising a respective instance of each of the time slots. It should therefore be noted that a time slot as designated here designates the place allocated repeatedly in the sequence, not a particular instance of the slot in a given repetition of the sequence. In other words, the scheduler 24 divides the execution cycles of the pipeline 13 into a plurality of temporally interleaved execution channels (multiplexed by time separation), each comprising a recurrence of a respective time slot in a sequence repetitive time slots. In the illustrated embodiment, there are four time slots, but this is only for purposes of illustration and other numbers are possible. For example, in a preferred embodiment there are actually six time slots.
Regardless of the number of time slots into which the turn schedule is divided, then according to the present description, the processing unit 10 comprises a bank of context registers 26 more than the number of time slots , that is, it supports one more context than the number of interlaced time slots it is capable of processing in barrels.
This is illustrated by means of the example of Figure 2: if there are four time slots S0 ... S3 as shown in Figure 3, then there are five banks of context registers, referenced here CXO, CX1, CX2, CX3 and CXS. That is, even if there are only four execution time slots, S0 ... S3 in the barrel wire diagram and thus only four wires can be executed simultaneously, it is described here by add a fifth bank of CXS context registers, comprising a fifth program counter (PC), a fifth set of operand registers 32, and in embodiments also a fifth set of one or more status registers ( CT). Note, however, that as mentioned, in some embodiments, the supervisor context may differ from other CXO. „3, and the supervisor thread may support a different set of instructions to operate the execution pipeline 13.
Each of the first four contexts CX0 ... CX3 is used to represent the state of a respective one of a plurality of "working wires" currently assigned to one of the four execution time slots S0. ..S3, to perform any specific calculation task of an application desired by the programmer (it should also be noted here that this may be only the subset of the total number of program work threads as stored in the memory of instructions 12). The fifth context CXS is however reserved for a special function, to represent the state of a "supervisor thread" (SV) whose role is to coordinate the execution of the work threads, at least in the direction of the assignment. that of the working wires W which must be executed in such a time slot S0, SI, S2. . . and how well in the overall program. Optionally, the supervisor thread may have other "supervision" or coordination responsibilities. For example, the supervisor wire can be responsible for carrying out barrier synchronizations to ensure a certain order of execution. For example, in the case where one or more second wires depend on data to be supplied by one or more first wires executed on the same processor module 4, the supervisor can perform barrier synchronization to ensure that none of the second wires starts before the first sons are finished. In addition or instead, the supervisor can perform barrier synchronization to ensure that one or more wires on the processor module 4 do not start before a certain external data source, such as another pad or chip. processor, has completed the processing required to make data available. The supervisor wire can also be used to perform other functionalities associated with the multiple work wires. For example, the supervisor wire can be responsible for communicating data outside of the processor module 4 (to receive external data on which it is necessary to act with one or more of the wires, and / or to transmit data supplied by one or more several of the working threads). In general, the supervisor wire can be used to provide any kind of supervision or coordination function desired by the programmer. For example, in another example, the supervisor can supervise transfers between the local block memory 12 and one or more resources in the larger system (external to the array 6) such as a storage disk or a network card.
It will of course be noted that four slots are only an example, and that in general, in other embodiments, there may be other numbers, so that if there is a maximum of M time slots 0 ... Ml per revolution, processor 4 includes M + l contexts CX ... CX (M-1) & CXS, i.e. one for each work thread which can be interleaved at any given time and an additional context for the supervisor. For example, in an example implementation there are six time slots and seven contexts.
Referring to Figure 4, according to the principles shown here, the supervisor wire SV does not have its own time slot per se in the diagram of interlaced execution time slots. The same is true for work leads since the allocation of slots to work leads is defined flexibly. Instead, each time slot has its own dedicated context register bank (CX0 ... CXM-1) to store a work context, which is used by the work thread when the time slot is allocated to the work thread , but not used when the slot is allocated to the supervisor. When a given slot is allocated to the supervisor, that slot uses the supervisor's CVS context register bank instead. Note that the supervisor always has access to his own context and that no work thread is able to occupy the CXS supervisor context register bank.
The supervisor wire SV can run in any one and in all the time slots S0 .... S3 (or more generally S0 ... SM-I). The scheduler 24 is arranged for, when the program as a whole starts, to start by allocating the supervisor wire to all of the time slots, that is to say that thus the supervisor SV starts by executing in all the slots S0 ... S3. However, the supervisor thread is provided with a mechanism for, at a certain later point (either immediately or after having performed one or more supervisor tasks), temporarily abandon each of the slots in which it is executed at a respective one of the working wires, for example initially the working wires W0 ... W3 in the example shown in FIG. 4. This is obtained by the fact that the supervisor wire executes an abandon instruction, called "RUN" as example here. In embodiments, this instruction takes two operands: an address of a work thread in the instruction memory 12 and an address of certain data for this work thread in the data memory 22: RUN task_addr, data_addr [0053 ] The working threads are portions of code that can be executed simultaneously between them, each representing one or more respective calculation tasks to be performed. The data address can specify certain data on which the work thread should act. Alternatively, the abandon instruction may take a single operand specifying the address of the work thread, and the address of the data could be included in the code of the work thread; or in another example the single operand could point to a data structure specifying the addresses of the work thread and the data. As mentioned, in embodiments at least some of the working threads can take the form of codelets, i.e., atomic code units executable simultaneously. As a variant or in addition, some of the working threads are not necessarily codelets and may instead be able to communicate with each other.
The abandonment instruction ("RUN") acts on the scheduler 24 so as to abandon the current time slot, in which this instruction is executed itself, at the work wire specified by the operand. Note that it is implicit in the abandonment instruction that it is the time slot in which this instruction is executed that is abandoned (implicit in the context of machine code instructions means that there is no need of an operand to specify this - this is implicitly understood from the operation code itself). So the time slot that is abandoned is the time slot in which the supervisor executes the abandonment instruction. Or put it another way, the supervisor runs in the same space as the one he abandons. The supervisor says "execute this piece of code in this location", then from this point the recurring slot is (temporarily) the property of the concerned work thread.
The supervisor wire SV performs a similar operation in each of one or more of the other time slots, to abandon some or all of its time slots to different respective wires among the working wires W0 ... W3 (selected in a larger set W0 ... Wj in the instruction memory 12). Once he has done this for the last slot, the supervisor is suspended (he will resume later where he left when one of the slots is returned by a working thread W).
The supervisor wire SV is thus capable of allocating different work wires, each carrying out one or more tasks, to different slots among the interleaved execution time slots S0 ... S3. When the supervisor thread determines that it is time to execute a work thread, it uses the RUN instruction to allocate this work thread to the time slot in which the RUN instruction was executed.
In some embodiments, the instruction set also includes a variant of the execution instruction, RUNALL ("execute all"). This instruction is used to launch a set of several work threads together, all of them executing the same code. In embodiments, this launches a working thread in each of the slots of the processing unit S0 ... S3 (or more generally S0 ... S (Ml)) Instead of or in addition to the RUNALL instruction, in some embodiments the instruction set may include a "multi-execution" instruction, MULTIRUN. This instruction also launches multiple work threads, each in a respective one of the time slots. In preferred embodiments, it launches a respective working thread W in each of all the slots SO .... S (Ml) (that is to say that the total number of working threads launched is equal to the number M of material working contexts). However, the MULTIRUN instruction differs from the RUNALL instruction in that the multiple threads launched do not all consist of the same code taken from the same task address. Instead, the MULTIRUN instruction takes at least two operands: a first which is an explicit task address; and a progression step value: MULTIRUN task_addr, stride A first of the multiple threads launched is taken at the address task_addr specified by the address operand of the MULTIRUN instruction. Each of the other multiple threads thrown is taken at an address equal to that of the first thread plus a respective incremental integer multiple of the step value, the multiples being the sequence of positive integers starting at 1 and incrementing by 1 with each time slot. In other words, the launched work threads are in a progression in stages equal to the step value with respect to the first address. That is, a second child is taken at an address equal to task_addr + stride, a third child is taken at an address equal to task_addr + 2 * stride, and a fourth child is taken at an address equal to task_addr + 3 * stride (and so on depending on the number of threads launched, a number which in embodiments is equal to the number of slots S). The execution of the MULTIRUN instruction triggers each of the M multiple work threads to launch them into a respective one of the slots SO .... S (Ml), each starting with a program counter defined by the address value. respective determined as previously specified.
In addition, in some embodiments, the RUN, RUNALL and / or MULTIRUN instructions, when executed, also automatically copy a state from one or more CXS supervisor state registers (SR) to one of one or more status registers of the work thread (s) launched by the RUN or RUNALL instruction. For example, the copied state can include one or more modes, such as a floating-point rounding mode (for example, rounded to the nearest or rounded to zero) and / or an overflow mode (for example, which saturates or uses a separate value representing infinity). The copied state or mode then controls the work thread in question to operate according to the copied state or mode. In embodiments, the work thread can later overwrite this in its own status register (but cannot change the state of the supervisor). In other alternative or additional embodiments, the work threads can choose to read a state from one or more supervisor state registers (and again can change their own state later). For example, here again this could consist in adopting a mode from the supervisor status register, such as a floating point mode or a rounding mode. However, in embodiments, the supervisor cannot read any of the CXO ... context registers of the work threads.
Once launched, each of the currently allocated working wires W0 ... W3 proceeds so as to carry out one or more calculation tasks defined in the code specified by the respective abandonment instruction. At the end of this, the respective work thread returns the time slot in which it is running to the supervisor thread. This is achieved by executing an exit instruction ("EXIT"). In some embodiments it takes no operand:
EXIT
As a variant, in other embodiments, the EXIT instruction takes a single exit_state operand (for example a binary value), to be used for any purpose desired by the programmer to indicate a state of the respective codelet at its termination (for example to indicate if a certain condition is fulfilled or if an error has occurred): EXIT exit_state In all cases, the instruction EXIT acts on the scheduler 24 so that the time slot in which it is executed is returned to the supervisor thread. The supervisor wire can then perform one or more of the following supervisory tasks (e.g. barrier synchronization and / or data exchange), and / or continue to execute another abandon instruction to allocate a new work wire (W4 , etc.) in the niche in question. It will again be noted that consequently the total number of working threads in the instruction memory 12 may be greater than the number of threads that the barrel thread processing unit 10 can interleave at any time. It is the role of the supervisor thread SV to plan which of the working threads W0 ... Wj coming from the instruction memory 12, at what stage in the overall program, must be assigned to such a slot of the interleaved time slots S0. ..SM in the schedule diagram of the scheduler 24.
In embodiments, there is also another way in which a working wire can return its time slot to the supervisor wire. That is to say that the execution unit 18 comprises an exception mechanism arranged in such a way that, when a work thread encounters an exception, the latter can automatically return its time slot to the supervisor. In this case, the individual output state can be set to a default value or can be left unchanged.
In addition, in embodiments, the processing unit 10 can be arranged such that one or more instructions from the instruction set are reserved for use by the supervisor wire and not by the wires of work, and / or that one or more instructions in the instruction set are reserved for use by the work leads and not by the supervisor lead. For example, this could be enforced in the execution stage 18, the decoding stage 16 or the extraction stage 14, assuming that the abort (RUN) and exit (EXIT) instructions act on the floor concerned to inform it of what type of wire is currently occupying the niche in question. In such cases, the instructions specific to the supervisor include at least the abort instruction, but could also include other instructions such as one or more barrier synchronization instructions if the processing unit 10 contains dedicated logic for perform barrier synchronization. Also, the specific instructions of the work threads include at least the output instruction, but can also include other instructions such as floating point operations (which are liable to errors).
The processor 4 described above can be used as a single stand-alone processor comprising a single instance of the processing unit 10 and of the memory 11. However as a variant, as illustrated in FIG. 5, in certain embodiments processor 4 can be one of several processors in a matrix 6, integrated on the same chip, or extending over several chips. In this case, the processors 4 are connected to each other via an appropriate interconnection 34 allowing them to exchange data with each other, including the results of one or more calculations performed by one, by some or by all the different working threads throughout the extent of the matrix. For example, processor 4 may be one of multiple blocks in a larger multi-block processor implemented on a single chip, each block comprising its own respective instance of the barrel wire processing unit 10 and the associated memory 11, each being arranged in the manner described above in relation to FIGS. 1 to 4. To be complete, it will also be noted that a "matrix" as mentioned here does not necessarily imply a particular number of dimensions or a physical arrangement of the blocks or of the processors 4. In some of these embodiments, the supervisor may be responsible for carrying out exchanges between blocks.
In some embodiments, the EXIT instruction is given a special function, namely to ensure that an output state specified in the operand of the EXIT instruction is automatically aggregated {by logic dedicated hardware) with the exit states of a plurality of other work threads running in the same pipeline 13, each of these work threads having a respective exit state specified as the operand of its own instance of the EXIT instruction. This may consist in aggregating the specified output state with the output states of all the working threads which are executed by the same processor module 4 (i.e. in the same pipeline 13 of a processing unit 10 given), or at least all those in a specified phase. In some embodiments, other instructions can be executed to aggregate with the work thread output states that are executed on one or more other processors in a matrix 6 (which can be other blocks on the same chip or even on other chips). In all cases, the processor 4 comprises at least one register 38 specifically designed to store the locally aggregated output state of the processor 4. In some embodiments, it is one of the supervisor's state registers in the CXS supervisor context register bank. When each EXIT instruction is executed by the respective thread, the dedicated aggregation logic causes the output state specified in the operand of the EXIT instruction to contribute to the aggregate output state stored in the status register of output 38. At any time, for example once all the working threads of interest are terminated by means of a respective output instruction, the supervisor thread can then access the output state in the register exit status 38. This may include accessing its own SR status register.
The aggregation logic is implemented in dedicated hardware circuits in the execution unit 18. Thus an additional implicit facility is included in the instruction to end a work thread. Dedicated circuits or hardware designate circuits with a wired function, unlike being programmed by software using general purpose code. The update of the locally aggregated output state (in register 38) is triggered by the execution of the operation code of the special EXIT output instruction, this being one of the fundamental machine code instructions in the instruction set of processor 4, having the inherent functionality of aggregating the output states. Also, the locally aggregated output state is stored in a register 38, which means a dedicated storage element (in some embodiments a single storage bit) whose value can be accessed by the code executing in the pipeline. Preferably, the output status register 38 forms one of the status registers of the supervisor.
In one example, the output states of the individual wires and the aggregate output state can each take the form of a single bit, that is to say 0 or 1, and the aggregation logic can be arranged to perform a logical AND of the output states of individual work wires. This means that any entry at 0 causes an aggregate to 0, but if all entries are at 1, then the aggregate is worth 1. That is, if we use a 1 to represent a true or successful result , this means that if any of the local exit states of one of the work threads is false or failed, then the aggregate aggregate exit state will also be false or represent an unsuccessful result. For example, this could be used to determine whether the work threads have all met a terminal condition or not. So the supervisor can query a single register (in single-bit embodiments) to ask "Did something go wrong Yes or no", rather than having to examine individual states individual work threads on each individual paver. In fact, in some embodiments, the supervisor is not able to interrogate a work thread at an arbitrary point and does not have access to the state of the work threads, so that the register of output state 38 is the only way to determine the outcome of a working thread. The supervisor does not know which context register bank corresponds to which work thread, and after the work thread is output by EXIT, the state of the work thread disappears. The only other way for the supervisor to determine an output from a work thread would be for the work thread to leave a message in the general purpose data memory 22.
A function equivalent to the above-mentioned logic would be to replace the AND with an OR gate and reverse the interpretation of the output states 0 and 1 by software, that is to say 0 true, i -> false . Equivalently if the AND gate is replaced by an OR gate but the interpretation of the output states is not reversed, and neither is the reset value, then the aggregate state $ LC will record if any ( rather than all) of the working threads came out with state i. In other embodiments, the output states need not be single bits. For example, the output state of each individual work thread can be a single bit, but the aggregate output state can include two bits representing a trinary state: all work threads are output with state 1, all the working threads are output with state 0, or the output states of the working threads were mixed. As an example of the logic to implement this, one of the two bits encoding the trinary value can be a Boolean AND of the individual output states, and the other bit of the trinary value can be a Boolean OR of individual output states. The third coded case, indicating that the output states of the working threads were mixed, can then be formed by an EXCLUSIVE OR of these two bits.
The exit states can be used to represent what the programmer wants, but a particularly envisaged example consists in using an exit state at 1 to indicate that the respective working thread is left with a "success" or "state. true ", while an exit state of 0 indicates that the respective work thread is exited with a" failed "or" false "state (or vice versa if the aggregation circuit performs an OR instead of an AND and the register $ LC 38 is initially reset to 0). For example, if we consider an application where each work thread performs a calculation with an associated condition, as a condition indicating whether the error or errors in said one or more parameters of a respective node in the graph of an algorithm of artificial intelligence are at an acceptable level according to a predetermined metric. In this case, an individual output state of a given logic level (for example 1) can be used to indicate that the condition is satisfied (for example that the error or errors in said one or more parameters of the node are at an acceptable level according to a certain metric); whereas an individual output state of the opposite logic level (for example 0) can be used to indicate that the condition was not satisfied (for example the error or errors are not at an acceptable level according to the metric in question). The condition can for example be an error threshold placed on a single parameter or on each parameter, or could be a more complex function of a plurality of parameters associated with the respective calculation performed by the work thread.
In another more complex example, the individual output states of the working wires and the aggregate output state may each comprise two or more bits, which can be used, for example, to represent a degree of confidence in the results of work threads. For example, the output state of each individual work thread can represent a probabilistic measure of confidence in a result of the respective work thread, and the aggregation logic can be replaced by a more complex circuit to achieve probabilistic aggregation of individual confidence levels in material form.
Whatever the meaning given by the programmer to the output states, the supervisor wire SV can then obtain the aggregated value from the output state register 38 to determine the aggregated output state of all the work threads that have come out since its last reset, for example at the last synchronization point, for example, to determine whether or not all the work threads have come out in a successful or true state. Based on this aggregated value, the supervisor thread can then make a decision in accordance with the designer's choice of design. The programmer can choose to use the locally aggregated output state as he wishes, for example, to determine whether to indicate an exception, or to make a connection decision based on the output state. Associate. For example, the supervisor thread can consult the local aggregated output state in order to determine whether a certain portion of the program consisting of a plurality of work threads has ended as expected or desired. If it is not the case (for example at least one of the working threads left in a failed or false state), it can report to a host processor, or can carry out another iteration of the part of the program comprising the same working threads; but if it is the case (for example if all the working threads left with a state of success or true) it can instead connect to another part of the program comprising one or more new working threads.
Preferably, the supervisor subroutine should not access the value in the output status register 38 before all of the work threads in question are out, so that the value stored therein represents the correct updated aggregated status of all desired threads. This wait could be imposed by a barrier synchronization performed by the supervisor wire to wait for all the local work wires running simultaneously (i.e. those located on the same processor module 4, running in the same pipeline 13) came out. That is, the supervisor wire resets the output status register 38, launches a plurality of work wires, then initiates barrier synchronization in order to wait for all pending work wires to be out before that the supervisor be authorized to obtain the aggregated exit status in the status register 38.
FIG. 6 illustrates an example of application of the processor architecture described here, namely an application to artificial intelligence.
As is well known to those skilled in the art in the technique of artificial intelligence, artificial intelligence begins with a learning step in which the artificial intelligence algorithm learns a knowledge model . The model includes a graph of interconnected nodes (that is to say vertices) 102 and of edges (that is to say of links) 104. Each node 102 of the graph has one or more input edges and one or more exit stops. Some of the input edges of some of the nodes 102 are the output edges of some of the other nodes, thereby connecting the nodes to each other to form the graph. In addition, one or more of the input edges of one or more of the nodes 102 form the inputs of the graph as a whole, and one or more of the output edges of one or more of the nodes 102 form the outputs of the graph in his outfit. Sometimes a given node can even have it all: inputs from the graph, outputs from the graph and connections to other nodes. Each stop 104 communicates a value or more often a tensor (n-dimensional matrix), this forming the inputs and outputs supplied to the nodes and obtained from the nodes 102 on their input and output edges respectively.
Each node 102 represents a function of its one or more inputs received on its input edge (s), the result of this function being the output (s) provided on the output edge (s). Each function is parameterized by one or more respective parameters (sometimes called weights or weights, although they do not necessarily have to be multiplier weights). In general, the functions represented by the different nodes 102 can take different forms of function and / or can be parameterized by different parameters.
In addition, each of said one or more parameters of each function of a node is characterized by a respective error value. In addition, a respective condition can be associated with the error or errors in the parameter or parameters of each node 102. For a node 102 representing a function parameterized by a single parameter, the condition can be a simple threshold, that is that is, the condition is satisfied if the error is within the specified threshold but is not satisfied if the error is beyond the threshold. For a node 102 configured with more than one respective parameter, the condition for this node 102 to have reached an acceptable level of error can be more complex. For example, the condition can be satisfied only if each of the parameters of this node 102 remains below the respective threshold. In another example, a combined metric can be defined as combining the errors in the different parameters for the same node 102, and the condition can be satisfied if the value of the combined metric remains below a specified threshold, but otherwise the condition n 'is not satisfied if the value of the combined metric is above the threshold (or vice versa depending on the definition of the metric). Whatever the condition, this gives a measure of whether the error in the node parameter (s) remains below a certain level or degree of acceptability. In general any suitable metric can be used. The condition or the metric can be the same for all the nodes, or can be different for certain respective different nodes.
In the learning step, the algorithm receives experimental data, that is to say multiple data points representing different possible combinations of inputs to the graph. As experimental data is received, the algorithm gradually adjusts the parameters of the various nodes 102 of the graph on the basis of the experimental data so as to try to minimize errors in the parameters. The goal is to find parameter values such that the output of the graph is as close as possible to a desired output for a given input. When the graph as a whole tends towards such a state, we say that the graph converges. After an appropriate degree of convergence the graph can be used to make predictions or inferences, that is, to predict an exit for a certain given entry or to infer a cause for a certain given exit.
The learning stage can take a number of different possible forms. For example, in a supervised approach, the experimental input data takes the form of training data, that is to say inputs which correspond to known outputs. With each data point, the algorithm can adjust the parameters so that the output more closely matches the known output for the given input. In the next prediction stage, the graph can then be used to map an input query to an approximate predicted output (or vice versa if an inference is made). Other approaches are also possible. For example, in an unsupervised approach, there is no concept of reference result per input data, and instead we let the artificial intelligence algorithm identify its own structure in the output data. Or, in a reinforcement approach, the algorithm tries at least one possible output for each data point in the experimental input data, and it is told whether its output is positive or negative (and potentially the degree to which it is positive or negative), e.g. won or lost, or reward or punishment, or the like. On many tests, the algorithm can gradually adjust the parameters of the graph to be able to predict inputs which will lead to a positive output. The various approaches and algorithms for learning a graph are known to those skilled in the art in the field of artificial intelligence.
According to an example of application of the techniques described here, each work thread is programmed to perform the calculations associated with a respective individual node among the nodes 102 in an artificial intelligence graph. In this case, at least some of the edges 104 between the nodes 102 correspond to the exchanges of data between wires, and some may involve exchanges between blocks. In addition, the individual output states of the working threads are used by the programmer to represent whether the respective node 102 has satisfied or not its respective condition for the convergence of the parameter or parameters of this node, that is to say if the error in the parameter (s) remains in the acceptable level or region in the error space. For example, there is an example of using the embodiments in which each of the individual output states is an individual bit and the aggregate output state is an AND of the individual output states (or equivalently an OR if 0 is taken as positive); or wherein the aggregate output state is a trinary value representing whether the individual output states were all true, all false or mixed. Thus, by examining a single register value in the output state register 38, the program can determine whether the whole graph, or at least one sub-region of the graph, has converged to an acceptable degree.
In another variant of this, it is possible to use embodiments in which the aggregation takes the form of a statistical aggregation of individual confidence values. In this case, each individual output state represents a confidence level (for example a percentage) indicating that the parameters of the node represented by the respective thread have reached an acceptable degree of error. The aggregate output state can then be used to determine an overall confidence level indicating whether the graph, or a sub-region of the graph, has converged to an acceptable degree.
In the case of a multi-block arrangement 6, each block executes a subgraph of the graph. Each subgraph includes a supervisor routine comprising one or more supervisor threads, and a set of work threads in which some or all of the work threads can take the form of codelets.
In such applications, or indeed in any graph-based application where each working thread is used to represent a respective node in a graph, the "codelet" that each working thread includes can be defined as a procedure software acting on the persistent state and the inputs and / or outputs of a vertex, in which the codelet: • is launched on a context of working thread register, to execute in a barrel slot, by the thread supervisor executing a "run" instruction; • runs until completion without communication with other codelets or the supervisor (except for the return to the supervisor when the codelet leaves); • has access to the persistent state of a vertex via a memory pointer provided by the "run" instruction, and to a non-persistent work area in memory which is private for this barrel slot; and • executes an "EXIT" as its last instruction, after which the barrel slot it used is returned to the supervisor, and the exit state specified by the exit instruction is aggregated with the state of local exit from the paver which is visible to the supervisor.
Updating a graph (or a sub-graph) means updating each constituent vertex once, in any order consistent with the causality defined by the edges. Updating a vertex means executing a codelet on the state of the vertex. A codelet is an update procedure for vertices - a codelet is usually associated with many vertices. The supervisor executes a RUN per vertex instruction, each of these instructions specifying a vertex state address and a codelet address.
Note that the above embodiments have been described only by way of example.
For example, the applicability of the present description is not limited to the particular processor architecture described in relation to FIGS. 2 and 3, and in general the concepts described here could be applied to any architecture of processor having a plurality of execution time slots, adding at least one or more contexts which are possible time slots.
It will also be noted that it is not excluded that still other contexts beyond the number of time slots may be included for other purposes. For example, some processors include a debugging context which never represents an actual thread, but is used by a thread when it encounters an error in order to memorize the program state of the erroneous thread for later analysis by the program developer for debugging purposes.
In addition, the role of the supervisor wire is not limited only to a barrier synchronization and / or an exchange of data between wires, and in other embodiments it could instead or in addition be responsible other features involving visibility of two or more of the work threads. For example, in embodiments where the program includes multiple iterations of a graph, the supervisor thread could be responsible for determining the number of iterations of the graph to be performed, which could depend on a result d 'a previous iteration.
Other variants or other applications of the techniques described could appear to a person skilled in the art given the description given here. The scope of this description is not limited by the exemplary embodiments described here, but only by the appended claims.

权利要求:
Claims (20)
[1" id="c-fr-0001]
CLAIMS L. Processor comprising: an execution unit arranged to execute a respective thread in each time slot of a repetitive sequence of different time slots, the sequence consisting of a plurality of time slots in which the execution logic is operable to interlace the execution of the respective wires; and a plurality of sets of context registers, each comprising a respective set of registers for representing a respective state of a respective thread, the sets of context registers comprising a set of respective work thread context registers for each of time slots that the thread is capable of interleaving in said sequence and at least one set of additional context registers, so that the number of sets of context registers is at least one greater than the number of time slots that the thread is capable of interleaving, the sets of work context registers being arranged to represent the respective states of respective work threads which perform computational tasks, and the set of registers additional context being arranged to represent the state of a supervisor thread which schedules the execution of the tasks performed by the threads of t ravail; wherein the processor is arranged to start executing the supervisor thread in each of the time slots, and to allow the supervisor thread to then individually abandon each of the time slots in which it runs to a respective one of the work threads.
[2" id="c-fr-0002]
2. Processor according to claim 1, the processor being arranged to authorize the supervisor wire to carry out said abandonment by executing one or more abandonment instructions in the time slot in which it is executed.
[3" id="c-fr-0003]
The processor of claim 2, wherein said one or more abandonment instructions are a single abandonment instruction.
[4" id="c-fr-0004]
4. Processor according to claim 2 or 3, in which the execution unit is arranged to operate according to a set of instructions defining types of machine code instructions recognized by the processor, each machine code instruction being defined by a respective operation code; and wherein at least one of said one or more abandonment instructions is a dedicated instruction from the instruction set having an operation code which when executed triggers said abandonment.
[5" id="c-fr-0005]
5. Processor according to claim 4, in which it is implicit in the operation code of said at least one abandonment instruction that the time slot which is abandoned is the time slot in which said at least one abandonment instruction is executed.
[6" id="c-fr-0006]
6. Processor according to any one of claims 4 or 5, in which one or more instructions of the instruction set comprising at least said one or more abandon instructions are reserved for use by the supervisor wire and are not executable by the working threads.
[7" id="c-fr-0007]
7. Processor according to any one of the preceding claims, in which said one or more abandonment instructions specify as operand an address of the work thread to which the abandoned time slot is abandoned.
[8" id="c-fr-0008]
8. Processor according to any one of the preceding claims, the processor being arranged to authorize the working thread, to which one of the time slots has been abandoned, to return the time slot in which it is executed to the supervisor thread by executing an exit instruction in the time slot in which it is executed.
[9" id="c-fr-0009]
9. The processor as claimed in claim 8, in which the execution unit is arranged to operate according to a set of instructions defining types of machine code instructions recognized by the processor, each machine code instruction being defined by a code. respective operation; and in which the output instruction is a dedicated instruction from the instruction set having an operation code which when executed performs said return of the time slot abandoned to the supervisor wire.
[10" id="c-fr-0010]
10. The processor as claimed in claim 9, in which it is implicit in the operation code of the output instruction that the time slot which is returned is the time slot in which the output instruction is executed.
[11" id="c-fr-0011]
11. The processor as claimed in claim 9 or 10, in which it is implicit in the operation code of the output instruction that the thread to which the returned time slot is returned is the supervisor thread.
[12" id="c-fr-0012]
12. Processor according to any one of claims 8 to 11, in which one or more instructions of the instruction set comprising at least the output instruction are reserved for use by the working threads and cannot be executed by the supervisor wire.
[13" id="c-fr-0013]
13. Processor according to any one of the preceding claims, in which the supervisor wire is arranged to perform barrier synchronization to synchronize the working wires.
[14" id="c-fr-0014]
14. Processor according to any one of the preceding claims, in which the supervisor wire is arranged to carry out communication with an external resource on the part of one or more of the working wires.
[15" id="c-fr-0015]
15. The processor of claim 2 or any of the dependent claims, wherein the abort instruction further copies one or more modes from one or more state registers of the set of context registers. supervisor in one or more corresponding state registers of the working thread launched by the abandon instruction, thereby controlling the working thread so that it adopts said one or more modes.
[16" id="c-fr-0016]
16. Processor according to any one of the preceding claims, the processor being further arranged to execute an instruction which launches a set of more than a single work thread jointly in certain respective slots of said slots, all the threads executing the same code .
[17" id="c-fr-0017]
17. The processor of claim 4 or any of the dependent claims, wherein the set of instructions that the processor is intended to execute further comprises a multi-execution instruction which launches a plurality of work threads together in respective slots of said slots, the plurality of working wires being three or more; wherein one of the working threads includes code extracted from a first address specified by an operand of the multi-execution instruction, and wherein the other threads of the plurality of working threads include code extracted from respective addresses spaced by a step of progression of a step value with respect to the first address, the step value being specified by another operand of the multi-execution instruction.
[18" id="c-fr-0018]
18. The processor of claim 17, wherein the number of work threads is equal to the number of time slots.
[19" id="c-fr-0019]
19. A method of operating a processor, the method comprising: using an execution unit to execute a respective thread in each slot of a repeating sequence of different time slots, the sequence consisting of a plurality of time slots wherein the execution logic is operable to interleave the execution of the respective threads; wherein the processor comprises a plurality of sets of context registers, each comprising a respective set of registers for representing a respective state of a respective thread, wherein the sets of context registers comprise a set of working context registers respective for each of the plurality of time slots that the thread is capable of interleaving in said sequence and at least one additional set of context registers, so that the number of sets of context registers is greater at least one relative to the number of time slots that the thread is capable of interleaving, the sets of work context registers being used to represent the respective states of respective work threads which perform computation, and the set of additional context registers being used to represent the state of a supervisor thread that i plan the execution of the tasks performed by the work threads; and the method further comprises starting the execution of the supervisor thread in each of the time slots, and the supervisor thread then individually abandoning each of the time slots in which it runs at a respective one of the work threads.
[20" id="c-fr-0020]
20. A computer program product comprising code incorporated on a storage readable by a computer and which is arranged to execute on the processor of any one of claims 1 to 18, in which the code comprises the supervisor wire and the wires of job.

类似技术:

公开号 | 公开日 | 专利标题

FR3072798A1|2019-04-26|ORDERING OF TASKS IN A MULTI-CORRECTION PROCESSOR

EP1805611B1|2017-11-29|Task processing scheduling method and device for implementing same

EP1043658B1|2006-07-05|Method for improving the performance of a multiprocessor system including a tasks waiting list and system architecture thereof

FR3072800A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESSING ARRANGEMENT

FR3072799A1|2019-04-26|COMBINING STATES OF MULTIPLE EXECUTIVE WIRES IN A MULTIPLE WIRE PROCESSOR

FR3072801A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESS MATRIX

EP0020202B1|1983-05-11|Multiprocessing system for signal treatment

EP0030504A1|1981-06-17|Device for the synchronization and allocation of processes between several processors in a data processing system

EP3129874B1|2018-06-13|Distributing computing system implementing a non-speculative hardware transactional memory and a method for using same for distributed computing

FR3072797A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVING AND MULTI-CHIP TREATMENT ARRANGEMENT

EP0527664B1|1995-12-20|Method for helping the development of a communicating programmable logic controller set

WO2011036377A1|2011-03-31|System and method for managing interleaved execution of command threads

EP2257876B1|2011-08-10|Method for preloading configurations of a reconfigurable heterogeneous system for information processing into a memory hierarchy

EP0558125B1|1997-10-29|Neural processor with distributed synaptic cells

EP1158405A1|2001-11-28|System and method for managing a multi-resource architecture

WO2014072628A1|2014-05-15|Method, device and computer programme for the placement of tasks in a multi-core system

FR3091389A1|2020-07-03|REGISTER BENCHES IN A MULTIPLE PERFORMANCE WIRE PROCESSOR

FR3091375A1|2020-07-03|LOADING-STORAGE INSTRUCTION

FR3091362A1|2020-07-03|HIDDEN MEMORY OF INSTRUCTIONS IN A MULTIPLE RUNNING PROCESSOR

FR2980611A1|2013-03-29|CIRCUIT FOR PLANNING THE PROCESS OF DATA PROCESSING

US10606641B2|2020-03-31|Scheduling tasks in a multi-threaded processor

WO2006042736A1|2006-04-27|Reconfigurable, modular and hierarchical parallel processor system

FR3090924A1|2020-06-26|EXCHANGE OF DATA IN A COMPUTER

EP0270457A2|1988-06-08|Dedicated computer for carrying out symbolic processes for artificial intelligence applications

EP3782036A1|2021-02-24|Mimd processor emulated on simd architecture

同族专利:

公开号 | 公开日

FR3072798B1|2021-04-30|

GB201816891D0|2018-11-28|

GB2569843B|2020-06-10|

CA3021447A1|2019-04-20|

JP2019079530A|2019-05-23|

CN109697111A|2019-04-30|

JP6660991B2|2020-03-11|

KR20190044551A|2019-04-30|

KR102159730B1|2020-09-24|

TWI687866B|2020-03-11|

US20210165660A1|2021-06-03|

TW201923561A|2019-06-16|

GB2569843A|2019-07-03|

US10956165B2|2021-03-23|

CA3021447C|2021-12-14|

GB201717303D0|2017-12-06|

US20190121668A1|2019-04-25|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

DE69429204T2|1993-03-26|2002-07-25|Cabletron Systems Inc|Sequence control method and device for a communication network|

US6233599B1|1997-07-10|2001-05-15|International Business Machines Corporation|Apparatus and method for retrofitting multi-threaded operations on a computer by partitioning and overlapping registers|

US6401155B1|1998-12-22|2002-06-04|Philips Electronics North America Corporation|Interrupt/software-controlled thread processing|

US6957432B2|2000-03-21|2005-10-18|Microsoft Corporation|Real-time scheduler|

US8099110B2|2004-07-29|2012-01-17|Samsung Electronics Co., Ltd.|Apparatus and method for efficient determination of mobile station location in a wireless network|

US7203822B2|2004-07-31|2007-04-10|Hewlett-Packard Development Company, L.P.|Unprivileged context management|

US20060146864A1|2004-12-30|2006-07-06|Rosenbluth Mark B|Flexible use of compute allocation in a multi-threaded compute engines|

US7810083B2|2004-12-30|2010-10-05|Intel Corporation|Mechanism to emulate user-level multithreading on an OS-sequestered sequencer|

US8195922B2|2005-03-18|2012-06-05|Marvell World Trade, Ltd.|System for dynamically allocating processing time to multiple threads|

US7743233B2|2005-04-05|2010-06-22|Intel Corporation|Sequencer address management|

US7849466B2|2005-07-12|2010-12-07|Qualcomm Incorporated|Controlling execution mode of program threads by applying a mask to a control register in a multi-threaded processor|

US7685409B2|2007-02-21|2010-03-23|Qualcomm Incorporated|On-demand multi-thread multimedia processor|

JP2010176403A|2009-01-29|2010-08-12|Toyota Motor Corp|Multithread processor device|

US8214831B2|2009-05-05|2012-07-03|International Business Machines Corporation|Runtime dependence-aware scheduling using assist thread|

GB2489708B|2011-04-05|2020-04-15|Advanced Risc Mach Ltd|Thread selection for multithreaded processing|

US20150205614A1|2012-03-21|2015-07-23|Mika Lähteenmäki|Method in a processor, an apparatus and a computer program product|

JP5894496B2|2012-05-01|2016-03-30|ルネサスエレクトロニクス株式会社|Semiconductor device|

JP2014153860A|2013-02-07|2014-08-25|Renesas Electronics Corp|Multi-thread processor|GB201819616D0|2018-11-30|2019-01-16|Graphcore Ltd|Virtualised gateways|

GB2580316B|2018-12-27|2021-02-24|Graphcore Ltd|Instruction cache in a multi-threaded processor|

US10943323B2|2018-12-28|2021-03-09|Arm Limited|Data processing systems|

GB201904265D0|2019-03-27|2019-05-08|Graphcore Ltd|A partitionable networked computer|

GB201904266D0|2019-03-27|2019-05-08|Graphcore Ltd|A networked computer with embedded rings|

GB201904267D0|2019-03-27|2019-05-08|Graphcore Ltd|A networked computer with multiple embedded rings|

GB201904263D0|2019-03-27|2019-05-08|Graphcore Ltd|A networked computer|

CN112084122B|2019-09-30|2021-09-28|成都海光微电子技术有限公司|Confidence and aggressiveness control for region prefetchers in computer memory|

KR20220003621A|2020-03-26|2022-01-10|그래프코어 리미티드|Network computer with two built-in rings|

法律状态:
2019-10-15| PLFP| Fee payment|Year of fee payment: 2 |

2020-10-29| PLFP| Fee payment|Year of fee payment: 3 |

2020-11-06| PLSC| Publication of the preliminary search report|Effective date: 20201106 |

2021-10-27| PLFP| Fee payment|Year of fee payment: 4 |

优先权:

申请号 | 申请日 | 专利标题

GB1717303.0|2017-10-20|

GBGB1717303.0A|GB201717303D0|2017-10-20|2017-10-20|Scheduling tasks in a multi-threaded processor|

US15/885,925|US10956165B2|2017-10-20|2018-02-01|Scheduling tasks in a multi-threaded processor|

GB1816891.4A|GB2569843B|2017-10-20|2018-10-17|Scheduling tasks in a multi-threaded processor|

[返回顶部]