巴西专利BR112019014822A2 NEURAL NETWORKS FOR ATTENTION-BASED SEQUENCE TRANSDUCTION

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
methods, systems and apparatus, including computer programs encoded on a computer storage medium, to generate a sequence of outputs from a sequence of inputs. in one aspect, one of the systems includes a coding neural network configured to receive the sequence of inputs and generate coded representations of the network inputs, the coding neural network comprising a sequence of one or more coding subnets, each coding subnet configured to receive a respective encoding subnet input for each of the input positions and to generate a respective subnet output for each of the input positions, and each encoding subnet comprising: a coding self-service sublayer which is configured to receive the subnet entry for each of the entry positions and, for each specific entry position in the entry order: apply an attention mechanism to the encoding subnet entries using one or more queries obtained from of the encoding subnet entry at the specific entry position.
公开号:BR112019014822A2
申请号:R112019014822-1
申请日:2018-05-23
公开日:2020-02-27
发明作者:M. Shazeer Noam；Nicholas Gomez Aidan；Mieczyslaw Kaiser Lukasz；D. Uszkoreit Jakob；Owen Jones Llion；J. Parmar Niki；Polosukhin Illia；Teku Vaswani Ashish
申请人:Google Llc；
IPC主号:

专利说明:

NEURAL NETWORKS FOR TRANSDUCTION OF ATTENTION-BASED SEQUENCES
REFERENCE TO RELATED PATENT APPLICATIONS [001] This patent application is non-provisional and claims priority for US Provisional Patent Application No. 62 / 510,256, filed on May 23, 2017, and US Provisional Patent Application No. 62 / 541,594, filed on August 4, 2017. The full contents of the preceding patent applications are incorporated into this document by reference.
FUNDAMENTALS [002] This patent application concerns the transduction of sequences using neural networks.
[003] Neural networks are models of machine learning that use one or more layers of non-linear units to predict an output for an input received. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as an input to the next layer in the network, that is, the next hidden layer or the output layer. Each layer of the network generates an output from an input received according to the current values of a respective set of parameters.
SUMMARY [004] This specification describes a system implemented as computer programs on one or more computers in one or more locations that generates a sequence of exits that includes a respective exit at each of multiple positions in an exit order from a sequence of entries that includes a respective entry
Petition 870190068076, of 07/18/2019, p. 39/91
2/37 in each of multiple positions in an input order, that is, it converts the input sequence into the output sequence. In particular, the system generates the output sequence using a coding neural network and a decoding neural network that are both attention-based.
[005] Modalities specific to the subject described in this specification can be implemented in order to achieve one or more of the following advantages.
[006] Many existing techniques for transducing sequences using neural networks use recurrent neural networks in both the encoder and the decoder. Although these types of networks can achieve good performance in sequence transduction tasks, their computation is sequential in nature, that is, a recurrent neural network generates an output in a current time step conditioned in the hidden state of the recurrent neural network in the step previous time. This sequential nature makes parallelization impossible, resulting in long training and inference times and, consequently, workloads that use a large amount of computational resources.
[007] On the other hand, since the encoder and decoder of the described sequence transduction neural network are attention-based, the sequence transduction neural network can quickly convert sequences, be trained more quickly, or both, once that network operation can be more easily parallelized. That is, since the sequence transduction neural network depends entirely on an attention mechanism to obtain global dependencies between
Petition 870190068076, of 07/18/2019, p. 40/91
3/37 input and output and does not use any layers of recurrent neural networks with long training and inference times, problems with long training and inference times and the high resource utilization caused by the sequential nature of the recurrent neural network layers are mitigated.
[008] In addition, the sequence transduction neural network can convert sequences with greater precision than existing networks that are based on convolutional layers or recurrent layers, even if training and inference times are shorter. In particular, in conventional models, the number of operations required to relate signals from two arbitrary input or output positions increases with the distance between positions, for example, linear or logarithmically depending on the model's architecture. This makes it more difficult to learn dependencies between distant positions during training. In the presently described sequence transduction neural network, this number of operations is reduced to a constant number of operations due to the use of attention (and, in particular, self-care) while it does not depend on recurrence or convolutions. Self-attention, sometimes called intra-attention, is a mechanism of attention that relates different positions of a single sequence in order to compute a representation of the sequence. The use of attention mechanisms allows the sequence transduction neural network to effectively learn dependencies between distant positions during training, improving the accuracy of the sequence transduction neural network in various transduction tasks, for example, translation by
Petition 870190068076, of 07/18/2019, p. 41/91
4/37 machine. In fact, the described sequence transduction neural network can achieve state of the art results in the machine translation task although it is easier to train and quickly generate outputs than conventional machine translation neural networks. The sequence transduction neural network can also exhibit improved performance compared to conventional machine translation neural networks without specific task adjustment through the use of the attention mechanism.
[009] Details of one or more modalities of the subject of this specification are presented in the attached drawings and in the description below. Other resources, aspects and advantages of the subject will become evident from the description, drawings and claims.
BRIEF DESCRIPTION OF THE DRAWINGS [0010] Figure 1 shows an exemplary neural network system.
[0011] Figure 2 is a diagram showing mechanisms of attention that are applied by the sub-layers of attention in the subnets of the coding neural network and in the decoding neural network.
[0012] Figure 3 is a flowchart of an exemplary process for generating a sequence of outputs from a sequence of inputs.
[0013] Reference numbers and similar designations in the various drawings indicate similar elements.
DETAILED DESCRIPTION [0014] This specification describes a system implemented as computer programs on one or more computers in one or more locations that generates a
Petition 870190068076, of 07/18/2019, p. 42/91
5/37 sequence of outputs that includes a respective output in each of multiple positions in an output order from a sequence of inputs that includes a respective input in each of multiple positions in an input order, that is, converts the sequence of entries in the sequence of exits.
[0015] For example, the system can be a machine translation neural system. That is, if the input sequence is a sequence of words in an original language, for example, a sentence or phrase, the output sequence can be a translation of the input sequence in a target language, that is, a sequence of words in the target language that represents the sequence of words in the original language.
[0016] As another example, the system can be a speech recognition system. That is, if the input sequence is a sequence of audio data that represents a spoken statement, the output sequence can be a sequence of graphemes, characters or words that represent the statement, that is, it is a transcription of the input sequence .
[0017] As another example, the system can be a natural language processing system. For example, if the input sequence is a sequence of words in an original language, for example, a sentence or phrase, the output sequence can be a summary of the input sequence in the original language, that is, a sequence that has less words than the sequence of entries, but which maintains the essential meaning of the sequence of entries. As another example, if the sequence of entries is a
Petition 870190068076, of 07/18/2019, p. 43/91
6/37 sequence of words that form a question, the sequence of exits can be a sequence of words that form an answer to the question.
[0018] As another example, the system may be part of a computer-assisted medical diagnostic system. For example, the sequence of entries may be a sequence of data from an electronic medical record and the sequence of exits may be a sequence of predicted treatments.
[0019] As another example, the system can be part of an image processing system. For example, the sequence of inputs can be an image, that is, a sequence of color values of the image, and the output can be a text sequence that describes the image. As another example, the input string can be a text string or a different context and the output string can be an image that describes the context.
[0020] In particular, the neural network includes a coding neural network and a decoding neural network. Generally, both the encoder and the decoder are attention-based, that is, they both apply an attention mechanism to their respective incoming inputs while converting the input stream. In some cases, neither the encoder nor the decoder includes any convolutional layers or any recurring layers.
[0021] Figure 1 shows an exemplary neural network system 100. The neural network system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components and techniques
Petition 870190068076, of 07/18/2019, p. 44/91
7/37 described below can be implemented.
[0022] The neural network system 100 receives a sequence of inputs 102 and processes the sequence of inputs 102 to convert the sequence of inputs 102 into a sequence of outputs 152.
[0023] The input sequence 102 has a respective network input in each of multiple input positions in an input order and the output sequence 152 has a respective network output in each of multiple output positions in an order about to leave. That is, the input sequence 102 has multiple inputs arranged according to an input order and the output sequence 152 has multiple outputs arranged according to an output order.
[0024] As described above, the neural network system 100 can perform any of a variety of tasks that need to process sequential inputs to generate sequential outputs.
[0025] The neural network system 100 includes an attention-based sequence transduction neural network 108 which, in turn, includes a coding neural network 110 and a decoding neural network 150.
[0026] The encoding neural network 110 is configured to receive the sequence of inputs 102 and generate a respective encoded representation of each of the network inputs in the sequence of inputs. Generally, a coded representation is a vector or another ordered collection of numeric values.
[0027] The decoding neural network 150 is then configured to use the encoded representations of the
Petition 870190068076, of 07/18/2019, p. 45/91
8/37 network inputs to generate the sequence of outputs 152.
[0028] Generally, and as will be described in more detail below, both encoder 110 and decoder 150 are attention-based. In some cases, neither the encoder nor the decoder includes any convolutional layers or any recurring layers.
[0029] The encoding neural network 110 includes an embedding layer 120 and a sequence of one or more encoding subnets 130. In particular, as shown in Figure 1, the encoding neural network includes N encoding subnets 130.
[0030] Embedding layer 120 is configured to, at each network entry in the sequence of entries, map the network entry to a numerical representation of the network entry in an embedding space, for example, in a vector in the incorporation. The embedding layer 120 then provides the numerical representation of the network inputs to the first subnet in the sequence of encoding subnets 130, i.e., to the first encoding subnet 130 of the N encoding subnets 130.
[0031] In particular, in some implementations, the embedding layer 120 is configured to map each network entry to an embedded representation of the network entry and then combines, for example, sum, or averages, the embedded representation of the network entry network with a positional incorporation of the input position of the network input in the input order to generate a combined embedded representation of the network input. That is, each position in the sequence of entries has a corresponding incorporation and for each network entry the
Petition 870190068076, of 07/18/2019, p. 46/91
9/37 embedding layer 120 combines the embedded representation of the network input with the incorporation of the position of the network input in the input sequence. Such positional incorporations may allow the model to make full use of the order of the sequence of entries without relying on recurrence or convolutions.
[0032] In some cases, positional incorporations are learned. As used in this specification, the term learned means that an operation or value was adjusted during the training of the sequence transduction neural network 108. The training of the sequence transduction neural network 108 is described below with reference to Figure 3.
[0033] In some other cases, positional incorporations are fixed and are different for each position. For example, the incorporations can be made of sine and cosine functions of different frequencies and can satisfy:
PE (pos, 2i) = sen (pos / 10000 ^{2i / d} model)
PE (pos, 21 + 1) = COS (pOS / 10000 ^{2i / d} model), where pos is the position, i is the dimension within the positional incorporation and dmodei is the dimensionality of the positional incorporation (and of the other vectors processed by the network neural 108). The use of sinusoidal positional incorporations can allow the model to extrapolate to greater lengths of sequences, which can increase the range of applications for which the model can be used.
[0034] The combined embedded representation is then used as the numerical representation of the network input.
[0035] Each of the coding subnets 130 is
Petition 870190068076, of 07/18/2019, p. 47/91
10/37 configured to receive a respective encoding subnet input for each of the plurality of input positions and to generate a respective subnet output for each of the plurality of input positions.
[0036] The subnet outputs generated by the last coding subnet in the sequence are then used as the coded representations of the network inputs.
[0037] For the first coding subnet in the sequence, the coding subnet entry is the numerical representation generated by the embedding layer 120 and, for each coding subnet different from the first coding subnet in the sequence, the entry encoding subnet is the encoding subnet output of the preceding encoding subnet in the sequence.
[0038] Each coding subnet 130 includes a coding self-service sublayer 132. The coding self-service sublayer 132 is configured to receive the subnet entry for each of the plurality of input positions and, for each specific input position in the order of entry, apply a mechanism of attention on the encoding subnet entries at the entry positions using one or more queries obtained from the encoding subnet entry at a specific entry position to generate a respective exit to the specific entry position. In some cases, the attention mechanism is a multi-headed attention mechanism. The attention mechanism and how the attention mechanism is applied by the coding self-care sublayer 132 will be described in greater detail below with reference to Figure 2.
Petition 870190068076, of 07/18/2019, p. 48/91
11/37 [0039] In some implementations, each of the coding subnets 130 also includes an individual connection layer that combines the outputs of the coding self-service sublayer with the inputs for the coding self-service sublayer to generate a residual encoder self-service output and a layer normalization layer that applies layer normalization to the residual encoder self-service output. These two layers are collectively referred to as an Add & Norm operation in Figure 1.
[0040] Some or all of the encoding subnets may also include a positioned feed-forward layer 134 that is configured to operate at each position in the input sequence separately. In particular, for each input position, the feed-forward layer 134 is configured to receive an input at the input position and apply a sequence of transformations to the input at the input position to generate an output for the input position. For example, the sequence of transformations can include two or more linear transformations learned each separated by an activation function, for example, an elementally nonlinear activation function, for example, a ReLU activation function, which can allow for faster training and most effective on large, complex data sets. The inputs received by the positioned feed-forward layer 134 can be the outputs of the layer normalization layer when the residual and layer normalization layers are included or the outputs of the coding self-care sublayer 132 when the residual and layer normalization layers are not. are
Petition 870190068076, of 07/18/2019, p. 49/91
12/37 included. The transformations applied by layer 134 will generally be the same for each input position (but different feed-forward sublayers on different subnets will apply different transformations).
[0041] In cases where an encoding subnet 130 includes a positioned feed-forward layer 134, the encoding subnet may also include a residual connection layer that combines the outputs of the positioned feed-forward layer with the inputs for the layer positioned feed-forward to generate a positioned encoder residual output and a layer normalization layer that applies layer normalization to the positioned encoder residual output. These two layers are also collectively referred to as an Add & Norm operation in Figure 1. The outputs of this layer normalization layer can then be used as the outputs of the coding subnet 130.
[0042] Once the encoding neural network 110 has generated the encoded representations, the decoding neural network 150 is configured to generate the output sequence in autoregressive mode.
That is, the decoder neural network 150 generates the output sequence, in each of a plurality of generation time steps, generating a network output for a corresponding conditioned output position in (i) the encoded and (ii) network exits in exit positions that precede the exit position in the exit order.
[0044] In particular, for a given output position, the decoding neural network generates an output that defines a distribution of probabilities over possible
Petition 870190068076, of 07/18/2019, p. 50/91
13/37 network outputs at a given output position. The decoding neural network can then select a network output for the output position by sampling the probability distribution or by selecting the network output with the highest probability.
[0045] Since the decoder neural network 150 is autoregressive, at each generation time step, the decoder 150 operates on the network outputs that were already generated before the generation time step, that is, the network outputs in exit positions that precede the corresponding exit position in the exit order. In some implementations, to ensure that this is the case during both inference and training, at each stage of generation time the decoding neural network 150 shifts the already generated network outputs to the right by an output order position (ie, introduces a shifted position in the sequence of network outputs already generated) and (as will be described in more detail below) masks certain operations so that positions can only match positions above or including that position in the sequence of exits (not positions subsequent). Although the rest of the description below describes that when generating a given output at a given output position, several components of the decoder 150 operate on data at output positions that precede the given output positions (and not on data at any other) positions), it will be understood that this type of conditioning can be effectively implemented using the displacement described above.
[0046] The decoding neural network 150 includes a
Petition 870190068076, of 07/18/2019, p. 51/91
14/37 embedding layer 160, a sequence of decoding subnets 170, a linear layer 180 and a softmax layer. In particular, as shown in Figure 1, the decoder neural network includes N decoder subnets 170. However, although the example in Figure 1 shows encoder 110 and decoder 150 including the same number of subnets, in some cases the encoder 110 and decoder 150 include different numbers of subnets. That is, the decoder 150 may include more or less subnets than the encoder 110.
[0047] The embedding layer 160 is configured to, at each generation time step, for each network output in an output position that precedes the current output position in the output order, to map the network output for a representation number of the network outlet in the incorporation space. The embedding layer 160 then provides the numerical representations of the network outputs for the first subnet 170 in the sequence of decoding subnets, i.e., for the first decoding subnet 170 of the N decoding subnets.
[0048] In particular, in some embodiments, the embedding layer 160 is configured to map each network output to an embedded representation of the network output and combining the embedded representation of the network output with a positional incorporation of the exit position of the exit output in order to generate a combined embedded representation of the network output. The combined embedded representation is then used as the numerical representation of the network output. The embedding layer 160 generates the embedding representation combined in the same way
Petition 870190068076, of 07/18/2019, p. 52/91
15/37 which is described above with reference to embedding layer 120.
[0049] Each decoder subnet 170 is coded to, at each generation time step, receive a respective decoder subnet input for each of the plurality of output positions that precede the corresponding output position and to generate a respective decoder subnet output for each of the plurality of output positions that precede the corresponding output position (or equivalently, when the output sequence has been shifted to the right, each network output in a position up to and including the current exit position).
[0050] In particular, each decoder subnet 170 includes two different attention sublayers: a decoder self-attention sublayer 172 and a coder-decoder attention sublayer 174.
[0051] Each decoder 172 self-care sublayer is configured to receive, at each generation time step, an input for each exit position that precedes the corresponding exit position and, for each of the specific exit positions, apply a mechanism pay attention to the entries in the exit positions that precede the corresponding position using one or more queries obtained from the entry in the specific exit position to generate an updated representation for the specific exit position. That is, the decoder self-care sublayer 172 applies an attention mechanism that is masked so that it does not participate in or otherwise process any data that is not in a position that precedes the current exit position in the sequence
Petition 870190068076, of 07/18/2019, p. 53/91
16/37 out.
[0052] Each coding-coding attention sublayer 174, on the other hand, is configured to, at each generation time step, receive an input for each output position that precedes the corresponding output position and, for each of the output positions , apply a mechanism of attention to the representations coded in the input positions using one or more queries obtained from the input to the output position to generate an updated representation for the output position. Therefore, the coding-coding attention sublayer 174 applies attention to coded representations while the coding self-attention sublayer 172 applies attention to input and output positions.
[0053] The attention mechanism applied by each of these attention sublayers will be described in more detail below with reference to Figure 2.
[0054] In Figure 1, the decoder self-attention sublayer 172 is shown as being before the encoder-decoder attention sublayer in the processing order within the decoder subnet 170. In other examples, however, the decoder self-attention sublayer 172 may being after the coding-decoding attention sublayer 174 in the processing order within the decoding subnet 170 or different subnets can have different processing orders.
[0055] In some implementations, each decoder subnet 170 includes, after the self-service sublayer
Petition 870190068076, of 07/18/2019, p. 54/91
17/37 decoder 172, after the coder-decoder attention sublayer 174, or after each of the two sublayers, a residual connection layer that combines the outputs of the attention sublayer with the inputs for the attention sublayer to generate a residual output and a layer normalization layer that applies layer normalization to the residual outlet. Figure 1 shows these two layers being inserted after each of the two sublayers, both referred to as an Add & Norm operation.
[0056] Some or all of the decoder subnets 170 also include a positioned feed-forward layer 176 which is configured to operate in a similar manner to the positioned feed-forward layer 134 of encoder 110. In particular, layer 176 is configured to , at each stage of generation time, for each exit position that precedes the corresponding exit position: receive an entry at the exit position and apply a sequence of transformations to the entry at the exit position to generate an exit to the exit position . For example, the sequence of transformations can include two or more linear transformations learned each separated by an activation function, for example, an elementally nonlinear activation function, for example, a ReLU activation function. The inputs received by the positioned feedforward layer 176 can be the outputs of the layer normalization layer (after the last attention sub-layer in subnet 170) when the residual and layer normalization layers are included, or the outputs of the last sub-layer attention on subnet 170 when the residual and layer normalization layers are not
Petition 870190068076, of 07/18/2019, p. 55/91
18/37 included.
[0057] In cases where a decoder subnet 170 includes a feed-forward layer positioned 176, the decoder subnet may also include a residual connection layer that combines the outputs of the positioned feed-forward layer with the inputs for the layer positioned feed-forward to generate a positioned decoder residual output and a layer normalization layer that applies layer normalization to the positioned decoder residual output. These two layers are also collectively referred to as an Add & Norm operation in Figure 1. The outputs of this layer normalization layer can then be used as the outputs of the decoder subnet 170.
[0058] At each stage of generation time, linear layer 180 applies a learned linear transformation to the output of the last decoder subnet 170 in order to project the output of the last decoder subnet 170 into the space suitable for processing by the softmax layer 190. The softmax layer 190 then applies a softmax function over the outputs of the linear layer 180 to generate the probability distribution over the possible network outputs in the generation time step. As described above, the decoder 150 can then select a network output from the possible network outputs using the probability distribution.
[0059] Figure 2 is a diagram 200 showing mechanisms of attention that are applied by the sublayers of attention in the subnets of the encoding neural network 110 and in the decoding neural network 150.
Petition 870190068076, of 07/18/2019, p. 56/91
19/37 [0060] Generally, an attention mechanism maps a query and a set of key-value pairs to an output, where the query, keys and values are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a query compatibility function with the corresponding key.
[0061] More specifically, each sub-layer of attention applies a mechanism for staggered attention to point products 230. In staggered attention to point products, for a given query, the sub-layer of attention computes the point products of the query with all keys , divides each of the point products by a scale factor, for example, by the square root of the dimensions of the queries and keys, and then applies a softmax function on the products of scaled points to obtain the weights over the values. The attention sub-layer then computes a weighted sum of the values according to these weights. Therefore, for scaled attention of point products, the compatibility function is the product of points and the output of the compatibility function is further scaled by the scale factor.
[00 62] In operation and as shown on the left side of Figure 2, the attention sublayer computes attention on a set of queries simultaneously. In particular, the attention sub-layer loads queries into a matrix Q, loads keys into a matrix K, and loads values into a matrix V. To load a set of vectors into a matrix, the attention sub-layer can generate a matrix that includes the vectors as the matrix lines.
Petition 870190068076, of 07/18/2019, p. 57/91
20/37 [0063] The attention sub-layer then performs matrix multiplication (MatMul) between matrix Q and the transpose of matrix K to generate an output matrix of compatibility functions.
[0064] The attention sub-layer then scales the output matrix of compatibility functions, that is, by dividing each element of the matrix by the scale factor.
[0065] The attention sublayer then applies a softmax over the staggered output matrix to generate a weight matrix and performs a matrix multiplication (MatMul) between the weight matrix and the V matrix to generate an output matrix that includes the output of the attention mechanism for each of the values.
[0066] For sublayers that use masking, that is, decoder attention sublayers, the attention sublayer masks the staggered output matrix before applying the softmax. That is, the sub-layer of attention masks (defines as negative infinity), all the values in the staggered output matrix that correspond to positions after the current output position.
[0067] In some implementations, to allow the sub-layers of attention to jointly serve information from different subspaces of representation in different positions, the sub-layers of attention use multi-headed attention, as illustrated on the right side of Figure 2.
[0068] In particular, to implement multi-headed attention, the attention sublayer applies different mechanisms of attention in parallel. In other words, the attention sub-layer includes h different
Petition 870190068076, of 07/18/2019, p. 58/91
21/37 layers of attention, with each layer of attention within the same sublayer of attention that receives the same original queries Q, the same original keys K and the same original values V.
[0069] Each attention layer is configured to transform the queries, keys and original values using linear transformations learned and then apply the attention mechanism 230 to the queries, keys and values transformed. Each layer of attention will generally learn different transformations from each other layer of attention in the same sublayer of attention.
[0070] In particular, each attention layer is configured to apply a linear query transformation learned to each original query to generate a layer specific query for each original query, apply a linear key transformation learned to each original key to generate a layer-specific key for each original key, and apply a linear learned value transformation to each original value to generate a layer-specific value for each original value. The attention layer then applies the attention mechanism described above using these layer-specific queries, keys and values to generate outputs for the attention layer.
[0071] The attention sub-layer then combines the original outputs of the attention layers to generate the final output of the attention sub-layer. As shown in Figure 2, the concatenated attention sublayer (concat) exits the attention layers and applies a learned linear transformation to the concatenated output to generate the output of the attention layer.
Petition 870190068076, of 07/18/2019, p. 59/91
22/37 sub-layer of attention.
[0072] In some cases, the transformations learned applied by the sub-layer of attention reduce the dimensionality of the original keys and values and, optionally, of the queries. For example, when the dimensionality of the original keys, values and queries is d and there are h layers of attention in the sublayer, the sublayer can reduce the dimensionality of the original keys, values and queries to d / h. This keeps the computational cost of the multi-headed attention mechanism similar to what it cost to run the attention mechanism once in full dimensionality while at the same time increasing the representative capacity of the attention sub-layer.
[0073] Although the care mechanism applied by each sub-layer of care is the same, the queries, keys and values are different for different types of care. That is, different types of attention sublayers use different sources for queries, keys and original values that are received as input by the attention sublayer.
[0074] In particular, when the attention sub-layer is a coding self-service sublayer, all keys, values and queries come from the same place, in this case, the exit of the previous subnet in the encoder, or, for the self-service sublayer encoder in the first subnet, the input incorporations and each position in the encoder can meet all positions in the input order. Therefore, there is a respective key, value and query for each position in the order of entry.
Petition 870190068076, of 07/18/2019, p. 60/91
23/37 [0075] When the attention sublayer is a decoder self-attention sublayer, each position in the decoder serves all the positions in the decoder that precedes that position. Therefore, all keys, values and queries come from the same place, in this case, the exit of the previous subnet in the decoder, or, for the decoder self-service sublayer in the first decoder subnet, the incorporation of the outputs already generated. Therefore, there is a respective key, value and query for each position in the order of departure before the current position.
[0076] When the attention sublayer is a coder-decoder self-service sublayer, queries come from the previous component in the decoder and the keys and values come from the encoder output, that is, from the encoded representations generated by the encoder. This allows each position in the decoder to participate in all positions in the input sequence. Therefore, there is a respective query for each position in the order of exit before the current position and a respective key and a respective value for each position in the order of entry.
[0077] In greater detail, when the attention sub-layer is a coding self-service sublayer, for each specific entry position specified in the entry order, the coding self-service sublayer is configured to apply a attention mechanism to the coding sub-network entries in input positions using one or more queries obtained from the encoding subnet at the specific input position to generate a respective output for the specific input position.
Petition 870190068076, of 07/18/2019, p. 61/91
24/37 [0078] When the coding self-service sublayer implements multi-head attention, each coder self-service layer in the coding self-service sublayer is configured to: apply a linear query transformation learned to each coding subnet entry in each input position to generate a respective query for each input position, apply a linear transformation of the learned key to each encoding subnet input in each input position to generate a respective key for each input position, apply a linear transformation of learned value for each coding subnet entry at each entry position to generate a respective value for each entry position, and then apply the attention mechanism (ie, the step product attention mechanism described above) using queries, keys and values to determine an initial self-service exit from encoder for each input position. The sublayer then combines the initial outputs of the attention layers as described above.
[0079] When the attention sublayer is a decoder self-service sublayer, the decoder self-service sublayer is configured to, at each generation time step: receive an entry for each exit position that precedes the corresponding exit position and, for each of the specific exit positions, apply a mechanism of attention to the entries in the exit positions that precede the corresponding position using one or more queries obtained from the entry in the specific exit position to generate an updated representation for the exit position specific.
Petition 870190068076, of 07/18/2019, p. 62/91
25/37 [0080] When the decoder self-care sublayer implements multi-head attention, each layer of attention in the decoder self-care sublayer is configured to apply, at each generation time step, a linear query transformation learned at the entrance to each exit position that precedes the corresponding exit position to generate a respective query for each exit position, apply a linear transformation of the learned key to each entry in each exit position that precedes the corresponding exit position to generate a respective key for each exit position, and apply a linear transformation of learned value to each entry in each exit position that precedes the corresponding exit position to generate a corresponding value for each exit position, and then apply the attention mechanism (ie, the mechanism for staggered attention to point products described above) using the keys, keys and values to determine a decoder self-service output for each of the output positions. The sublayer then combines the initial outputs of the attention layers as described above.
[0081] When the attention sub-layer is a coding-decoding attention sub-layer, the coding-decoding attention sub-layer is configured to, at each generation time step: receive an entry for each exit position that precedes the corresponding position of exit and, for each of the exit positions, apply a mechanism of attention on the coded representations in the entry positions using one or more queries obtained from the entrance to the exit position for
Petition 870190068076, of 07/18/2019, p. 63/91
26/37 generate an updated representation for the exit position.
[0082] When the coding-coding attention sublayer implements multiple-head attention, each attention layer is configured to, at each stage of generation time: apply a linear query transformation learned at the entry at each exit position that precedes the corresponding position output to generate a respective query for each output position, apply a linear transformation of learned key to each representation encoded in each input position to generate a respective key for each input position, apply a linear transformation of learned value to each representation coded at each entry position to generate a respective value for each entry position, and then apply the attention mechanism (ie, the point product escalation mechanism described above) using the queries, keys and values to determine a initial encoder-decoded attention output or for each input position. The sublayer then combines the initial outputs of the attention layers as described above.
[0083] Figure 3 is a flow chart of an exemplary process for generating a sequence of outputs from a sequence of inputs. For convenience, process 300 will be described as being run by a system of one or more computers located in one or more locations. For example, a neural network system, for example, the neural network system 100 of Figure 1, properly programmed according to this specification, can perform process 300.
[0084] The system receives a sequence of entries
Petition 870190068076, of 07/18/2019, p. 64/91
27/37 (step 310).
[0085] The system processes the input sequence using the encoding neural network to generate a respective encoded representation of each of the network inputs in the input sequence (step 320). In particular, the system processes the input sequence through the embedding layer to generate an embedded representation of each network input and then processes the embedded representations through the encoding subnet sequence to generate the encoded representations of the network inputs.
[0086] The system processes the encoded representations using the encoding neural network to generate an output sequence (step 330). The decoding neural network is configured to generate the output sequence from the coded representations in an autoregressive mode. That is, the decoding neural network generates an output from the output sequence at each stage of generation time. At a given generation time step in which a given output is being generated, the system processes the outputs before the given output in the output sequence through the embedding layer in the decoder to generate embedded representations. The system then processes the embedded representations through the sequence of decoder subnets, the linear layer and the softmax layer to generate the given output. Since the decoding subnets include coding-decoding attention sublayers as well as decoding self-attention sublayers, the decoder makes use of both the outputs already generated and the encoded representations
Petition 870190068076, of 07/18/2019, p. 65/91
28/37 when generating the given output.
[0087] The system can execute process 300 for sequences of inputs for which the desired output, that is
and the sequence of exits that should be generated by the system for the sequence of entries is not known.[0088] The system can also execute the process 300 in input strings on a set of data from
training, that is, a set of inputs for which the sequence of outputs that must be generated by the system is known, with the purpose of training the encoder and decoder to determine trained values for the encoder and decoder parameters. Process 300 can be performed repeatedly on selected inputs from a training data set as part of a conventional machine learning training technique to train the initial layers of neural network, for example, a decreasing gradient with training technique backpropagation
that uses an optimizer conventional, per example, the optimizer Adam. During training, O system can incorporate any number of techniques to improve the velocity, the effectiveness, or both of process of training . For example, the system can use shutdown,
label smoothing, or both to reduce overfitting. As another example, the system can perform training using a distributed architecture that trains multiple cases of the sequence transduction neural network in parallel.
[0089] This specification uses the term configured in connection with systems and components of
Petition 870190068076, of 07/18/2019, p. 66/91
29/37 computer programs. For a system of one or more computers, being configured to perform specific operations or actions means that the system has installed software, firmware, hardware, or a combination of these that in operation cause the system to perform the operations or actions. For one or more computer programs, being configured to perform specific operations or actions means that the one or more programs include instructions that, when executed by the data processing device, cause the device to perform the operations or actions.
[0090] The modalities of the subject and the functional operations described in this specification can be implemented in digital electronic circuits, in software or computer firmware tangibly incorporated, in computer hardware, including the structures revealed in this specification and their structural equivalents, or in combination with one or more of these. The modalities of the subject described in this specification can be implemented as one or more computer programs, that is, one or more modules of computer program instructions encoded in a tangible non-transitory storage medium for execution by, or to control the operation of the data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of these. Alternatively or in addition, the instructions for
Petition 870190068076, of 07/18/2019, p. 67/91
30/37 program can be encoded in an artificially generated propagated signal, for example, an electrical, optical or electromagnetic signal generated by a machine, which is generated to encode information for transmission to a receiving device suitable for execution by a data processing device.
[0091] The term data processing apparatus refers to data processing hardware and encompasses all types of apparatus, devices and machines for processing data, including as examples a programmable processor, a computer, or multiple processors or computers. The device can also be, or even include, logic circuits for special purposes, for example, an FPGA (set of programmable field gates) or an ASIC (integrated circuit of specific application). The device can optionally include, in addition to hardware, code that creates an execution environment for computer programs, for example, code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or
more of these. program computer, which can also to be [0092] a referred or described as a program, software, an application in software, an application, a module , a module software, a text , or code, can to be written in
any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and can be implemented in any form, including a standalone program or a module, component, subroutine, or other suitable unit
Petition 870190068076, of 07/18/2019, p. 68/91
31/37 for use in a computational environment. A program can, but does not need to, match a file in a file system. A program can be stored in a portion of a file that holds other programs or data, for example, one or more texts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, for example, files that store one or more modules, subprograms, or chunks of code. A computer program can be implemented to run on one computer or on multiple computers that are located in one location or distributed across multiple locations and interconnected by a data communication network.
[0093] In this specification, the term database is used widely to refer to any collection of data: the data need not be structured in any specific way, or even structured, and can be stored on storage devices in one or more more locations. Therefore, for example, the index database can include multiple collections of data, each of which can be organized and accessed differently.
[0094] Similarly, in this specification the term mechanism is used widely to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, a mechanism will be implemented as one or more modules or software components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to one
Petition 870190068076, of 07/18/2019, p. 69/91
32/37 specific mechanism; in other cases, multiple engines can be installed and run on the same computer or the same computers.
[0095] The processes and logical flows described in this specification can be performed by one or more programmable computers that run one or more computer programs to perform functions by operating on input and output data. Logical processes and flows can also be performed by special purpose logic circuits, for example, an FPGA or ASIC, or by a combination of special purpose logic circuits and one or more programmed computers.
[0096] Computers suitable for the execution of a computer program can be based on microprocessors for general or special use or both, or on any type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing or carrying out instructions and one or more memory devices for storing instructions and data. The central processing unit and memory can be supplemented by, or incorporated into, special purpose logic circuits. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, for example, magnetic or magneto-optical disks, or optical disks. However, a computer
Petition 870190068076, of 07/18/2019, p. 70/91
33/37 does not need to have such devices. In addition, a computer can be embedded in another device, for example, a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver , or a portable storage device, for example, a universal serial bus (USB) flash drive, just to mention a few.
[0097] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including, for example, semiconductor memory devices, for example, EPROM, EEPROM, and flash memory devices; magnetic disks, for example, internal hard drives or removable disks; magneto-optical discs; and CD ROM and DVD-ROM discs.
[0098] To provide interaction with a user, the modalities of the subject described in this specification can be implemented on a computer that has a display device, for example, a CRT (cathode ray tube) or LCD (crystal screen) monitor liquid), for displaying information to the user, and a keyboard and a pointing device, for example, a mouse or a track ball, through which the user can provide input to the computer. Other types of devices can also be used to provide user interaction; for example, the feedback provided to the user can be any form of sensory feedback, for example, visual feedback, auditory feedback, or tactile feedback; and user input can be
Petition 870190068076, of 07/18/2019, p. 71/91
34/37 received in any form, including acoustic, voice, or tactile input. In addition, a computer can interact with a user by transmitting documents to, and receiving documents from, a device that is used by the user; for example, by transmitting web pages to an internet browser on a user's device in response to requests received from the internet browser. In addition, a computer can interact with a user by transmitting text messages or other forms of messages to a personal device, for example, a smart phone that runs a messaging application, and receiving user response messages in return.
[0099] The data processing apparatus for implementing machine learning models may also include, for example, hardware accelerating units for special purposes for processing common and computationally intensive parts of training or machine learning production, that is, inference, workloads.
[00100] Machine learning models can be implemented and deployed using a machine learning framework, for example, a TensorFlow framework, a Microsoft Cognitive Toolkit framework, an Apache Singa framework, or an Apache MXNet framework.
[00101] The modalities of the subject described in this specification can be implemented in a computer system that includes a backup component, for example, a data server, or that includes a middleware component, for example, an application server, or which includes a front end component, for example
Petition 870190068076, of 07/18/2019, p. 72/91
35/37 example, a client computer that has a graphical user interface, an Internet browser, or an application through which a user can interact with an implementation of the subject described in this specification, or any combination of one or more of such rear, middleware or front end components. The system components can be interconnected by any form or means of digital data communication, for example, a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), for example, the Internet.
[00102] The computing system can include clients and servers. A client and a server are usually remote from each other and typically interact over a communication network. The client and server relationship arises because of computer programs that run on the respective computers and that have a client-server relationship between them. In some embodiments, a server transmits data, for example, an HTML page, to a user's device, for example, for the purpose of displaying data to, and receiving user input from, a user who interacts with the device, who acts as a customer. The data generated on the user's device, for example, a result of user interaction, can be received on the server from the device.
[00103] Although this specification contains many specific implementation details, these should not be considered as limitations on the scope of any invention or the scope of what can be claimed, but as a description of resources that can be claimed.
Petition 870190068076, of 07/18/2019, p. 73/91
36/37 specific to specific modalities of specific inventions. Certain features that are described in this specification in the context of separate modalities can also be implemented in combination in a single modality. Conversely, several features that are described in the context of a single modality can also be implemented in multiple modalities separately or in any suitable combination. In addition, although the features can be described above as acting on certain combinations and even initially are claimed as such, one or more features of a claimed combination can in some cases be excised from the combination, and the claimed combination can be directed to a subcombination or variation of a subcombination.
[00104] Similarly, although the operations are represented in the drawings and mentioned in the claims in a specific order, this should not be understood as necessary for such operations to be performed in the specific order shown or in sequential order, or for all the operations illustrated to be performed to achieve desired results. Under certain circumstances, multitasking and parallel processing can be advantageous. Furthermore, the separation of several modules and system components in the modalities described above should not be understood as requiring such separation in all modalities, and it should be understood that the components and program systems described can generally be interpreted together in a single product. software or bundled in multiple software products.
Petition 870190068076, of 07/18/2019, p. 74/91
37/37 [00105] Subject specific modalities have been described. Other modalities are within the scope of the following claims. For example, the actions mentioned in the claims can be performed in a different order and still achieve desired results. As an example, the processes represented in the attached figures do not necessarily require the specific order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing can be advantageous.

权利要求:
Claims (12)
[1]
1. System characterized by comprising one or more computers and one or more storage devices that store instructions that, when executed by one or more computers, cause the one or more computers to implement a sequence transduction neural network to convert a sequence of inputs , which has a respective network input in each of a plurality of input positions in an input order, in a sequence of outputs which has a respective network output in each of a plurality of output positions in an order of output, the sequence transduction neural network comprising:
a coding neural network configured to receive the input sequence and generating a respective coded representation of each of the network inputs in the input sequence, the coding neural network comprising a sequence of one or more coding subnets, each coding subnet configured for receiving a respective encoding subnet input for each of the plurality of input positions and for generating a respective subnet output for each of the plurality of input positions, and each encoding subnet comprising:
a coding self-service sublayer that is configured to receive subnet input for each of the plurality of input positions and, for each specific input position in the input order:
apply an attention mechanism to the encoding subnet entries at the entry positions
Petition 870190068076, of 07/18/2019, p. 76/91
[2]
2/12 using one or more queries obtained from the encoding subnet input at the specific input position to generate a respective output for the specific input position; and a decoding neural network configured to receive the encoded representations and generate the output sequence.
2. System, according to claim 1, characterized by the fact that the coding neural network also comprises:
for each network input in the input sequence, map the network input to an embedded representation of the network input, and combine the embedded representation of the network input with a positional incorporation of the input position of the network input in the input order for generate a combined embedded representation of the network input; and providing the combined embedded representations of the network inputs as the coding subnet inputs for a first coding subnet in the sequence of coding subnets.
[3]
System according to either of claims 1 or 2, characterized by the fact that the respective encoded representations of the network inputs are the encoding subnet outputs generated by the last subnet in the sequence.
[4]
4. System according to any one of claims 1 to 3, characterized by the fact that, for
Petition 870190068076, of 07/18/2019, p. 77/91
3/12 each coding subnet different from the first coding subnet in the sequence, the coding subnet input is the coding subnet output of a preceding coding subnet in the sequence.
[5]
5. System according to any one of claims 1 to 4, characterized in that at least one of the coding subnets also comprises:
a positioned feed-forward layer that is configured to:
for each entry position:
receive an entry in the entry position, and apply a sequence of transformations to the entry in the entry position to generate an exit to the entry position.
[6]
6. System, according to claim 5, characterized by the fact that the sequence comprises two linear transformations learned separated by an activation function.
[7]
System according to either of claims 5 or 6, characterized in that the at least one coding subnet also comprises:
a residual connection layer that combines the outputs of the positioned feed-forward layer with the inputs for the positioned feed-forward layer to generate a positioned encoder residual output, and a layer normalization layer that applies layer normalization to the positioned residual output of encoder.
[8]
8. System, according to any of the
Petition 870190068076, of 07/18/2019, p. 78/91
4/12 claims 1 to 7, characterized by the fact that each coding subnet also includes:
a residual connection layer that combines the outputs of the coding self-service sublayer with the inputs for the coding self-service sublayer to generate a residual encoder self-service output, and a layer normalization layer that applies layer normalization to the residual self-service output of encoder.
9. System, into 8, a deal withfeatured Any of themby the fact ofeach claims 1 sublayer of coding self-care understand an
plurality of encoder self-service layers.
10. System, according to claim 9, characterized by the fact that each encoder self-service layer is configured to:
apply a linear query transformation learned to each encoding subnet entry at each input position to generate a respective query for each input position, apply a learned key linear transformation to each encoding subnet entry at each position input to generate a respective key for each input position, apply a linear learned value transformation to each encoding subnet input at each input position to generate a respective value for each input position, and for each input position, determine a respective specific weight of
Petition 870190068076, of 07/18/2019, p. 79/91
5/12 input position for each of the input positions by applying a function of comparison between the query for the input position and the keys, and to determine an initial exit from the encoder self-service to the input position by determining a weighted sum of values weighted by the corresponding entry position specific weights for entry positions.
11. System, according to claim 10, characterized in that the coding self-service sublayer is configured to, for each input position, combine the initial encoder self-service outputs to the input position generated by the encoder self-service layers to generate the output for the coding self-service sublayer.
System according to any one of claims 9 to 11, characterized in that the encoder self-service layers operate in parallel.
13. System according to any one of claims 1 to 12, characterized in that the encoding neural network auto-regressively generates the output sequence, in each of the plurality of generation time steps, generating a network output in one corresponding conditioned output position in the coded representation and network outputs at output positions that precede the output position in the output order.
14. System, according to claim 13, characterized by the fact that the decoding neural network comprises a sequence of decoding subnets, each decoding subnet configured for each
Petition 870190068076, of 07/18/2019, p. 80/91
6/12 generation time step, receiving a respective decoder subnet input for each of the plurality of output positions preceding the corresponding output position and to generate a respective decoder subnet output for each of the plurality of exit positions that precede the corresponding exit position.
15. System, according to claim 14, characterized by the fact that the decoding neural network also comprises:
an embedding layer configured for each generation time step:
for each network exit in exit positions that precede the exit position in the exit order:
mapping the network output to an embedded representation of the network output, and combining the embedded representation of the network output with a positional incorporation of the output position of the network output in the output order to generate a combined embedded representation of the network output; and providing the combined embedded representations of the network outputs as input to a first decoding subnet in the sequence of encoding subnets.
16. System according to either claim 14 or claim 15, characterized in that at least one of the decoding subnets comprises:
a positioned feed-forward layer that is configured for each generation time step:
for each exit position that precedes the corresponding exit position:
Petition 870190068076, of 07/18/2019, p. 81/91
7/12 receive an entry in the exit position, and apply a sequence of transformations to the entry in the exit position to generate an exit to the exit position.
17. System, according to claim 16, characterized by the fact that the sequence comprises two linear transformations learned separated by an activation function.
18. System according to either of claims 16 or 17, characterized in that the at least one coding subnet also comprises:
a residual connection layer that combines the outputs of the
layer feed-forward positioned with the entries to the layer feed-forward positioned to generate a exit residual , and an layer of normalization in layers that apply
normalization of layers to the residual outlet.
19. System according to any of claims 10 to 13, characterized in that each coding subnet comprises:
a coding-decoding attention sublayer that is configured for each generation time step:
receive an entry for each exit position that precedes the corresponding exit position and, for each exit position:
apply a mechanism of attention to the representations coded in the input positions using one or more queries obtained from the input to the output position to generate an updated representation
Petition 870190068076, of 07/18/2019, p. 82/91
8/12 to the starting position.
20. System, according to claim 15, characterized by the fact that each coding-decoding attention sub-layer comprises a plurality of coding-decoding attention layers, and by the fact that each coding-decoding attention layer is configured to, in each generation time step:
apply a linear query transformation learned on entry at each exit position that precedes the corresponding exit position to generate a respective query for each exit position, apply a linear transformation of learned key to each encoded representation at each input position to generate a respective key for each input position, apply a linear transformation of learned value to each coded representation in each input position to generate a respective value for each input position, and for each output position that precedes the corresponding output position, determine a respective output position specific weight for each of the input positions by applying a comparison function between the query for the output position and the keys, and determine an initial encoder-decoder attention output for the position of output by determining a weighted sum of the s values weighted by the corresponding position-specific weights for the entry position.
Petition 870190068076, of 07/18/2019, p. 83/91
[9]
9/12
21. System, according to claim 20, characterized by the fact that the coding-decoding self-service sublayer is configured to, in each generation time step, combine the encoder-decoder attention outputs generated by the coding-decoding layers for generate the output for the coding-decoding attention sublayer.
22. System according to either of claims 20 or 21, characterized in that the encoding-decoding attention layers operate in parallel.
23. System according to any one of claims 19 to 22, characterized in that each decoding subnet also comprises:
a residual connection layer that combines the outputs of the coding-decoding attention sub-layer with the inputs for the coding-decoding attention sub-layer to generate a residual output, and a layer normalization layer that applies layer normalization to the residual output.
24. System according to any one of claims 14 to 23, characterized in that each decoding subnet comprises:
a decoder self-service sublayer that is configured for each generation time step:
receive an entry for each exit position that precedes the corresponding exit position and, for each of the specific exit positions:
apply a mechanism of attention to the entries in the exit positions that precede the position
Petition 870190068076, of 07/18/2019, p. 84/91
[10]
Corresponding 10/12 using one or more queries obtained from entering the specific exit position to generate an updated representation for the specific exit position.
25. System according to claim 24, characterized by the fact that each decoding self-care sublayer comprises a plurality of decoding self-care layers, and by the fact that each decoding attention layer is configured for, at each stage of generation time:
apply a linear query transformation learned on entry at each exit position that precedes the corresponding exit position to generate a corresponding query for each exit position, apply a linear transformation of the learned key to each entry at each exit position that precedes the corresponding exit position to generate a respective key for each exit position, apply a linear transformation of learned value to each entry in each exit position that precedes the corresponding exit position to generate a respective value for each exit position, and for each exit position that precedes the corresponding exit position, determine a corresponding exit position specific weight for each of the exit positions by applying a comparison function between the query for the exit position and the keys, and determine a initial decoder attention output to the output position by the det ermination of
Petition 870190068076, of 07/18/2019, p. 85/91
[11]
11/12 a weighted sum of the values weighted by the corresponding exit position specific weights for the exit position.
26. System, according to claim 25, characterized by the fact that the coder-decoder attention sublayer is configured to, in each generation time step, combine the coder-decoder attention outputs generated by the coder-decoder layers for generate the output for the coding-decoding attention sublayer.
27. System according to either of claims 25 or 26, characterized in that the encoding-decoding attention layers operate in parallel.
28. System according to any one of claims 24 to 27, characterized in that each decoding subnet also comprises:
a residual connection layer that combines the outputs of the decoder self-service sublayer with the inputs for the decoder self-service sublayer to generate a residual output, and a layer normalization layer that applies layer normalization to the residual output.
29. One or more computer storage media characterized by storing instructions that, when executed by one or more computers, cause the one or more computers to implement the sequence transduction neural network, as defined in any one of claims 1 to 28.
30. Method characterized by understanding:
Petition 870190068076, of 07/18/2019, p. 86/91
[12]
12/12 receiving a sequence of entries that has a respective entry in each of a plurality of entry positions in an entry order;
processing the input sequence through the encoding neural network, as defined in any of claims 1 to 28, to generate a respective coded representation of each of the inputs in the input sequence; and processing the encoded representations through the decoding neural network, as defined in any one of claims 1 to 28, to generate a sequence of outputs that have a respective output in each of a plurality of output positions in an output order.
31. System characterized by comprising one or more computers and one or more storage devices that store instructions that when executed by one or more computers cause the one or more computers to perform the operations of the method as defined in claim 30.
32. One or more computer storage media characterized by storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the method as defined in claim 31.

类似技术:

公开号 | 公开日 | 专利标题

BR112019014822A2|2020-02-27|NEURAL NETWORKS FOR ATTENTION-BASED SEQUENCE TRANSDUCTION

JP6756916B2|2020-09-16|Processing text sequences using neural networks

US20210342670A1|2021-11-04|Processing sequences using convolutional neural networks

JP2020520516A5|2020-10-08|

US20200342316A1|2020-10-29|Attention-based decoder-only sequence transduction neural networks

WO2021139344A1|2021-07-15|Text generation method and apparatus based on artificial intelligence, computer device, and medium

US11256866B2|2022-02-22|Natural language processing with an N-gram machine

WO2019208070A1|2019-10-31|Question/answer device, question/answer method, and program

US10789942B2|2020-09-29|Word embedding system

US20210232948A1|2021-07-29|Question responding apparatus, question responding method and program

CN111354333A|2020-06-30|Chinese prosody hierarchy prediction method and system based on self-attention

US20200364543A1|2020-11-19|Computationally efficient expressive output layers for neural networks

US20220028367A1|2022-01-27|Expressive text-to-speech utilizing contextual word-level style tokens

US20210342380A1|2021-11-04|Generative ontology learning and natural language processing with predictive language models

US20210081503A1|2021-03-18|Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query

WO2021159201A1|2021-08-19|Initialization of parameters for machine-learned transformer neural network architectures

CN111125324A|2020-05-08|Text data processing method and device, electronic equipment and computer readable medium

同族专利:

公开号 | 公开日

AU2020213318A1|2020-08-27|

KR20200129197A|2020-11-17|

CA3144657A1|2018-11-29|

US20190392319A1|2019-12-26|

AU2020213317A1|2020-08-27|

US20210019623A1|2021-01-21|

US10719764B2|2020-07-21|

WO2018217948A1|2018-11-29|

JP2021121951A|2021-08-26|

US20200372357A1|2020-11-26|

RU2749945C1|2021-06-21|

JP2021121952A|2021-08-26|

AU2018271931A1|2019-07-11|

US10956819B2|2021-03-23|

CN110192206A|2019-08-30|

AU2018271931B2|2020-05-07|

JP2020506466A|2020-02-27|

US10452978B2|2019-10-22|

US11113602B2|2021-09-07|

JP6884871B2|2021-06-09|

EP3542316A1|2019-09-25|

RU2021116658A|2021-07-05|

US20220051099A1|2022-02-17|

KR20200129198A|2020-11-17|

KR20190089980A|2019-07-31|

US20210019624A1|2021-01-21|

KR102180002B1|2020-11-17|

US20180341860A1|2018-11-29|

CA3144674A1|2018-11-29|

US20200372358A1|2020-11-26|

CA3050334A1|2018-11-29|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US9536528B2|2012-07-03|2017-01-03|Google Inc.|Determining hotword suitability|

US9965705B2|2015-11-03|2018-05-08|Baidu Usa Llc|Systems and methods for attention-based configurable convolutional neural networks for visual question answering|

US10204299B2|2015-11-04|2019-02-12|Nec Corporation|Unsupervised matching in fine-grained datasets for single-view object reconstruction|

JP6671515B2|2016-05-20|2020-03-25|ディープマインドテクノロジーズリミテッド|Classifying input examples using comparison sets|

RU2749945C1|2017-05-23|2021-06-21|ГУГЛ ЭлЭлСи|Neural networks with attention-based sequence transformation|RU2749945C1|2017-05-23|2021-06-21|ГУГЛ ЭлЭлСи|Neural networks with attention-based sequence transformation|

US11205121B2|2018-06-20|2021-12-21|Disney Enterprises, Inc.|Efficient encoding and decoding sequences using variational autoencoders|

KR20200067632A|2018-12-04|2020-06-12|삼성전자주식회사|Method and apparatus for allocating memory space for driving a neural network|

KR20200075615A|2018-12-18|2020-06-26|삼성전자주식회사|Method and apparatus for machine translation|

CN109740169B|2019-01-09|2020-10-13|北京邮电大学|Traditional Chinese medicine ancient book translation method based on dictionary and seq2seq pre-training mechanism|

EP3690752A1|2019-01-31|2020-08-05|Avatar Cognition Barcelona, SL|Fractal cognitive computing node and computer-implemented method for learning procedures|

CN109919358B|2019-01-31|2021-03-02|中国科学院软件研究所|Real-time station flow prediction method based on neural network space-time attention mechanism|

KR102254300B1|2019-04-19|2021-05-21|한국과학기술원|Suggestion of evidence sentence for utterance in debate situation|

CN110321961A|2019-07-09|2019-10-11|北京金山数字娱乐科技有限公司|A kind of data processing method and device|

US20210081673A1|2019-09-12|2021-03-18|Nec Laboratories America, Inc|Action recognition with high-order interaction through spatial-temporal object tracking|

KR20210044056A|2019-10-14|2021-04-22|삼성전자주식회사|Natural language processing method and appratus using repetetion token embedding|

CN112751686A|2019-10-29|2021-05-04|中国移动通信集团浙江有限公司|Office data script generation method and device, computing equipment and computer storage medium|

US11246173B2|2019-11-08|2022-02-08|Huawei Technologies Co. Ltd.|Systems and methods for multi-user pairing in wireless communication networks|

US20210150349A1|2019-11-15|2021-05-20|Waymo Llc|Multi object tracking using memory attention|

CN111079450B|2019-12-20|2021-01-22|北京百度网讯科技有限公司|Language conversion method and device based on sentence-by-sentence driving|

CN111078825A|2019-12-20|2020-04-28|北京百度网讯科技有限公司|Structured processing method, structured processing device, computer equipment and medium|

CN111222562A|2020-01-02|2020-06-02|南京邮电大学|Space self-attention mechanism and target detection method|

WO2021145862A1|2020-01-14|2021-07-22|Google Llc|Method and system for activity prediction, prefetching and preloading of computer assets by a client-device|

CN111460126B|2020-06-12|2020-09-25|支付宝信息技术有限公司|Reply generation method and device for man-machine conversation system and electronic equipment|

CN111652357B|2020-08-10|2021-01-15|浙江大学|Method and system for solving video question-answer problem by using specific target network based on graph|

US20220051078A1|2020-08-14|2022-02-17|Micron Technology, Inc.|Transformer neural network in memory|

法律状态:
2020-10-06| B07A| Application suspended after technical examination (opinion) [chapter 7.1 patent gazette]|

2021-01-26| B09B| Patent application refused [chapter 9.2 patent gazette]|

2021-04-06| B12B| Appeal against refusal [chapter 12.2 patent gazette]|

2021-10-13| B350| Update of information on the portal [chapter 15.35 patent gazette]|

优先权:

申请号 | 申请日 | 专利标题

US201762510256P| true| 2017-05-23|2017-05-23|

US62/510,256|2017-05-23|

US201762541594P| true| 2017-08-04|2017-08-04|

US62/541,594|2017-08-04|

PCT/US2018/034224|WO2018217948A1|2017-05-23|2018-05-23|Attention-based sequence transduction neural networks|

[返回顶部]