Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

动态池和展开递归自动编码器的意译检测 论文地址

Richard Socher,Eric H. Huang, Jeffrey Pennington∗ , Andrew Y. Ng, Christopher D. Manning

Computer Science Department, Stanford University, Stanford, CA 94305, USA ∗SLAC National Accelerator Laboratory, Stanford University, Stanford, CA 94309, USArichard@socher.org, { ehhuang, jpennin, ang, manning}@stanford.edu Abstract


Paraphrase detection** is the task of examining two sentences and determining whether they have the same meaning.


In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed.


We introduce a method for paraphrase detection based on recursive autoencoders (RAE).


Our unsupervised RAEs are based on a novel unfolding objective and learn feature vectors for phrases in syntactic trees.


These features are used to measure the word- and phrase-wise similarity between two sentences.


Since sentences may be of arbitrary length, the resulting matrix of similarity measures is of variable size.


We introduce a novel dynamic pooling layer which computes a fixed-sized representation from the variable-sized matrices.


The pooled representation is then used as input to a classifier.


Our method outperforms other state-of-the-art approaches on the challenging MSRP paraphrase corpus.


1 Introduction

Paraphrase detection determines whether two phrases of arbitrary length and form capture the same meaning.


Identifying paraphrases is an important task that is used in information retrieval, question answering [1], text summarization, plagiarism detection [2] and evaluation of machine translation [3], among others.


For instance, in order to avoid adding redundant information to a summary one would like to detect that the following two sentences are paraphrases:


​ S1 The judge also refused to postpone the trial date of Sept. 29.

​ S2 Obus also denied a defense motion to postpone the September trial date

We present a joint model that incorporates the similarities between both single word features as well as multi-word phrases extracted from the nodes of parse trees.


Our model is based on two novel components as outlined in Fig. 1.


The first component is an unfolding recursive autoencoder (RAE) for unsupervised feature learning from unlabeled parse trees.

第一个组件是展开递归自动编码器 (RAE),用于从未标记的解析树中学习非监督特性。

The RAE is a recursive neural network.


It learns feature representations for each node in the tree such that the word vectors underneath each node can be recursively reconstructed.


2 Recursive Autoencoders

In this section we describe two variants of unsupervised recursive autoencoders which can be used to learn features from parse trees.


The RAE aims to find vector representations for variable-sized phrases spanned by each node of a parse tree.


These representations can then be used for subsequent supervised tasks.


Before describing the RAE, we briefly review neural language models which compute word representations that we give as input to our algorithm.


2.1 Neural Language Models


Collobert and Weston [6] introduced a new neural network model to compute such an embedding.


When these networks are optimized via gradient ascent the derivatives modify the word embedding matrix L ∈ R ^{n×|V|} , where |V| is the size of the vocabulary.

当这些网络通过梯度上升优化后,其导数修改单词嵌入矩阵L∈R ^{n×|V|},其中|V|是词汇量的大小。

The word vectors inside the embedding matrix capture distributional syntactic and semantic information via the word’s co-occurrence statistics.


For further details and evaluations of these embeddings, see [5, 6, 7, 8].


Once this matrix is learned on an unlabeled corpus, we can use it for subsequent tasks by using each word’s vector (a column in L) to represent that word.


In the remainder of this paper, we represent a sentence (or any n-gram) as an ordered list of these vectors (x1, . . ., xm).


This word representation is better suited for autoencoders than the binary number representations used in previous related autoencoder models such as the recursive autoassociative memory (RAAM) model of Pollack [9, 10] or recurrent neural networks [11] since the activations are inherently continuous.


2.2 Recursive Autoencoder

Fig. 2 (left) shows an instance of a recursive autoencoder (RAE) applied to a given parse tree as introduced by [12].


Unlike in that work, here we assume that such a tree is given for each sentence by a parser.


Initial experiments showed that having a syntactically plausible tree structure is important for paraphrase detection.


Assume we are given a list of word vectors x = (x1, . . . *, x**m*) as described in the previous section.

假设我们有一个单词向量x = (x1 . . *, x**m*)的列表,如前一节所述。

The binary parse tree for this input is in the form of branching triplets of parents with children: (p c1c2).

这个输入的二进制解析树是带有子节点的父节点的三个一分支(p c1c2)。

The trees are given by a syntactic parser.


Each child can be either an input word vector x i or a nonterminal node in the tree.

每个子节点可以是一个输入字向量 x i,也可以是树中的一个非终端节点

For both examples in Fig. 2, we have the following triplets:


Given this tree structure, we can now compute the parent representations.


The fifirst parent vector p = y1 is computed from the children (c1, c2) = (x2, x3) by one standard neural network layer:

第五代父向量 p = y1由一标准神经网络层计算得到(c1, c2) = (x2, x3)

where [c1 c2] is simply the concatenation of the two children, f an element wise activation function such as tanh and *W**e* R n × 2n the encoding matrix that we want to learn.

其中[c1 c2]是两个子节点的简单拼接,对于tanh和W e* R n **×* 2n我们要学习的编码矩阵f个元素智能激活函数。

One way of assessing how well this n-dimensional vector represents its direct children is to decode their vectors in areconstruction layer and then to compute the Euclidean distance between the original input and its reconstruction:


In order to apply the autoencoder recursively, the same steps repeat.


Now that y1 is given, we can use Eq. 1 to compute y2 by setting the children to be (c1, c2) = (x1, y1).

现在给定了y1,我们可以使用公式1来计算y2,方法是将子元素设为(c1, c2) = (x1, y1)。

Again, after computing the intermediate parent vector p = y2, we can assess how well this vector captures the content of the children by computing the reconstruction error as in Eq. 2.

同样,在计算中间父向量p = y2之后,我们可以通过计算公式2中的重构误差来评估这个向量捕获子内容的能力。

The process repeats until the full tree is constructed and each node has an associated reconstruction error.


During training, the goal is to minimize the reconstruction error of all input pairs at nonterminal nodes p in a given parse tree T :


Since the RAE computes the hidden representations it then tries to reconstruct, it could potentially lower reconstruction error by shrinking the norms of the hidden layers.


In order to prevent this, we add a length normalization layer p = p/||p ||to this RAE model (referred to as the standard RAE).

为了防止这种情况,我们在这个RAE模型中添加了一个长度归一化层p = p/||p ||(简称标准RAE)。

Another more principled solution is to use a model in which each node tries to reconstruct its entire subtree and then measure the reconstruction of the original leaf nodes.


Such a model is described in the next section.


Figure 2: Two autoencoder models with details of the reconstruction at node y2.


For simplicity we left out the reconstruction layer at the first node y1 which is the same standard autoencoder for both models.


Left: A standard autoencoder that tries to reconstruct only its direct children.


Right: The unfolding autoencoder which tries to reconstruct all leaf nodes underneath each node.


2.3 Unfolding Recursive Autoencoder

The unfolding RAE has the same encoding scheme as the standard RAE.


The difference is in the decoding step which tries to reconstruct the entire spanned subtree underneath each node as shown 1in Fig. 2 (right) .


For instance, at node y2, the reconstruction error is the difference between the leaf nodes underneath that node [x1 x2 x3] and their reconstructed counterparts [x 01 x 02 x 03].

例如,在节点y2处,重构误差就是该节点下的叶节点[x1 x2 x3]与被重构节点[x01 x02 x03]之间的差值。

The unfolding produces the reconstructed leaves by starting at y2 and computing


Then it recursively splits y01 again to produce vectors


In general, we repeatedly use the decoding matrix W d to unfold each node with the same tree structure as during encoding.

一般情况下,我们重复使用解码矩阵W d展开与编码过程中具有相同树结构的每个节点。

The reconstruction error is then computed from a concatenation of the word vectors in that node’s span.


For a node y that spans words i to j:

对于一个节点 y 跨越单词 ij:

The unfolding autoencoder essentially tries to encode each hidden layer such that it best reconstructs its entire subtree to the leaf nodes.


Hence, it will not have the problem of hidden layers shrinking in norm.


Another potential problem of the standard RAE is that it gives equal weight to the last merged phrases even if one is only a single word (in Fig. 2, x1 and y1 have similar weight in the last merge).


In contrast, the unfolding RAE captures the increased importance of a child when the child represents a larger subtree.


2.4 Deep Recursive Autoencoder

Both types of RAE can be extended to have multiple encoding layers at each node in the tree.


Instead of transforming both children directly into parent p, we can have another hidden layer h in between.


While the top layer at each node has to have the same dimensionality as each child (in order for the same network to be recursively compatible), the hidden layer may have arbitrary dimensionality.


For the two-layer encoding network, we would replace Eq. 1 with the following:


2.5 RAE Training

For training we use a set of parse trees and then minimize the sum of all nodes’ reconstruction errors.


We compute the gradient effificiently via backpropagation through structure [13].


Even though the objective is not convex, we found that L-BFGS run with mini-batch training works well in practice.


Convergence is smooth and the algorithm typically fifinds a good locally optimal solution.


After the unsupervised training of the RAE, we demonstrate that the learned feature representations capture syntactic and semantic similarities and can be used for paraphrase detection.


3 An Architecture for Variable-Sized Similarity Matrices

Now that we have described the unsupervised feature learning, we explain how to use these features to classify sentence pairs as being in a paraphrase relationship or not.


3.1 Computing Sentence Similarity Matrices

Our method incorporates both single word and phrase similarities in one framework.


First, the RAE computes phrase vectors for the nodes in a given parse tree.


We then compute Euclidean distances between all word and phrase vectors of the two sentences.


These distances fifill a similarity matrix S as shown in Fig. 1.


For computing the similarity matrix, the rows and columns are fifirst fifilled by the words in their original sentence order.


We then add to each row and column the nonterminal nodes in a depth-fifirst, right-to-left order.


Simply extracting aggregate statistics of this table such as the average distance or a histogram of distances cannot accurately capture the global structure of the similarity comparison.


For instance, paraphrases often have low or zero Euclidean distances in elements close to the diagonal of the similarity matrix.


This happens when similar words align well between the two sentences.


However, since the matrix dimensions vary based on the sentence lengths one cannot simply feed the similarity matrix into a standard neural network or classififier.


Figure 3: Example of the dynamic min-pooling layer fifinding the smallest number in a pooling window region of the original similarity matrix S.


3.2 Dynamic Pooling

Consider a similarity matrix S generated by sentences of lengths n and m.


Since the parse trees are binary and we also compare all nonterminal nodes, S ∈ R ^(2n-1)×(2m-1).

由于解析树是二进制的,而且我们还比较了所有非末端节点,所以S∈ R ^(2n-1)×(2m-1)

We would like to map S into a matrix Spooled of fifixed size, n_p × n_p.

我们希望将S映射到一个大小固定的矩阵池中,即n_p × n_p

Our fifirst step in constructing such a map is to partition the rows and columns of S into n p roughly equal parts, producing an n_p × n_p grid.

构造这样一个映射的第一步是将S的行和列划分为n p大致相等的部分,生成一个n_p × n_p网格。

We then defifine S pooled to be the matrix of minimum values of each rectangular region within this grid, as shown in Fig. 3.


The matrix S_pooled loses some of the information contained in the original similarity matrix but it still captures much of its global structure.


Since elements of S with small Euclidean distances show that there are similar words or phrases in both sentences, we keep this information by applying a min function to the pooling regions.


Other functions, like averaging, are also possible, but might obscure the presence of similar phrases. This dynamic pooling layer could make use of overlapping pooling regions, but for simplicity, we consider only non-overlapping pooling regions. After pooling, we normalize each entry to have 0 mean and variance 1.


Table 1: Nearest neighbors of randomly chosen phrases.


Recursive averaging and the standard RAE focus mostly on the last merged words and incorrectly add extra information.


The unfolding RAE captures most closely both syntactic and semantic similarities.


1 The partitions will only be of equal size if 2n n 1 and 2m m 1 are divisible by n_p.

只有当2n n 1和2m m 1能被n_p整除时,分区的大小才相等。

We account for this in the following way, although many alternatives are possible.


Let the number of rows of S be R = 2n-1.

S的行数R = 2n-1。

Each pooling window then has [R/n_p] many rows.


Let M = R mod n_p, be the number of remaining rows.

M = R mod n_p,为剩余行数。

We then evenly distribute these extra rows to the last M window regions which will have [ R/n_p] + 1 rows.

然后我们将这些额外的行均匀地分布到最后的M窗口区域,这些区域将有[R/n_p] + 1行。

The same procedure applies to the number of columns for the windows.


This procedure will have a slightly fifiner granularity for the single word similarities which is desired for our task since word overlap is a good indicator for paraphrases.


In the rare cases when n_p > R, the pooling layer needs to fifirst up-sample.

在极少数情况下,当n_p > R,池层需要fifirst up-sample。

We achieve this by simply duplicating pixels row-wise until R n_p.

我们通过简单地复制像素行,直到R n_p来实现这一点。

4 Experiments

For unsupervised RAE training we used a subset of 150,000 sentences from the NYT and AP sections of the Gigaword corpus.


We used the Stanford parser [14] to create the parse trees for all sentences.

我们使用Stanford parser[14]来创建所有句子的解析树。

For initial word embeddings we used the 100-dimensional vectors computed via the unsupervised method of Collobert and Weston [6] and provided by Turian et al. [8].


For all paraphrase experiments we used the Microsoft Research paraphrase corpus (MSRP) introduced by Dolan et al. [4].


The dataset consists of 5,801 sentence pairs.


The average sentence length is 21, the shortest sentence has 7 words and the longest 36.


3,900 are labeled as being in the paraphrase relationship (technically defifined as “mostly bidirectional entailment”).


We use the standard split of 4,076 training pairs (67.5% of which are paraphrases) and 1,725 test pairs (66.5% paraphrases).


All sentences were labeled by two annotators who agreed in 83% of the cases.


A third annotator resolved conflflicts.


During dataset collection, negative examples were selected to have high lexical overlap to prevent trivial examples.


For more information see [4, 15].


As described in Sec. 2.4, we can have deep RAE networks with two encoding or decoding layers.


The hidden RAE layer (see h in Eq. 8) was set to have 200 units for both standard and unfolding RAEs



4.1 Qualitative Evaluation of Nearest Neighbors

In order to show that the learned feature representations capture important semantic and syntactic information even for higher nodes in the tree, we visualize nearest neighbor phrases of varying length.


After embedding sentences from the Gigaword corpus, we compute nearest neighbors for all nodes in all trees.


In Table 1 the fifirst phrase is a randomly chosen phrase and the remaining phrases are the closest phrases in the dataset that are not in the same sentence.


We use Euclidean distance between the vector representations.


Note that we do not constrain the neighbors to have the same word length.


We compare the two autoencoder models above: RAE and unfolding RAE without hidden layers, as well as a recursive averaging baseline (R.Avg).

我们比较了以上两种自动编码器模型:无隐含层的RAE和展开RAE,以及递归平均基线(r.a avg)。

R.Avg recursively takes the average of both child vectors in the syntactic tree.


We only report results of RAEs without hidden layers between the children and parent vectors.


Even though the deep RAE networks have more parameters to learn complex encodings they do not perform as well in this and the next task.


This is likely due to the fact that they get stuck in local optima during training.


Table 2: Original inputs and generated output from unfolding and reconstruction.


Words are the nearest neighbors to the reconstructed leaf node vectors.


The unfolding RAE can reconstruct perfectly almost all phrases of 2 and 3 words and many with up to 5 words.

展开的RAE可以完美地重构几乎所有2 - 3个单词的短语,许多短语最多可以重构5个单词。

Longer phrases start to get incorrect nearest neighbor words.


For the standard RAE good reconstructions are only possible for two words.


Recursive averaging cannot recover any words.


Table 1 shows several interesting phenomena.


Recursive averaging is almost entirely focused on an exact string match of the last merged words of the current phrase in the tree.


This leads the nearest neighbors to incorrectly add various extra information which would break the paraphrase relationship if we only considered the top node vectors and ignores syntactic similarity.


The standard RAE does well though it is also somewhat focused on the last merges in the tree.


Finally, the unfolding RAE captures most closely the underlying syntactic and semantic structure.


4.2 Reconstructing Phrases via Recursive Decoding

In this section we analyze the information captured by the unfolding RAE’s 100-dimensional phrase vectors.


We show that these 100-dimensional vector representations can not only capture and memorize single words but also longer, unseen phrases.


In order to show how much of the information can be recovered we recursively reconstruct sentences after encoding them.


The process is similar to unfolding during training.


It starts from a phrase vector of a nonterminal node in the parse tree.


We then unfold the tree as given during encoding and fifind the nearest neighbor word to each of the reconstructed leaf node vectors.


Table 2 shows that the unfolding RAE can very well reconstruct phrases of up to length fifive.


No other method that we compared had such reconstruction capabilities.


Longer phrases retain some correct words and usually the correct part of speech but the semantics of the words get merged.


The results are from the unfolding RAE that directly computes the parent representation as in Eq. 1.


4.3 Evaluation on Full-Sentence Paraphrasing

We now turn to evaluating the unsupervised features and our dynamic pooling architecture in our main task of paraphrase detection.


Methods which are based purely on vector representations invariably lose some information.


For instance, numbers often have very similar representations, but even small differences are crucial to reject the paraphrase relation in the MSRP dataset.


Hence, we add three number features.


The fifirst is 1 if two sentences contain exactly the same numbers or no number and 0 otherwise, the second is 1 if both sentences contain the same numbers and the third is 1 if the set of numbers in one sentence is a strict subset of the numbers in the other sentence.


Since our pooling-layer cannot capture sentence length or the number of exact string matches, we also add the difference in sentence length and the percentage of words and phrases in one sentence that are in the other sentence and vice-versa.


We also report performance without these three features (only S).


For all of our models and training setups, we perform 10-fold cross-validation on the training set to choose the best regularization parameters and n_p, the size of the pooling matrix S ∈ Rn_p×n_p .

对于我们所有的模型和训练设置,我们对训练集进行10次交叉验证,选择最佳的正则化参数和n_p,池矩阵S∈ Rn_p×n_p的大小。

In our best model, the regularization for the RAE was 10^-5 and 0.05 for the softmax classififier.

在我们的最佳模型中,RAE的正则化为10^ (-5),softmax分类器的正则化为0.05。

The best pooling size was consistently n_p = 15, slightly less than the average sentence length.

最佳的池大小始终是n_p = 15,略小于平均句子长度。

For all sentence pairs (S1, S2) in the training data, we also added (S2, S1) to the training set in order to make the most use of the training data.

对于训练数据中所有的句子对(S1, S2),我们也将(S2, S1)添加到训练集中,以便最大限度地利用训练数据。

This improved performance by 0.2%.


Table 3: Test results on the MSRP paraphrase corpus.


Comparisons of unsupervised feature learning methods (left), similarity feature extraction and supervised classifification methods (center) and other approaches (righ).


In our fifirst set of experiments we compare several unsupervised feature learning methods: Recursive averaging as defifined in Sec. 4.1, standard RAEs and unfolding RAEs.


For each of the three methods, we cross-validate on the training data over all possible hyperparameters and report the best performance.


We observe that the dynamic pooling layer is very powerful because it captures the global structure of the similarity matrix which in turn captures the syntactic and semantic similarities of the two sentences.


With the help of this powerful dynamic pooling layer and good initial word vectors even the standard RAE and recursive averaging perform well on this dataset with an accuracy of 75.5% and 75.9% respectively.


We obtain the best accuracy of 76.8% with the unfolding RAE with out hidden layers.


We tried adding 1 and 2 hidden encoding and decoding layers but performance only decreased by 0.2% and training became slower.


A histogram of values in the matrix S.


The low performance shows that our


dynamic pooling layer better captures the global similarity information than aggregate statistics.


(ii) Only Feat: 73.2%.


Only the three features described above.


This shows that simple binary string and number matching can detect many of the simple paraphrases but fails to detect complex cases.


(iii) Only Spooled: 72.6%.


Without the three features mentioned above.


This shows that some infor mation still gets lost in Spooled and that a better treatment of numbers is needed.


In order to better recover exact string matches it may be necessary to explore overlapping pooling regions.


(iv) Top Unfolding RAE Node: 74.2%.


Instead of Spooled, use Euclidean distance between the two top sentence vectors.


The performance shows that while the unfolding RAE is by itself very powerful, the dynamic pooling layer is needed to extract all information from its trees.


Table 3 shows our results compared to previous approaches (see next section).


Our unfolding RAE and dynamic similarity pooling architecture achieves state-of-the-art performance without hand designed semantic taxonomies and features such as WordNet.


Note that the effective range of the accuracy lies between 66% (most frequent class baseline) and 83% (interannotator agreement).


In Table 4 we show several examples of correctly classifified paraphrase candidate pairs together with their similarity matrix after dynamic min-pooling.


The fifirst and last pair are simple cases of paraphrase and not paraphrase.


The second example shows a pooled similarity matrix when large chunks are swapped in both sentences.


Our model is very robust to such transformations and gives a high probability to this pair.


Even more complex examples such as the third with very few direct string matches (few blue squares) are correctly classifified.


The second to last example is highly interesting.


Even though there is a clear diagonal with good string matches, the gap in the center shows that the fifirst sentence contains much extra information.


This is also captured by our model.


5 Related Work

The fifield of paraphrase detection has progressed immensely in recent years.


Early approaches were based purely on lexical matching techniques [22, 23, 19, 24].


Since these methods are often based on exact string matches of n-grams, they fail to detect similar meaning that is conveyed by synonymous words.


Several approaches [17, 18] overcome this problem by using Wordnet and corpus-based semantic similarity measures.


In their approach they choose for each open-class word the single most similar word in the other sentence.


Fernando and Stevenson [20] improved upon this idea by computing a similarity matrix that captures all pair-wise similarities of single words in the two sentences.


They then threshold the elements of the resulting similarity matrix and compute the mean of the remaining entries.


There are two shortcomings of such methods: They ignore (i) the syntactic structure of the sentences (by comparing only single words) and (ii) the global structure of such a similarity matrix (by computing only the mean).


Table 4: Examples of sentence pairs with: ground truth labels L (P - Paraphrase, N - Not Paraphrase), the probabilities our model assigns to them (P r(S1, S2) > 0.5 is assigned the label Paraphrase) and their similarity matrices after dynamic min-pooling.

表4:与:ground truth label L (P - ase, N - Not ase)配对的句子例子,我们的模型赋予它们的概率(P r(S1, S2) > 0.5被赋予标签释义)以及它们的相似矩阵经过动态min-pooling。

Simple paraphrase pairs have clear diagonal structure due to perfect word matches with Euclidean distance 0 (dark blue).


That structure is preserved by our min-pooling layer.


Best viewed in color.See text for details.


Instead of comparing only single words [21] adds features from dependency parses.


Most recently, Das and Smith [15] adopted the idea that paraphrases have related syntactic structure.


Their quasisynchronous grammar formalism incorporates a variety of features from WordNet, a named entity recognizer, a part-of-speech tagger, and the dependency labels from the aligned trees.


In order to obtain high performance they combine their parsing-based model with a logistic regression model that uses 18 hand-designed surface features.


We merge these word-based models and syntactic models in one joint framework: Our matrix consists of phrase similarities and instead of just taking the mean of the similarities we can capture the global layout of the matrix via our min-pooling layer.


The idea of applying an autoencoder in a recursive setting was introduced by Pollack [9] and extended recently by [10].


Pollack’s recursive auto-associative memories are similar to ours in that they are a connectionist, feedforward model.


One of the major shortcomings of previous applications of recursive autoencoders to natural language sentences was their binary word representation as discussed in Sec. 2.1.


Recently, Bottou discussed related ideas of recursive autoencoders [25] and recursive image and text understanding but without experimental results.


Larochelle [26] investigated autoencoders with an unfolded “deep objective”.


Supervised recursive neural networks have been used for parsing images and natural language sentences by Socher et al. [27, 28].


Lastly, [12] introduced the standard recursive autoencoder as mentioned in Sect. 2.2.


6 Conclusion

We introduced an unsupervised feature learning algorithm based on unfolding, recursive autoencoders.


The RAE captures syntactic and semantic information as shown qualitatively with nearest neighbor embeddings and quantitatively on a paraphrase detection task.


Our RAE phrase features allow us to compare both single word vectors as well as phrases and complete syntactic trees.


In order to make use of the global comparison of variable length sentences in a similarity matrix we introduce a new dynamic pooling architecture that produces a fifixed-sized representation.


We show that this pooled representation captures enough information about the sentence pair to determine the paraphrase relationship on the MSRP dataset with a higher accuracy than any previously published results.



[1] E. Marsi and E. Krahmer. Explorations in sentence fusion. In European Workshop on Natural Language Generation, 2005.


[2] P. Clough, R. Gaizauskas, S. S. L. Piao, and Y. Wilks. METER: MEasuring TExt Reuse. In ACL, 2002.


[3] C. Callison-Burch. Syntactic constraints on paraphrases extracted from parallel corpora. In Proceedings of EMNLP, pages 196–205, 2008.


[4] B. Dolan, C. Quirk, and C. Brockett. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In COLING, 2004.


[5] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. j. Mach.Learn. Res., 3, March 2003.


[6] R. Collobert and J. Weston. A unifified architecture for natural language processing: deep neural network, with multitask learning. In ICML, 2008.


[7] Y. Bengio, J. Louradour, Collobert R, and J. Weston. Curriculum learning. In ICML, 2009.


[8] J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semisupervised learning. In Proceedings of ACL, pages 384–394, 2010.


[9] J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46, November 1990.


[10] T. Voegtlin and P. Dominey. Linear Recursive Distributed Representations. Neural Networks, 18(7), 2005.


[11] J. L. Elman. Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7(2-3), 1991.


[12] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In EMNLP, 2011.


[13] C. Goller and A. Kuchler. Learning task-dependent distributed representations by backpropagation through structure. In Proceedings of the International Conference on Neural Networks (ICNN-96), 1996.


[14] D. Klein and C. D. Manning. Accurate unlexicalized parsing. In ACL, 2003.

准确unlexicalized解析 ( lexical adj. 词汇的)

[15] D. Das and N. A. Smith. Paraphrase identification as probabilistic quasi-synchronous recognition. In In Proc. of ACL-IJCNLP, 2009.


[16] V. Rus, P. M. McCarthy, M. C. Lintean, D. S. McNamara, and A. C. Graesser. Paraphrase identifification with lexico-syntactic graph subsumption. In FLAIRS Conference, 2008.


[17] R. Mihalcea, C. Corley, and C. Strapparava. Corpus-based and Knowledge-based Measures of Text Semantic Similarity. In Proceedings of the 21st National Conference on Artifificial Intelligence - Volume 1, 2006.


[18] A. Islam and D. Inkpen. Semantic Similarity of Short Texts. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2007), 2007.


[19] L. Qiu, M. Kan, and T. Chua. Paraphrase recognition via dissimilarity signifificance classifification. In EMNLP, 2006.


[20] S. Fernando and M. Stevenson. A semantic similarity approach to paraphrase detection. Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, 2008.


[21] S. Wan, M. Dras, R. Dale, and C. Paris. Using dependency-based features to take the “para-farce” out of paraphrase. In Proceedings of the Australasian Language Technology Workshop 2006, 2006.


[22] R. Barzilay and L. Lee. Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In NAACL, 2003.


[23] Y. Zhang and J. Patrick. Paraphrase identifification by text canonicalization. In Proceedings of the Australasian Language Technology Workshop 2005, 2005.


[24] Z. Kozareva and A. Montoyo. Paraphrase Identifification on the Basis of Supervised Machine Learning Techniques. In Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL, 2006.


[25] L. Bottou. From machine learning to machine reasoning. CoRR, abs/1102.1808, 2011.


[26] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training deep neural networks. JMLR, 10, 2009.


[27] R. Socher, C. D. Manning, and A. Y. Ng. Learning continuous phrase representations and syntactic parsing with recursive neural networks. In Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop, 2010.


[28] R. Socher, C. Lin, A. Y. Ng, and C.D. Manning. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In ICML, 2011.


论文翻译——Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection的更多相关文章

  1. 深度学习论文翻译解析(九):Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

    论文标题:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition 标题翻译:用于视觉识别的深度卷积神 ...

  2. R-CNN论文翻译

    R-CNN论文翻译 Rich feature hierarchies for accurate object detection and semantic segmentation 用于精确物体定位和 ...

  3. SSD: Single Shot MultiBoxDetector英文论文翻译

    SSD英文论文翻译 SSD: Single Shot MultiBoxDetector 2017.12.08    摘要:我们提出了一种使用单个深层神经网络检测图像中对象的方法.我们的方法,名为SSD ...

  4. R-FCN论文翻译

    R-FCN论文翻译 R-FCN: Object Detection viaRegion-based Fully Convolutional Networks 2018.2.6   论文地址:R-FCN ...

  5. 深度学习论文翻译解析(四):Faster R-CNN: Down the rabbit hole of modern object detection

    论文标题:Faster R-CNN: Down the rabbit hole of modern object detection 论文作者:Zhi Tian , Weilin Huang, Ton ...

  6. 《论文翻译》Xception

    目录 深度可分离网络-Xception 注释 1. 摘要 2. 介绍 3. Inception假设 4. 卷积和分离卷积之间的联系 4. 先验工作 5. Xception 架构 6. 个人理解 单词汇 ...

  7. 论文翻译——R-CNN(目标检测开山之作)

    R-CNN论文翻译 <Rich feature hierarchies for accurate object detection and semantic segmentation> 用 ...

  8. 论文翻译—SPP-Net(目标检测)

    SPPNet论文翻译 <Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition> Kai ...

  9. 【论文翻译】NIN层论文中英对照翻译--(Network In Network)

    [论文翻译]NIN层论文中英对照翻译--(Network In Network) [开始时间]2018.09.27 [完成时间]2018.10.03 [论文翻译]NIN层论文中英对照翻译--(Netw ...


  1. Linux 平台静默安装 Oracle客户端

    需求:Linux平台,安装完整版Oracle客户端 Tips:如果只是用到sqlldr,sqlplus功能,可以参考<Linux上oracle精简版客户端快速部署>快速部署精简版:如果需要 ...

  2. win7系统下如何配置php-Apache-mysql环境

    如何在win7系统下配置php环境呢,php+Apache+mysql都是在配置过程中必不可少的元素,php负责解析php代码,apache负责服务器端而mysql是数据交互的中转站. 那么如何将ph ...

  3. SSRF攻击实例解析

    ssrf攻击概述 很多web应用都提供了从其他的服务器上获取数据的功能.使用用户指定的URL,web应用可以获取图片,下载文件,读取文件内容等.这个功能如果被恶意使用,可以利用存在缺陷的web应用作为 ...

  4. MyBaits的各种基本查询方式

    <?xml version="1.0" encoding="gbk"?> <!DOCTYPE mapper PUBLIC "-//m ...

  5. 青瓷qici - H5小游戏 抽奖机 0 创建工程

    安装运行平台需要nodejs,具体方法请参照官方说明文档. 运行后打开了一个空空的窗口. 首先我们进行工程设置,菜单>工程>设置 菜单里面设置我们游戏的名称,到时候会显示在游戏的title ...

  6. .net下灰度模式图像

    .net下灰度模式图像在创建Graphics时出现:无法从带有索引像素格式的图像创建graphics对象 问题的解决方案. Posted on 2013-07-13 14:23 Imageshop 阅 ...

  7. python3 与 Django 连接数据库报错:ImportError: No module named &#39;MySQLdb&#39;

    在 python2 中,使用 pip install mysql-python 进行安装连接MySQL的库,使用时 import MySQLdb 进行使用 在 python3 中,改变了连接库,改为了 ...

  8. php判断正常访问和外部访问

    php判断正常访问和外部访问 <?php session_start(); if(isset($_POST['check'])&&!empty($_POST['name'])){ ...

  9. netbeans等宽字体却不支持中文

    一直用netbeans,各方面都很满意,就是这字体十分不爽,如用等宽字体却不支持中文,百度了一下,找到了解决办法,贴出来,给需要的朋友. 01.找到自己java字体目录.我的目录是[C:\Progra ...

  10. 详解Django自定义过滤器

    django过滤器的本质是函数,但函数太多了,为了显示自己的与众不同,设计者们想了个名字过滤器... django有一些内置的过滤器,但和新手赛车不多(把字母转成小写,求数组长度,从数组中取一个随机值 ...