Java自然语言处理 LingPipe_JAVA

您所在的位置：程序员俱乐部 > 编程开发 > JAVA > Java自然语言处理 LingPipe

Java自然语言处理 LingPipe

2012/3/1 9:34:56 orange.lpai 程序员俱乐部我要评论(0)

摘要：LingPipe是一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能，包括主题分类（TopClassification）、命名实体识别（NamedEntityRecognition）、词性标注（Part-ofSpeechTagging）、句题检测（SentenceDetection）、查询拼写检查（QuerySpellChecking）、兴趣短语检测（IntersetingPhraseDetection）、聚类（Clustering）、字符语言建模
标签：Java

LingPipe是一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能，包括主题分类（Top Classification）、命名实体识别（Named Entity Recognition）、词性标注（Part-of Speech Tagging）、句题检测（Sentence Detection）、查询拼写检查（Query Spell Checking）、兴趣短语检测（Interseting Phrase Detection）、聚类（Clustering）、字符语言建模（Character Language Modeling）、医学文献下载/解析/索引（MEDLINE Download, Parsing and Indexing）、数据库文本挖掘（Database Text Mining）、中文分词（Chinese Word Segmentation）、情感分析（Sentiment Analysis）、语言辨别（Language Identification）等API。

lingpipe 是alias公司开发的一款自然语言处理软件包，目前（2008.04.21）最高版本是3.5（[url]http://www.5yiso.cn/2008 /04/28856.html[/url]），功能非常强大，最重要的是文档超级详细，每个模型甚至连参考论文都列出来了，不仅使用方便，也非常适合模型的学习。

地址：http:/alias-i.com/lingpipe/

　　SIGHAN06中有一篇paper, 关于Alias-i公司的Bob Carpenter所提交的参评报告”Character Language Models for Chinese Word Segmentation and Named Entity Recognition”看到了他们开发的LingPipe NLP Toolkit，一个自然语言处理的Java开源工具包。可以免费下载，而且开源，支持中文，不仅仅是对代码结构的说明，而且还提供了算法思想文档和相关的资源，如测试数据集、相关论文等，一个不错的toolkit。
　　包括的模块：
　　主题分类（Top Classification）、命名实体识别（Named Entity Recognition）、词性标注（Part-of Speech Tagging）、句题检测（Sentence Detection）、查询拼写检查（Query Spell Checking）、兴趣短语检测（Interseting Phrase Detection）、聚类（Clustering）、字符语言建模（Character Language Modeling）、医学文献下载/解析/索引（MEDLINE Download, Parsing and Indexing）、数据库文本挖掘（Database Text Mining）、中文分词（Chinese Word Segmentation）、情感分析（Sentiment Analysis）、语言辨别（Language Identification）等
　　Feature Overview
　　LingPipe’s information extraction and data mining tools:
　　* track mentions of entities (e.g. people or proteins); 实体跟踪（如，人物、蛋白质）
　　* link entity mentions to database entries; 链接命名实体数据库中记录
　　* uncover relations between entities and actions; 发现实现和行为间关系
　　* classify text passages by language, character encoding, genre, topic, or sentiment; 通过语言、字体编码、类型、主题和情感对文本分类
　　* correct spelling with respect to a text collection; 拼写检查
　　* cluster documents by implicit topic and discover significant trends over time; and 通过隐藏主题对文档聚类和基于时间序列的趋势发现
　　* provide part-of-speech tagging and phrase chunking. 提供词性标注和短语组块

----------------------------------------
如何使用LingPipe计算词向量

如何使用LingPipe抽取向量空间模型例子

import com.aliasi.matrix.SparseFloatVector;
import com.aliasi.matrix.Vector;
import com.aliasi.symbol.MapSymbolTable;
import com.aliasi.symbol.SymbolTable;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.TokenFeatureExtractor;
import java.util.HashMap;
import java.util.Map;

public class ExtractFeatures {
public static Vector[] featureVectors(String[] texts,
SymbolTable symbolTable) {
Vector[] vectors = new Vector[texts.length];
TokenizerFactory tokenizerFactory = new IndoEuropeanTokenizerFactory();
TokenFeatureExtractor featureExtractor = new TokenFeatureExtractor(
tokenizerFactory);
for (int i = 0; i < texts.length; ++i) {
Map featureMap = featureExtractor
.features(texts[i]);
vectors[i] = toVectorAddSymbols(featureMap, symbolTable,
Integer.MAX_VALUE);
}
return vectors;
}

public static SparseFloatVector toVectorAddSymbols(
Map featureVector, SymbolTable table,
int numDimensions) {
int size = (featureVector.size() * 3) / 2;
Map vectorMap = new HashMap(size);
for (Map.Entry entry : featureVector
.entrySet()) {
String feature = entry.getKey();
Number val = entry.getValue();
int id = table.getOrAddSymbol(feature);
vectorMap.put(new Integer(id), val);
}
return new SparseFloatVector(vectorMap, numDimensions);
}

public static void main(String[] args) {
args = new String[]{"this is a book", "go to school"

};
SymbolTable symbolTable = new MapSymbolTable();
Vector[] vectors = featureVectors(args, symbolTable);
System.out.println("VECTORS");
for (int i = 0; i < vectors.length; ++i)
System.out.println(i + ") " + vectors[i]);
System.out.println(" SYMBOL TABLE");
System.out.println(symbolTable);
}
}

-------------------------------
如何使用LingPipe 计算TF-IDF[b]
By jeffye | 五月 25, 2008
Hope that the following java code can help you:
---------------------------------------------------------

import com.aliasi.spell.TfIdfDistance;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
public class TfIdfDistanceDemo {
public static void main(String[] args) {
TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.FACTORY;
TfIdfDistance tfIdf = new TfIdfDistance(tokenizerFactory);
for (String s : args)
tfIdf.trainIdf(s);
System.out.printf("n %18s %8s %8sn",
"Term", "Doc Freq", "IDF");
for (String term : tfIdf.termSet())
System.out.printf(" %18s %8d %8.2fn",term,tfIdf.docFrequency(term),
tfIdf.idf(term));
for (String s1 : args) {
for (String s2 : args) {
System.out.println("nString1=" + s1);
System.out.println("String2=" + s2);
System.out.printf("distance=%4.2f proximity=%4.2fn",
tfIdf.distance(s1,s2),
tfIdf.proximity(s1,s2));
}
}
}
}

[/b]------------------

上一篇： java中使用迭代器时的注意点下一篇： JAVA与正则表达式