java 语义分析开源_JAVA_程序员俱乐部

您所在的位置：程序员俱乐部 > 编程开发 > JAVA > java 语义分析开源

java 语义分析开源

2013/11/21 12:29:40 fengbin2005 程序员俱乐部我要评论(0)

摘要：语义分析LingPipe的优势是:比较全面的覆盖自然语言处理的各个分支，文本分词，聚类，语义情感分析，领域知识学习等等具有全套在research上免费的源码，样列代码，测试代码(商业与非商业均同一套代码)，并且文档详细，对于其中模型所参考的论文都引用出来，适合研究学习.作为相对开源资源缺少的领域，项目一直持续更新中.包含的模块：主题分类（TopClassification）:基于文本语言模型训练，归类命名实体识别（NamedEntityRecognition）:基于first-best,n
标签：Java 分析开源

语义分析?

LingPipe的优势是:

比较全面的覆盖自然语言处理的各个分支，文本分词，聚类，语义情感分析，领域知识学习等等
具有全套在research上免费的源码，样列代码，测试代码(商业与非商业均同一套代码)，并且文档详细，对于其中模型所参考的论文都引用出来，适合研究学习.
作为相对开源资源缺少的领域，项目一直持续更新中.

包含的模块：

主题分类（Top Classification） :?基于文本语言模型训练，归类
命名实体识别（Named Entity Recognition）:基于first-best, n-best and per-entity confidencemodes识别，以及训练与评估识别器
聚类（Clustering）:?基于single-link andcomplete-link多层聚类，包裹一些聚类评估技术
词性标注（Part-of Speech Tagging）:
句题检测（Sentence Detection）:
拼写更正（Spelling Correction）:基于"你要找的是"风格的检查引擎
数据库文本挖掘（Database Text Mining）
字符串比较(String Comparison) ：基于距离与相似度测量，包括权重距离，TF/IDF距离，Jaccard distance, Jaro-Winkler distance,等
兴趣短语检测（Interseting Phrase Detection）
字符语言建模（Character Language Modeling）
中文分词（Chinese WordSegmentation）基于空格分割类似训练库，机器学习，发现认知新词
数据库文本挖掘（Database Text Mining）
情感分析（Sentiment Analysis）基于文本聚类
断字识音（Hyphenation and Syllabification）
语言辨别（Language Identification）
奇异值分解（Singular Value Decomposition）
逻辑回归（Logistic Regression）
期望最大化（Expectation Maximization）
词义排歧（Word Sense Disambiguation）

LingPipe包含资源:

Papaer&language material :source，介绍中均包含有所引用资源

目前个人应用LingPipe包中的中文分词，结合情感分析模块研究中文情感检测与辨别。API接口均已高度概括化，便于快速实现，不过所运用的算法需要详尽的分析。

中文自然语言处理工具包?FudanNLP
?

FudanNLP主要是为中文自然语言处理而开发的工具包，也包含为实现这些任务的机器学习…

?

Java自然语言处理?LingPipe
?

LingPipe是一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能，包括…

?

自然语言处理工具?OpenNLP
?

OpenNLP 是一个机器学习工具包，用于处理自然语言文本。支持大多数常用的 NLP 任务…

?

自然语言工具包?NLTK
?

NLTK 会被自然地看作是具有栈结构的一系列层，这些层构建于彼此基础之上。那些熟悉…

?

自然语言处理工具?CRF++
?

CRF++是著名的条件随机场开源工具，也是目前综合性能最佳的CRF工具。CRF++本身已经…

?

分布式在线机器学习框架?Jubatus
?

Jubatus 是一个分布式处理框架和机器学习库，包含以下功能：在线机器学习库，包括…

?

机器学习软件包?Mallet
?

Mallet是专门用于机器学习方面的软件包，此软件包基于java。通过mallet工具，可以进…

?

大规模知识加速器?LarKC
?

欧盟第7框架计划(FP7)的LarKC项目的目标是开发大规模知识加速器(LarKC，其发音为“…

?

DKPro Core
?

DKPro Core 是基于 Apache UIMA 框架之上的自然语言处理（NLP）的软件组件。DKPro…

?

TextTeaser
?

TextTeaser是一个自动摘要算法,结合了自然语言处理的力量和机器学习产生好结果。…

class="headline-2" style="margin-top: 15px; margin-bottom: 15px; padding-left: 5px; font-size: 19px; font-family: 微软雅黑, 黑体, Verdana; line-height: 19px; clear: both; font-weight: 500;">OpenNLP

OpenNLP是一个基于Java机器学习工具包，用于处理自然语言文本。支持大多数常用的 NLP 任务，例如：标识化、句子切分、部分词性标注、名称抽取、组块、解析等。

FudanNLP

FudanNLP主要是为中文自然语言处理而开发的工具包，也包含为实现这些任务的机器学习算法和数据集。本工具包及其包含数据集使用LGPL3.0许可证。开发语言为Java。功能： 1.?文本分类?新闻聚类 2. 中文分词词性标注实体名识别关键词抽取依存句法分析时间短语识别 3. 结构化学习在线学习层次分类聚类精确推理

Standford NLP

Standford NLP提供了一系列的自然语言处理工具。

机器学习

Support Vector Machine

SVM^light

An implementation of Vapnik's Support Vector Machine

LIBSVM

A Library for Support Vector Machines

Decision Tree

C4.5

The "classic" decision-tree tool, developed by J. R. Quinlan?Tutorial

Maximum Entropy

YASMET

Yet Another Small MaxEnt Toolkit

Conditional Random Field

CRF++

A simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data

自然语言处理

综合

OpenNLP

An organizational center for open source projects related to natural language processing

CMU Statistical Language Modeling Toolkit

A suite of UNIX software tools to facilitate the construction and testing of statistical language models

The Dragon ToolKit

A Java-based development package for academic use in information retrieval (IR) and text mining. Include many NLP tools

LingPipe

A suite of Java libraries for the linguistic analysis of human language, including

track mentions of entities (e.g. people or proteins);
link entity mentions to database entries;
uncover relations between entities and actions;
classify text passages by language, character encoding, genre, topic, or sentiment;
correct spelling with respect to a text collection;
cluster documents by implicit topic and discover significant trends over time; and
provide part-of-speech tagging and phrase chunking.

Natural Language Toolkit

Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux.

Antelope

Advanced Natural Lange Object-oriented Processing Environment.包括一系列工具(特别c#的stanford parser)

分词

ICTCLAS

中科院的中文分词系统

Stanford Chinese Word Segmenter

A Java implementation of a CRF-based Chinese Word Segmenter

词性标注

Brill tagger

A error-driven transformation-based tagger implemented by?Eric Brill

Stanford POS Tagger

A Java implementation of the log-linear part-of-speech taggers descriped by Kristina Toutanova, et.al.

MBT:Memory-based Tagger
TreeTagger

A decision tree based tagger from the University of Stuttgart.

SVMTool?, a POS Tagger based on SVMs
QTAG Part of speech tagger

An HMM-based Java POS tagger from Birmingham U.

命名实体识别

Stanford Named Entity Recognizer

A Java implementation of a Conditional Random Field sequence model, together with well-engineered features for Named Entity Recognition

LingPipe

Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.

YamCha

SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)

Stemming

Porter Stemming

A process for removing the commoner morphological and inflexional endings from words in English byMartin Porter

Snowball

A small string processing language designed for creating stemming algorithms for use in Information Retrieval.

句法分析

Stanford Parser

Java implementations of probabilistic natural language parsers, both highly optimized PCFG and dependency parsers, and a lexicalized PCFG parser.

Berkeley Parser

文本挖掘

摘要

Rouge?Rouge在Windows下的配置

其他

加密

OpenSSL

包括众多加密算法，RSA、DES、MD5、SHA等?Win32安装版

压缩

zlib

A Massively Spiffy Yet Delicately Unobtrusive Compression Library

日志

Apache Logging Services

Creates and maintains open-source software related to the logging of application behavior and released at no charge to the public, including

log4j?for Java,
log4cxx?for C++, and
log4net?for MS .Net framework.

注: log4cxx官方版本有内存泄漏问题

Unicode

A mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications

XML

Xerces

A validating XML parser, including C and Java edition

多字符串匹配

AC in C#?: Aho-Corasick string matching in C#

HTML Parser

Html Agility Pack?, an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
Majestic-12?, an open source high-performance .NET C# module that was created to parse HTML for links, indexing and other purposes. 速度快，但不生成dom树