- 浏览: 429981 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
sunwang810812:
万分感谢中!!!!!这么多年终于看到一个可运行可解决的方案!! ...
POI 后台生成Excel,在前台显示进度 -
zzb7728317:
LZ正解
Spring Jackson AjaxFileUpload 没有执行回调函数的解决办法 -
sleeper_qp:
lz是在源码上修改的么? 源码的话你重新编译一遍了么? 可 ...
由nutch readseg -dump 中文编码乱码想到的…… -
shenjian430:
请问你改好的程序在写在哪了?
由nutch readseg -dump 中文编码乱码想到的…… -
yinxusen:
It seems to be the bug occur in ...
Mahout Local模式 执行example的注意点
参考:http://mylazycoding.blogspot.com/2012/03/cluster-apache-solr-data-using-apache_13.html
Minimum Requirement:
- Basic understanding of Apache Solr and Apache Mahout
- Understanding of K-Means clustering
- Up and Running Apache Solr and Apache Mahout on your system
Before indexing some sample data into Solr make sure to configure fields in SolrConfig.xml(schema.xml).
<field name=”field_name” type=”text” indexed=”true” stored=”true” termVector=”true” /> |
- Add termVector=”true” for the fields which can be clustered
- Indexing some sample documents into Solr
mahout lucene.vector <PATH OF INDEXES> --output <OUTPUT VECTOR PATH> --field <field_name> --idField id –dicOut <OUTPUT DICTIONARY PATH> --norm 2 |
mahout kmeans -i <OUTPUT VECTOR PATH> -c <PATH TO CLUSTER CENTROIDS> -o <PATH TO OUTPUT CLUSTERS> -dm org.apache.mahout.common.distance.CosineDistanceMeasure –x 10 –k 20 –ow –clustering |
Here:
- k: number of clusters/value of K in K-Means clustering
- x: maximum iterations
- o: path to output clusters
- ow: overwrite output directory
- dm: classname of Distance Measure
mahout clusterdump -s <PATH TO OUTPUT CLUSTERS> -d <OUTPUT DICTIONARY PATH> -dt text -n 20 -dm org.apache.mahout.common.distance.CosineDistnanceMeasure --pointsDir <PATH OF OUTPUT CLUSTERED POINTS> --output <PATH OF OUTPUT DIR> |
Here:
- s: Directory containing clusters
- d:Path of dictionary from step #2
- dt: Format of dictionary file
- n: number of top terms
- output: Path of generated clusters
Mahout Vectors from Lucene Term Vectors
In order for Mahout to create vectors from a Lucene index, the first and foremost thing that must be done is that the index must contain Term Vectors. A term vector is a document centric view of the terms and their frequencies (as opposed to the inverted index, which is a term centric view) and is not on by default.
For this example, I’m going to use Solr’s example, located in <Solr Home>/example
In Solr, storing Term Vectors is as simple as setting termVectors=”true” on on the field in the schema, as in:
<field name=”text” type=”text” indexed=”true” stored=”true” termVectors=”true”/>
For pure Lucene, you will need to set the TermVector option on during Field creation, as in:
Field fld = new Field(“text”, “foo”, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES);
From here, it’s as simple as pointing Mahout’s new shell script (try running <MAHOUT HOME>/bin/mahout for a full listing of it’s capabilities) at the index and letting it rip:
<MAHOUT HOME>/bin/mahout lucene.vector –dir <PATH TO INDEX>/example/solr/data/index/ –output /tmp/foo/part-out.vec –field title-clustering –idField id –dictOut /tmp/foo/dict.out –norm 2
A few things to note about this command:
- This outputs a single vector file, title part-out.vec to the target/foo directory
- It uses the title-clustering field. If you want a combination of fields, then you will have to create a single “merged” field containing those fields. Solr’s <copyField> syntax can make this easy.
- The idField is used to provide a label to the Mahout vector such that the output from Mahout’s algorithms can be traced back to the actual documents.
- The –dictOut outputs the list of terms that are represented in the Mahout vectors. Mahout uses an internal, sparse vector representation for text documents (dense vector representations are also available) so this file contains the “key” for making sense of the vectors later. As an aside, if you ever have problems with Mahout, you can often share your vectors with the list and simply keep the dictionary to yourself, since it would be pretty difficult (not sure if it is impossible) to reverse engineer just the vectors.
- The –norm tells Mahout how to normalize the vector. For many Mahout applications, normalization is a necessary process for obtaining good results. In this case, I am using the Euclidean distance (aka the 2-norm) to normalize the vector because I intend to cluster the documents using the Euclidean distance similarity. Other approaches may require other norms.
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
Creating Vectors from Text
Introduction
For clustering documents it is usually necessary to convert the raw text into vectors that can then be consumed by the clustering Algorithms. These approaches are described below.
From Lucene
NOTE: Your Lucene index must be created with the same version of Lucene used in Mahout. Check Mahout's POM file to get the version number, otherwise you will likely get "Exception in thread "main" org.apache.lucene.index.CorruptIndexException: Unknown format version: -11" as an error.
Mahout has utilities that allow one to easily produce Mahout Vector representations from a Lucene (and Solr, since they are they same) index.
For this, we assume you know how to build a Lucene/Solr index. For those who don't, it is probably easiest to get up and running using Solr as it can ingest things like PDFs, XML, Office, etc. and create a Lucene index. For those wanting to use just Lucene, see the Lucene website or check out Lucene In Action by Erik Hatcher, Otis Gospodnetic and Mike McCandless.
To get started, make sure you get a fresh copy of Mahout from SVN and are comfortable building it. It defines interfaces and implementations for efficiently iterating over a Data Source (it only supports Lucene currently, but should be extensible to databases, Solr, etc.) and produces a Mahout Vector file and term dictionary which can then be used for clustering. The main code for driving this is the Driver program located in the org.apache.mahout.utils.vectors package. The Driver program offers several input options, which can be displayed by specifying the --help option. Examples of running the Driver are included below:
Generating an output file from a Lucene Index
$MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE INDEX> \ --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO] \ <--max <Number of vectors to output>> <--norm {INF|integer >= 0}> <--idField <Name of the idField in the Lucene index>>
Create 50 Vectors from an Index
$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \ --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50
This uses the index specified by --dir and the body field in it and writes out the info to the output dir and the dictionary to dict.txt. It only outputs 50 vectors. If you don't specify --max, then all the documents in the index are output.
Normalize 50 Vectors from a Lucene Index using the L_2 Norm
$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body \ --dictOut <PATH>/solr/wikipedia/dict.txt --output <PATH>/solr/wikipedia/out.txt --max 50 --norm 2
From Directory of Text documents
Mahout has utilities to generate Vectors from a directory of text documents. Before creating the vectors, you need to convert the documents to SequenceFile format. SequenceFile is a hadoop class which allows us to write arbitary key,value pairs into it. The DocumentVectorizer requires the key to be a Text with a unique document id, and value to be the Text content in UTF-8 format.
You may find Tika (http://lucene.apache.org/tika) helpful in converting binary documents to text.
Converting directory of documents to SequenceFile format
Mahout has a nifty utility which reads a directory path including its sub-directories and creates the SequenceFile in a chunked manner for us. the document id generated is <PREFIX><RELATIVE PATH FROM PARENT>/document.txt
From the examples directory run
$MAHOUT_HOME/bin/mahout seqdirectory \ --input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \ <-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \ <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \ <-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
Creating Vectors from SequenceFile
Mahout_0.3
From the sequence file generated from the above step run the following to generate vectors.
$MAHOUT_HOME/bin/mahout seq2sparse \ -i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED> \ <-wt <WEIGHTING METHOD USED> {tf|tfidf}> \ <-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \ <-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT> org.apache.lucene.analysis.standard.StandardAnalyzer> \ <--minSupport <MINIMUM SUPPORT> 2> \ <--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \ <--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \ <--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>" <-seq <Create SequentialAccessVectors>{false|true required for running some algorithms(LDA,Lanczos)}>"
--minSupport is the min frequency for the word to be considered as a feature. --minDF is the min number of documents the word needs to be in
--maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. This helps remove high frequency features like stop words
Background
- http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c
- http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
From a Database
TODO:
Other
Converting existing vectors to Mahout's format
If you are in the happy position to already own a document (as in: texts, images or whatever item you wish to treat) processing pipeline, the question arises of how to convert the vectors into the Mahout vector format. Probably the easiest way to go would be to implement your own Iterable<Vector> (called VectorIterable in the example below) and then reuse the existing VectorWriter classes:
VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, configuration, outfile, LongWritable.class, SparseVector.class); long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
发表评论
-
mahout将文本数据转化成向量形式
2012-11-14 15:38 3324对于文本信息的向量化,Mahout 已经提供了工具类,它 ... -
混淆矩阵(Confusion Matrix)
2012-11-14 15:10 5874在人工智能中,混淆矩阵(confusion matrix)是可 ... -
Mahout Local模式 执行example的注意点
2012-07-25 19:56 2542在export MAHOUT_LOCAL=true后,执 ... -
solr3.5 整合到tomcat6中
2012-06-16 17:36 975参考:http://martin3000.iteye.com/ ... -
bin/nutch solrindex报java.io.IOException: Job failed! 错误
2012-06-15 21:26 3246一段时间没有碰nutch和solr后,今天重新用nutch抓取 ... -
Mahout 聚类 Nutch 爬取的网易新闻
2012-05-15 13:57 829爬取163的新闻 bin/nutch crawl ur ... -
【转】mahout应用kmeans进行文本聚类2之——实例分析
2012-05-13 22:47 1752转:http://blog.csdn.net/aidayei/ ... -
【转】mahout应用kmeans进行文本聚类1之——输入输出分析
2012-05-13 22:46 1835转:http://blog.csdn.net/aidayei/ ... -
【转】mahout中的kmeans结果分析
2012-05-13 22:45 2455转:http://blog.csdn.net/aida ... -
【转】将lucene索引转化成mahout输入向量
2012-05-09 14:07 1332转自:http://blog.csdn.net/a ... -
使用 Apache Solr 实现更加灵巧的搜索
2012-04-28 15:29 1227Solr 是一种可供企业使用的、基于 Lucene 的搜索服务 ... -
把Solr导入到eclipse中
2012-04-27 21:21 1345参考http://www.lucidimagination.c ... -
Mahout In Action第7章Clustering的SimpleKMeansClustering例子
2012-04-12 20:29 381环境:Ubuntu10.10,Hadoop1.0.1,Maho ... -
Mahout0.6安装
2012-03-22 00:28 785参考:http://www.docin.com/p ...
相关推荐
solr定时索引使用到的定时调度器jar包, 可使用于solr7.x版本
solr-data-import-scheduler-1.1.2,用于solr定时更新索引的jar包,下载后引入到solr本身的dist下面,或者你tomcat项目下面的lib下面
solr 增量更新所需要的包 solr-dataimporthandler-6.5.1 + solr-dataimporthandler-extras-6.5.1 + solr-data-import-scheduler-1.1.2
经过测试可以适用solr7.4版本。如果低版本solr(6.*) 可以直接适用网上的solr-dataimport-scheduler 1.1 或者1.0版本。
兼容solr6.5.1 定时任务依赖jar包
讲述了如何利用mahout机器学习改进solr查询结果
Apache Solr for Indexing Data
使用mahout机器学习改进solr查询结果
这是我自己反编译fix后,支持solr7.4高版本的定时增量任务(亲测solr7.4),下载下来开箱即用。低版本的没试过,估计低版本的solr配合之前apache-solr-dataimportscheduler-1.0.jar这些能行,不行就试试我这个。
Spring Data for Apache Solr API。 Spring Data for Apache Solr 开发文档
Scaling Big Data with Hadoop and Solr is a step-by-step guide that helps you build high performance enterprise search engines while scaling data. Starting with the basics of Apache Hadoop and Solr, ...
兼容最新的solr6.1,tomcat启动不会报错
使用solr做数据库定时同步更新数据和索引时用到该jar,经过本人测试通过,放心使用. 支持solr5.x,solr6.x
这是属于Solr7.X版本的全量、增量更新jar包,有很多版本的这个jar包是没有更新过的,因为这个jar包是爱好者开发的,并不是官方维护,所以很难找到,我是用了两天才找到。
Spring-data-solr ,代码可用,需要更改solr配置文件和core admin的名字,请先确保solr是可用的,在网上找了很多也没有类似的代码,这个一定可以用。
JAVA语言,实现SSM+SQL Server 数据库整合,通过spring-data-solr框架实现与solr平台的互通,实现全文搜索功能,亲测,完全了可以用
solr入门环境搭建,sorl ik分词器分词,solr数据库数据导入,solr同步等技术实现及配置.版本solr7.4.0 + ikanalyzer-solr5 + solr-dataimport-scheduler
Title: Scaling Big Data with Hadoop and Solr, 2nd Edition Author: Hrishikesh Vijay Karambelkar Length: 156 pages Edition: 1 Language: English Publisher: Packt Publishing Publication Date: 2015-03-31 ...