Quantcast
Channel: Hortonworks » All Replies
Viewing all articles
Browse latest Browse all 3435

Creating Vectors from SequenceFile in Mahout

$
0
0

I’m using Mahout 0.9 (installed on HDP 2.2) for topic discovery (LDA algorithm). I have my text file stored in directory inputraw and execute the following commands in order

command#1:

mahout seqdirectory -i inputraw -o output-directory -c UTF-8

command#2:

mahout seq2sparse -i output-directory -o output-vector-str -wt tf -ng 3 –maxDFPercent 40 -ow -nv

command#3:

mahout rowid -i output-vector-str/tf-vectors/ -o output-vector-int

command#4:

mahout cvb -i output-vector-int/matrix -o output-topics -k 1 -mt output-tmp -x 10 -dict output-vector-str/dictionary.file-0

After executing the second command and as expected it creates a bunch of subfolders and files under the output-vector-str (named df-count, dictionary.file-0, frequency.file-0, tf-vectors,tokenized-documents and wordcount). The size of these files all looks ok considering the size of my input file however the file under tf-vectors has a very small size, in fact it’s only 118 bytes).

Apparently as the tf-vectors is the input to the 3rd command, the third command also generates a file of small size. Does anyone know:

1- what is the reason of the file under tf-vectors folder to be that small? There must be something wrong.

2- Starting from the first command, all the generated files have a strange coding and are nor human readable. Is this something expected?

Besides,


Viewing all articles
Browse latest Browse all 3435

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>