I’m using Mahout 0.9 (installed on HDP 2.2) for topic discovery (LDA algorithm). I have my text file stored in directory inputraw
and execute the following commands in order
command#1:
mahout seqdirectory -i inputraw -o output-directory -c UTF-8
command#2:
mahout seq2sparse -i output-directory -o output-vector-str -wt tf -ng 3 –maxDFPercent 40 -ow -nv
command#3:
mahout rowid -i output-vector-str/tf-vectors/ -o output-vector-int
command#4:
mahout cvb -i output-vector-int/matrix -o output-topics -k 1 -mt output-tmp -x 10 -dict output-vector-str/dictionary.file-0
After executing the second command and as expected it creates a bunch of subfolders and files under the output-vector-str
(named
df-count
,
dictionary.file-0
,
frequency.file-0
,
tf-vectors
,
tokenized-documents
and
wordcount
). The size of these files all looks ok considering the size of my input file however the file under
tf-vectors
has a very small size, in fact it’s only 118 bytes).
Apparently as the tf-vectors
is the input to the 3rd command, the third command also generates a file of small size. Does anyone know:
1- what is the reason of the file under tf-vectors
folder to be that small? There must be something wrong.
2- Starting from the first command, all the generated files have a strange coding and are nor human readable. Is this something expected?
Besides,