On Compression-Based Text Classification
ERRATA PAGE
For the proceedings of ECIR’05
6.2 A Comparison of AMDL and BCN Procedures)
"In our
experiments, BCN ran much more slowly than AMDL. This is not surprising,
because in BCN, each byte of a test file is compressed as many times as there
are training documents, because the test file is concatenated to each
training file before compression. In contrast, in AMDL, each byte of a test
file is compressed as many times as there are training classes.
Thus, for example, if a Reuters-10 experiment takes several hours using
AMDL, it can easily take over a month using BCN. All remaining experiments
reported in this paper use AMDL."
Reason for change:
Although the BCN runtime on our machines was
indeed several tens of times slower than AMDL, the argument given above is
wrong. In fact, it can be shown that if all documents are the same length, then
the total number of bytes compressed by BCN is never more than twice
the total number of bytes compressed by AMDL.[1]
Replace with / should be (new text) :
"In our
experiments, BCN ran much more slowly than AMDL. For example, on 10news, gzip
ran for approximately 100 minutes under AMDL, and approximately 69 hours under
BCN. (However, it should be noted that we did not run our experiments on
a dedicated machine nor did we optimize them for speed). The reason for the
slower runtime of BCN may be that it performs more system calls and disk
operations than AMDL. All remaining experiments reported in this paper use
AMDL."