On Compression-Based Text Classification

Yuval Marton, Ning Wu, Lisa Hellerstein
ERRATA PAGE

For the proceedings of ECIRí05

 

 

 

The authors regret the following error. Please replace the following paragraph with the new paragraph given below it.

 

Replace this old paragraph (in section 6.2 A Comparison of AMDL and BCN Procedures) :

"In our experiments, BCN ran much more slowly than AMDL. This is not surprising, because in BCN, each byte of a test file is compressed as many times as there are training documents, because the test file is concatenated to each training file before compression. In contrast, in AMDL, each byte of a test file is compressed as many times as there are training classes.  Thus, for example, if a Reuters-10 experiment takes several hours using AMDL, it can easily take over a month using BCN. All remaining experiments reported in this paper use AMDL."

 

Reason for change:

Although the BCN runtime on our machines was indeed several tens of times slower than AMDL, the argument given above is wrong. In fact, it can be shown that if all documents are the same length, then the total number of bytes compressed by BCN is never more than twice the total number of bytes compressed by AMDL.[1]

 

Replace with / should be (new text) :

"In our experiments, BCN ran much more slowly than AMDL. For example, on 10news, gzip ran for approximately 100 minutes under AMDL, and approximately 69 hours under BCN.  (However, it should be noted that we did not run our experiments on a dedicated machine nor did we optimize them for speed). The reason for the slower runtime of BCN may be that it performs more system calls and disk operations than AMDL. All remaining experiments reported in this paper use AMDL."

 



[1] We thank Jude Shavlik for making this observation and catching our error.