Comparative Analysis on Academic Paper Similarity using Jaccard and Levenshtein and Blocking
Main Article Content
Abstract
Paper search engines have made it easier for academics to conduct literature reviews. However, easy doesn't mean accurate. For certain niche topics, search results often aren’t quite good. Snowballing can be done to overcome this, but it is limited to the initial articles owned, especially the author's access when the article was written. As an alternative, paper databases provide recommendations for relevant articles of an article, but it’s limited to that database. A tool to search for similar articles without relying on a specific database would be very helpful, but before that, the appropriate method for measuring article similarity needs to be determined. This research aims to measure article similarity based on title, author, and keywords using Weighted Jaccard Measure and Levenshtein distance and evaluate it. This study also compares performance by adding blocking with overlap blocking and stop word removal. The Jaccard evaluation results are quite poor, but the Levenshtein + Jaccard evaluation results are decent. In addition, it was found that emphasizing weighting on the title produces the best results. Overlap blocking and stop words removal increases processing time instead. Overlap blocking can reduce the number of measurements by almost half with an overlap of 1, but overlaps above 1 will discard many pairs that should be similar. Removing stop words improves Jaccard and Levenshtein performance but requires threshold adjustment.
Downloads
Download data is not yet available.
Article Details
How to Cite
[1]
M. R. Nur, G. S. Buana, and N. A. Rakhmawati, “Comparative Analysis on Academic Paper Similarity using Jaccard and Levenshtein and Blocking”, JuTISI, vol. 9, no. 2, pp. 272 –, Aug. 2023.
Section
Articles
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial used, distribution and reproduction in any medium.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.