A Consideration of JSON format Log File Discrimination using Machine Learning for Automatic Log Collection

谷屋, 直樹, 中野, 心太, 関谷, 信吾, 折田, 彰, 岸本, 頼紀, 早稲田, 篤志, 花田, 真樹, Naoki, Taniya, Shinta, Nakano, Shingo, Sekiya, Akira, Orita, Yorinori, Kishimoto, Atsushi, Waseda, Masaki, Hanada

デジタルフォレンジックにおいて，ログファイルを収集する作業は煩わしい．これに対してファイルのバイナリデータに対してfasttextを利用した類似度からテキスト形式のログファイルを自動判別，収集する方法が提案されている．しかし，この手法ではnginxやApacheなどのログ出力形式として利用されるJSON形式のログファイルに対応できない．JSON形式ファイルでは，出力するアプリケーション毎にキーが異なるため，tf-idfのような類似度では分類が難しい．そこで，fasttext，ナイーブベイズ，ランダムフォレスト，SVMのそれぞれのアルゴリズムを用いて類似度を計算し，その傾向について分析する．これにより，ログファイルか否かの判別に適したアルゴリズムについて検討・報告する．

In digital forensics, collecting log files is a difficult task. A method has been proposed to automatically identify and collect text-format log files based on similarity using fasttext for the binary data of the files. However, this method does not support JSON log files, which are used as the log output format for nginx, Apache, etc. In JSON files, the keys are different for each output application, making classification difficult using similarity measures such as tf-idf. Therefore, we investigated similarity measures using each of the algorithms, fasttext, naive Bayes, random forest, and SVM, and investigated and reported on algorithms suitable for identifying log files.

A Consideration of JSON format Log File Discrimination using Machine Learning for Automatic Log Collection

Bibliographic Information

Description

Journal

Keywords

Details 詳細情報について

Export

Report a problem