[转载]机器学习识别恶意 URL的一些维度

识别恶意 URL 是一个二分类问题,安全方向的机器学习和传统的机器学习步骤无异,一步是准备基础数据,一步是特征工程,一步是选取分类算法。

特征工程决定了30%还是80%,而折腾分类算法的作用在于验证效果是80%还是98%或者是99.999%。四川科技大学的研究者们发布了他们的研究成果并且上线了恶意网页实时检测系统,下面是转载的他们使用的一些特征维度。

This project now mainly uses four types of features:

(1) Domain-based features. Such as WHOIS information, domain and its sections, days after the domain was registered and remaining days before the domain expires.

(2) Host-based features. Such as where the server is located, sponsor of the host, and technologies used by the site.

(3) Reputation-based features. Such as Alexa rank, Alexa linkin, pages number indexed by Baidu, and so on.

(4) Lexical Features. Such as the number of dots in URL, number of parameters, and sensitive keywords contained in the URL.

All 62 features used in this system are list below:

TABLE 1

The description of all 62 features used in this system

Type

feature

description

Domain

domain_length

The length of the domain.

server_name_length

The length of the “server name” section.

pure_domain_length

The length of the pure domain.

if_com

If the top level domain name is “.com”.

if_org_gov

If the top level domain name is “.gov” or “.org”.

expire_days

Remaining days before the domain expires.

register_days

Days after the domain is registered.

Reputation based

baidu_index

The number of pages indexed by Baidu.

baidu_inverse_index

The number of the pages which link to the objective URL and indexed by Baidu.

beian

If it is registered by Chinese “BEIAN System”.

google_pr

Google page rank.

google_index

The number of pages indexed by Google.

alexa_rank

The rank of the website in Alexa.

Host based

if_godaddy

If the website is hosted by “Godaddy.com”.

if_usa

If the hosting server is located in USA.

if_cn

If the hosting server is located in China.

if_latin_american

If the hosting server is located in latin America.

if_europe

If the hosting server is located in Europe.

if_japan

If the hosting server is located in Japan.

if_aus

If the hosting server is located in Austrilia.

if_north_ame

If the hosting server is located in North America (except USA).

if_india

If the hosting server is located in India.

if_africa

If the hosting server is located in Africa.

if_middle_east

If the hosting server is located in Middle East.

if_se_asia

If the hosting server is located in South East Asia.

if_korea

If the hosting server is located in Korea.

if_jsp

If the website uses “JSP” technology.

if_cgi

If the website uses “CGI” technology.

if_html

If the website uses static HTML technology.

if_asp

If the website uses “ASP” technology.

if_php

If the website uses “PHP” technology.

if_http

If the web protocol is “HTTP”.

if_https

If the web protocol is “HTTPS”.

if_ftp

If the web protocol is “FTP”.

Lexical

dot_num

The number of the dots in the URL.

param_num

The number of the parameters in the URL.

if_paypal_u

If “paypal” is contained in the PATH section.

if_taobao_u

If “taobao” is contained in the PATH section.

if_ali_u

If “ali” is contained in the PATH section.

if_jd_u

If “jd” is contained in the PATH section.

if_safety_u

If “safety” is contained in the PATH section.

if_verify_u

If “verify” is contained in the PATH section.

if_google_u

If “google” is contained in the PATH section.

if_apple_u

If “apple” is contained in the PATH section.

if_facebook_u

If “facebook” is contained in the PATH section.

if_amazon_u

If “amazon” is contained in the PATH section.

if_porn_u

If “porn” related words are contained in the PATH section.

if_gamble_u

If “gamble” related words are contained in the PATH section.

if_awarding_u

If “awarding” related words are contained in the PATH section.

if_paypal_d

If “paypal” is contained in the DOMAIN section.

if_taobao_d

If “taobao” is contained in the DOMAIN section.

if_ali_d

If “ali” is contained in the DOMAIN section.

if_jd_d

If “jd” is contained in the DOMAIN section.

if_safety_d

If “safety” is contained in the DOMAIN section.

if_verify_d

If “verify” is contained in the DOMAIN section.

if_google_d

If “google” is contained in the DOMAIN section.

if_apple_d

If “apple” is contained in the DOMAIN section.

if_facebook_d

If “facebook” is contained in the DOMAIN section.

if_amazon_d

If “amazon” is contained in the DOMAIN section.

if_porn_d

If “porn” related words are contained in the DOMAIN section.

if_gamble_d

If “gamble” related words are contained in the DOMAIN section.

if_awarding_d

If “awarding” related words are contained in the DOMAIN section.

All features are numeric.

点赞
  1. admin说道:

    博客模板有点问题,排版有点乱。
    我正在迁移至另外一个博客并且在尝试用英文写作,我的另外一个博客的地址是 blog.fht.im.
    写了一篇使用 cnn 识别宝塔控制面板的验证码程序。欢迎各位前来指教

发表评论

电子邮件地址不会被公开。 必填项已用*标注