Verifies if the url is of correct form of "Uniform Resource Identifiers (URI): Generic Syntax" RFC2396 (http://www.ietf.org/rfc/rfc2396.txt) Default valid protocols are: http, https, ftp.
Extract url domain, i.e.
Extracts url protocol, i.e.
Converts a sequence of URL features into a vector, extracting the domains of the valid urls and keeping the top K occurrences of each feature, along with an extra column per feature indicating how many values were not in the top K.
How many values to keep in the vector
If true, ignores capitalization and punctuations when grouping categories
Min times a value must occur to be retained in pivot
keep an extra column that indicated if feature was null
Other URL features
max percentage of distinct values a categorical feature can have (between 0.0 and 1.00)
The vectorized features