Verifies if the url is of correct form of "Uniform Resource Identifiers (URI): Generic Syntax" RFC2396 (http://www.ietf.org/rfc/rfc2396.txt)
url protocols to consider valid, i.e. http, https, ftp etc.
Verifies if the url is of correct form of "Uniform Resource Identifiers (URI): Generic Syntax" RFC2396 (http://www.ietf.org/rfc/rfc2396.txt) Default valid protocols are: http, https, ftp.
Extract url domain, i.e.
Extracts url protocol, i.e.
Converts a sequence of URL features into a vector, extracting the domains of the valid urls and keeping the top K occurrences of each feature, along with an extra column per feature indicating how many values were not in the top K.
How many values to keep in the vector
If true, ignores capitalization and punctuations when grouping categories
Min times a value must occur to be retained in pivot
keep an extra column that indicated if feature was null
Other URL features
The vectorized features