Name: url-classifier
Owner: OWASP
Description: Declarative syntax for defining sets of URLs. No need for error-prone regexs.
Created: 2017-10-06 03:26:17.0
Updated: 2018-05-03 10:31:56.0
Pushed: 2018-03-06 17:01:44.0
Homepage: null
Size: 220
Language: Java
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Declarative syntax for defining sets of URLs. No need for error-prone regexs.
lasses are all defined under org.owasp.url
rt org.owasp.url.*;
s C {
* We define a classifier with a declarative syntax */
atic final UrlClassifier CLASSIFIER = UrlClassifiers.builder()
// We want to allow HTTP and HTTPS for this example
.scheme(BuiltinScheme.HTTP, BuiltinScheme.HTTPS)
.authority(
AuthorityClassifiers.builder()
// We whitelist some subdomains of hosts we trust.
.host("**.example.com", "**.example.net")
.build())
// We allow access to .html files
.pathGlob("**.html")
.build();
id f() {
// At runtime, we build a URL value.
// Pass in a UrlContext if you know the base URL.
UrlValue url = UrlValue.from("http://example.com/");
Classification c = CLASSIFIER.apply(
url,
// If we want an explanation of why classification failed
// we can connect diagnostics to our logs.
Diagnostic.Receiver.NULL);
// We can switch on the result.
switch (c) {
case MATCH:
// ...
System.out.println(url.urlText);
break;
case NOT_A_MATCH:
// ...
break;
case INVALID:
// ...
break;
}
To use in maven, just add to your WORKSPACE
:
WORKSPACE
n_jar(
name = "org_owasp_url",
artifact = "org.owasp:url:1.2.4",
sha1 = "f79cace2e811092dff78bc03b520eade0d675d33")
Then in a BUILD
you can use it thus:
BUILD
_library(
name = ...,
deps = [
"@org_owasp_url//jar",
# ...
],
In your WORKSPACE
, you can replace the artifact
version with one
chosen from
Maven Central.
You can check the hash by copying the jar link for the version you want
and adding .sha1
to the end.
Alternatively, you can use any of the
release ZIP files
with http_archive
thus:
WORKSPACE
_archive(
name = "org_owasp_url",
url = "https://github.com/OWASP/url-classifier/archive/v1.2.4.zip",
sha256 = "0019dfc4b32d63c1392aa264aed2253c1e0c2fb09216f8e2cc269bbfb8bb49b5")
Add
endency>
<groupId>org.owasp</groupId>
<artifactId>url</artifactId>
<version>1.2.3</version>
pendency>
to your POM's <dependencies>
or <dependencyManagement>
section.
A UrlClassifier returns MATCH
for some URLs and NOT_A_MATCH
for
others, but it can also return INVALID
. An INVALID
URL is one that
http:/foo
is invalid. Although it is
syntactically valid according to STD 66, it is missing a host
required by RFC 7230 which defines the http
protocol.http://?/
is valid even though it is rejected by
a strict interpretation of STD 66 because there is a
widely & consistently implemented way of handling non-ASCII
characters in host names.http://example.com/../../../../etc/passwd
is
equivalent to http://example.com/etc/passwd
per the specification
but has been used in directory traversal attacks).There are several corner cases that are rejected as INVALID
by default.
If you need to treat one or more as valid, you can tell your UrlClassifier to tolerate them thus:
rt static org.owasp.url.UrlValue.CornerCase.*;
lClassifiers.builder()
// Allow too many ..
.tolerate(PATH_SIMPLIFICATION_REACHES_ROOT_PARENT)
// More policy here ...
.build();
Alternatively, if we're triggering this particular corner case
because the default context doesn't capture our application path
we can use a different context when classifying UrlValues.
lContext context = UrlContext.DEFAULT.withContextUrl(
"http://example.com/foo/bar/baz/");
Sometimes its nice to know which URLs do not match a classifier and why.
You can tie UrlClassifiers into your logging framework by implementing
a Diagnostic.Receiver
.
sification classifyNoisily(UrlClassifier c, UrlValue x) {
turn c.apply(
x,
(d, v) -> { System.err.println(v + " did not match due to " + d); }
// Use your favorite logging framework instead of System.err.
);
sification classifyNoisilyOldStyle(UrlClassifier c, UrlValue x) {
Old style anonymous class.
agnostic.Receiver<UrlValue> r = new Diagnostic.Receiver<UrlValue>() {
@Override public void note(Diagnostic d, UrlValue x) {
System.err.println(x + " did not match due to " + d);
}
turn c.apply(x, r);
Matching URLs with regular expressions is hard.
Even experienced programmers who are familiar with the URL spec
produce patterns like /http:\/\/example.com/
which spuriously
matches unintended URLs:
http://example.com.evil.com/
http://example.com@evil.com/
http://example_com/
javascript:alert(1)//http://example.com
while failing to match simple variants that probably should:
HTTP://example.com/
which uses a ucase schemehttp://EXAMPLE.com/
which uses a ucase hostnamehttps://example.com/
which uses a scheme that is equivalent for most intents and purposes.A common “fix” for that example, /^https?:\/\/example\.com\//i
, spuriously fails to match
other variants:
http://example.com./
which use a trailing dot to disable DNS suffix searchinghttp://example.com:80/
which makes the port explicitEpicycles can be added to a regex to work around problems as they're found but there is a tradeoff between correctness and readability/maintainability.
There are similar hazards when trying to constrain other parts of the URL like the paths.
/^(?:https?:\/\/example\.com)?\/foo\/.*/
looks like
it should match only URLs that have a path under /foo/
but spuriously matches
http://example.com/foo/../../../../etc/passed
which, used in the wrong context, can cause problems
We assume all %
-sequences outside data or blob content can be
decoded into UTF-8 and mark as invalid any inputs that include
code-unit sequences that are not valid UTF-8 or that are not minimally
encoded.
We assume that all hostnames are complete.
For example, http://www/
might actually resolve to
http://www.myorganization.org/
after the domain search list is applied.
We can't do this and have stable predicates that do not depend on
external services and that do not potentially leak information about
servers inside a firewall to anyone outside the firewall who can
specify a partial URL.