OWASP/url-classifier

Name: url-classifier

Owner: OWASP

Description: Declarative syntax for defining sets of URLs. No need for error-prone regexs.

Created: 2017-10-06 03:26:17.0

Updated: 2018-05-03 10:31:56.0

Pushed: 2018-03-06 17:01:44.0

Homepage: null

Size: 220

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

URL Classifier Build Status

Declarative syntax for defining sets of URLs. No need for error-prone regexs.

Usage (javadoc)
lasses are all defined under org.owasp.url
rt org.owasp.url.*;

s C {

* We define a classifier with a declarative syntax */
atic final UrlClassifier CLASSIFIER = UrlClassifiers.builder()
  // We want to allow HTTP and HTTPS for this example
  .scheme(BuiltinScheme.HTTP, BuiltinScheme.HTTPS)
  .authority(
      AuthorityClassifiers.builder()
      // We whitelist some subdomains of hosts we trust.
      .host("**.example.com", "**.example.net")
      .build())
  // We allow access to .html files
  .pathGlob("**.html")
  .build();

id f() {
// At runtime, we build a URL value.
// Pass in a UrlContext if you know the base URL.
UrlValue url = UrlValue.from("http://example.com/");

Classification c = CLASSIFIER.apply(
    url,
    // If we want an explanation of why classification failed
    // we can connect diagnostics to our logs.
    Diagnostic.Receiver.NULL);

// We can switch on the result.
switch (c) {
  case MATCH:
    // ...
    System.out.println(url.urlText);
    break;
  case NOT_A_MATCH:
    // ...
    break;
  case INVALID:
    // ...
    break;
}


In Bazel

To use in maven, just add to your WORKSPACE:

 WORKSPACE

n_jar(
name = "org_owasp_url",
artifact = "org.owasp:url:1.2.4",
sha1 = "f79cace2e811092dff78bc03b520eade0d675d33")

Then in a BUILD you can use it thus:

 BUILD

_library(
name = ...,
deps = [
    "@org_owasp_url//jar",
    # ...
],

In your WORKSPACE, you can replace the artifact version with one chosen from Maven Central.

You can check the hash by copying the jar link for the version you want and adding .sha1 to the end.


Alternatively, you can use any of the release ZIP files with http_archive thus:

 WORKSPACE

_archive(
name = "org_owasp_url",
url = "https://github.com/OWASP/url-classifier/archive/v1.2.4.zip",
sha256 = "0019dfc4b32d63c1392aa264aed2253c1e0c2fb09216f8e2cc269bbfb8bb49b5")
In Maven

Add

endency>
<groupId>org.owasp</groupId>
<artifactId>url</artifactId>
<version>1.2.3</version>
pendency>

to your POM's <dependencies> or <dependencyManagement> section.

Invalid URLs

A UrlClassifier returns MATCH for some URLs and NOT_A_MATCH for others, but it can also return INVALID. An INVALID URL is one that

There are several corner cases that are rejected as INVALID by default.

If you need to treat one or more as valid, you can tell your UrlClassifier to tolerate them thus:

rt static org.owasp.url.UrlValue.CornerCase.*;


lClassifiers.builder()
  // Allow too many ..
  .tolerate(PATH_SIMPLIFICATION_REACHES_ROOT_PARENT)
  // More policy here ...
  .build();

 Alternatively, if we're triggering this particular corner case
 because the default context doesn't capture our application path
 we can use a different context when classifying UrlValues.
lContext context = UrlContext.DEFAULT.withContextUrl(
  "http://example.com/foo/bar/baz/");

Diagnostics

Sometimes its nice to know which URLs do not match a classifier and why.

You can tie UrlClassifiers into your logging framework by implementing a Diagnostic.Receiver.

sification classifyNoisily(UrlClassifier c, UrlValue x) {
turn c.apply(
  x,
  (d, v) -> { System.err.println(v + " did not match due to " + d); }
  // Use your favorite logging framework instead of System.err.
  );


sification classifyNoisilyOldStyle(UrlClassifier c, UrlValue x) {
 Old style anonymous class.
agnostic.Receiver<UrlValue> r = new Diagnostic.Receiver<UrlValue>() {
@Override public void note(Diagnostic d, UrlValue x) {
  System.err.println(x + " did not match due to " + d);
}

turn c.apply(x, r);

Problem

Matching URLs with regular expressions is hard. Even experienced programmers who are familiar with the URL spec produce patterns like /http:\/\/example.com/ which spuriously matches unintended URLs:

while failing to match simple variants that probably should:

A common “fix” for that example, /^https?:\/\/example\.com\//i, spuriously fails to match other variants:

Epicycles can be added to a regex to work around problems as they're found but there is a tradeoff between correctness and readability/maintainability.

There are similar hazards when trying to constrain other parts of the URL like the paths. /^(?:https?:\/\/example\.com)?\/foo\/.*/ looks like it should match only URLs that have a path under /foo/ but spuriously matches

which, used in the wrong context, can cause problems

Simplifying Assumptions
UTF-8 centric

We assume all %-sequences outside data or blob content can be decoded into UTF-8 and mark as invalid any inputs that include code-unit sequences that are not valid UTF-8 or that are not minimally encoded.

Empty domain search list

We assume that all hostnames are complete. For example, http://www/ might actually resolve to http://www.myorganization.org/ after the domain search list is applied. We can't do this and have stable predicates that do not depend on external services and that do not potentially leak information about servers inside a firewall to anyone outside the firewall who can specify a partial URL.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.