youzan/ngx_http_html_sanitize_module

Name: ngx_http_html_sanitize_module

Owner: ??

Description: It's a nginx http module to sanitize HTML5 with whitelisted elements, whitelisted attributes and whitelisted CSS property

Created: 2017-04-19 09:39:12.0

Updated: 2018-01-19 02:34:25.0

Pushed: 2017-10-09 08:46:18.0

Homepage: null

Size: 2263

Language: HTML

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Name

ngx_http_html_sanitize_module - It's base on google's gumbo-parser as HTML5 parser and hackers-painters's katana-parser as inline CSS parser to sanitize HTML with whitelisted elements, whitelisted attributes and whitelisted CSS property.

TOC

Status

Build Status

Production Ready :-)

Example

There is a example of nginx configuration according to the https://dev.w3.org/html5/html-author/#the-elements as the following:

er {
listen 8888;

location = /sanitize {
    # Explicitly set utf-8 encoding
    add_header Content-Type "text/html; charset=UTF-8";

    client_body_buffer_size 10M;
    client_max_body_size 10M;

    html_sanitize on;

    # Check https://dev.w3.org/html5/html-author/#the-elements

    # Root Element
    html_sanitize_element html;

    # Document Metadata
    html_sanitize_element head title base link meta style;

    # Scripting
    html_sanitize_element script noscript;

    # Sections
    html_sanitize_element body section nav article aside h1 h2 h3 h4 h5 h6 header footer address;

    # Grouping Content
    html_sanitize_element p hr br pre dialog blockquote ol ul li dl dt dd;

    # Text-Level Semantics
    html_sanitize_element a q cite em strong small mark dfn abbr time progress meter code var samp kbd sub sup span i b bdo ruby rt rp;

    # Edits
    html_sanitize_element ins del;

    # Embedded Content
    htlm_sanitize_element figure img iframe embed object param video audio source canvas map area;

    # Tabular Data
    html_sanitize_element table caption colgroup col tbody thead tfoot tr td th;

    # Forms
    html_sanitize_element form fieldset label input button select datalist optgroup option textare output;

    # Interactive Elements
    html_sanitize_element details command bb menu;

    # Miscellaneous Elements
    html_sanitize_element legend div;

    html_sanitize_attribute *.style;
    html_sanitize_attribute a.href a.hreflang a.name a.rel;
    html_sanitize_attribute col.span col.width colgroup.span colgroup.width;
    html_sanitize_attribute data.value del.cite del.datetime;
    html_sanitize_attribute img.align img.alt img.border img.height img.src img.width;
    html_sanitize_attribute ins.cite ins.datetime li.value ol.reversed ol.stasrt ol.type ul.type;
    html_sanitize_attribute table.align table.bgcolor table.border table.cellpadding table.cellspacing table.frame table.rules table.sortable table.summary table.width;
    html_sanitize_attribute td.abbr td.align td.axis td.colspan td.headers td.rowspan td.valign td.width;
    html_sanitize_attribute th.abbr th.align th.axis th.colspan th.rowspan th.scope th.sorted th.valign th.width;

    html_sanitize_style_property color font-size;

    html_sanitize_url_protocol http https tel;
    html_sanitize_url_domain *.google.com google.com;

    html_sanitize_iframe_url_protocol http https;
    html_sanitize_iframe_url_domain  facebook.com *.facebook.com;
}

And It's recommanded to use the below commnand to sanitize HTML5:

rl -X POST -d "<h1>Hello World </h1>" http://127.0.0.1:8888/sanitize?element=2&attribute=1&style_property=1&style_property_value=1&url_protocol=1&url_domain=0&iframe_url_protocol=1&iframe_url_domain=0

Hello World </h1>

This querystring element=2&attribute=1&style_property=1&style_property_value=1&url_protocol=1&url_domain=0&iframe_url_protocol=1&iframe_url_domain=0 is the as following:

With ngx_http_html_sanitize_module, we have the ability to specify whether output HTML5's element ?attribute and inline CSS's property by directive and querystring as the following:

whitelisted element
 -X POST -d "<h1>h1</h1>" http://127.0.0.1:8888/sanitize?element=0

h1

h7
able whitelisted element:

 we want to output whitelisted element, we can do this as the following

$ curl -X POST -d “

h1

h7” http://127.0.0.1:8888/sanitize?element=1

h1

elisted attribute
-----------------

sable attribute:

 we do not want to output any attribute, we can do this as the following:

curl -X POST -d “<h1 ha="ha">h1” “http://127.0.0.1:8888/sanitize?element=1&attribute=0”

h1

able attribute:

 we want to output any attribute, we can do this as the following:

$ curl -X POST -d “<h1 ha="ha">h1” “http://127.0.0.1:8888/sanitize?element=1&attribute=1”

h1

able whitelisted attribute:

 we want to output whitelisted element, we can do this as the following:

$ curl -X POST -d “<img src="/" ha="ha" />” “http://127.0.0.1:8888/sanitize?element=1&attribute=2”

elisted style property
-----------------

sable style property:

 we do not want to output any style property, we can do this as the following:

It will do not output any style property

curl -X POST -d “<h1 style="color:red;">h1” “http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=0”

h1

able style property:

 we want to output any style property, we can do this as the following:

$ curl -X POST -d “<h1 style="color:red;text-align:center;">h1” “http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=1”

h1

able whitelisted style property:

 we want to output whitelisted style property, we can do this as the following:

$ curl -X POST -d “<h1 style="color:red;text-align:center;" >h1” “http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=2”

h1

ription
=======

the implement of [ngx_http_html_sanitize_module] is based on [gumbo-parser] and [katana-parser]. And we make the combo upon it then run it on [nginx] to as a center web service maintained by professional security people for discarding language-level difference. If we want to gain more higher performance (here is the [brenchmark](#benchmark)), it's recommanded to write language-level library wrapering above pure c library to overcome the overhead of network transmission.

hmark
=====

ing with `wrk -s benchmarks/shot.lua -d 60s "http://127.0.0.1:8888"` on Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz and 64GB memory

me | Size | Avg Latency | QPS
----------- |:-------------:| -----:| -----:|
acker_news.html](/benchmarks/hacker_news.html) | 30KB | 9.06ms | 2921.82
aidu.html](/benchmarks/baidu.html) | 76KB | 13.41ms | 1815.75
rabic_newspapers.html](/benchmarks/arabic_newspapers.html) | 78KB | 16.58ms| 1112.70 |
bc.html](/benchmakrs/bbc.html) | 115KB | 17.96ms |993.12
inhua.html](/benchmarks/xinhua.html) | 323KB | 33.37ms | 275.39
oogle.html](/benchmakrs/google.html) | 336KB | 26.78ms | 351.54
ahoo.html](/benchmakrs/yahoo.html) | 430KB | 29.16ms | 323.04
ikipedia.html](/benchmakrs/wikipedia.html) | 511KB | 57.62ms | 160.10
tml5_spec.html](/benchmarks/html5_spec.html) | 7.7MB | 1.63s | 2.00


=======

mbo-parser (hard): Improvement with SSE-4.2 to speed up string processing
mbo-parser (hard): Additional performance improvements with algorithm level
tana-parser (hard): Improvement with SSE-4.2 to speed up string processing
tana-parser (hard): Additional performance improvements with algorithm level
rective (optional): Add mode directives to carefully control HTML5 and inline CSS output
ml_sanitize_attribute (hard): Add new algorithm instead of current hash find to reduce memory allocation
sts (easy): Pass more xss security tests
erystring (optional): Allow foreign whitelisted querystring to control whitelisted elements?attributes?style_properties.

ps to optimize performance is learn from the On-CPU Flamegraph as the following:

lamegraph](https://cdn.rawgit.com/youzan/ngx_http_html_sanitize_module/master/flamegraphs/html_sanitize_gumbo_parse.svg)](https://cdn.rawgit.com/youzan/ngx_http_html_sanitize_module/master/flamegraphs/html_sanitize_gumbo_parse.svg)

ctive
====

_sanitize
---------

ntax:** *html_sanitize on | off*

fault:** *html_sanitize on*

ntext:** *location*

ifies whether enable html sanitize handler on location context


_sanitize_hash_max_size
-----------------------

ntax:** *html_sanitize_hash_max_size size*

fault:** *html_sanitize_hash_max_size 2048*

ntext:** *location*

 the maximum size of the element?attribute?style_property?url_protocol?url_domain?iframe_url_protocol?iframe_url_domain hash tables.

_sanitize_hash_bucket_size
-----------------------

ntax:** *html_sanitize_hash_bucket_size size*

fault:** *html_sanitize_hash_bucket_size 32|64|128*

ntext:** location

 the bucket size for element?attribute?style_property?url_protocol?url_domain?iframe_url_protocol?iframe_url_domain. The default value depends on the size of the processor?s cache line.

_sanitize_element
-----------------------

ntax:** *html_sanitize_element element ...*

fault:** -

ntext:** location

the whitelisted HTML5 elements when enable whitelisted element by setting the querystring [element] whitelist mode as the following:

html_sanitize_element html head body;

_sanitize_attribute
-----------------------

ntax:** *html_sanitize_attribute attribute ...*

fault:** -

ntext:** location

the whitelisted HTML5 attributes when enable whitelisted element by setting the querystring [attribute] whitelist mode as the following:

html_sanitize_attribute a.href h1.class;

attribute format must be the same as `element.attribute` and support `*.attribute` (prefix asterisk) and `element.*` (suffix asterisk)

_sanitize_style_property
-----------------------

ntax:** *html_sanitize_style_property property ...*

fault:** -

ntext:** location

the whitelisted CSS property when enable whitelisted element by setting the querystring [style_property] whitelist mode as the following:

html_sanitize_style_property color background-color;

_sanitize_url_protocol
-----------------------

ntax:** *html_sanitize_url_protocol [protocol] ...*

fault:** -

ntext:** location

the allowed URL protocol at [linkable attribute](#linkable_attribute) when only the URL is absoluted rahter than related and enable URL protocol check by setting the querystring [url_protocol] check mode as the following:

html_sanitize_url_protocol http https tel;

_sanitize_url_domain
--------------------

ntax:** *html_sanitize_url_domain domain ...*

fault:** -

ntext:** location

the allowed URL domain at [linkable attribute](#linkable_attribute) when only the URL is absoluted rahter than relatived and enable URL protocol check?URL domain check by setting the querystring [url_protocol] check mode and the querystring [url_domain][#url_domain] check mode as the following:

html_sanitize_url_domain *.google.com google.com;

_sanitize_iframe_url_protocol
----------------------------

ntax:** *html_sanitize_iframe_url_protocol [protocol] ...*

fault:** -

ntext:** location

he same as [html_sanitize_url_protocol] but only for `iframe.src` attribute

html_sanitize_iframe_url_protocol http https tel;

_sanitize_iframe_url_domain
----------------------------

ntax:** *html_sanitize_iframe_url_domain [protocol] ...*

fault:** -

ntext:** location

he same as [html_sanitize_url_domain] but only for `iframe.src` attribute

html_sanitize_iframe_url_domain *.facebook.com facebook.com;

able_attribute
=============
linkable attribute is the following:

href
ockquote.cite
cite
l.cite
g.src
s.cite
rame.src
S URL function

ystring
======
querystring from request URL is used to control the [ngx_http_html_sanitize_module] internal action.

ment
----
lue:** *0 or 1*

fault:** *0*

ntext:** querystring

ifies whether append `<!DOCTYPE>` to response body


----
lue:** *0 or 1*

fault:** *0*

ntext:** querystring

ifies whether append `<html></html>` to response body


pt
----
lue:** *0 or 1*

fault:** *0*

ntext:** querystring

ifies whether allow `<script></script>`

e
----
lue:** *0 or 1*

fault:** *0*

ntext:** querystring

ifies whether allow `<style></style>`

space
----
lue:** *0?1 or 2*

fault:** *0*

ntext:** querystring

ifies the mode of gumbo-parser with the value as the following:

MBO_NAMESPACE_HTML: 0
MBO_NAMESPACE_SVG: 1
MBO_NAMESPACE_MATHML: 2

ext
----
lue:** *[0, 150)*

fault:** *38(GUMBO_TAG_DIV)*

ntext:** querystring

ifies the context of gumbo-parser with the value at the this file [tag_enum.h](tag_enum.h)

ent
----
lue:** *0?1?2*

fault:** *0*

ntext:** querystring

ifies the mode of output element with the value as the following:

 0: do not output element
 1: output all elements
 2: output whitelisted elements

ibute
----
lue:** *0?1?2*

fault:** *0*

ntext:** querystring

ifies the mode of output attribute with the value as the following:

 0: do not output attributes
 1: output all attributes
 2: output whitelisted attributes

e_property
----
lue:** *0?1?2*

fault:** *0*

ntext:** querystring

ifies the mode of output CSS property with the value as the following:

0: do not output CSS property
1: output all CSS property
2: output whitelisted CSS property

e_property_value
----
lue:** *0?1*

fault:** *0*

ntext:** querystring

ifies the mode of output CSS property_value with the value as the following:

0: do not check the CSS property's value
1: check the CSS property's value for [URL] function and IE's expression function to avoid XSS inject

protocol
-------
lue:** *0?1*

fault:** *0*

ntext:** querystring

ifies whether check the URL protocol at [linkable_attribute]. The value is as the following:

0: do not check the URL protocol
1: output whitelisted URL protocol

domain
------
lue:** *0?1*

fault:** *0*

ntext:** querystring

ifies whether check the URL domain at [linkable_attribute] when enable [url_protocol] check. The value is  as the following:

0: do not check the URL domain
1: output whitelisted URL domain

me_url_protocol
----
lue:** *0?1*

fault:** *0*

ntext:** querystring

he same as [url_protocol] but only for `iframe.src`

me_url_domain
----
lue:** *0?1*

fault:** *0*

ntext:** querystring

he same as [url_domain] but only for `iframe.src`

right
====
_http_html_sanitize_module] is licensed under the Apache License, Version 2.0. See [LICENSE] for the complete license text.

right 2017, By detailyang "Yang Bingwu" Youzan Inc. All Rights Reserved.

nsed under the Apache License, Version 2.0 (the "License");
may not use this file except in compliance with the License.
may obtain a copy of the License at

tp://www.apache.org/licenses/LICENSE-2.0

ss required by applicable law or agreed to in writing, software
ributed under the License is distributed on an "AS IS" BASIS,
OUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
the License for the specific language governing permissions and
tations under the License.

ce
-
 that [ngx_http_html_sanitize_module] bundles many projects with different license as the following:

le/gumbo-parser: [https://github.com/google/gumbo-parser](https://github.com/google/gumbo-parser)

ers-painters/katana-parser: [https://github.com/hackers-painters/katana-parser](https://github.com/hackers-painters/katana-parse)

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.