Name: ngx_http_html_sanitize_module
Owner: ??
Description: It's a nginx http module to sanitize HTML5 with whitelisted elements, whitelisted attributes and whitelisted CSS property
Created: 2017-04-19 09:39:12.0
Updated: 2018-01-19 02:34:25.0
Pushed: 2017-10-09 08:46:18.0
Homepage: null
Size: 2263
Language: HTML
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
ngx_http_html_sanitize_module - It's base on google's gumbo-parser as HTML5 parser and hackers-painters's katana-parser as inline CSS parser to sanitize HTML with whitelisted elements, whitelisted attributes and whitelisted CSS property.
Production Ready :-)
There is a example of nginx configuration according to the https://dev.w3.org/html5/html-author/#the-elements as the following:
er {
listen 8888;
location = /sanitize {
# Explicitly set utf-8 encoding
add_header Content-Type "text/html; charset=UTF-8";
client_body_buffer_size 10M;
client_max_body_size 10M;
html_sanitize on;
# Check https://dev.w3.org/html5/html-author/#the-elements
# Root Element
html_sanitize_element html;
# Document Metadata
html_sanitize_element head title base link meta style;
# Scripting
html_sanitize_element script noscript;
# Sections
html_sanitize_element body section nav article aside h1 h2 h3 h4 h5 h6 header footer address;
# Grouping Content
html_sanitize_element p hr br pre dialog blockquote ol ul li dl dt dd;
# Text-Level Semantics
html_sanitize_element a q cite em strong small mark dfn abbr time progress meter code var samp kbd sub sup span i b bdo ruby rt rp;
# Edits
html_sanitize_element ins del;
# Embedded Content
htlm_sanitize_element figure img iframe embed object param video audio source canvas map area;
# Tabular Data
html_sanitize_element table caption colgroup col tbody thead tfoot tr td th;
# Forms
html_sanitize_element form fieldset label input button select datalist optgroup option textare output;
# Interactive Elements
html_sanitize_element details command bb menu;
# Miscellaneous Elements
html_sanitize_element legend div;
html_sanitize_attribute *.style;
html_sanitize_attribute a.href a.hreflang a.name a.rel;
html_sanitize_attribute col.span col.width colgroup.span colgroup.width;
html_sanitize_attribute data.value del.cite del.datetime;
html_sanitize_attribute img.align img.alt img.border img.height img.src img.width;
html_sanitize_attribute ins.cite ins.datetime li.value ol.reversed ol.stasrt ol.type ul.type;
html_sanitize_attribute table.align table.bgcolor table.border table.cellpadding table.cellspacing table.frame table.rules table.sortable table.summary table.width;
html_sanitize_attribute td.abbr td.align td.axis td.colspan td.headers td.rowspan td.valign td.width;
html_sanitize_attribute th.abbr th.align th.axis th.colspan th.rowspan th.scope th.sorted th.valign th.width;
html_sanitize_style_property color font-size;
html_sanitize_url_protocol http https tel;
html_sanitize_url_domain *.google.com google.com;
html_sanitize_iframe_url_protocol http https;
html_sanitize_iframe_url_domain facebook.com *.facebook.com;
}
And It's recommanded to use the below commnand to sanitize HTML5:
rl -X POST -d "<h1>Hello World </h1>" http://127.0.0.1:8888/sanitize?element=2&attribute=1&style_property=1&style_property_value=1&url_protocol=1&url_domain=0&iframe_url_protocol=1&iframe_url_domain=0
Hello World </h1>
This querystring element=2&attribute=1&style_property=1&style_property_value=1&url_protocol=1&url_domain=0&iframe_url_protocol=1&iframe_url_domain=0
is the as following:
iframe.src
by html_sanitize_iframe_url_protocoliframe.src
by html_sanitize_iframe_url_domainWith ngx_http_html_sanitize_module, we have the ability to specify whether output HTML5's element ?attribute and inline CSS's property by directive and querystring as the following:
disable element:
if we do not want to output any element, we can do this as the following:
-X POST -d "<h1>h1</h1>" http://127.0.0.1:8888/sanitize?element=0
enable element:
if we want to output any element, we can do this as the following:
rl -X POST -d "<h1>h1</h1><h7>h7</h7>" http://127.0.0.1:8888/sanitize?element=1
able whitelisted element:
we want to output whitelisted element, we can do this as the following
$ curl -X POST -d “
elisted attribute
-----------------
sable attribute:
we do not want to output any attribute, we can do this as the following:
curl -X POST -d “<h1 ha="ha">h1” “http://127.0.0.1:8888/sanitize?element=1&attribute=0”
able attribute:
we want to output any attribute, we can do this as the following:
$ curl -X POST -d “<h1 ha="ha">h1” “http://127.0.0.1:8888/sanitize?element=1&attribute=1”
able whitelisted attribute:
we want to output whitelisted element, we can do this as the following:
$ curl -X POST -d “<img src="/" ha="ha" />” “http://127.0.0.1:8888/sanitize?element=1&attribute=2”
elisted style property
-----------------
sable style property:
we do not want to output any style property, we can do this as the following:
curl -X POST -d “<h1 style="color:red;">h1” “http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=0”
able style property:
we want to output any style property, we can do this as the following:
$ curl -X POST -d “<h1 style="color:red;text-align:center;">h1” “http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=1”
able whitelisted style property:
we want to output whitelisted style property, we can do this as the following:
$ curl -X POST -d “<h1 style="color:red;text-align:center;" >h1” “http://127.0.0.1:8888/sanitize?element=1&attribute=1&style_property=2”
ription
=======
the implement of [ngx_http_html_sanitize_module] is based on [gumbo-parser] and [katana-parser]. And we make the combo upon it then run it on [nginx] to as a center web service maintained by professional security people for discarding language-level difference. If we want to gain more higher performance (here is the [brenchmark](#benchmark)), it's recommanded to write language-level library wrapering above pure c library to overcome the overhead of network transmission.
hmark
=====
ing with `wrk -s benchmarks/shot.lua -d 60s "http://127.0.0.1:8888"` on Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz and 64GB memory
me | Size | Avg Latency | QPS
----------- |:-------------:| -----:| -----:|
acker_news.html](/benchmarks/hacker_news.html) | 30KB | 9.06ms | 2921.82
aidu.html](/benchmarks/baidu.html) | 76KB | 13.41ms | 1815.75
rabic_newspapers.html](/benchmarks/arabic_newspapers.html) | 78KB | 16.58ms| 1112.70 |
bc.html](/benchmakrs/bbc.html) | 115KB | 17.96ms |993.12
inhua.html](/benchmarks/xinhua.html) | 323KB | 33.37ms | 275.39
oogle.html](/benchmakrs/google.html) | 336KB | 26.78ms | 351.54
ahoo.html](/benchmakrs/yahoo.html) | 430KB | 29.16ms | 323.04
ikipedia.html](/benchmakrs/wikipedia.html) | 511KB | 57.62ms | 160.10
tml5_spec.html](/benchmarks/html5_spec.html) | 7.7MB | 1.63s | 2.00
=======
mbo-parser (hard): Improvement with SSE-4.2 to speed up string processing
mbo-parser (hard): Additional performance improvements with algorithm level
tana-parser (hard): Improvement with SSE-4.2 to speed up string processing
tana-parser (hard): Additional performance improvements with algorithm level
rective (optional): Add mode directives to carefully control HTML5 and inline CSS output
ml_sanitize_attribute (hard): Add new algorithm instead of current hash find to reduce memory allocation
sts (easy): Pass more xss security tests
erystring (optional): Allow foreign whitelisted querystring to control whitelisted elements?attributes?style_properties.
ps to optimize performance is learn from the On-CPU Flamegraph as the following:
lamegraph](https://cdn.rawgit.com/youzan/ngx_http_html_sanitize_module/master/flamegraphs/html_sanitize_gumbo_parse.svg)](https://cdn.rawgit.com/youzan/ngx_http_html_sanitize_module/master/flamegraphs/html_sanitize_gumbo_parse.svg)
ctive
====
_sanitize
---------
ntax:** *html_sanitize on | off*
fault:** *html_sanitize on*
ntext:** *location*
ifies whether enable html sanitize handler on location context
_sanitize_hash_max_size
-----------------------
ntax:** *html_sanitize_hash_max_size size*
fault:** *html_sanitize_hash_max_size 2048*
ntext:** *location*
the maximum size of the element?attribute?style_property?url_protocol?url_domain?iframe_url_protocol?iframe_url_domain hash tables.
_sanitize_hash_bucket_size
-----------------------
ntax:** *html_sanitize_hash_bucket_size size*
fault:** *html_sanitize_hash_bucket_size 32|64|128*
ntext:** location
the bucket size for element?attribute?style_property?url_protocol?url_domain?iframe_url_protocol?iframe_url_domain. The default value depends on the size of the processor?s cache line.
_sanitize_element
-----------------------
ntax:** *html_sanitize_element element ...*
fault:** -
ntext:** location
the whitelisted HTML5 elements when enable whitelisted element by setting the querystring [element] whitelist mode as the following:
html_sanitize_element html head body;
_sanitize_attribute
-----------------------
ntax:** *html_sanitize_attribute attribute ...*
fault:** -
ntext:** location
the whitelisted HTML5 attributes when enable whitelisted element by setting the querystring [attribute] whitelist mode as the following:
html_sanitize_attribute a.href h1.class;
attribute format must be the same as `element.attribute` and support `*.attribute` (prefix asterisk) and `element.*` (suffix asterisk)
_sanitize_style_property
-----------------------
ntax:** *html_sanitize_style_property property ...*
fault:** -
ntext:** location
the whitelisted CSS property when enable whitelisted element by setting the querystring [style_property] whitelist mode as the following:
html_sanitize_style_property color background-color;
_sanitize_url_protocol
-----------------------
ntax:** *html_sanitize_url_protocol [protocol] ...*
fault:** -
ntext:** location
the allowed URL protocol at [linkable attribute](#linkable_attribute) when only the URL is absoluted rahter than related and enable URL protocol check by setting the querystring [url_protocol] check mode as the following:
html_sanitize_url_protocol http https tel;
_sanitize_url_domain
--------------------
ntax:** *html_sanitize_url_domain domain ...*
fault:** -
ntext:** location
the allowed URL domain at [linkable attribute](#linkable_attribute) when only the URL is absoluted rahter than relatived and enable URL protocol check?URL domain check by setting the querystring [url_protocol] check mode and the querystring [url_domain][#url_domain] check mode as the following:
html_sanitize_url_domain *.google.com google.com;
_sanitize_iframe_url_protocol
----------------------------
ntax:** *html_sanitize_iframe_url_protocol [protocol] ...*
fault:** -
ntext:** location
he same as [html_sanitize_url_protocol] but only for `iframe.src` attribute
html_sanitize_iframe_url_protocol http https tel;
_sanitize_iframe_url_domain
----------------------------
ntax:** *html_sanitize_iframe_url_domain [protocol] ...*
fault:** -
ntext:** location
he same as [html_sanitize_url_domain] but only for `iframe.src` attribute
html_sanitize_iframe_url_domain *.facebook.com facebook.com;
able_attribute
=============
linkable attribute is the following:
href
ockquote.cite
cite
l.cite
g.src
s.cite
rame.src
S URL function
ystring
======
querystring from request URL is used to control the [ngx_http_html_sanitize_module] internal action.
ment
----
lue:** *0 or 1*
fault:** *0*
ntext:** querystring
ifies whether append `<!DOCTYPE>` to response body
----
lue:** *0 or 1*
fault:** *0*
ntext:** querystring
ifies whether append `<html></html>` to response body
pt
----
lue:** *0 or 1*
fault:** *0*
ntext:** querystring
ifies whether allow `<script></script>`
e
----
lue:** *0 or 1*
fault:** *0*
ntext:** querystring
ifies whether allow `<style></style>`
space
----
lue:** *0?1 or 2*
fault:** *0*
ntext:** querystring
ifies the mode of gumbo-parser with the value as the following:
MBO_NAMESPACE_HTML: 0
MBO_NAMESPACE_SVG: 1
MBO_NAMESPACE_MATHML: 2
ext
----
lue:** *[0, 150)*
fault:** *38(GUMBO_TAG_DIV)*
ntext:** querystring
ifies the context of gumbo-parser with the value at the this file [tag_enum.h](tag_enum.h)
ent
----
lue:** *0?1?2*
fault:** *0*
ntext:** querystring
ifies the mode of output element with the value as the following:
0: do not output element
1: output all elements
2: output whitelisted elements
ibute
----
lue:** *0?1?2*
fault:** *0*
ntext:** querystring
ifies the mode of output attribute with the value as the following:
0: do not output attributes
1: output all attributes
2: output whitelisted attributes
e_property
----
lue:** *0?1?2*
fault:** *0*
ntext:** querystring
ifies the mode of output CSS property with the value as the following:
0: do not output CSS property
1: output all CSS property
2: output whitelisted CSS property
e_property_value
----
lue:** *0?1*
fault:** *0*
ntext:** querystring
ifies the mode of output CSS property_value with the value as the following:
0: do not check the CSS property's value
1: check the CSS property's value for [URL] function and IE's expression function to avoid XSS inject
protocol
-------
lue:** *0?1*
fault:** *0*
ntext:** querystring
ifies whether check the URL protocol at [linkable_attribute]. The value is as the following:
0: do not check the URL protocol
1: output whitelisted URL protocol
domain
------
lue:** *0?1*
fault:** *0*
ntext:** querystring
ifies whether check the URL domain at [linkable_attribute] when enable [url_protocol] check. The value is as the following:
0: do not check the URL domain
1: output whitelisted URL domain
me_url_protocol
----
lue:** *0?1*
fault:** *0*
ntext:** querystring
he same as [url_protocol] but only for `iframe.src`
me_url_domain
----
lue:** *0?1*
fault:** *0*
ntext:** querystring
he same as [url_domain] but only for `iframe.src`
right
====
_http_html_sanitize_module] is licensed under the Apache License, Version 2.0. See [LICENSE] for the complete license text.
right 2017, By detailyang "Yang Bingwu" Youzan Inc. All Rights Reserved.
nsed under the Apache License, Version 2.0 (the "License");
may not use this file except in compliance with the License.
may obtain a copy of the License at
tp://www.apache.org/licenses/LICENSE-2.0
ss required by applicable law or agreed to in writing, software
ributed under the License is distributed on an "AS IS" BASIS,
OUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
the License for the specific language governing permissions and
tations under the License.
ce
-
that [ngx_http_html_sanitize_module] bundles many projects with different license as the following:
le/gumbo-parser: [https://github.com/google/gumbo-parser](https://github.com/google/gumbo-parser)
ers-painters/katana-parser: [https://github.com/hackers-painters/katana-parser](https://github.com/hackers-painters/katana-parse)