URL encoding stands for encoding certain characters in a URL by
replacing them with one or more character triplets that consist of the
percent character "%
" followed by two hexadecimal digits. The two
hexadecimal digits of the triplet(s) represent the
numeric value of the replaced character.
The term URL encoding is a bit inexact because the encoding
procedure is not limited to
URLs (Uniform Resource
Locators), but can also be applied to any
other URIs (Uniform
Resource Identifiers)
such as URNs (Uniform Resource
Names).
Therefore, the term percent-encoding should be preferred.
The characters allowed in a URI are either reserved or unreserved
(or a percent character as part of a percent-encoding).
Reserved characters are those characters that sometimes have special
meaning, while unreserved characters have no such
meaning. Using percent-encoding, characters which otherwise would not be allowed
are represented using allowed characters.
The sets of reserved and unreserved characters and the circumstances under which
certain reserved characters have special meaning
have changed slightly with each revision of specifications that govern URIs and
URI schemes.
According to RFC
3986, the characters in a URL have to
be taken from a defined set of unreserved and reserved ASCII characters.
Any other characters are not allowed in a URL.
The unreserved characters can be encoded, but should not be encoded. The
unreserved characters are:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ . ~
The reserved characters have to be encoded only under certain circumstances. The
reserved characters are:
! * ' ( ) ; : @ & = + $ , / ? % # [ ]
RFC 3986 does
not define according to which character
encoding table non-ASCII
characters (e.g. the umlauts ä, ö, ü) should
be encoded. As URL encoding involves a pair of hexadecimal
digits and as a pair of hexadecimal digits is equivalent to 8 bits, it would
theoretically be possible to use one of the 8-bit code pages for non-ASCII
characters (e.g. ISO-8859-1 for umlauts).
On the other hand, as many languages have their own 8-bit code page, handling all
these different 8-bit code pages would be a quite
cumbersome thing to do. Some languages do not even fit into an 8-bit code page
(e.g. Chinese). Therefore,
RFC 3629 proposes to use the
UTF-8 character encoding table
for non-ASCII characters.
The following tool takes this into account and offers to choose between the
ASCII character encoding table and the UTF-8 character
encoding table. If you opt for the ASCII character encoding table, a warning
message will pop up if the URL encoded/decoded text
contains non-ASCII characters.
When data that has been entered into HTML forms is submitted, the form field
names and values are encoded and sent to the server in an
HTTP request message using method GET or POST, or, historically, via email. The
encoding used by default is based on a very early version
of the general URI percent-encoding rules, with a number of modifications such
as newline normalization and replacing spaces
with "+
" instead of "%20
". The MIME type of data
encoded this way is application/x-www-form-urlencoded
,
and it is currently defined (still in a very outdated manner) in the HTML and
XForms specifications. In addition, the
CGI specification contains rules for how web servers decode data of this type
and make it available to applications.
When sent in an HTTP GET request, application/x-www-form-urlencoded data is
included in the query component of the request URI.
When sent in an HTTP POST request or via email, the data is placed in the body
of the message, and the name of the media type is included in the message's
Content-Type header.