Charles M. Kozierok The TCP-IP Guide

Подождите немного. Документ загружается.

that most people have with some of these “esoterics” has led to URLs being abused

through deliberate obscuration, to get people to visit “resources” they would normally want

to avoid.

URL Fragments

Technically, a “<fragment>” is not considered a formal part of the URL by the standards that

describe resource naming. The reason is that it only identifies a portion of a resource, and

is not part of the information required to identify the resource itself. It is not sent to the

server but retained by the client software, to guide it in how to display or use the resource.

Some would make a valid argument, however, that this distinction is somewhat arbitrary;

consider, for example, that the scheme itself is also used only by the client, as is the host

itself.

The most common example of a URL fragment is specifying a particular bookmark to

“scroll” to in displaying a Web page. In practice, a fragment identifier is often treated as if it

were part of a URL since it is part of the string that specifies a URL.

"Unsafe" Characters and Special Encodings

URLs are normally expressed in the standard US ASCII character set, which is the default

used by most TCP/IP application protocols. Certain characters in the set are called unsafe,

because they have special meaning in different contexts, and including them in a URL

would lead to ambiguity or problems in of how they should be interpreted. The “space”

character is the classical “unsafe” character because spaces are normally used to separate

URLs, so including one in a URL would break the URL into “pieces”. Other characters are

“unsafe” because they have special significance in a URL, such as the colon (“:”).

The “safe” characters in a URL are alphanumerics (A to Z, a to z and 0 to 9) and the

following special characters: the dollar sign (“$”), hyphen (“-”), underscore (“_”), period (“.”),

plus sign (“+”), exclamation point (“!”), asterisk (“*”), apostrophe (“'”), left parenthesis (“(”),

and right parenthesis (“)”). All other “unsafe” characters can be represented in a URL using

an encoding scheme consisting of a percent sign (“%”) followed by the hexadecimal ASCII

value of the character. The most common examples are given in Table 224.

Table 224: URL Special Character Encodings

Character Encoding Character Encoding Character Encoding

<space> %20 < %3C > %3E

# %23 % %25 { %7B

} %7D | %7C \ %5C

^ %5E ~ %7E [ %5B

] %5D `%60 ; %3B

/ %2F ? %3F : %3A

@ %40 = %3D % %26

When these sequences are encountered, they are interpreted as the literal character they

represent, without any “significance”. So, the URL “http://www.myfavesite.com/

are%20you%20there%3F” points to a file called “are you there?” on “www.myfavesite.com”.

The “%20” codes prevent the spaces from breaking up the URL, and the “3F” prevents the

question mark in the file name from being interpreted as a special URL character.

Note: Since the percent sign is used for this encoding mechanism, it itself is

“special”; any time it is seen the next values are interpreted as character

encodings, so to embed a literal percent sign, it must be encoded as “%25”.

Again, these encodings are sometimes abused for nefarious purposes, unfortunately, such

as using them for regular ASCII characters to obscure URLs.

URL Schemes (Applications / Access Methods) and Scheme-Specific

Syntaxes

Uniform Resource Locators (URLs) use a general syntax that describes the location and

method for accessing a TCP/IP resource. Each access method, called a scheme, has its

own specific URL syntax, including the various pieces of information required by the

method to identify a resource. RFC 1738 includes a description of the specific syntaxes

used by several popular URL schemes. Others have been defined in subsequent RFCs

using the procedure established for URL scheme registration.

Several of the URL schemes use the common Internet pattern given in my overview of URL

syntax. Other schemes use entirely different (usually simpler) structures based on their

needs. For reference, I will repeat the general syntax again here, as it will help you under-

stand the rest of the topic:

The following are the most common URL schemes and the scheme-specific syntaxes they

use.

World Wide Web / Hypertext Transfer Protocol URL Syntax (http)

The Web potentially uses most of the elements of the common Internet scheme syntax, as

follows:

http://<user>:<password>@<host>:<port>/<url-path>?<query>#<bookmark>

As discussed in the overview, the Web is the primary application using URLs today. A URL

can theoretically contain most of the common URL syntax elements, but in practice most

are omitted. Most URLs contain only a host and a path to a resource. The port number is

usually omitted, implying that the default value of 80 should be used. The “<query>”

construct is often used to pass arguments or information from the client to the Web server.

I have provided full details on how Web URLs are used in a separate topic in the section on

HTTP.

File Transfer Protocol URL Syntax (ftp)

The syntax for FTP URLs is:

ftp://<user>:<password>@<host>:<port>/<url-path>;type=<typecode>

FTP is an interactive command-based protocol, so it may seem odd to use a URL for FTP.

However, one of the most common uses of FTP is to access and read a single, particular

file, and this is what an FTP URL allows a client to do. The <user> and <password> are

used for login and may be omitted for anonymous FTP access. The port number is usually

omitted and defaults to the standard FTP control channel port, 21.

The “<url-path>” is interpreted as a directory structure and file name. The appropriate

“CWD” (“change working directory”) commands are issued to go to the specified directory,

and then a “RETR” (“retrieve”) command is issued for the named file. The optional “type”

parameter can be used to indicate the file type: “a” to specify an ASCII file retrieval or “i” for

an image (binary) file. The “type” parameter is often omitted from the URL, with the correct

mode being set automatically by the client based on the name of the file.

For example, consider this URL:

ftp://ftp.hardwarecompanyx.com/drivers/widgetdriver.zip

This is equivalent to starting an FTP client, making an anonymous FTP connection to

“ftp.hardwarecompanyx.com”, then changing to the “drivers” directory and retrieving the file

“widgetdriver.zip”. The client will retrieve the file in binary mode because it is a compressed

“zip” file.

It is also possible to use an FTP URL to get a listing of the files within a particular directory.

This allows a user to navigate an FTP server's directory structure using URL links to find the

file he or she wants, and then retrieve it. This is done by specifying a directory name for the

<url-path> and using the “type” parameter with a “<typecode>” of “d” to request a directory

listing. Again, the “type” parameter is usually omitted and the software figures out to send a

“LIST” command to the server when a directory name is given in a URL.

URL Syntax for Sending Electronic Mail (mailto)

A special syntax is defined to allow a URL to represent the command to send mail to a user:

mailto:<e-mail-address>

The e-mail address is in standard Internet form: “<username>@<domainname>”. This is

really an unusual type of URL because it does not really represent an object at all, though a

person can be considered a type of “resource”.

Note: Note that optional parameters, such as the subject of the e-mail, can also be

included in a mailto URL. This facility is not often used, however.

Gopher Protocol URL Syntax (gopher)

The syntax for the Gopher Protocol is similar to that of HTTP and FTP:

gopher://<host>:<port>/<gopher-path>

See the topic on the Gopher protocol for more information on Gopher paths and how the

protocol operates.

Network News / Usenet URL Syntaxes (news)

Two syntaxes are defined for URL specification of Usenet (NetNews):

news://<newsgroup-name>

news://<message-id>

Both of these URLs are used to access a Usenet newsgroup or a specific message, refer-

enced by message ID. Like the “mailto” scheme, this is a special type of URL because it

defines an access method but does not provide the detailed information to describe how to

locate a newsgroup or message.

By definition, the first form of this URL is interpreted as being “local”. So for example,

“news://alt.food.sushi” means “access the newsgroup alt.food.sushi on the local news

server, using the default news protocol”. The default news protocol is normally NNTP (see

below). The second URL form is global, because message IDs are unique on Usenet (or at

least, are supposed to be!)

Network News Transfer Protocol URL Syntax (nttp)

This is a different URL type for news access:

nntp://<host>:<port>/<newsgroup-name>/<article-number>

Unlike “news”, this URL form specifically requests the use of NNTP and identifies a

particular NNTP server. Then it tells the server which newsgroup to access and which

article number within that newsgroup. Note that articles are numbered using a different

sequence by each server, so this is still a “local” form of news addressing. The port number

defaults to 119.

Even though the “nntp” form seems to provide a more complete resource specification, the

“news” URL is more often used, because it is simpler. It's easier just to set up the appro-

priate NNTP server in the client software once than to specify it each time, since clients

usually only use one NNTP server.

Telnet URL Syntax (telnet)

This scheme is used to open a Telnet connection to a server. This is the syntax:

telnet://<user>:<password>@<host>:<port>

In practice, the user name and password are often omitted, which causes the Telnet server

to prompt for this information. Alternately, the “<user>” can be supplied and the password

left out (to prevent it being seen) and the server will prompt for just the password. The port

number defaults to the standard port for Telnet, 23, and is also often omitted.

This type of URL is interesting in that it identifies a resource that is not an object but rather

a service.

Local File URL Syntax (file)

This is a special URL type used for referring to files on a particular host computer. The

standard syntax is:

file://<host>:<url-path>

This type of URL is also somewhat interesting, in that it describes the location of an object

but not an access method. It is not sufficiently general to allow access to a file anywhere on

an internetwork, but is often used for referencing files on computers on a local area network

where names have been assigned to different devices.

A special syntax is also defined to refer specifically to files on the local computer:

file:///<url-path>

Here, the entire “//<host>:” element has been replaced by a set of three slashes, meaning

specifically, “look on the local host”.

Additional URL Syntax Rules

Additional syntax rules are often used by browsers to support the quirks of Microsoft

operating systems, especially for the “file” scheme. First, the backslashes used by Microsoft

Windows are expressed as forward slashes as required by TCP/IP. Second, since colons

are used in drive letters specifications in Microsoft operating systems, these are replaced

by the “vertical pipe” character, “|”, which “sorta looks like a colon” (play along, please. ☺)

So, to refer to the file “C:\WINDOWS\SYSTEM32\DRIVERS\ETC\HOSTS”, the following

URL could be used:

file:///C|/WINDOWS/SYSTEM32/DRIVERS/ETC/HOSTS

Note however that some browsers actually do allow the colon in the drive specification.

URL Relative Syntax and Base URLs

The Uniform Resource Locator syntax described in the first topic of this section is

sometimes said to specify an absolute URL. This is because the information in the URL is

sufficient to completely identify the resource. Absolute URLs thus have the property of

being context-independent, meaning that one can access and retrieve the resource using

the URL without any additional information required.

Since the entire point of a URL is to provide the information needed to locate and access a

resource, it makes sense that we would want them to be absolute in definition most of the

time. The problem with absolute URLs is that they can be long and cumbersome. There are

cases where many different resources need to be identified that have a relationship to each

other; the URLs for these resources often have many common elements. Using absolute

URLs in such situations leads to a lot of excess and redundant “verbiage”.

The Motivation for Relative URLs

In my overview of URIs I gave a “real world” analogy to a URL in the form of a description of

an “access method” and location for a person retrieving a book: “Take the train to

Albuquerque, then Bus #11 to 41 Albert Street, a red brick house owned by Joanne

Johnson. The book you want is the third from the right on the bottom of the bookshelf on the

second floor”.

What if I also wanted the same person to get a second book located in the same house on

the ground floor after getting the first one? Should I start by saying again “take the train to

Albuquerque, then Bus #11 to 41 Albert Street, a red brick house owned by Joanne

Johnson”? Why bother, when they are already there at that house? No, I would give a

second instruction in relative terms: “go back downstairs, and also get the blue book on the

wood table”. This instruction only makes sense in the context of the original one.

The same need arises in URLs. Consider a Web page located at “http://www.longdomain-

namesareirritating.com/index.htm” that has 37 embedded graphic images in it. The poor

guy stuck with maintaining this site doesn't want to have to put “http://www.longdomain-

namesareirritating.com/” in front of the URL of every image.

Similarly, if we have just taken a directory listing at “ftp://ftp.somesitesomewhere.org/very/

deep/directory/structures/also/stink/” and we want to explore the parent directory, we would

like to just say “go up one level” without having to say “ftp://ftp.somesitesomewhere.org/

very/deep/directory/structures/also/”.

Creating and Interpreting Relative URLs

It is for these reasons that URL syntax was extended to include a relative form. In simplest

terms, a relative URL is the same as an absolute URL but with pieces of information omitted

that are implied by context. Like our “go downstairs” instruction, a relative URL does not by

itself contain enough information to specify a resource. A relative URL must be interpreted

within a context that provides the missing information.

The context needed to find a resource from a relative URL is provided in the form of a base

URL that provides the missing information. A base URL must be either a specific absolute

URL, or itself a relative URL that refers to some other absolute base. The base URL may be

either explicitly stated or may be inferred from use. The RFCs dealing with URLs define

three methods for determining the base URL, which are arranged into the following

precedence:

1. Base URL Within Document: Some documents allow the base URL to be explicitly

stated. If present, this specification is used for any relative URLs in the document.

2. Base URL From Encapsulating Entity: In cases where no explicit base URL is

specified in a document, but the document is part of a higher-level entity enclosing it,

the base URL is the URL of the “parent” document. For example, a document within a

body part of a MIME multipart message can use the URL of the message as a whole

as the base URL for relative references.

3. Base URL From Retrieval URL: If neither of those two methods are feasible, the

base URL is inferred from the URL used to retrieve the document containing the

relative URL.

Of these three methods, #1 and #3 are the most common. HTML, the language used for the

Web, allows a base URL to be explicitly stated which removes any doubt about how relative

URLs are to be interpreted. Failing this, method #3 is commonly used for images and other

links in HTML documents that are specified in relative terms.

For example, let's go back to the poor slob maintaining “http://www.longdomainnamesareir-

ritating.com/index.htm”. By default, any images referenced from that “index.htm” HTML

document can use relative URLs—the base URL will be assumed from the name of the

document itself. So he can just say “companylogo.gif” instead of “http://www.longdomain-

namesareirritating.com/companylogo.gif”, as long as that file is in the same directory on the

same server as “index.htm”.

If all three of these methods fail for whatever reason, then no base URL can be determined.

Relative URLs in such a document will be interpreted as absolute URLs, and since they do

not contain complete information, they will not work properly.

Also, relative URLs only have meaning for certain URL schemes. For others, they make no

sense and cannot be used. In particular, relative URLs are never used for the “telnet”,

“mailto” and “news” schemes. They are very commonly used for HTTP documents, and

may also be used for FTP and file URLs.

Key Concept: Regular URLs are absolute, meaning that they include all of the infor-

mation needed to fully specify how to access a resource. In situations where many

resources need to be accessed that are approximately in the same place or are

related in some way, completely specifying a URL can be inefficient. Instead, relative URLs

can be used, which specify how to access a resource relative to the location of another one.

A relative URL can only be interpreted within the context of a base URL that provides any

information missing from the relative reference.

Practical Interpretation of Relative URLs

The description above probably seems confusing, but relative URLs are actually fairly easy

to understand, because they are interpreted in a rather “common sense” way. You simply

take the base URL and the relative URL, and you substitute whatever information is in the

relative URL for the appropriate information in the base URL to get the resulting equivalent

absolute reference. In so doing, you must “drop” any elements that are more specific than

the ones being replaced.

What do I mean by “more specific”? Well, most URLs can be considered to move from

“most general” to “most specific” in terms of the location they specify. As you go from left to

right, you go through the host name, then high-level directories, subdirectories, the file

name, and optionally, parameters/query/fragment applied to the file name. If a relative URL

specifies a new file name, it replaces the file name in the base URL, and any parameters/

query/fragment are dropped as they no longer have meaning given that the file name has

changed. If the relative URL changes the host name, the entire directory structure, file

name and everything else “to the right” of the host name “goes away”, replaced with any

that might have been included in the new host name specification.

This is hard to explain in words but easy to understand with a few examples. Let's assume

we start with the following explicit base URL:

http://site.net/dir1/subdir1/file1?query1#bookmark1

Table 225 shows some example of relative URLs and how they would be interpreted.

Table 225: Relative URL Specifications and Absolute Equivalents

Relative URL

Equivalent Absolute

URL

Explanation

#bookmark2

http://site.net/dir1/subdir1/

file1?query1#bookmark2

The URL is the same except that the bookmark is

different. This can be used to reference different

places in the same document in HTML.

(Technically, the URL has not changed here, since

the “fragment” (bookmark) is not part of the actual

URL. A Web browser given a new bookmark name

will usually not try to re-access the resource.)

?query2

http://site.net/dir1/subdir1/

file1?query2

The same file but with a different query string. Note

that the bookmark reference from the base URL is

“stripped off”.

file2

http://site.net/dir1/subdir1/

file2

Here we have referred to a file using the name

“file2”, which replaces “file1” in the base URL. Here

both the query and bookmark are removed.

/file2 http://site.net/file2

Since a single slash was included, this means “file2”

is in the root directory; this relative URL replaces the

entire <url-path> of the base URL.

.. http://site.net/dir1/

The pair of dots refers to the parent directory of the

one in the base URL. Since the directory in the base

URL is “dir1/subdir1”, this refers to “dir1/”.

../file2 http://site.net/dir1/file2

Specifies that we should go up to the parent directory

to find the file “file2” in “dir1”.

../subdir2/file2

http://site.net/dir1/subdir2/

file2

Go up one directory with “..”, then enter the subdi-

rectory “subdir2” to find “file2”.

../../dir2/subdir2/file2

http://site.net/dir2/subdir2/

file2

Same thing as above but going up two directory

levels, then down through “dir2” and “subdir2” to find

“file2”.

//file2 http://file2

Two slashes means that “file2” replaces the host

name, causing everything to the right of the host

name to be stripped. This is probably not what was

intended, and shows how important it is to watch

those slashes. ☺

//www.newsite.net/

otherfile.htm

http://www.newsite.net/

otherfile.htm

Here everything but the scheme has been replaced.

(In practice this form of relative URL is not that

common—the scheme is usually included if the site

name is specified, for completeness.)

file2?query2#bookm

ark2

http://site.net/dir1/subdir1/

file2?query2#bookmark2

Here we replace the file name, query name and

bookmark name.

ftp://differentsite.net/

whatever

ftp://differentsite.net/

whatever

Using a new scheme forces the URL to be inter-

preted as absolute.

Improving Document Portability Using Relative URLs

There is one other very important benefit of using relative URLs: avoiding absolute URLs in

a document allows it to be more portable by eliminating “hard-coded” references to names

that might change. Going back to our previous example, if the guy maintaining the site

“http://www.longdomainnamesareirritating.com/” uses only relative links to refer to graphics

and other embedded objects, then if the site is migrated to “www.muchshortername.com”,

he will not have to edit all of his links to the new name. The significance of this in Web URLs

is explored further in the detailed topic on HTTP URLs.

Key Concept: In addition to being more efficient than absolute URLs, relative URLs

have the advantage that they allow a resource designer to avoid the specific mention

of names. This increases the portability of documents between locations within a

site, or between sites.

URL Length and Complexity Issues

Uniform Resource Locators (URLs) are the most ubiquitous form of resource addressing for

some very good reasons: they represent a simple, convenient and easy-to-understand way

of finding documents. Popularized by their use on the World Wide Web, URLs can now be

seen in everything from electronic document lists to television commercials, a testament to

their universality and ease of use.

At least, this is true most of the time!

When URLs work, they work very well. Unfortunately, there are also some concerns that

arise with respect to how URLs are used. Both accidental and intentional misuse of URLs

occurs on a regular basis. Part of why I have devoted so much effort to describing URLs is

that most people don't really understand how they work, and this is part of why problems

occur.

Many of the issues with URLs are directly due to the related matters of length and

complexity. URLs work best when they are short and simple, so it is clear what they are

about and so they are easy to manipulate. For example, “http://www.ibm.com” is recog-

nizable to almost everyone as the World Wide Web (WWW) site of the International

Business Machines Corporation (IBM). Similarly, you can probably figure out what this URL

does without any explanation: “ftp://www.somecomputercompany.com/drivers/

videodrivers.zip”.

However, as we have seen earlier in this section, URLs can be much more complex. In

particular, the common Internet syntax used by protocols such as HTTP and FTP is

extremely flexible, containing a large number of optional elements that can be used when

required to provide the information necessary for a particular resource access.