Web-Based Formats

Structure-based file formats are difficult to categorize. All file formats have structure and that structure allows a user to see content as it should be seen. What makes a structure-based format different is that it is almost always a text-based file that can be opened using a text reader to see the basic content, BUT to be able to see the content as it was meant to be seen, a special different program is necessary to render the content properly. Structure-based files usually use tags, which look like <this>, to tell the special program how the file should look. Tags are also called mark-up. The most common type of structure-based file format is a html webpage.

HTML and web-based formats

Web-based formats are computer files meant to be viewed through a web-browser. These formats, such as html, are a special case that does not easily fit into a text, video or audio format category. More often than not, a webpage is primarily text-based in what it is important for a typical user to see, BUT many webpages have audio or video linked to it that creates a gray area for categorization. Often a webpage/website will use mini-programs and special file formats to create additional functions that would not be present in a basic .html file.

In SCERA, html and similar pages are almost always considered text because that is how the file was meant to be seen. However, the real text of the file is not what was meant to be viewed by most users so a transformation of the file to a different format will try to preserve the final look as seen using an html viewer and not the text code to create that look. Linked files such as audio or video will be provided separately as their own record.

Other web-based file formats complementary to .html files are:

  • .htm
  • .xhtml
  • .mhtml
  • .asp/.aspx
  • .warc
  • .php
  • .js
  • .css

XML, etc.

.xml is a standardized Markup Language format. The ML in the name refers to Markup Language. XML is a wildcard of file formats because it is very versatile in how it is used and what the final look and feel should be. Also, any file extension that uses markup to work with content is XML. For example, html is a type of xml because it uses Markup tags to structure the final appearance. .yml is another xml format used for special purposes. In MOST cases when the xml is not html, the xml file can be understood in its basic text view, so SCERA (on a case-by-case basis) will usually preserve the text of the file.

XML files that are meant to be seen using a web-browser are usually found in sets of three (3) related files:

  1. .xml
  2. .xsd/.dtd
  3. .xsl/.xslt

.xml is the file type with the <tags> that has the structured information.

.xsd and .dtd are the file types that contain the rules on how a .xml file should be <tagged>. Without the .xsd/.dtd a user could use whatever tags they felt like without any rhyme or reason. The difference between .xsd and .dtd is that .xsd is also written using <tags> and .dtd is plain text.

.xsl and .xslt are the file formats the .xml uses to convert its content into another format for use/viewing. For example, a .xml file will refer to a .xsl file and the web-browser will use the two together to convert the look of a .xml file into that of a .html file.

Although the relationship between xml, xsd/dtd, and xsl/xslt seems very complicated, in practice it is very effective and common for managing information. For practical purposes when thinking about XML for webpages, the following equation summarizes the relationships: .xml + .xsd/.dtd + .xsl/.xslt = webpage.