Text-Based Formats

Regardless of whether the file format has other types of content in it, if it is marked as text the most important part of the file is the human readable content.

Text-based formats, generally, come in two forms: Text and Spreadsheets. To protect government records from being altered by the general public, SCERA will provide text and spreadsheet files in .pdf form as often as possible. Because of special functionality concerns, .html pages may or may not be converted. The original text/spreadsheet file may be available on request at the discretion of the Archives.

A note on office productivity software: Office productivity level software is software that can be used to create documents/information that is of sufficient quality as to be usable by a business office. Even though it is usable for office purposes, office productivity software is also common (as of 2015) on home computers and is typically packaged as a suite of products. An example is Microsoft Office suite. A key feature of this type of software is the ability to read, write and edit documents.

Text

A text format is at its heart a computer version of a print document like those created using a typewriter or written by hand. The characters, referred to as fonts, are standardized in appearance to a large extent, as are the way the characters are coded. Depending on the creating computer program and the type of text format, a text document can have other things embedded in it, such as tables, images, audio and moving pictures.

The most common file format extensions for text files are:

  • .pdf
  • .doc/.docx
  • .odt
  • .txt
  • .rtf
  • .ppt/pptx
  • .odp

.pdf

.pdf stands for Portable Document Format. The PDF file format was made most popular by Adobe corporation through its Acrobat and Acrobat Pro programs. PDF files are most often primarily text, but are powerful files because it can save information as if it were an image (scanning to PDF), preserve the look-and-feel of the original document regardless of whether it is text-based or image-based, and is very hard to change without a specialized computer program. They can also save a file as an image but have a text version embedded to help with searching the document. The most common PDF file in SCERA is the PDF/A. The A stands for archival, and this type of PDF has an “open-standard” encoding freely available to the public so that people in the future might be able to create a PDF reader program on their own if necessary. Note that PDF reader programs typically CANNOT read other text-based files properly.

.doc/.docx

.doc/.docx is a shortening of the word document and is the main current (as of 2015) Microsoft Word proprietary office productivity level format. (.docx is different from .doc in that it uses xml as part of how it structures its information.) It has the ability to embed other content in the file, such as images, tables, and “art” that can be manipulated to create visual aids. Microsoft has release some of the coding standard for its Word file documents, but as a proprietary format this is not a requirement. Word documents are often readable by other text reading programs, BUT those need to be designed for office productivity and cannot be based on plain-text formatting.

.odt

.odt stands for Open-Document Text. Open-Document text files are a office productivity level file format based on an “open-standard” coding where the coding information is freely available so a member of the general public can create their own program to read the file if necessary. Functionality of a ODS file is similar to, but not as complex as, a .doc/.docx file and a user should be aware that some of the “look-and-feel” may be lost in translation.

.txt

.txt is an abbreviation of the word Text. It is the most basic of text-based files and should be readable by a text-reading program, regardless of whether it is freely available or for purchase. Fonts may be changed in a .txt file, and the way characters are encoded may vary, but the ability to make large format changes such as embedding a picture or making “headings” in the file are not possible. Very often, an “exotic” looking file format extension, or a file without an extension, is actually a .txt file that has been specialized to be read by a particular computer program.

.rtf

.rtf stands for Rich Text Format. Its general functionality is very similar to that of a Text file format, in that it is fairly basic. BUT it is more complex than the average Text file. A .rtf file provides a middle ground between a Text file and an office productivity software file. A user should be careful of opening a .rtf file in a computer program designed for Text files as that kind of program may not properly show all of the content in the intended manner.

.ppt/.pptx

.ppt/.pptx stands for Powerpoint Presentation and is the main current (as of 2015) Microsoft PowerPoint proprietary file format. (.pptx is different from .ppt in that it uses xml as part of how it structures the file.) Powerpoint files are designed for presentations, and the typical file is divided into individual “slides” that a speaker can use to accompany their presentation. A Powerpoint file is usually primarily about text content, but almost universally has other visual elements such as pictures, special transitions in content, or even images. It is an office productivity level format and can only properly render in an office productivity type of program. A non-Microsoft program option for reading a PPT/PPTX file may not preserve full functionality of the computer file.

.odp

.odp stands for Open-Document Presentation. ODP files are the open-document counterpart for the Microsoft PowerPoint office productivity level presentation format. It is an “open-standard” format, so a user can get information about how the file is coded and could create their own reader if necessary. As with Powerpoint files, an ODP file is designed for presentations, with individual slides that can be only text or have mixed content. Since it is an office productivity type of format, it requires office productivity level software.

Spreadsheets

A spreadsheet format is essentially a table or set of tables divided into rows and columns with information divided into sections called cells. The row and cell give coordinates on where to find information in a spreadsheet. For example, cell A5 is in column A, row 5. Some spreadsheet file formats can use math functions to make automatic calculations based on numbers put into the cells, which is why a spreadsheet can be so powerful for accounting and other math-based needs.

The most common file format extensions for spreadsheets are:

  • .xls/.xlsx
  • .csv
  • .tsv
  • .ods

.xls/.xlsx

.xls/.xlsx is the main current Microsoft Excel proprietary spreadsheet file format. (.xlsx is different from .xls in that it uses xml as part of how it structures its information.) Microsoft does have the ability to make the code used to structure its files freely available, but it is not required to. Excel files are often readable by non-Excel computer programs, but this is not always guaranteed.

.csv

.csv stands for Comma-Separated Value. A Comma-Separated value file stores information as plain text that can be read by any text-reading program. Rather than special coding it uses a comma (,) to mark where one cell ends and another begins. (commas that aren’t meant to mark new cells are put in quotations “”.) When read by a spreadsheet program, a comma-separated file goes from plain-text to be a table with rows and columns.

.tsv

.tsv stands for Tab-Separated Value. A Tab-Separated Value file stores information as plain text so it can be read by any text-reading program. It uses the tab key to mark where one cell ends and another begins (hence Tab-separated). When read by a spreadsheet program, a tab-separated file goes from plain text to being a table with rows and columns.

.ods

.ods format stands for Open-Document Spreadsheet. Open-Document spreadsheets are coded specifically to be read by a spreadsheet computer program and a variety of computer programs can create .ods files. What separates a .ods file from Excel spreadsheet file formats is that the format has an “open-standard” where information about how the spreadsheet coding works is freely available and standardized so a industrious person can adapt a spreadsheet program to read a .ods file, or create their own program to read it.