An Introduction to Using Data at DISC

What are Data?

Modified from a text by Gregory Haley, Head, Electronic Data Service, Columbia University.

In the context of data libraries and archives, 'data' means computer-readable data. We acquire, store and disseminate data for secondary research. This implies that the data collected for a primary purpose are then made available for research by other individuals or groups. This research may seek to replicate analyses already carried out by primary researchers in order to verify, extend, or elaborate upon the original results, or to analyse the data from an entirely different perspective. Censuses and large surveys carried out by governments for their own policy purposes are particularly rich sources of data for further exploration.

For most, the "Introduction to Data" occurs in the context of an introductory or advanced course in statistical analysis. Typically, the data you have used has been preformatted by a TA or RA for use with a particular statistical package such as SPSS, SAS or Stata. Once you embark upon your own research, you will rapidly discover that few datasets come prepared for immediate access by your favorite application. In fact, you will find that getting your data in an appropriate form for analysis is oftentimes more involved than the most complex statistical analyses you will employ!

Using Secondary Data

Usually your research design will be such that you will not have to collect your own data but can test your hypotheses using data that already exist among the wealth of data available in the public realm. These data might be small, simple, micro-level data such as a public opinion poll, or a survey of social or political attitudes. Or they may be more extensive and complex data, such as the Current Population Surveys or the Panel Study of Income Dynamics. Alternatively, many macro-level data sets (geographically aggregated data such as the County Business Patterns or the International Financial Statistics) are also available. Regardless, the challenge with secondary data is to assure yourself that the data appropriately address your research question such that you are not caught in a dilemma of altering your hypothesis to fit the data.

The sorts of questions you will need to ask yourself when you are evaluating secondary data sources for use include the appropriateness of the study's unit of analysis and sampling, the variables and their values, and levels of measurement.

Finding data

At the Data and Information Services Center our primary source of data is the Inter-university Consortium for Political and Social Research (ICPSR) located in Ann Arbor Michigan. They have a searchable database of their extensive collection, which is available freely to all University of Wisconsin students, staff and faculty. DISC also obtains data from other sources such as the U.S. Government, from International Organizations, and from private vendors. You can browse all of the holdings at DISC via our Online Catalog, and you can search for data on the Internet yourself by using the DISC Internet Crossroads.

Up until the early 1990s most data were available only on electronic storage media such as a magnetic tape. Data had to be accessed via a mainframe computer and a great amount of technological knowledge was required to use them -- knowledge not only of a statistical nature but also general computer knowledge. More recently the desktop computing revolution has brought data directly to the researcher's desktop: mainframe technology has been replaced by the desktop computer, CD-ROMs and DVDs, FTP, and the World Wide Web. This revolution has not only been technological in nature; many more people can now use data without having to devote tremendous amounts of time to master all of the technical knowledge that once was mandatory.

The one thing that hasn't changed is the task of locating the data that will suit your research need. In most cases this will be a time-consuming and meticulous process. There are many tools available to assist you in this process, including online and printed resources, books and articles, and, of course, the knowledge and experience of others.

Once you find the dataset that you want, you can contact DISC staff to help you gain access to it. Many of our datasets, including all those that come from the ICPSR, can be easily downloaded. Some data, specifically those on commercially produced CD-ROMs or DVDs, must be used within the data library. Also be aware that for all of the steps of finding the right data there are potential gotchas. For this reason it is prudent to give yourself plenty of time not only to locate your data but to review all of the associated information (including technical documentation and codebooks).

Accessing the Data

As a starting point, assume that you have identified a dataset containing information you would like to analyze. These data consist of a number of measured attributes--called variables--each describing a set of observations. In the case of a survey, the observations are typically individual respondents and the variables are responses solicited from questions about attitudes, behaviors and traits.

Before getting started, it is essential that you understand the technical language of data. The most basic concepts are the record and the field. A record is simply a line. In some cases, the record contains all of the information about an observation, but as is noted below that isn't always true. A field is a column or columns containing the data for a specific variable. This assumes, of course, that your data is in a fixed column format--meaning that the values for a particular variable are in the same column(s) for all observations--which is true for 90% of all data files you will encounter. The alternative is a variable format in which the variables a found in a specified sequence within the file but are not in the same locations for each observation.

The relationship between records and fields constitutes a "data structure". The four most common structures are as follows:

  • Logical Record or Rectangular structure. This means that each line of data or record contain all of the variables for a single observation.
  • Multiple Record or Card-image structures. This means that several lines of data contain all of the variables for a particular observation. Card-image structures are those in which there is a fixed record length of 80 characters.
  • Hierarchical structure. This type of data contains multiple levels of related records within the same data file. For example, a file containing both household records and household member records.
  • Relational structure. This refers to multiple files that can be merged on the basis of a predefined structure or variable--the relationship. Examples include data collected on the same population at different time periods (the relation is the individual) or file containing data on students and a file containing information about the student's school.

Typically, you won't want all of the variables in a file, and in some instances you won't want all of the observations. The process of reducing the number of variables is called an extraction; reducing the number of observations is called subsetting a dataset.

Generally, once a suitable dataset has been identified you will need to subset those cases and/or extract those variables which you will want to save for further processing. On many of our proprietary CD-ROMs extraction software (of varying quality) is included. In general it will allow you to point and click on the variables you want. Your data will be extracted as either a raw ASCII file or a spreadsheet or statistical package system file. DISC produces or links to on-line Users Guides for many of these products.

In the case of ICPSR data, the place to start is with a codebook, a manual describing a particular study or data collection. While the content and format of codebooks vary considerably between data collections, the typical codebook contains the following information:

  • A description of how the data were collected including sampling design;
  • The variables contained in the data;
  • In the case of surveys, the survey instrument or questionairre used to solicit responses from the respondent and the coded values of each question;
  • The location and format of the variable within the raw data file.

Most codebooks have at least two major sections: the data dictionary which lists the variables and column locations and the data collection instrument. In a number of cases, there is a section describing how to read the codebook!

You will also need to use a statistical package such as SAS or SPSS to access the supplied raw data and create your extract. With many ICPSR data sets SAS and SPSS command files are included, along with an electronic codebook, to facilitate this process. The following is an example of an ICPSR dataset titled The Euro-barometer 14: Trust in the European Community, October 1980. The study consists of a single raw ASCII data file, a codebook file, and SPSS and SAS command files. The first line of data begins like this:

795820010000101032078001233133113240002002120000131013221120030420117720003

Each line of data takes 106 columns, so there isn't room for them all on a single line of thise page. Each line of data is a single observation (or record). The first 25 records of data look like this. There are a total of 9994 records in the entire data set (this file specification information can be obtained via the ICPSR or DISC web sites). We will add a scale with these data even though we already have a codebook, and we will only use the first 60 columns of data. Now we can compare the the data to the documentation, or the codebook.

1___5___10___15___20___25___30___35___40___45___50___55___60
 
795820010000101032078001233133113240002002120000131013221120

Let's look at the first few pages of the machine readable codebook. We will snip out the initial introductory material from the document, but you should get into the practice of reading this material very closely. In it you will find out about the population from which the sample was drawn, whether the respondents were selected by random sampling, by cluster random sampling or a proportional random sampling process. In the first two instances, each respondent will need to be weighted, whereas in the latter, they will not need to be. A variety of other important information will also be found in the introductory material.

In the codebook, the first variable, named VAR0001 is documented like this:


  VAR 0001      ICPSR STUDY NUMBER-7958     NO MISSING DATA CODES
  REF 0001         LOC    1 WIDTH  4             DK   1 COL  3- 6
 
  ICPSR STUDY NUMBER-7958

Here the first variable identifies the dataset by the ICPSR number. This is important. These data are collected for the European Union, and they originate from the Zentralarkhiv fur Sozialforschung (ZA) at the University of Koln. If these data came from ZA, the documentation from ICPSR might be of little use. This underscores an important point: is always absolutely essential that you determine the codebook matches the data; having a codebook that is the incorrect version as compared to the data can mean that the data will be unusable.

Since we are using data that are formated with one record per case, the first variable starts in column 1 and is 4 columns wide. There are no missing data codes. If we were using a dataset with more than one record per case, an important piece of information that we would need to account for would be the record number (or type), often identified with DK for deck number.

The second variable, starting in column 4 and only one column wide, identifies the edition, which for this dataset is the second.


 VAR 0002      ICPSR EDITION NUMBER-2      NO MISSING DATA CODES
 REF 0002         LOC    5 WIDTH  1             DK   1 COL  7
 
 ICPSR EDITION NUMBER
 --------------------
 
 THE NUMBER IDENTIFYING THE RELEASE EDITION OF THIS DATASET.
 
 2.  WINTER, 1983 RELEASE

If we look at the line of data above, we see the first five numbers are 7958 and 2.

In another example of a typical codebook entry (this time from the American National Election Survey) we see the following:


           VAR 0062      R INTREST-POL CAMPGN                MD=0 OR GE  8
           REF 0062         LOC  151 WIDTH  1
 
              In this interview I will be talking with you about the
              recent elections, as well as a number of other things.
              First, I have some questions about the political campaigns
              that took place this election year.
 
              Q.A1.  Some people don't pay much attention to political
              campaigns.  How about you?   Would you say that you were
              VERY MUCH INTERESTED, SOMEWHAT INTERESTED, or NOT MUCH
              INTERESTED in following the political campaigns this year?
              ----------------------------------------------------------
 
             304  1.  VERY MUCH INTERESTED
             635  3.  SOMEWHAT INTERESTED
             419  5.  NOT MUCH INTERESTED
 
                  8.  DK
               1  9.  NA
            1126  0.  INAP, 1992 cross section

This entry shows the actual survey text as well as the coded values, their labels, and frequencies, plus missing data information in the MD field.

Other data sets have different types of codebooks, and you will find that these sometimes vary widely. For example, an entry from the 1990 Census Public Microdata looks quite different than the ICPSR codebooks described above:



DATA SIZE BEGIN
D RECTYPE 1 1
Record Type
V H .Housing Record
D SERIALNO 7 2
V 0000000..
9999999 .Housing unit/GQ person serial number unique
.identifier assigned within state or state group
D SAMPLE 1 9
Sample Identifier
V 1 .5% sample
V 2 .1% sample
V 3 .Elderly
............................................................................
PERSON RECORD

DATA SIZE BEGIN
D RECTYPE 1 1
Record Type
V P .Person Record
D SERIALNO 7 2
V 0000000..
V 9999999 .Housing unit/GQ person serial number unique
V .identifier assigned within state or state group
D RELAT1 2 9
Relationship
V 00 .Householder
V 01 .Husband/wife
V 02 .Son/daughter
V 03 .Stepson/stepdaughter
V 04 .Brother/sister
V 05 .Father/mother
V 06 .Grandchild
V 07 .Other relative
V 08 .Roomer/boarder/foster child
V 09 .Housemate/roommate
V 10 .Unmarried partner
V 11 .Other nonrelative
V 12 .Institutionalized person
V 13 .Other persons in group quarters


Since these are hierarchical data, there is a RECTYPE indicator, in this case Person Record. The "BEGIN Column contains the starting column location of the variable and "SIZE" indicates the width of the variable. The lines beginning with "V" contain the coded values of the variable and a description of these codes.

In almost all cases, the codebook will contain the information you will need to begin thinking about and writing syntax to extract the variables and cases you need from the raw data.

The basics pieces of information that you will need are:

  • The data structure--rectangular, hierarchical, etc.
  • The variables that you are interested in: including their column location(s), variable type (alpha or numeric); additional formatting information (number of decimals, whether there are blanks in the field).
  • Identifying essential supplemental variables such as unique case identifiers, weights, and the like that are necessary for using the data correctly;
  • Preparing labels to identify the variables and values on your output.
  • You should also note some baseline marginals to test. For example, if the codebook lists the number of cases for by geographic area, you might want to consider comparing your extraction results to this table.

The time-consuming task, especially with huge, comprehensive studies such as the Panel Study of Income Dynamics, is to read through the long lists of variables, track down their descriptions in the codebook and keep track of their column locations until you have to write your extraction program (or, if you are lucky, to edit an already existing program).

Once you've decided which variables you need, you'll need to decide upon an application to use. The application you decide to use should be one that is suited for the types of analyses that you expect to conduct; however, there isn't always an obvious choice. Some applications are well-suited to specific types of analysis; however, they are poor choices as "extraction engines" because they either don't handle data manipulation efficiently or they cannot work with complex data structures. Most people use either SPSS or SAS for extracting data because both have very robust data manipulation capabilities. After you extract your data you are ready to write a statistical package program to read and analyze them.