next up previous
Next: About this document ...

Splus: overview & examples

Features:

1.
Interpreted, not compiled, programming language.

2.
High quality graphics output.

3.
Built in statistical analysis packages.

4.
Slow memory hog when looping.

5.
Generally good numerics.

6.
Wide variety of modern statistical techniques

7.
Fairly powerful syntax.

A list of things you need to know how to do:

1.
Set up a new Splus project
2.
Get data into SPlus
3.
Make simple plots: scatterplots and so on.
4.
Do regression
5.
Write functions
6.
Do simulations
7.
Avoid pitfalls (scoping, memory, speed)
8.
How to get help.



 
Table 1: Here is an list of help categories on our system for Splus 5.
All Functions and Datasets Add to Existing Plot ANOVA Models Bootstrap Methods
$\textstyle \parbox{1.6in}{\raggedright Categorical Data
}$ $\textstyle \parbox{1.6in}{\raggedright Character Data Operations
}$ $\textstyle \parbox{1.6in}{\raggedright Clustering
}$ $\textstyle \parbox{1.6in}{\raggedright Complex Numbers }$
$\textstyle \parbox{1.6in}{\raggedright Computations Related to Plotting
}$ $\textstyle \parbox{1.6in}{\raggedright Data Attributes
}$ $\textstyle \parbox{1.6in}{\raggedright Data Directories
}$ $\textstyle \parbox{1.6in}{\raggedright Data Manipulation }$
$\textstyle \parbox{1.6in}{\raggedright Data Sets
}$ $\textstyle \parbox{1.6in}{\raggedright Data Types
}$ $\textstyle \parbox{1.6in}{\raggedright Dates Objects
}$ $\textstyle \parbox{1.6in}{\raggedright Defunct Functions }$
$\textstyle \parbox{1.6in}{\raggedright Deprecated Functions
}$ $\textstyle \parbox{1.6in}{\raggedright Documentation
}$ $\textstyle \parbox{1.6in}{\raggedright Dynamic Graphics
}$ $\textstyle \parbox{1.6in}{\raggedright Error Handling }$
$\textstyle \parbox{1.6in}{\raggedright Graphical Devices
}$ $\textstyle \parbox{1.6in}{\raggedright High-Level Plots
}$ Input/Output-Files $\textstyle \parbox{1.6in}{\raggedright Interacting with Plots }$
$\textstyle \parbox{1.6in}{\raggedright Interfaces to Other Languages }$ $\textstyle \parbox{1.6in}{\raggedright Jackknife Methods }$ $\textstyle \parbox{1.6in}{\raggedright Library of Chronological Functions
}$ $\textstyle \parbox{1.6in}{\raggedright Library of Clustering Methods }$
$\textstyle \parbox{1.6in}{\raggedright Library of Maps }$ $\textstyle \parbox{1.6in}{\raggedright Linear Algebra }$ $\textstyle \parbox{1.6in}{\raggedright Lists }$ $\textstyle \parbox{1.6in}{\raggedright Loess Objects }$
$\textstyle \parbox{1.6in}{\raggedright Logical Operators }$ $\textstyle \parbox{1.6in}{\raggedright Looping and Iteration }$ $\textstyle \parbox{1.6in}{\raggedright Mathematical Operations }$ $\textstyle \parbox{1.6in}{\raggedright Matrices and Arrays }$
$\textstyle \parbox{1.6in}{\raggedright Methods and Generic Functions
}$ $\textstyle \parbox{1.6in}{\raggedright Miscellaneous }$ Missing Values $\textstyle \parbox{1.6in}{\raggedright Mixed Effects Models }$
$\textstyle \parbox{1.6in}{\raggedright Multivariate Techniques }$ $\textstyle \parbox{1.6in}{\raggedright Non-linear Regression }$ $\textstyle \parbox{1.6in}{\raggedright Nonparametric Statistics }$ $\textstyle \parbox{1.6in}{\raggedright Optimization }$
$\textstyle \parbox{1.6in}{\raggedright Ordinary Differential Equations }$ $\textstyle \parbox{1.6in}{\raggedright Printing }$ $\textstyle \parbox{1.6in}{\raggedright Probability Distributions and Random Numbers }$ $\textstyle \parbox{1.6in}{\raggedright Programming }$
$\textstyle \parbox{1.6in}{\raggedright Quality Control }$ $\textstyle \parbox{1.6in}{\raggedright Regression }$ $\textstyle \parbox{1.6in}{\raggedright Regression and Classification Trees }$ $\textstyle \parbox{1.6in}{\raggedright Release Notes }$
$\textstyle \parbox{1.6in}{\raggedright Resampling (Bootstrap, Jackknife, and Permutations) }$ $\textstyle \parbox{1.6in}{\raggedright Robust/Resistant Techniques }$ $\textstyle \parbox{1.6in}{\raggedright S-PLUS Session Environment }$ $\textstyle \parbox{1.6in}{\raggedright Smoothing Operations }$
$\textstyle \parbox{1.6in}{\raggedright Statistical Inference }$ $\textstyle \parbox{1.6in}{\raggedright Statistical Models }$ $\textstyle \parbox{1.6in}{\raggedright Survival Analysis }$ $\textstyle \parbox{1.6in}{\raggedright Time Series }$
$\textstyle \parbox{1.6in}{\raggedright Trellis Displays Library }$ $\textstyle \parbox{1.6in}{\raggedright Utilities }$ $\textstyle \parbox{1.6in}{\raggedright Design of Experiments, Response Surfaces and Robust Design }$ $\textstyle \parbox{1.6in}{\raggedright Fractional Factorial Experiments }$
$\textstyle \parbox{1.6in}{\raggedright Response Surfaces }$ $\textstyle \parbox{1.6in}{\raggedright Robust Experimental Design }$ $\textstyle \parbox{1.6in}{\raggedright GARCH Module for Modeling Time Series Volatility }$ $\textstyle \parbox{1.6in}{\raggedright Geostatistical Data Analysis }$
$\textstyle \parbox{1.6in}{\raggedright Hexagonal Binning }$ $\textstyle \parbox{1.6in}{\raggedright Lattice Data Analysis }$ $\textstyle \parbox{1.6in}{\raggedright Point Pattern Analysis }$ $\textstyle \parbox{1.6in}{\raggedright Spatial Regression }$
$\textstyle \parbox{1.6in}{\raggedright Spatial Statistics Module }$ $\textstyle \parbox{1.6in}{\raggedright Wavelet Analysis of Data, Signals, and Images. }$ $\textstyle \parbox{1.6in}{\raggedright Discrete Wavelet Transform Analysis }$ $\textstyle \parbox{1.6in}{\raggedright 1-D Wavelet and Cosine Transforms }$
$\textstyle \parbox{1.6in}{\raggedright 2-D Wavelet and Cosine Transforms }$ $\textstyle \parbox{1.6in}{\raggedright Wavelet Convolutions and Filters }$ $\textstyle \parbox{1.6in}{\raggedright Cosine Packet Analysis }$ $\textstyle \parbox{1.6in}{\raggedright Wavelet Packet Analysis }$
$\textstyle \parbox{1.6in}{\raggedright Wavelet Crystals }$ $\textstyle \parbox{1.6in}{\raggedright Wavelet Molecules and Atoms }$ $\textstyle \parbox{1.6in}{\raggedright Creating Wavelets, Wavelet Packets, and Cosine Packets }$ $\textstyle \parbox{1.6in}{\raggedright Wavelets Module Functions }$
$\textstyle \parbox{1.6in}{\raggedright Wavelets Module Signals, Images and Datasets }$      



Starting a new Splus project:

[47]ehlehl% mkdir Splus5project

[48]ehlehl% cd Splus5project

[49]ehlehl% Splus5 CHAPTER
Creating data directory for chapter .
Splus5 chapter Splus5project initialized.

[50]ehlehl% Splus5
S-PLUS : Copyright (c) 1988, 1999 MathSoft, Inc.
S : Copyright Lucent Technologies, Inc.
Version 5.1 Release 1 for Sun SPARC, SunOS 5.5 : 1999
Working data will be in .Data

Getting data into Splus:

> # Making a vector of numbers
> xshort <- c(0,1,2,3,4)
  # c is a function which can take an 
  # arbitrary number of arguments and
  # strings them out one after another
> # Printing out an object -- type the name
> xshort 
[1] 0 1 2 3 4
> # Make a big object 
> times <- seq(from=0, to=10,length=1001)
> # Print part of times
> times[1:10] 
 [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
> # Generate normal data
noise <- rnorm(1001)
> # Read data from a file
> gas <- read.table("gas.data",header=TRUE)
> gas[1:3,]
   Cost Distance UnitPrice 
1  9.80    390.4      31.2
2 10.41    413.6      31.4
3 10.50    415.2      31.5
> # gas is now a data.frame
> summary(gas)
       Cost          Distance       UnitPrice    
    Min.: 6.33      Min.:198.4      Min.:31.20  
 1st Qu.:11.21   1st Qu.:423.4   1st Qu.:36.10  
  Median:13.12    Median:443.1    Median:40.10  
    Mean:12.71      Mean:443.0      Mean:39.15  
 3rd Qu.:14.09   3rd Qu.:465.6   3rd Qu.:43.30  
    Max.:15.67      Max.:534.5      Max.:47.10  
> # Now what objects have we created?
> objects()
[1] ".Last.value" "gas"         "last.dump"   "times"      
[5] "xshort"

Manipulating data

> # Most functions operate on vectors componentwise
> y <- sin(2*pi*times)    
> # add columns to the data.frame
> # Compute litres of gas used in tankful
> gas$GasUsed  <- gas$Cost/(gas$UnitPrice/100)
> # Compute litres per hundred kilometres 
> gas$Consumption <- gas$GasUsed/(gas$Distance/100)
> var(gas[,c(2,4,5)])
              Distance   GasUsed Consumption 
   Distance  1886.49916 90.893806 -10.2972991
    GasUsed    90.89381  8.195885   0.3428260
Consumption   -10.29730  0.342826   0.2426734
> sqrt(diag(var(gas[,c(2,4,5)])))
[1] 43.4338480  2.8628456  0.4926189
> apply(gas,2,mean)
   Cost Distance UnitPrice     GasUsed  Consumption 
 12.7143 443.0374  39.15234 32.44423    7.346162
> # some examples of vector behaviour
> c(1,2,3)*c(1,2,3)
[1] 1 4 9
> # some matrix manipulation
> c(1,-1,0)%*%var(gas[,c(2,4,5)]) %*% c(1,-1,0)
         [,1] 
[1,] 1712.907
> solve(var(gas[,c(2,4,5)]))%*%var(gas[,c(2,4,5)])
           Distance       GasUsed   Consumption 
   Distance        1 -1.332268e-15  2.220446e-16
    GasUsed        0  1.000000e+00 -3.552714e-15
Consumption        0 -8.526513e-14  1.000000e+00

Making simple plots

> # Plot a graph of the sin(2*pi*t)
> # Open up a graphics window
> motif()
> plot(times,y,xlab="Time",ylab="Amplitude",
            main="The Sine Function",type='l')
> # Saving a hard copy
> postscript(file='sine.ps',horizontal=F)
> plot(times,y,xlab="Time",ylab="Amplitude"
   ,main="The Sine Function",type='l')
> dev.off()
> postscript("gaspairs.ps",horizontal=F)
> pairs(gas)
> dev.off()

Other important plotting functions:

Regression: Splus has a built-in modelling language. Many different modelling functions use the same language.

> Dreg <- lm(Distance ~ GasUsed, data=gas)
> Dreg
Call:
lm(formula = Distance ~ GasUsed, data = gas)

Coefficients:
 (Intercept)  GasUsed 
    83.22513 11.09018

Degrees of freedom: 107 total; 105 residual
Residual standard error: 29.77981 
> summary(Dreg)

Call: lm(formula = Distance ~ GasUsed, data = gas)
Residuals:
    Min     1Q Median    3Q   Max 
 -74.02 -16.77 -2.967 17.76 100.8

Coefficients:
              Value Std. Error t value Pr(>|t|) 
(Intercept) 83.2251 32.9062     2.5292  0.0129 
    GasUsed 11.0902  1.0103    10.9766  0.0000 

Residual standard error: 29.78 on 105 degrees of freedom
Multiple R-Squared: 0.5343 
F-statistic: 120.5 on 1 and 105 degrees of freedom, the p-value is 0 

Correlation of Coefficients:
        (Intercept) 
GasUsed -0.9962    
> # what is Dreg?
> # List: with components
> names(Dreg)
 [1] "coefficients"  "residuals"     "fitted.values" "effects"      
 [5] "R"             "rank"          "assign"        "df.residual"  
 [9] "contrasts"     "terms"         "call"         
> names(Dreg$call)
[1] ""        "formula" "data"   
> Dreg$call
lm(formula = Distance ~ GasUsed, data = gas)
> Dreg$call$formula
Distance ~ GasUsed
Categorical independent variables
> # Some data sets are built in to S
> # Different objects are kept in different places.
> objects(4)
> summary(fuel.frame)
 ...
[124] "freeny.x"            "freeny.y"            "fuel.frame"         
[127] "fuel.frame.orig"     "galaxy"              "gas"                
 ...
> summary(fuel.frame)
      Weight       Disp.        Mileage         Fuel        Type   
   Min.:1845    Min.: 73.0    Min.:18.00    Min.:2.703 Compact:15  
1st Qu.:2571 1st Qu.:113.8 1st Qu.:21.00 1st Qu.:3.704   Large: 3  
 Median:2885  Median:144.5  Median:23.00  Median:4.348  Medium:13  
   Mean:2901    Mean:152.1    Mean:24.58    Mean:4.210   Small:13  
3rd Qu.:3231 3rd Qu.:180.0 3rd Qu.:27.00 3rd Qu.:4.762  Sporty: 9  
   Max.:3855    Max.:305.0    Max.:37.00    Max.:5.556     Van: 7 
i> Freg <- lm(Fuel ~ Disp. + Weight +Type, data=fuel.frame)
> summary(Freg)

Call: lm(formula = Fuel ~ Disp. + Weight + Type, data = fuel.frame)
Residuals:
     Min      1Q   Median  3Q    Max
 -0.6973 -0.2444 -0.01367 0.2 0.6363

Coefficients:
              Value Std. Error t value Pr(>|t|)
(Intercept)  2.9606  0.6005     4.9306  0.0000
      Disp.  0.0076  0.0017     4.3616  0.0001
     Weight  0.0000  0.0003     0.1611  0.8726
      Type1 -0.1453  0.1293    -1.1239  0.2662
      Type2  0.0981  0.0470     2.0850  0.0420
      Type3 -0.1241  0.0505    -2.4556  0.0174
      Type4 -0.0436  0.0252    -1.7314  0.0893
      Type5  0.1915  0.0337     5.6748  0.0000

Residual standard error: 0.3139 on 52 degrees of freedom
Multiple R-Squared: 0.8486
F-statistic: 41.65 on 7 and 52 degrees of freedom, the p-value is 0
Correlation of Coefficients:
       (Intercept)   Disp.  Weight   Type1   Type2   Type3   Type4
 Disp.  0.4864
Weight -0.9415     -0.7477
 Type1  0.3786     -0.2960 -0.1559
 Type2  0.0888      0.3470 -0.2166 -0.4995
 Type3 -0.7591     -0.0589  0.5929 -0.6197  0.1746
 Type4 -0.3099     -0.1641  0.2937 -0.2997  0.1304  0.3021
 Type5  0.7121      0.6013 -0.7688 -0.0053  0.2625 -0.3933 -0.2013
> coef(Freg)
 (Intercept)       Disp.      Weight      Type1      Type2      Type3
    2.960617 0.007594073 4.16396e-05 -0.1452804 0.09808411 -0.1240942

       Type4     Type5
 -0.04358047 0.1915062
> anova(Freg)
Analysis of Variance Table

Response: Fuel

Terms added sequentially (first to last)
          Df Sum of Sq  Mean Sq  F Value        Pr(F)
    Disp.  1  17.25253 17.25253 175.0661 0.000000e+00
   Weight  1   7.93103  7.93103  80.4783 0.000000e+00
     Type  5   3.54878  0.70976   7.2021 3.465263e-05
Residuals 52   5.12453  0.09855
> # Diagnostic plots:
> postscript("FuelDiag.ps",horizontal=F)
> # Lay the plots out in a 3 by 2 grid
> par(mfrow=c(3,2))
> plot(Freg)
> dev.off()
> # Examine the postscript file without leaving S
> !ghostview FuelDiag.ps&
Writing SPlus functions: Here I will make a contour plot for a density which is a 50/50 mixture of two bivariate normal densities with different correlations.
> x <- seq(-3, 3, length = 200)
> y <- x
> g <- function(x, y, rho1, rho2 = 0)
{
 #  g computes the density of a 50/50 mixture of two normal densities
 #   at a grid of x,y values
 #  
 #  Use outer to make a matrix whose ijth entry is x[i]*y[j]
        q1 <- outer(x, y, "*")
        y1 <- rep(1, length(y))
        x1 <- rep(1, length(x))
        qx <- outer(x^2, x1, "*")
        qy <- outer(y1, y^2, "*")
        den1 <- exp( - (qx - 2 * rho1 * q1 + qy)/
               (2 * sqrt(1 - rho1^2)))/(2 * pi * sqrt(1 - rho1^2))
        den2 <- exp( - (qx - 2 * rho2 * q1 + qy)/
               (2 * sqrt(1 - rho2^2)))/(2 * pi * sqrt(1 - rho2^2))
        (den1 + den2)/2
}
> postscript("partb.ps",height=6.5,width=6.5,horizontal=F)
>  zb <- g(x, y, 0.8, -0.8)
>  contour(x, y, zb, nlevels = 20, labex = 0)
> dev.off()
Simulations: I will simulate the sample mean when sampling from the Cauchy distribution.
> # generate 1000 samples of size 25 and put them in  1000 by 25 matrix
> xc <- matrix(rcauchy(10000*25), nrow=10000)
> # compute sample means the really slow way
> xcm0 <- mean(xc[1,1])
> for ( i in 2:10000) xcm0 <- c(xcm0,mean(xc[i,]))
> # compute sample means the slow way:
> xcm <- apply(xc,1,mean)
> # faster using vector arithmetic
> one <- rep(1,25)
> xcm1 <- xc %*% one/25
> # Plot a kernel smoothed density estimate for xbar
> plot(ksmooth(xcm1,range.x=c(-6,6)),type='l',
    xlab="x",ylab="Density")
> # Compare estimated density of xbar to
> # Original Cauchy density
> xs <- seq(-6,6,length=1000)
> lines(xs,dcauchy(xs),lty=2)
Pitfalls: When you write a function that calls a function and so on you can get caught in a trap: when functions look for variables you need to be aware of where they look.
> # in a new directory
> a <- 3
> g <- function(x) {x^a}
> g(2)
[1] 8
> # now remove all those functions and stick the whole thing
> # inside a function
> remove(ls())
>ls()
character(0)
> w <- function(){
   a <- 3
   g <- function(x) { x^a }
   g(2)
}
> w()
Problem in g(2): Object "a" not found
Use traceback() to see the call stack
The problem is in where S looks for things it needs inside g. Look up assign in Splus for help!

Looping: is slow so use the vector calculations - make the argument a vector. Use outer. Use matrix arithmetic.

Help: Function help or args when you know the name of the function but not how to use it. Use help.start() to get a netscape based on line help browser. Go to http://lib.stat.cmu.edu/ and look at S-news, the S-Archive and other things.



 
next up previous
Next: About this document ...
Richard Lockhart
2001-09-28