   Splus: overview & examples

Features:

1.
Interpreted, not compiled, programming language.

2.
High quality graphics output.

3.
Built in statistical analysis packages.

4.
Slow memory hog when looping.

5.
Generally good numerics.

6.
Wide variety of modern statistical techniques

7.
Fairly powerful syntax.

A list of things you need to know how to do:

1.
Set up a new Splus project
2.
Get data into SPlus
3.
Make simple plots: scatterplots and so on.
4.
Do regression
5.
Write functions
6.
Do simulations
7.
Avoid pitfalls (scoping, memory, speed)
8.
How to get help.

 All Functions and Datasets Add to Existing Plot ANOVA Models Bootstrap Methods                  Input/Output-Files               Missing Values                                              Starting a new Splus project:

ehlehl% mkdir Splus5project

ehlehl% cd Splus5project

ehlehl% Splus5 CHAPTER
Creating data directory for chapter .
Splus5 chapter Splus5project initialized.

ehlehl% Splus5
S-PLUS : Copyright (c) 1988, 1999 MathSoft, Inc.
S : Copyright Lucent Technologies, Inc.
Version 5.1 Release 1 for Sun SPARC, SunOS 5.5 : 1999
Working data will be in .Data

Getting data into Splus:

```> # Making a vector of numbers
> xshort <- c(0,1,2,3,4)
# c is a function which can take an
# arbitrary number of arguments and
# strings them out one after another
> # Printing out an object -- type the name
> xshort
 0 1 2 3 4
> # Make a big object
> times <- seq(from=0, to=10,length=1001)
> # Print part of times
> times[1:10]
 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
> # Generate normal data
noise <- rnorm(1001)
> # Read data from a file
> gas[1:3,]
Cost Distance UnitPrice
1  9.80    390.4      31.2
2 10.41    413.6      31.4
3 10.50    415.2      31.5
> # gas is now a data.frame
> summary(gas)
Cost          Distance       UnitPrice
Min.: 6.33      Min.:198.4      Min.:31.20
1st Qu.:11.21   1st Qu.:423.4   1st Qu.:36.10
Median:13.12    Median:443.1    Median:40.10
Mean:12.71      Mean:443.0      Mean:39.15
3rd Qu.:14.09   3rd Qu.:465.6   3rd Qu.:43.30
Max.:15.67      Max.:534.5      Max.:47.10
> # Now what objects have we created?
> objects()
 ".Last.value" "gas"         "last.dump"   "times"
 "xshort"
```

Manipulating data

```> # Most functions operate on vectors componentwise
> y <- sin(2*pi*times)
> # add columns to the data.frame
> # Compute litres of gas used in tankful
> gas\$GasUsed  <- gas\$Cost/(gas\$UnitPrice/100)
> # Compute litres per hundred kilometres
> gas\$Consumption <- gas\$GasUsed/(gas\$Distance/100)
> var(gas[,c(2,4,5)])
Distance   GasUsed Consumption
Distance  1886.49916 90.893806 -10.2972991
GasUsed    90.89381  8.195885   0.3428260
Consumption   -10.29730  0.342826   0.2426734
> sqrt(diag(var(gas[,c(2,4,5)])))
 43.4338480  2.8628456  0.4926189
> apply(gas,2,mean)
Cost Distance UnitPrice     GasUsed  Consumption
12.7143 443.0374  39.15234 32.44423    7.346162
> # some examples of vector behaviour
> c(1,2,3)*c(1,2,3)
 1 4 9
> # some matrix manipulation
> c(1,-1,0)%*%var(gas[,c(2,4,5)]) %*% c(1,-1,0)
[,1]
[1,] 1712.907
> solve(var(gas[,c(2,4,5)]))%*%var(gas[,c(2,4,5)])
Distance       GasUsed   Consumption
Distance        1 -1.332268e-15  2.220446e-16
GasUsed        0  1.000000e+00 -3.552714e-15
Consumption        0 -8.526513e-14  1.000000e+00
```

Making simple plots

```> # Plot a graph of the sin(2*pi*t)
> # Open up a graphics window
> motif()
> plot(times,y,xlab="Time",ylab="Amplitude",
main="The Sine Function",type='l')
> # Saving a hard copy
> postscript(file='sine.ps',horizontal=F)
> plot(times,y,xlab="Time",ylab="Amplitude"
,main="The Sine Function",type='l')
> dev.off()
> postscript("gaspairs.ps",horizontal=F)
> pairs(gas)
> dev.off()
```

Other important plotting functions:

• tsplot
• contour
• persp
• matplot
• The trellis library.
• Dynamic graphics: spin and brush.

Regression: Splus has a built-in modelling language. Many different modelling functions use the same language.

```> Dreg <- lm(Distance ~ GasUsed, data=gas)
> Dreg
Call:
lm(formula = Distance ~ GasUsed, data = gas)

Coefficients:
(Intercept)  GasUsed
83.22513 11.09018

Degrees of freedom: 107 total; 105 residual
Residual standard error: 29.77981
> summary(Dreg)

Call: lm(formula = Distance ~ GasUsed, data = gas)
Residuals:
Min     1Q Median    3Q   Max
-74.02 -16.77 -2.967 17.76 100.8

Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 83.2251 32.9062     2.5292  0.0129
GasUsed 11.0902  1.0103    10.9766  0.0000

Residual standard error: 29.78 on 105 degrees of freedom
Multiple R-Squared: 0.5343
F-statistic: 120.5 on 1 and 105 degrees of freedom, the p-value is 0

Correlation of Coefficients:
(Intercept)
GasUsed -0.9962
> # what is Dreg?
> # List: with components
> names(Dreg)
 "coefficients"  "residuals"     "fitted.values" "effects"
 "R"             "rank"          "assign"        "df.residual"
 "contrasts"     "terms"         "call"
> names(Dreg\$call)
 ""        "formula" "data"
> Dreg\$call
lm(formula = Distance ~ GasUsed, data = gas)
> Dreg\$call\$formula
Distance ~ GasUsed
```
Categorical independent variables
```> # Some data sets are built in to S
> # Different objects are kept in different places.
> objects(4)
> summary(fuel.frame)
...
 "freeny.x"            "freeny.y"            "fuel.frame"
 "fuel.frame.orig"     "galaxy"              "gas"
...
> summary(fuel.frame)
Weight       Disp.        Mileage         Fuel        Type
Min.:1845    Min.: 73.0    Min.:18.00    Min.:2.703 Compact:15
1st Qu.:2571 1st Qu.:113.8 1st Qu.:21.00 1st Qu.:3.704   Large: 3
Median:2885  Median:144.5  Median:23.00  Median:4.348  Medium:13
Mean:2901    Mean:152.1    Mean:24.58    Mean:4.210   Small:13
3rd Qu.:3231 3rd Qu.:180.0 3rd Qu.:27.00 3rd Qu.:4.762  Sporty: 9
Max.:3855    Max.:305.0    Max.:37.00    Max.:5.556     Van: 7
i> Freg <- lm(Fuel ~ Disp. + Weight +Type, data=fuel.frame)
> summary(Freg)

Call: lm(formula = Fuel ~ Disp. + Weight + Type, data = fuel.frame)
Residuals:
Min      1Q   Median  3Q    Max
-0.6973 -0.2444 -0.01367 0.2 0.6363

Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept)  2.9606  0.6005     4.9306  0.0000
Disp.  0.0076  0.0017     4.3616  0.0001
Weight  0.0000  0.0003     0.1611  0.8726
Type1 -0.1453  0.1293    -1.1239  0.2662
Type2  0.0981  0.0470     2.0850  0.0420
Type3 -0.1241  0.0505    -2.4556  0.0174
Type4 -0.0436  0.0252    -1.7314  0.0893
Type5  0.1915  0.0337     5.6748  0.0000

Residual standard error: 0.3139 on 52 degrees of freedom
Multiple R-Squared: 0.8486
F-statistic: 41.65 on 7 and 52 degrees of freedom, the p-value is 0
```
```Correlation of Coefficients:
(Intercept)   Disp.  Weight   Type1   Type2   Type3   Type4
Disp.  0.4864
Weight -0.9415     -0.7477
Type1  0.3786     -0.2960 -0.1559
Type2  0.0888      0.3470 -0.2166 -0.4995
Type3 -0.7591     -0.0589  0.5929 -0.6197  0.1746
Type4 -0.3099     -0.1641  0.2937 -0.2997  0.1304  0.3021
Type5  0.7121      0.6013 -0.7688 -0.0053  0.2625 -0.3933 -0.2013
> coef(Freg)
(Intercept)       Disp.      Weight      Type1      Type2      Type3
2.960617 0.007594073 4.16396e-05 -0.1452804 0.09808411 -0.1240942

Type4     Type5
-0.04358047 0.1915062
> anova(Freg)
Analysis of Variance Table

Response: Fuel

Terms added sequentially (first to last)
Df Sum of Sq  Mean Sq  F Value        Pr(F)
Disp.  1  17.25253 17.25253 175.0661 0.000000e+00
Weight  1   7.93103  7.93103  80.4783 0.000000e+00
Type  5   3.54878  0.70976   7.2021 3.465263e-05
Residuals 52   5.12453  0.09855
> # Diagnostic plots:
> postscript("FuelDiag.ps",horizontal=F)
> # Lay the plots out in a 3 by 2 grid
> par(mfrow=c(3,2))
> plot(Freg)
> dev.off()
> # Examine the postscript file without leaving S
> !ghostview FuelDiag.ps&
```
Writing SPlus functions: Here I will make a contour plot for a density which is a 50/50 mixture of two bivariate normal densities with different correlations.
```> x <- seq(-3, 3, length = 200)
> y <- x
> g <- function(x, y, rho1, rho2 = 0)
{
#  g computes the density of a 50/50 mixture of two normal densities
#   at a grid of x,y values
#
#  Use outer to make a matrix whose ijth entry is x[i]*y[j]
q1 <- outer(x, y, "*")
y1 <- rep(1, length(y))
x1 <- rep(1, length(x))
qx <- outer(x^2, x1, "*")
qy <- outer(y1, y^2, "*")
den1 <- exp( - (qx - 2 * rho1 * q1 + qy)/
(2 * sqrt(1 - rho1^2)))/(2 * pi * sqrt(1 - rho1^2))
den2 <- exp( - (qx - 2 * rho2 * q1 + qy)/
(2 * sqrt(1 - rho2^2)))/(2 * pi * sqrt(1 - rho2^2))
(den1 + den2)/2
}
> postscript("partb.ps",height=6.5,width=6.5,horizontal=F)
>  zb <- g(x, y, 0.8, -0.8)
>  contour(x, y, zb, nlevels = 20, labex = 0)
> dev.off()
```
Simulations: I will simulate the sample mean when sampling from the Cauchy distribution.
```> # generate 1000 samples of size 25 and put them in  1000 by 25 matrix
> xc <- matrix(rcauchy(10000*25), nrow=10000)
> # compute sample means the really slow way
> xcm0 <- mean(xc[1,1])
> for ( i in 2:10000) xcm0 <- c(xcm0,mean(xc[i,]))
> # compute sample means the slow way:
> xcm <- apply(xc,1,mean)
> # faster using vector arithmetic
> one <- rep(1,25)
> xcm1 <- xc %*% one/25
> # Plot a kernel smoothed density estimate for xbar
> plot(ksmooth(xcm1,range.x=c(-6,6)),type='l',
xlab="x",ylab="Density")
> # Compare estimated density of xbar to
> # Original Cauchy density
> xs <- seq(-6,6,length=1000)
> lines(xs,dcauchy(xs),lty=2)
```
Pitfalls: When you write a function that calls a function and so on you can get caught in a trap: when functions look for variables you need to be aware of where they look.
```> # in a new directory
> a <- 3
> g <- function(x) {x^a}
> g(2)
 8
> # now remove all those functions and stick the whole thing
> # inside a function
> remove(ls())
>ls()
character(0)
> w <- function(){
a <- 3
g <- function(x) { x^a }
g(2)
}
> w()
Use traceback() to see the call stack
```
The problem is in where S looks for things it needs inside g. Look up assign in Splus for help!

Looping: is slow so use the vector calculations - make the argument a vector. Use outer. Use matrix arithmetic.

Help: Function help or args when you know the name of the function but not how to use it. Use help.start() to get a netscape based on line help browser. Go to http://lib.stat.cmu.edu/ and look at S-news, the S-Archive and other things.   