1, 示例小数据
课堂上所使用的数据, 一般是比较少, 可以手动录入的数据.这里我们用10个牛的数据, 作为演示.
calf = c(1:10) # makes a string of numbers 1 to 10breed = c("AN","CH","HE","AN","CH","HE","CH","AN","HE","CH")
sex = c("M","M","M","M","F","F","F","F","F","M")
CE = c("U","E","U","U","H","E","H","E","E","C")
BWT = c(55,68,60,52,65,64,70,61,63,75)
beefdat = data.frame(calf,breed,sex,CE,BWT)
beefdat # looks at the data, exactly like the table
| calf | breed | sex | CE | BWT |
|---|---|---|---|---|
| 1 | AN | M | U | 55 |
| 2 | CH | M | E | 68 |
| 3 | HE | M | U | 60 |
| 4 | AN | M | U | 52 |
| 5 | CH | F | H | 65 |
| 6 | HE | F | E | 64 |
| 7 | CH | F | H | 70 |
| 8 | AN | F | E | 61 |
| 9 | HE | F | E | 63 |
| 10 | CH | M | C | 75 |
1.1 根据数据不同因素的结构设计矩阵
可以使用函数, 也可以使用model.matrix生成矩阵
desgn <- function(v){
if(is.numeric(v)){
vn = v
}else{
vn = as.numeric(factor(v))
}
mrow = length(vn)
mcol = length(levels(vn))
X = matrix(data=c(0),nrow=mrow,ncol=mcol)
for(i in 1:mrow){
ic = vn[i]
X[i,ic] = 1
}
return(X)
}
B = model.matrix(~breed -1,beefdat)
B
| breedAN | breedCH | breedHE | |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 |
| 3 | 0 | 0 | 1 |
| 4 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 |
| 6 | 0 | 0 | 1 |
| 7 | 0 | 1 | 0 |
| 8 | 1 | 0 | 0 |
| 9 | 0 | 0 | 1 |
| 10 | 0 | 1 | 0 |
S = model.matrix(~sex -1,beefdat)
S
| sexF | sexM | |
|---|---|---|
| 1 | 0 | 1 |
| 2 | 0 | 1 |
| 3 | 0 | 1 |
| 4 | 0 | 1 |
| 5 | 1 | 0 |
| 6 | 1 | 0 |
| 7 | 1 | 0 |
| 8 | 1 | 0 |
| 9 | 1 | 0 |
| 10 | 0 | 1 |
C = model.matrix(~sex -1,beefdat)
C
| sexF | sexM | |
|---|---|---|
| 1 | 0 | 1 |
| 2 | 0 | 1 |
| 3 | 0 | 1 |
| 4 | 0 | 1 |
| 5 | 1 | 0 |
| 6 | 1 | 0 |
| 7 | 1 | 0 |
| 8 | 1 | 0 |
| 9 | 1 | 0 |
| 10 | 0 | 1 |
1.2 汇总统计
summary(beefdat)
calf breed sex CE BWT
Min. : 1.00 AN:3 F:5 C:1 Min. :52.00
1st Qu.: 3.25 CH:4 M:5 E:4 1st Qu.:60.25
Median : 5.50 HE:3 H:2 Median :63.50
Mean : 5.50 U:3 Mean :63.30
3rd Qu.: 7.75 3rd Qu.:67.25
Max. :10.00 Max. :75.00
1.3 平均数和方差
这里使用BWT这个数据
平均数
mean(BWT)
63.3
方差
var(BWT)
46.6777777777778
标准差
sd(BWT)
6.83211371229854
1.4 作图
BWT直方图
hist(BWT)
BWT箱线图
boxplot(BWT)
散点图
plot(BWT)
2 处理大数据
如果数据很大, 你不能通过手动输入的形式进行录入了, 需要用到读取数据的函数. 如果数据更大, 比如大于100000, 这时候也可以使用R语言, 但是效率有点低, 这里推荐使用FORTRAN, C++会更有效率一点.
3, 练习
要求
将上面数据录入R语言中, 然后按照下面要求进行分析.
计算不同Sex的平均Time, 计算不同Race的平均Time
计算Time的平均数和方差
为horse, sex, race创建矩阵形式
对不同race time做直方图
保存数据
创建一个新的数据集, 这个数据及没有sex这一列
将Horse 6的Time由134变为136.

