.S 16 18 See http://www.research.att.com/~kwc/tutorials/NAACL-20001/svd.txt for an electronic version of this handout. Human-Computer Interaction .BL .LI c1: Human machine interface for Lab ABC computer applications .LI c2: A survey of user opinion of computer system response time .LI c3: The EPS user interface management system .LI c4: System and human system engineering testing of EPS .LI c5: Relation of user-perceived response time to error measurement .LE Graphs .BL .LI m1: The generation of random, binary, unordered trees .LI m2: The intersection graph of paths in trees .LI m3: Graph minors IV: Widths of trees and well-quasi-ordering .LI m4: Graph minors: A survey .LE .ft CW .nf .bp bellcore c1 c2 c3 c4 c5 m1 m2 m3 m4 human 1 0 0 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1 1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0 0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0 0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1 # This will define the matrix "bellcore" in S: .ft R "bellcore"<- structure(.Data = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1), .Dim = c( 12, 9), .Dimnames = list(c("human", "interface", "computer", "user", "system", "response", "time", "EPS", "survey", "trees", "graph", "minors"), c("c1", "c2", "c3", "c4", "c5", "m1", "m2", "m3", "m4"))) .ft .bp # compute the SVD b <- svd(bellcore) # The result of the SVD (U x D x V^T) is almost # the same as the input matrix range(bellcore - b$u %*% diag(b$d) %*% t(b$v)) [1] -1.332268e-15 1.221245e-15 # U^T x U is almost the same as I (diag(9)) range(diag(9) - t(b$u) %*% b$u) [1] -1.110223e-15 2.442491e-15 # V^T x V is almost the same as I (diag(9)) range(diag(9) - t(b$v) %*% b$v) [1] -1.332268e-15 4.440892e-16 .bp # Cosines matcos4 <- function(m) { norm <- sweep(m, 2, apply(m, 2, l2norm), "/") t(norm) %*% norm } > l2norm <- function(x) sqrt(sum(x^2)) # here are the document cosines print(matcos4(bellcore), digits=2) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 1.00 0.24 0.29 0.24 0.00 0.00 0.00 0.00 0.00 c2 0.24 1.00 0.41 0.33 0.71 0.00 0.00 0.00 0.24 c3 0.29 0.41 1.00 0.61 0.29 0.00 0.00 0.00 0.00 c4 0.24 0.33 0.61 1.00 0.00 0.00 0.00 0.00 0.00 c5 0.00 0.71 0.29 0.00 1.00 0.00 0.00 0.00 0.00 m1 0.00 0.00 0.00 0.00 0.00 1.00 0.71 0.58 0.00 m2 0.00 0.00 0.00 0.00 0.00 0.71 1.00 0.82 0.41 m3 0.00 0.00 0.00 0.00 0.00 0.58 0.82 1.00 0.67 m4 0.00 0.24 0.00 0.00 0.00 0.00 0.41 0.67 1.00 .bp # using SVD to compute cosines # optional arg r (rank reduction) defaults # to full rank matcos5 <- function(m, r = min(dim(m))) { norm <- sweep(m, 2, apply(m, 2, l2norm), "/") b <- svd(norm, nu = 0) b$v[, 1:r] %*% diag(b$d[1:r])^2 %*% t(b$v[, 1:r]) } # matcos5 is almost the same as matcos4 # (with no rank reduction) range(matcos4(bellcore) - matcos5(bellcore)) [1] -3.219647e-15 7.606762e-16 .bp .fi .ft R The folks at Bellcore argue that the cosines get better if you reduce the rank. I think that they are doing something like this: .ft .nf .ps -2 b <- svd(bellcore) print(matcos4(b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])), digits=2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 1.000 0.91 1.0000 0.99 0.88 -0.19 -0.17 -0.16 -0.0117 [2,] 0.914 1.00 0.9166 0.87 1.00 0.23 0.25 0.25 0.3945 [3,] 1.000 0.92 1.0000 0.99 0.88 -0.18 -0.16 -0.15 -0.0057 [4,] 0.995 0.87 0.9942 1.00 0.83 -0.28 -0.27 -0.26 -0.1137 [5,] 0.880 1.00 0.8827 0.83 1.00 0.30 0.32 0.33 0.4648 [6,] -0.185 0.23 -0.1793 -0.28 0.30 1.00 1.00 1.00 0.9848 [7,] -0.168 0.25 -0.1617 -0.27 0.32 1.00 1.00 1.00 0.9878 [8,] -0.160 0.25 -0.1541 -0.26 0.33 1.00 1.00 1.00 0.9889 [9,] -0.012 0.39 -0.0057 -0.11 0.46 0.98 0.99 0.99 1.0000 .bp # Clustering: better after rand reduction # both for documents as well as terms # documents plclust(hclust(sim=matcos4(bellcore)), labels=dimnames(bellcore)[[2]]) plclust(hclust(sim=matcos4(b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]))), labels=dimnames(bellcore)[[2]]) # terms plclust(hclust(sim=matcos4(t(bellcore))), labels=dimnames(t(bellcore))[[2]]) plclust(hclust(sim=matcos4(t(b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])))), labels=dimnames(t(bellcore))[[2]]) .bp .fi .ft R They want large cosines (>0.9) for m* and m* documents, and also c* and c* documents, and small cosines for m* and c* documents and c* and m* documents. I think that a more legit operation is to normalize before rank reduction. This produces smaller cosines. Note that the cos is no longer one along the main diag. Maybe this is a problem. .ft .nf .ps -2 round(matcos5(bellcore,2), digits=2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 0.19 0.36 0.34 0.28 0.27 -0.02 -0.02 -0.01 0.05 [2,] 0.36 0.70 0.66 0.54 0.51 0.01 0.04 0.05 0.13 [3,] 0.34 0.66 0.62 0.51 0.49 -0.04 -0.03 -0.02 0.08 [4,] 0.28 0.54 0.51 0.43 0.40 -0.04 -0.03 -0.02 0.07 [5,] 0.27 0.51 0.49 0.40 0.38 -0.02 -0.01 0.00 0.08 [6,] -0.02 0.01 -0.04 -0.04 -0.02 0.53 0.67 0.69 0.44 [7,] -0.02 0.04 -0.03 -0.03 -0.01 0.67 0.86 0.88 0.56 [8,] -0.01 0.05 -0.02 -0.02 0.00 0.69 0.88 0.90 0.58 [9,] 0.05 0.13 0.08 0.07 0.08 0.44 0.56 0.58 0.39 .ps .bp .fi .ft R In general, the more rank reduction you do, the more you change the cosines. .ft .nf for(i in 1:9) { print(c(i, range(matcos5(bellcore,i) - matcos4(bellcore)))) } [1] 1.0000000 -0.9990365 0.1661593 [1] 2.0000000 -0.8096205 0.4391896 [1] 3.0000000 -0.6295303 0.3993735 [1] 4.0000000 -0.6133425 0.2900175 [1] 5.0000000 -0.2181424 0.1628279 [1] 6.00000000 -0.11189952 0.06829901 [1] 7.00000000 -0.08707455 0.05850891 [1] 8.00000000 -0.01757237 0.01464602 [1] 9.000000e+00 -7.606762e-16 3.219647e-15 .bp .fi .ft R Imagine that we have the query: "human computer interface" Ideally, documents c1-c5 would have large cosines and m1-m4 would have a small cosine. They represent the query as a vector q: .ft .nf t(q) [,1] human 1 interface 0 computer 1 user 0 system 0 response 0 time 0 EPS 0 survey 0 trees 0 graph 0 minors 0 .fi .ft R Note that "interface" is deleted because it wasn't used in the library. The cosines are .nf .ft .ps -2 q %*% bellcore/ outer(l2norm(q), apply(bellcore, 2, l2norm)) c1 c2 c3 c4 c5 m1 m2 m3 m4 [1,] 0.8164966 0.2886751 0 0.2886751 0 0 0 0 0 .ps 16 .ft .bp .ft R q & c3 get a 0 cosine because they have no words in common: .ft CW .nf q %*% bellcore c1 c2 c3 c4 c5 m1 m2 m3 m4 [1,] 2 1 0 1 0 0 0 0 0 .ft R .fi but if you rank reduce, then you get large cosines for the first 5 documents (c1-c5) and small ones for the last 4 (m1-m4). .ft .nf .ps -2 print(q %*% b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])/ outer(l2norm(q), apply(bellcore, 2, l2norm)), digits=2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 0.13 0.26 0.26 0.25 0.17 -0.02 -0.028 -0.029 0.013 .ps .bp # SVD and the probabilistic retrieval model bellcore2 contains -log2 tail probs for example, bellcore2["human", "c1"] = -log2(sum(bellcore["human",] >= bellcore["human", "c1"])/ length(bellcore["human",])) print(bellcore2, digits=2) c1 c2 c3 c4 c5 m1 m2 m3 m4 human 2.2 0.0 0.0 2.2 0.0 0.0 0.0 0.0 0.0 interface 2.2 0.0 2.2 0.0 0.0 0.0 0.0 0.0 0.0 computer 2.2 2.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 user 0.0 1.6 1.6 0.0 1.6 0.0 0.0 0.0 0.0 system 0.0 1.6 1.6 3.2 0.0 0.0 0.0 0.0 0.0 response 0.0 2.2 0.0 0.0 2.2 0.0 0.0 0.0 0.0 time 0.0 2.2 0.0 0.0 2.2 0.0 0.0 0.0 0.0 EPS 0.0 0.0 2.2 2.2 0.0 0.0 0.0 0.0 0.0 survey 0.0 2.2 0.0 0.0 0.0 0.0 0.0 0.0 2.2 trees 0.0 0.0 0.0 0.0 0.0 1.6 1.6 1.6 0.0 graph 0.0 0.0 0.0 0.0 0.0 0.0 1.6 1.6 1.6 minors 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.2 2.2 .ft R let me score the documents by: .ft q %*% bellcore2 .bp .ft R This can be interpreted as saying that .ft score(doc) = PROD 1/Pr(freq[w,] >= freq[w,doc]) w in query .fi .ft R As you can see, this has basically the same problem as the cosine rule. c3 gets a low score because it has no words in common with the query. .ft .nf print(q %*% bellcore2, digits=2) c1 c2 c3 c4 c5 m1 m2 m3 m4 [1,] 4.3 2.2 0 2.2 0 0 0 0 0 .fi .ft R If we use SVD to reduce the rank down to 2, we get large scores for the first 5 documents and not for the last 4 (as desired). .ps -2 .ft .nf > b <- svd(bellcore2) > print(q %*% b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]), digits=2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 1.3 1.8 1.9 2.3 0.72 -0.035 -0.081 -0.15 0.04 .ps .fi .ft R I also feel good that the larger values in q %*% bellcore2 aren't so large after rank reduction. In general, likelihood ratios tend to get very large. I'm hoping that rank reduction might help keep them within reasonable limits. .ft .nf