Pan Matrix data
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
The file pan_matrix.txt is a huge table (tab-separated columns) where each row corresponds to a genome and each column to a domain sequences family. The rows are named by the BIOID-code, see map_ecoli.txt to look up the strain names. The columns are named Cluster 1, Cluster 2,...etc. The corresponding Pfam-A domain sequence is given in the file cluster_info.txt (see below). In cell (i,j) in this table you find the number of occurrences that domain sequence j has in genome number i.