As a less stringent imputation task, we left out batches of individual values and evaluated our ability to impute them

As a less stringent imputation task, we left out batches of individual values and evaluated our ability to impute them. conserved regulatory mechanisms. Here, we report that coupled matrixCtensor factorization (CMTF) can reduce these data into consistent patterns by recognizing the intrinsic structure of these data. We use measurements from two previous studies of HIV\ and SARS\CoV\2\infected subjects as examples. CMTF outperforms standard methods like principal components analysis in the extent of data reduction while maintaining equivalent prediction of immune functional responses and disease status. Under CMTF, model interpretation improves through effective data reduction, separation of the Fc and antigen\binding effects, and recognition of consistent patterns across individual measurements. Data reduction also helps make prediction models more replicable. Therefore, we propose that CMTF is an effective general strategy for data exploration in systems serology. receptorCantigen pairs all subjects (see Materials and Methods). We then performed CMTF which effectively filled these in and calculated the Q2X of the inferred values compared with the left\out data (Fig?3A). Factorization imputed these values with similar accuracy to the variance explained within observed measurements up to six components (Fig?2A), supporting that it can identify meaningful patterns even in the presence of missing measurements. As we were effectively leaving out entire columns of data when arranged in a flattened matrix form, we could not compare this performance with PCA. Using the average along the receptor or antigen dimensions led to Q2X values very close to 0. As a less stringent imputation task, we left out batches of individual values and evaluated our ability to impute them. CMTF showed similar or slightly better performance when imputing individual values compared with PCA (Fig?3B). This provides additional evidence that the patterns identified by factorization are a meaningful representation of the data. Open in a separate window Figure 3 CMTF accurately imputes missing values Percent variance predicted (Q2X) versus the number of components used for imputation of 15 randomly held\out receptorCantigen pairs. Error bars indicate standard error of the mean from repeatedly held\out pairs (is the total number of components in the factorization. PKCA The D8-MMAE original tensor is approximated as a sum of rank\one D8-MMAE tensors constructed by the vector outer product along each mode. The original matrix is represented by the sum D8-MMAE of rank\one matrices formed by the outer product of row and column vectors. For the are vectors indicating variation along the subject, receptor, and antigen dimensions, respectively, and is a vector indicating variation along glycan forms within the glycan matrix. Decomposition was initialized using singular value decomposition of the unfolded data along each mode, with missing values imputed by a one\component PCA model and entirely missing columns removed. We then optimized the decomposition using an alternating least squares (ALS) scheme (Kolda & Bader, 2009) for up to 2,000 iterations. In each ALS iteration, linear least squares solving was performed on each mode separately (preprint: Acar are the tensor unfoldings of along each mode, and and implemented within scikit\learn (Pedregosa and the SVD algorithm. Missing values were handled by an expectationCmaximization approach wherein they were filled in repeatedly by PCA. This filling step was performed up to 100 iterations or until convergence as determined by a tolerance of 1 1??10?5. Missingness imputation To evaluate the ability of factorization to impute missing data, we introduced new missing values by removing (i) entire receptorCantigen pairs or (ii) individual values from the antigen\specific tensor as indicated and then quantifying the variance D8-MMAE explained on reconstruction (Q2X). More specifically, in the first situation, fifteen randomly selected receptorCantigen pairs were entirely removed (2,715 values) and marked as missing across all subjects, leaving ??93,000 values for training. In the second, fifteen randomly selected individual values were removed, leaving ??96,000 training values. CMTF decomposition was performed in each trial as described before, and the left\out data were compared with the reconstructed values. There were 20 or 10 trials performed in each imputation situation,.