Glycobiology Advance Access originally published online on September 22, 2004
Glycobiology 2005 15(2):153-164; doi:10.1093/glycob/cwh151
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Glycobiology vol. 15 no. 2 © Oxford University Press 2005; all rights reserved.
Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites
Center for Biological Sequence Analysis, BioCentrum, Building 208, Technical University of Denmark, DK-2800 Lyngby, Denmark
1 To whom correspondence should be addressed; e-mail: karin.julenius{at}sbc.su.se
Received on January 9, 2004; revised on September 15, 2004; accepted on September 15, 2004
| Abstract |
|---|
|
|
|---|
O-GalNAc-glycosylation is one of the main types of glycosylation in mammalian cells. No consensus recognition sequence for the O-glycosyltransferases is known, making prediction methods necessary to bridge the gap between the large number of known protein sequences and the small number of proteins experimentally investigated with regard to glycosylation status. From O-GLYCBASE a total of 86 mammalian proteins experimentally investigated for in vivo O-GalNAc sites were extracted. Mammalian protein homolog comparisons showed that a glycosylated serine or threonine is less likely to be precisely conserved than a nonglycosylated one. The Protein Data Bank was analyzed for structural information, and 12 glycosylated structures were obtained. All positive sites were found in coil or turn regions. A method for predicting the location for mucin-type glycosylation sites was trained using a neural network approach. The best overall network used as input amino acid composition, averaged surface accessibility predictions together with substitution matrix profile encoding of the sequence. To improve prediction on isolated (single) sites, networks were trained on isolated sites only. The final method combines predictions from the best overall network and the best isolated site network; this prediction method correctly predicted 76% of the glycosylated residues and 93% of the nonglycosylated residues. NetOGlyc 3.1 can predict sites for completely new proteins without losing its performance. The fact that the sites could be predicted from averaged properties together with the fact that glycosylation sites are not precisely conserved indicates that mucin-type glycosylation in most cases is a bulk property and not a very site-specific one. NetOGlyc 3.1 is made available at www.cbs.dtu.dk/services/netoglyc.
Key words: machine learning / mucin-type / neural networks / O-glycosylation / prediction
| Introduction |
|---|
|
|
|---|
Protein glycosylation is more abundant and structurally diverse than all other types of posttranslational modifications combined (Hart, 1992
One of the most abundant types of mammalian glycosylation is when an N-acetylgalactosamine (GalNAc) is
-1 linked to the hydroxyl group of a serine or threonine residue. This type of glycosylateion is also called mucin-type. Mucin-type glycans are found on many secreted and membrane-bound mucins, but also on other glycoproteins. Mucins typically have very high carbohydrate content (>50% of the dry weight) and are the principal component of mucus, the gel that protects epithelial surfaces from dehydration, mechanical injury, proteases, and pathogens (Carraway and Hull, 1991
; Strous and Dekker, 1992
). The protein backbone of a mucin contains a number of repetitive sequences, including virtually all the O-linked oligosaccharide attachment sites. Although these differ in terms of length and sequence from mucin to mucin, they all have a high serine, threonine, and proline content and are sometimes referred to as Ser/Thr/Pro-rich domains. Due to the steric hindrance introduced by the glycans, these domains adopt a stiff extended conformation, with an average length of 2.5 Å per amino acid residue (Coltart et al., 2002
; Jentoft, 1990
).
The biosynthesis of mucin-type glycosylation takes place in the rough endoplasmatic reticulum and the Golgi complex after N-glycosylation, folding, and oligomerization (Asker et al., 1995
; Peters et al., 1989
). As opposed to the en bloc transfer of the high-mannose oligosaccharide involved in N-glycosylation, O-glycosylation is a stepwise process including one monosaccharide at a time. The addition of GalNAc to serine and threonine residues is what governs the site specificity, and this process is mediated by at least 14 different UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferases (Wang et al., 2003
). From sequence similarity, it is estimated that there are up to 24 unique GalNAc-transferases genes; see Ten Hagen et al. (2003)
for a recent review. The different transferases have overlapping, but different specificities and are differentially expressed (Sørensen et al., 1995
; Ten Hagen et al., 2003
; Van den Steen et al., 1998
). Although no consensus sequence has been formulated, many studies have noted the skew in amino acid composition around mucin-type O-glycosylation sites (Christlet and Veluraja, 2001
; Elhammer et al., 1993
; Hansen et al., 1998
; Wilson et al., 1991
, for example) with a higher frequency of prolines, serines, threonines, and alanines than expected. A number of studies have investigated the effect of flanking residues in in vitro experiments on synthetic peptides (Nishimori et al., 1994
; O'Connell et al., 1992
; Yoshida et al., 1997
; Young et al., 1979
) and especially the importance of prolines at certain positions has been confirmed. There is now strong support for the theory that mucin-type glycosylation of multisite substrates proceed in a hierarchical manner, because some of the characterized UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferases seem to only glycosylate peptides, which are already partly glycosylated (Bennett et al., 1999
; Ten Hagen et al., 1999
, 2001
). This could partly be explained by a recent nuclear magnetic resonance (NMR) study that showed that the preferred substrates of different transferases had different secondary structure in terms of slightly different dihedral angles and that previous glycosylation of a nearby residue affected these structural propensities (Kinarsky et al., 2003
).
Prediction of glycosylation sites is a valuable tool when trying to characterize a new protein, for example, to help interpret mass spectrometry results. Predicted mucin-type O-glycosylation is one of the important features when predicting orphan protein function (Jensen et al., 2002
, 2003
), and because O-glycosylation affects the structure of the protein and occurs primarily in surface-exposed regions, predicted glycosylation sites may be used to improve protein structure prediction as well. Prediction can also be useful in protein engineering to engineer or abolish O-glycosylation sites and to design competetive inhibitors of glycosyltransferases (Hansen et al., 1998
).
The most well-known and tested prediction methods for mucin-type O-glycosylation sites are a matrix statistics method (Elhammer et al., 1993
), a vector projection method (Chou et al., 1995
; Chou, 1995
), and a neural network method (Hansen et al., 1995
, 1998
). All these methods have been based on quite limited data, and when compared in independent experimental studies, none have shown convincing predictive performance (Gerken et al., 1997
; Neumann et al., 1998
). Gerken et al. (1997)
failed to find any correlation between the outputs of the predictor methods and the experimentally determined degree of glycosylation for individual serines and threonines in a highly glycosylated mucin peptide, something neither of the methods were intended for. There exists also three other predictors developed using different neural network methods (Cai and Chou, 1996
; Cai et al., 1997
, 2002
). The main problem with these predictors is that although modern machine learning approaches have been used, the data sets have not been updated. The training set consists of 195 positive and 110 negative sites and the test set only of 26 positive and 4 negative sites. In two of the articles (Cai and Chou, 1996
; Cai et al., 2002
) the only performance reported is the number of correct predictions: 26 and 23 out of 30, respectively. Note that a prediction method that predicts all sites to be positive will be correct for 26 out of 30 sites, but not very useful.
The neural network method developed by Hansen et al. (1998)
is available online (www.cbs.dtu.dk/services/netoglyc-2.0) and had
5000 queries/month during 2003. It was trained on data available at that time, in total 299 O-GalNAc sites from mammalian proteins. Through continuous updates of our glycosylation database OGlycBase (www.cbs.dtu.dk/databases/oglycbase), we now have access to 421 experimentally verified sites, an increase of more than 40%. When working with small data sets like this, the increase in available data motivates an update, and we also wanted to try predicting not only from sequence but from sequence derived features such as predicted structure. Elhammer et al. (1993)
and Hansen et al. (1998)
showed that glycosylation correlates with predicted secondary structure and a number of experimental studies show that UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase substrates adopt an extended ß-like or turn-like conformation (Coltart et al., 2002
; Kinarsky et al., 2003
; Kirnarsky et al., 1998
; O'Connell et al., 1991
; Schuman et al., 2003
) and that mucin-type glycosylation induces a more rigid extended structure (Schuman et al., 2000
, 2003
; Tagashira et al., 2002
).
We have searched the Protein Data Bank (Westbrook et al., 2003
) for structural information on 86 mammalian proteins containing a total number of 421 experimentally verified mucin-type glycosylation sites. Twelve structures were obtained. We found that all sites were found in coil or turn regions either located near the N- or C-termini of the proteins, in linker regions between domains, or in coil regions connecting secondary structure elements. We found that a glycosylated serine and threonine are less likely to be precisely conserved between mammalian protein homologs and more likely to be surface exposed than a nonglycosylated serine or threonine. We have trained a new predictor method, NetOGlyc 3.1, which correctly predicts 76% of the positive sites and 93% of the negative sites. We show that NetOGlyc 3.1 can predict sites for completely new proteins with no loss in performance.
| Results |
|---|
|
|
|---|
Structural context of O-glycosylation sites
The Protein Data Bank (Westbrook et al., 2003
|
A summary of where in the structures the mucin-type glycosylation sites are located can be found in Table I. All sites, both the glycosylated and the unglycosylated, were found in coil or turn regions. Seven were found near the N- or the C-termini of the polypeptide chains, four of these in the same structure (2GMF). Four sites were located in linker regions between two domains. There were four intradomain sites, all located in coil regions connecting two
-helices. These coil regions were loosely associated with the globular domains. All sites were localized in or close to mainly
-helical domains, except one found in the
-subunit of human chorionic gonadotropin. This preference for coil regions could potentially be used in a mucin-type O-glycosylation site predictor by providing it with predicted structural information.
Sequence conservation and surface accessibility
We investigated whether glycosylated serine and threonine residues are more likely to be conserved between close protein homologs than nonglycosylated serine and threonine residues. Because there are not enough examples of proteins where more than one homolog have been investigated for glycosylation sites, we aligned each proteins in our data set against all its mammalian homologs. A conservation of a threonine or serine residue does not guarantee that the glycosylation site is in fact conserved, but a mutation to anything other than serine or threonine proves that it is not. We were interested to see if there is any additional selective pressure on the glycosylated residues, presumably from the need to conserve the glycan itself, so we investigated both conservation, allowing for no mutations, and what we call semi-conservation, allowing for mutation between serine and threonine only. The results can be seen in Table II and indicate that there is no extra selective pressure on the glycosylated residues in terms of precise site conservation. On the contrary, glycosylation makes serine and threonine less likely to be conserved. Although the difference in sequence conservation is opposite to what we expected, the fact that there is a difference at all could potentially be used for improving a glycosylation site predictor.
|
To rule out the possibility that the glycosylation sites are selectively conserved compared to other residues in the disordered and surface-exposed regions of the proteins where glycosylation sites are typically found, we specifically investigated sequence conservation for residues in close proximity (distance < 5 amino acids) to glycosylation sites. The sequence conservation varies widely depending on the type of amino acid residue investigated (from 25.9% for methionine to 100% for cysteine), so we choose to restrict our comparisons to serines and threonines (Table II). The sequence conservation for residues in close proximity of glycosylation sites is lower than for other nonglycosylated residues, but not as low as for the glycosylated residues.
Surface accessibility prediction was performed on the 86 proteins in our data set and the result can be seen in Table III. Glycosylated serine and threonines are more surface exposed, and this information is hidden in the sequence and detected by the surface accessibility predictor. Although in principle a neural network trained on mucin-type O-glycosylation sites should be able to pick up this on its own if enough training examples are supplied, providing the network with this information could help when the data are limited, as in our case. The surface accessibility predictions were used already in NetOGlyc 2.0 (Hansen et al., 1998
) by letting it control the threshold for positive assignment at the output. This time we want to incorporate the surface accessibility prediction data in the input information to the network instead.
|
Predictive performance
The concept of using sequence derived features for mucin-type glycosylation prediction is illustrated in Figure 1. The sequence itself need to be translated from letters to numbers before it is presented to the network, and this can be done in various ways: as sparse encoding (the standard way), BLOSUM62 profile encoding (the corresponding row in the BLOSUM62 matrix), PSI-BLAST profile encoding (the corresponding row in the profile computed from PSI-BLAST), reduced alphabet (sparse encoding with fewer letters), or as amino acid composition. Cross-validation was used, so that the 421 positive and 2063 negative sites were divided into three groups of 828 sites each with a minimum of sequence similarity between the three groups. These were used so that every network was trained three times, using two sets as training set and one set as test set. As performance measure we used the joint Matthews correlation coefficient (Matthews, 1975
|
Predictors were trained using window sizes between 1 and 35 amino acids and different in-data information (one feature at the time): sparse encoding, BLOSUM62 profile encoding, PSI-BLAST profile encoding, 5-letter reduced alphabet, 8-letter reduced alphabet, amino acid composition, secondary structure, average secondary structure, protein distance constraints, surface accessibility, and average surface accessibility. The performance of these predictors can be seen in Figure 2 and show that each of these features clearly have predictive potential because all correlation coefficients are greater than zero. This means that there is in fact a preference of mucin-type O-glycosylation for a certain secondary structure, certain protein distance constraints, a certain value of surface accessibility, and so on (otherwise the correlation coefficients would be close to zero). The comparably low predictive performance of the networks trained only on secondary structure information show that it is not likely to be the most discriminating condition that needs to be fulfilled for a serine/threonine to be glycosylated, and this is probably the reason for the bad performance (a network trained only on secondary structure would predict all sites with the correct secondary structure to be positive and this would lead to a large number of false positives). Averaged information was as powerful as position specific information. This can be seen from comparing the curve for surface accessibility with the one for average surface accessibility, from comparing secondary structure with average secondary structure, and from comparing amino acid composition with the different sequence encoding methods. The only exception from this rule is that the networks trained on PSI-BLAST encoded sequence information perform better than amino acid composition for window sizes up to 15 amino acid residues. A PSI-BLAST encoded sequence contains information about sequence conservation between related proteins, and this additional information gives even a network trained only on a three-residue window surprisingly high performance. But because the number of input neurons increases linearily with increasing window size for sequence information, the high network complexity causes problems with overtraining for large windows, and this is probably the reason why BLAST encoding does not outperform amino acid composition for larger windows.
|
Overall, a network trained on amino acid composition in a 31-residue window (with only 21 input neurons) outperforms all other single-feature networks. We analyzed a linear network (no hidden neurons) trained on amino acid composition in a 31-residue window to see directly the effect of the different amino acids on the prediction (correlation coefficient = 0.54). The residues that makes a glycosylation more likely are (in decreasing order of their tendency to promote glycosylation): Pro, Thr, Ser, end of sequence, and Ala. One residue is essentially glycosylation neutral, Glu, and the rest makes glycosylation less likely (in decreasing order of their tendency to promote glycosylation): Val, Gly, Met, Ile, His, Gln, Trp, Asp, Arg, Phe, Tyr, Lys, Cys, Asn, and Leu. Note that these rankings are based on single amino acids and not correlated pairs or other combinations.
To find the best possible combination of features, we used a greedy strategy, trying to combine what appeared to be good input information from the results of the single-feature networks. For feature combinations that seemed promising, networks with varying number of hidden neurons (different network complexity) were trained. We also tried linear combinations of different networks and trained networks where the input was the output from a number of single-feature networks. The very best combination was profile encoding in a 1-residue window, plus amino acid composition in a 31-residue window, plus average surface accessibility in a 25-residue window using seven hidden neurons. The performance of this network can be seen in Figure 3a and in Table IV. The figure shows the trade-off between making many positive predictions, of which some are false, and making few predictions and thereby missing some. A curve reaching far up into the upper left corner is to be preferred, and completely random designation would perform along the diagonal. ROC curves are widely used in describing the quality of a classification method such as a predictor or a medical diagnostic tool. When you want to make a classification like sick/healthy or glycosylated/nonglycosylated you typically have to set a threshold. If you set a high threshold you will get few positives, but a higher percentage of the predictions you make will in fact be true (in our example, 40% of the positives can be found with only about 3% of the negatives being wrongly predicted to be positive). If a low threshold is used, you will find more of the true positives, but you will also get more false positives (80% of the positives found will give about 15% wrong predictions of the negative sites). Because nonglycosylated serines and threonines typically are much more common than glycosylated ones, it is normally preferred to keep the false positive rate as low as possible, because otherwise the specificity (the fraction of predicted sites that are in fact glycosylated) becomes very low. The maximum Matthews correlation coefficient is obtained when a threshold of 0.5 is used and the resulting detailed performance can be seen in Table IV. This is also the default threshold of the Web server of NetOGlyc 3.0, but ultimately the choice is up to the user.
|
|
In Table IV the performances of NetOGlyc 3.0 and NetOGlyc 2.0 are compared. Looking only at the reported cross-validation performance, the differences are not that dramatic. NetOGlyc 3.0 has a higher correlation coefficient and a considerably higher specificity, but also a lower sensitivity for the positive sites. (If desired, the balance between sensitivity and specificity can be changed by changing the prediction threshold, so this is not a problem.) It appears that the reported performance of NetOGlyc 2.0 has been somewhat overestimated when tested on completely new proteins (Gerken et al., 1997
Mucin-type O-glycosylation sites seem to fall within two different categories. The majorities of the sites occur in highly glycosylated regions where the distance to the closest neighboring glycosylation site is short. NetOGlyc 3.0 performs well on these sites. There are, however, a smaller group of isolated (single) sites in our data set. A previous database study suggests that single and multiple sites may be slightly different from each other (Christlet and Veluraja, 2001
). When we examine the performance on isolated sites only, it is much lower than for multiple sites. To improve the prediction on isolated sites, we trained networks only on these (distance to closest neighboring mucin-type glycosylation site > 10 amino acids), in total 65 threonine sites and 21 serine sites. The best network uses substitution matrix profile encoding (BLOSUM62) in a 9-residue window and averaged surface accessibility in a 17-residue window. The Matthew correlation coefficient is 0.46, which is to be compared to 0.24 for NetOGlyc 3.0 on these sites. The ROC curve in Figure 3b show that the perfomance on threonine sites is much better than for serine sites. This is due to the small number of isolated serine sites compared to isolated threonine sites. We have tried to improve the performance on serine sites by various means but believe that nothing short of more known sites can solve this problem.
To provide an easy-to-use all-around predictor, we devised an algorithm for combining NetOGlyc 3.0 and the single-site predictor:
- The sequence is run through both predictors.
- All NetOGlyc 3.0 predictions above a certain threshold are accepted.
- For serines/threonines where there are no predicted sites within 10 aa on either side, accept single-site predictions above a certain threshold.
The thresholds where optimized independently and found to be 0.5 in both cases for threonine sites, which makes sense because that is the threshold that gives the best performance in each individual case. For serine sites, adding sites predicted by the single-site predictor adds too many false positive sites, and the optimum is actually to stick with the NetOGlyc 3.0 prediction only. The new, combined predictor is called NetOGlyc 3.1, and its performance can be seen in Figure 3c and Table IV. As you can see, the performance on serine sites is identical between NetOGlyc 3.0 and NetOGlyc 3.1, but for threonine sites NetOGlyc 3.1 is outstanding.
| Discussion |
|---|
|
|
|---|
The fact that there is no extra evolutionary pressure to conserve site-specific mucin-type glycosylated serines and threonines compared to nonglycosylated serines and threonines was a surprise. To understand why, several points have to be made. One is that nonglycosylated serine and threonine residues often occur in the well-conserved core of a protein, whereas glycosylated serines and threonines occur in disordered and surface-exposed regions with little overall sequence conservation. Our predictions show that only about 35% of nonglycosylated serines and threonines are surface exposed, whereas 70% of glycosylated are. A priori, the buried core residues are more likely to be subjected to a high evolutionary pressure, leading to sequence conservation. This does not seem to be the whole explanation, though. Taking only serines and threonines found close to glycosylation sites into account, nonglycosylated residues are still more likely to be conserved than glycosylated residues. The same is true when comparing only serines and threonines predicted to be surface exposed (data not shown). The second point is that the loops where glycosylation occur often vary in length and the problem of aligning two loops of different length and weaker sequence similarity (than for the protein core) is not trivial. Although a low linear conservation of the glycosylated residues can be detected, there is a quite high structural one, as described in the previous section.
Another point is that the function of mucin-type glycosylation in the highly glycosylated mucin proteins, which are responsible for a large number of glycosylation sites in our data set, is believed mainly to be to change the biophysical properties of the protein: to protect it from cleavage, change the size and charge distribution of the protein, make the protein bind more water, and change the structure to be stiffer and more extended. In neither of these functions the exact number or positions of glycosylated residues would be important. Rather, the glycosylation would be conserved more as a bulk property. In fact, this can be observed for highly glycosylated homologs within our data set, see Figure 4. The mucin-type glycosylation is clearly conserved, but only on an overall, bulk level. This does not exclude the possibility that individual mucin-type glycosylation sites may be highly specific and therefore highly conserved between species; one example may be human and bovine corticotropin, COLI_HUMAN and COLI_BOVIN, which have identical sequences from position 10 to +20 relative to their only mucin-type glycosylation site, respectively.
|
The third point is that a large part of the endothelial glycocalyx consists of mucin-type glycosylated proteins (Jentoft, 1990
The action of the different UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferases on a Ser/Thr/Pro-rich domain is highly complex. In a hierarchical manner a number of enzymes glycosylate the serine and threonine residues in the surface accessible loops that have the right amino acid composition and adopts the right extended conformation. The glycosylation of the different sites takes place in a specific order, depending on the transferases present in the tissue, and due to steric hindrance from the flanking glycosylation sites, some sites may be only partially glycosylated or not at all (Gerken, 2004
; Hanisch et al., 2001
; Kato et al., 2001
; Takeuchi et al., 2002
). Unfortunately, NetOGlyc 3.1 does not hold the key to understanding all of this complexity. It is based on in vivo data, which is neither tissue- nor transferase-specific. In a highly glycosylated Ser/Thr/Pro-rich domain, it is likely to predict all the threonines and serines as glycosylation sites, even the ones that are not glycosylated or only to a lesser extent. Nevertheless, it is a powerful tool when it comes to identifying the glycosylated regions in a protein and for finding isolated threonine sites.
NetOGlyc 3.1 is only intended for extracellular protein sequences. Intracellular proteins or the cytosolic domains of membrane proteins will never encounter the UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferases performing the mucin-type O-glycosylation, because these are located in the Golgi complex. Therefore, all sequences submitted to NetOGlyc 3.0 are routinely checked for signal peptide using the SignalP prediction server (Bendtsen et al., forthcoming
; Nielsen et al., 1997
). For membrane proteins, the responsibility to only consider predictions in the potentially extracellular domains is left up to the user.
In several studies, threonine has been proven to be a better substrate for mucin-type glycosylation than serine (for example, Kinarsky et al., 2003
; O'Connell et al., 1992
; Yoshida et al., 1997
). At the same time, serine is a more common amino acid residue overall. The fact that we would normally expect a smaller percentage of serines to be glycosylated as compared to threonines makes the correct prediction of serine sites harder. In Table IV we can see that we will normally find fewer of the positive sites (the positive site sensitivity) and a fewer percentage of the predicted sites will be correct (the specificity) for serines than for threonines. The fact that we were able to specifically improve the performance on isolated sites for threonines and not for serines when developing NetOGlyc 3.1 indicates that the recognition sequence are sligthly different between isolated serine and threonine sites. With only 21 isolated serine sites, we have every reason to believe that a sufficient increase in the number of known isolated serine sites would make it possible to make a similar improvement in the prediction of serine sites using the method described here for threonine.
| Materials and methods |
|---|
|
|
|---|
Data set
Eighty-six mammalian protein sequences with one or more experimentally verified mucin-type glycosylation site were extracted from O-GlycBase v6.00 (www.cbs.dtu.dk/databases/oglycbase) (Gupta et al., 1999
Structural context
The program GetStruct (www.cbs.dtu.dk/services/getstruct) was used with default parameters to extract structural information about the glycosylation sites in our data set from the PDB database (Westbrook et al., 2003
). GetStruct performed BLAST (Altschul et al., 1997
) alignments of the sequences in our data set versus the sequences in the PDB with the aim of obtaining one hit structure for each query (input) sequence. Only structures with at least 90% sequence identity to the query (input) sequences were considered. With a few notable exceptions (Dalal et al., 1997
; Gerstein and Levitt, 1998
; Riesner, 2003
), a clear amino acid sequence relationship between two proteins implies that they have similar structure (Chothia and Lesk, 1986
). Therefore, at the required levels of sequence similarity (90% or more), the found structures can be expected to be good representatives of the structures of the glycoproteins.
The reported localization of the O-glycosylation sites are indicated relative to their position in the query sequence. Thus, a site that is close to the N-terminal in a structure but in the middle of the query sequence, is classified as being in an interdomain region (the assumption being that the structurally determined unit is a full domain).
Sequence conservation and surface accessibility
For each of the 86 proteins in our data set, close protein homologs were identified by searching SWISS-PROT (Boeckmann et al., 2003
) for mammalian proteins with entry names with identical prefix. Example: As homologs to bovine fibronectin (SWISS-PROT entry name FINC_BOVIN), FINC_HUMAN, FINC_MOUSE, and FINC_RAT were identified. To avoid fragment proteins in the study, proteins with less than half the length of the query protein were discarded. Multiple alignment of the sequences was performed using CLUSTAL W (Thompson et al., 1994
). The sequence conservation was estimated on a residue for residue basis.
Surface accessibility was predicted using a neural network method called surfg (Hansen et al., 1998
). Surfg gives both a direct output score and a smoothed score. Both are between 0 and 1, with a score above 0.5 if the amino acid residue is predicted to be buried and a score below 0.5 if it is predicted to be surface exposed. The serine and threonine residues for which the smoothed score is below 0.5 were considered to be predicted surface exposed.
Neural network training
For readability, this section was shortened to suit the average readers of Glycobiology. For details on sequence encoding, feature encoding, and neural networks, see the supplementary material online.
A neural network does not understand letters, so the amino acid sequence and different features must be translated into numbers. This is called encoding and can be done in a number of ways. Each number that is presented to the neural network make up what is called an input neuron. The goal is to provide the network with as much information as possible while still keeping the number of input neurons as low as possible.
- Sparse encoding (Hertz et al., 1991
; Qian and Sejnowski, 1988
) is the conventional way to convert the amino acid sequence into numerical form.
- With profile encoding, the input for each amino acid consisted of the corresponding row in the BLOSUM62 matrix (Henikoff and Henikoff, 1992
).
- With PSI-BLAST encoding, the input for each amino acid consisted of the corresponding row in the position-specific scoring matrix computed from three cycles of PSI-BLAST (Altschul et al., 1997
).
- The 5-letter alphabet encoding was conventional sparse encoding, but with a reduced alphabet (Soumpasis, personal communications).
- The 8-letter alphabet is another reduced alphabet.
- Amino acid composition was calculated for a sequence window around each particular site.
- Surface accessibility was predicted using a neural network method called surfg (Hansen et al., 1998
).
- Secondary structure was predicted using PSIPRED (Jones, 1999
; McGuffin et al., 2000
) using position-specific scoring matrices computed from three cycles of PSI-BLAST (Altschul et al., 1997
).
- Protein distance constraints were predicted using DistanceP (Gorodkin et al., 1999
).
The neural networks were of the two-layer feed-forward type, trained by standard back-propagation. Network complexity was varied by changing the number of neurons in the input layer as well as in the hidden layer to find the optimal complexity for this particular prediction problem. This is important, because a network with too little complexity (too few neurons) will lack the ability to learn the training examples, and a network with too much complexity (too many neurons) will learn the examples too well and lose the ability to make predictions for examples that were not in the training set (the ability to generalize). This second problem is sometimes called overtraining and is one of the reasons why it is so important to make sure that the examples in the test set are different and unrelated to the examples in the training set. If the sets are unrelated to each other, the performance on the test set will decrease when overtraining occurs, and if the problem can be detected, it can also be avoided. The risk of overtraining is greater the smaller the data set is.
The predictive performance was monitored using the Matthews correlation coefficient (Matthews, 1975
) during training and test of the networks:
![]() | (1) |
The fraction of positive sites correctly predicted, the positive site sensitivity, Sn,pos, was computed as
![]() | (2) |
![]() | (3) |
![]() | (4) |
In the data set, proteins were identified as closely related if at least two of the following criteria were fulfilled: (1) similar protein names, (2) SWISS-PROT entry name with identical prefix, and (3) high sequence identity. Examples: Human lithostathine 1
, LITA_HUMAN, and human lithostathine 1ß, LITB_HUMAN (86% sequence identity); human and bovine corticotropin, COLI_HUMAN, and COLI_BOVIN (80% sequence identity). Of these groups of related proteins, only the most well-studied in each group was used for negative site information. The positive sites were scanned for similarities within the group and those with identical residues from 5 to +5 were excluded. This resulted in one protein (COLI_BOVIN) being altogether masked out, so our data set consisted of 85 proteins. Using only the most well-studied protein from each group, the proteins were divided into three sets of equal size with minimal sequence overlap between the sets using a heuristic described in Jensen et al. (2003)
. After this division was performed, the closely related proteins were manually placed in the same partition as their representative. For computational reasons, we needed to have the same number of sites in each partition. To achieve this, some negative sites were randomly omitted. The result was a total of 421 positive (265 Thr and 156 Ser) and 2063 negative sites (903 Thr and 1160 Ser) divided into three sets of 828 sites each. These were used so that every network was trained three times, using two sets as training set and one set as test set. The reported cross-validation performance is the joint performance of the three resulting networks on their respective test sets.
To be able to truly compare our performance to the performance of NetOGlyc 2.0 (Hansen et al., 1998
), we also trained on a reduced set, consisting only of proteins entered into O-GLYCBASE (Gupta et al., 1999
) before 20 January 1997. These were the 65 proteins available for training of NetOGlyc 2.0 and is referred to as the old set. The same division of sets were used, and the result was 331 positive and 1190 negative divided into three sets of 507 sites each. The best window and feature combination as for the whole set was used, but the number of hidden neurons was varied (015), and the best number was chosen based on the cross-validation performance. The 20 proteins entered into the database after NetOGlyc 2.0 was trained could then be used to compare the performance of NetOGlyc 2.0 and NetOGlyc 3.0 directly. This is referred to as the new set and consists of 90 positive sites (50 Thr and 40 Ser) and 489 negative sites (188 Thr and 301 Ser). The reported performance of NetOGlyc 3.0-old on this set is the performance of the average output from the three cross-validation networks trained on the old set.
| Acknowledgements |
|---|
Kristoffer Rapacki, Hans Henrik Stærfeldt, and Lars Juhl Jensen are thanked for technical assistance. This work was supported by the Danish National Research Foundation, the Danish Center for Scientific Computing and Knut and Alice Wallenbergs Foundation.
| References |
|---|
|
|
|---|
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402.
Apweiler, R., Hermjakob, H., and Sharon, N. (1999) On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database. Biochim. Biophys. Acta, 1473, 48.[Medline]
Asker, N., Baeckstrom, D., Axelsson, M.A.B., Carlstedt, I., and Hansson, G.C. (1995) The human MUC2 apoprotein appears to dimerize before O-glycosylation and shares epitopes with the "insoluble" mucin of rat small intestine. Biochem. J., 308, 873880.[Medline]
Bendtsen, J.D., Nielsen, H., von Heijne, G., and Brunak, S. (forthcoming) Improved prediction of signal peptidessignalp 3.0. J. Mol. Biol.
Bennett, E.P., Hassan, H., Hollingsworht, M.A., and Clausen, H. (1999) A novel human UDP-N-acetyl-D-galactosamine:polypeptide N-acetyl-galactosaminyltransferase, GalNAc-T7, with specificity for partial GalNAc-glycosylated acceptor substrates. FEBS Lett., 460, 226230.[CrossRef][Web of Science][Medline]
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., J., M.M., Michoud, K., O'Donovan, C., Phan, I., and others. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365370.
Cai, Y.D. and Chou, K.C. (1996) Artificial neural network model for predicting the specificity of GalNAc-transferase. Anal. Biochem., 243, 284285.[CrossRef][Web of Science][Medline]
Cai, Y.D., Liu, X.J., Xu, X.B. and Chou, K.C. (2002) Support vector machines for predicting the specificity of GalNAc-transferase. Peptides, 23, 205208.[CrossRef][Web of Science][Medline]
Cai, Y.D., Yu, H. and Chou, K.C. (1997) Artificial neural network method for predicting the specificity of GalNAc-transferase. J. Protein Chem., 16, 689700.[CrossRef][Web of Science][Medline]
Carraway, K.L. and Hull, S.R. (1991) Cell surface mucin-type glycoproteins and mucin-like domains. Glycobiology, 1, 131138.[Medline]
Chothia, C. and Lesk, A.M. (1986) Relationship between the divergence of sequence and structure in proteins. EMBO J., 5, 823827.[Web of Science][Medline]
Chou, K.C. (1995) A sequence-coupled vector-projection model for predicting the specificity of GalNAc-transferase. Protein Sci., 4, 13651383.[Web of Science][Medline]
Chou, K.C., Zhang, C.T., Kezdy, F.J. and Poorman, R.A. (1995) A vector projection method for predicting the specificity of GalNAc-transferase. Proteins, 21, 118126.[CrossRef][Web of Science][Medline]
Christlet, T.H.T. and Veluraja, K. (2001) Database analysis of O-glycosylation sites in proteins. Biophys. J., 80, 952960.[Medline]
Coltart, D.M., Royyuru, A.K., Williams, L.J., Glunz, P.W., Sames, D., Kuduk, S.D., Schwarz, J.B., Chen, X.T., Danishefsky, S.J., and Live, D.H. (2002) Principles of mucin architecture: Structural studies on synthetic glycopeptides bearing clustered mono-, di-, tri-, and hexasaccharide glycodomains. J. Am. Chem. Soc., 124, 98339844.[CrossRef][Web of Science][Medline]
Dalal, S., Balasubramanian, S. and Regan, L. (1997) Protein alchemy: changing beta-sheet into alpha-helix. Nat. Struct. Biol., 4, 548552.[CrossRef][Web of Science][Medline]
Elhammer, Å.P., Poorman, R.A., Brown, E., Maggiora, L.L., Hoogerheide, J.G. and Kézdy, F.J. (1993) The specificty of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase as inferred from a database of in vivo substrates and from the in vitro glycosylation of proteins and peptides. J. Biol. Chem., 268, 1002910038.
Gerbaud, V., Pignol, D., Loret, E., Bertrand, J.A., Berland, Y., Fontecilla-Camps, J.C., Canselier, J.P., Gabas, N., and Verdier, J.M. (2000) Mechanism of calcite crystal growth inhibition by the N-terminal undecapeptide of lithostathine. J. Biol. Chem., 275, 10571064.
Gerken, T.A. (2004) Kinetic modeling confirms the biosynthesis of mucin core 1 (beta-Gal(1-3)alpha-GalNAc-O-ser/thr) O-glycan structures are modulated by neighboring glycosylation effects. Biochemistry, 43, 41374142.[CrossRef][Medline]
Gerken, T.A., Owen, C.L., and Pasumarthy, M. (1997) Determination of the site-specific O-glycosylation pattern of the porcine submaxillary mucin tandem repeat glycopeptide. J. Biol. Chem., 272, 97099719.
Gerstein, M. and Levitt, M. (1998) Comprehensive assessment of automatic structural alignment against a manual standard; the scop classification of proteins. Protein Sci., 7, 445456.[Web of Science][Medline]
Gorodkin, J., Lund, O., Andersen, C.A., and Brunak, S. (1999) Using sequence motifs for enhanced neural network prediction of protein distance constraints. In Lengauer, T., Schneider, R., Bork, P., Brutlag, D., Glasgow, J., Mewes, H.W., and Zimmer, R. (Eds.), Proceedings of the Seventh International Conference for Molecular Biology. pp. 95105.
Gupta, R., Birch, H., Rapacki, K., Brunak, S., and Hansen, J.E. (1999) O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins. Nucleic Acids Res., 27, 370372.
Hanisch, F.G., Reis, C.A., Clausen, H., and Paulsen, H. (2001) Evidence for glycosylation-dependent activities of polypeptide N-acetylgalactosaminyltransferases rGalNAc-T2 and -T4 on mucin glycopeptides. Glycobiology, 11, 731740.
Hansen, J.E., Lund, O., Engelbrecht, J., Bohr, H., Nielsen, J.O., Hansen, J.E., and Brunak, S. (1995) Prediction of O-glycosylation of mammalian proteins: specificity patterns of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase. Biochem. J., 308, 801813.[Web of Science][Medline]
Hansen, J.E., Lund, O., Gooley, A.A., Williams, K.L., and Brunak, S. (1998) NetOGlyc. Prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconj. J., 15, 115130.[CrossRef][Web of Science][Medline]
Hart, G.W. (1992) Glycosylation. Curr. Opin. Cell Biol., 4, 10171023.[CrossRef][Medline]
Henikoff, S. and Henikoff, J.G. (1992) Amino acid subsitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89, 1091510919.
Hertz, J., Krogh, A., and Palmer, R. (1991) Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley.
Jensen, L.J., Gupta, R., Blom, N., Devos, D., Tamames, J., Kesmir, C., Nielsen, H., Stærfeldt, H.H., Rapacki, K., Workman, C., and others. (2002) Prediction of human protein function from post-translational modifications and localization features. J. Mol. Biol., 319, 12571265.[CrossRef][Web of Science][Medline]
Jensen, L.J., Gupta, R., Stærfeldt, H.H., and Brunak, S. (2003) Prediction of human protein function according to Gene Ontology categories. Bioinformatics, 19, 635642.
Jentoft, N. (1990) Why are proteins O-glycosylated? Trends Biochem. Sci., 15, 291294.[CrossRef][Web of Science][Medline]
Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195202.[CrossRef][Web of Science][Medline]
Kato, K., Takeuchi, H., Miyahara, N., Kanoh, A., Hassan, H., Clausen, H., and Irimura, T. (2001) Distinct orders of GalNAc incorporation into a peptide with consecutive threonines. Biochem. Biophys. Res. Commun., 287, 110115.[CrossRef][Web of Science][Medline]
Kinarsky, L., Suryanarayanan, G., Prakash, O., Paulsen, H., Clausen, H., Hanisch, F.G., Hollingsworth, M.A., and Sherman, S. (2003) Conformational studies on the MUC1 tandem repeat glycopeptides: implication for the enzymatic O-glycosylation of the mucin protein core. Glycobiology, 13, 929939.
Kirnarsky, L., Nomoto, M., Ikematsu, Y., Hassan, H., Bennet, E.P., Cerny, R.L., Clausen, H., Hollingsworth, M.A., and Sherman, S. (1998) Structural analysis of peptide substrates for mucin-type O-glycosylation. Biochemistry, 37, 1281112817.[CrossRef][Medline]
Knepper, T.P., Arbogast, B., Schreurs, J., and Deinzer, M.L. (1992) Determination of the glycosylation patterns, disulphide linkages, and protein heterogeneities of baculovirus-expressed mouse interleukin-3 by mass spectrometry. Biochemistry, 31, 1165111659.[CrossRef][Medline]
Matthews, B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, 405, 442451.[Medline]
McGuffin, L.J., Bryson, K., and Jones, D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics, 16, 404405.
Neumann, G.M., Marinaro, J.A., and Bach, L.A. (1998) Identification of O-glycosylation sites and partial characterization of carbohydrate structure and disulfide linkages of human insulin-like growth factor binding protein 6. Biochemistry, 37, 65726585.[CrossRef][Medline]
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng., 10, 16.[Medline]
Nishimori, I., Johnson, N.R., Sanderson, S.D., Perini, F., Mountjoy, K., Cerny, R.L., Gross, M.L., and Hollingsworth, M.A. (1994) Influence of acceptor substrate primary amino acid sequence on the activity of human UDP-N-acetylgalactosamine:polypeptide N-Acetylgalactosaminyltransferase. J. Biol. Chem., 269, 1612316130.
O'Connell, B., Tabak, L.A., and Ramasubbu, N. (1991) The influence of flanking sequences on O-glycosylation. Biochem. Biophys. Res. Commun., 180, 10241030.[CrossRef][Web of Science][Medline]
O'Connell, B.C., Hagen, F.K., and Tabak, L.A. (1992) The influence of flanking sequence on the O-glycosylation of threonine in vitro. J. Biol. Chem., 267, 2501025018.
Peters, B.P., Krzesicki, R.F., Perini, F., and Ruddon, R.W. (1989) O-glycosylation of the
-subunit does not limit the assembly of chorionic gonadotropin
ß dimer in human malignant and nonmalignant trophoblast cells. Endocrinology, 124, 16021612.
Qian, N. and Sejnowski, T.J. (1988) Prediction the secondary structure of globular proteins using neural network models. J. Mol. Biol., 202, 865884.[CrossRef][Web of Science][Medline]
Riesner, D. (2003). Biochemistry and structure of prp(c) and prp(sc). Br. Med. Bull., 66, 2133.
Schuman, J., Qiu, D., Koganty, R.R., Longenecker, B.M., and Campbell, A.P. (2000) Glycosylations versus conformational preferences of cancer associated mucin core. Glycoconj. J., 17, 835848.[CrossRef][Web of Science][Medline]
Schuman, J., Campbell, A.P., Koganty, R.R., and Longenecker, B.M. (2003) Probing the conformational and dynamical effects of O-glycosylation within the immunodominant region of a MUC1 peptide tumor antigen. J. Peptide Res., 61, 91108.[CrossRef][Web of Science][Medline]
Seitz, O. (2000) Glycopeptide synthesis and the effects of glycosylation on protein structure and activity. Chembiochem., 1, 214246.[CrossRef][Medline]
Sørensen, T., White, T., Wandall, H.H., Kristensen, A.K., Roepstorff, P., and Clausen, H. (1995) UDP-N-acetyl-
-D-galactosamine:polypeptide N-acetylgalactosaminyltransferase. J. Biol. Chem., 270, 2416624173.
Spiro, R.G. (2002) Protein glycosylation: nature, distribution, enzymatic formation, and disease implications of glycopeptide bonds. Glycobiology, 12, 43R56R.
Strous, G.J. and Dekker, J. (1992) Mucin-type glycoproteins. Crit. Rev. Biochem. Mol. Biol., 27, 5792.[Web of Science][Medline]
Tagashira, M., Iijimia, H., and Toma, K. (2002) An NMR study of O-glycosylation induced structural changes in the
-helix of calcitonin. Glycoconj. J., 19, 4352.[CrossRef][Web of Science][Medline]
Takeuchi, H., Kato, K., Hassan, H., Clausen, H., and Irimura, T. (2002) O-GalNAc incorporation into a cluster acceptor site of three consecutive threonines. Distinct specificity of GalNAc-transferase isoforms. Eur. J. Biochem., 269, 61736183.[Web of Science][Medline]
Ten Hagen, K.G., Tetaert, D., Hagen, F.K., Richet, C., Beres, T.M., Gagnon, J., Balys, M.M., VanWuyckhuyse, B., Bedi, G.S., Degand, P., and Tabak, L.A. (1999) Characterization of a UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase that displays glycopeptide N-acetylgalactosaminyltransferase activity. J. Biol. Chem., 274, 2786727874.
Ten Hagen, K.G., Bedi, G.S., Tetaert, D., Kingsley, P.D., Hagen, F.K., Balys, M.M., Beres, T.M., Degand, P., and Tabak, L.A. (2001) Cloning and characterization of a ninth membre of the UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase family, ppGaNTase-T9. J. Biol. Chem., 276, 1739517404.
Ten Hagen, K.G., Fritz, T.A., and Tabak, L.A. (2003) All in the family: the UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferases. Glycobiology, 13, 1R16R.
Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 46734680.
Van den Steen, P., Rudd, P.M., Dwek, R.A., and Opdenakker, G. (1998) Concepts and principles of O-linked glycosylation. Crit. Rev. Biochem. Mol. Biol., 33, 151208.[Medline]
Varki, A. (1993) Biological roles of oligosaccharides: all of the theories are correct. Glycobiology, 3, 97130.
Wang, H., Tachibana, K., Zhang, Y., Iwasaki, H., Kameyama, A., Cheng, L., Guo, J., Hiruma, T., Togayachi, A., Kudo, T., and others. (2003) Cloning and characterization of a novel UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase, pp-GalNAc-T14. Biochem. Biophys. Res. Commun., 300, 738744.[CrossRef][Web of Science][Medline]
Westbrook, J., Feng, Z., Chen, L., Yang, H., and Berman, H.M. (2003) The Protein Data Bank and structural genomics. Nucleic Acids Res., 31, 489491.
Wilson, I.B., Gavel, Y., and von Heijne, G. (1991) Amino acid distributions around O-linked glycosylation sites. Biochem. J., 275, 529534.[Web of Science][Medline]
Yoshida, A., Suzuki, M., Ikenaga, H., and Takeuchi, M. (1997) Discovery of the shortest sequence motif for high level mucin-type O-glycosylation. J. Biol. Chem., 272, 1688416888.
Young, J.D., Tsuchiya, D., Sandlin, D.E., and Holroyde, M.J. (1979) Enzymic O-glycosylation of synthetic peptides from sequences in basic myelin protein. Biochemistry, 18, 44444448.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
D. E. Wright, S. Colaco, C. Colaco, and P. G. Stevenson Antibody limits in vivo murid herpesvirus-4 replication by IgG Fc receptor-dependent functions J. Gen. Virol., November 1, 2009; 90(11): 2592 - 2603. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Darula and K. F. Medzihradszky Affinity Enrichment and Characterization of Mucin Core-1 Type Glycopeptides from Bovine Serum Mol. Cell. Proteomics, November 1, 2009; 8(11): 2515 - 2526. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kwilas, R. M. Liesman, L. Zhang, E. Walsh, R. J. Pickles, and M. E. Peeples Respiratory Syncytial Virus Grown in Vero Cells Contains a Truncated Attachment Protein That Alters Its Infectivity and Dependence on Glycosaminoglycans J. Virol., October 15, 2009; 83(20): 10710 - 10718. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Sando, R. Pearson, C. Gray, P. Parker, R. Hawken, P. C. Thomson, J. R. S. Meadows, K. Kongsuwan, S. Smith, and R. L. Tellam Bovine Muc1 is a highly polymorphic gene encoding an extensively glycosylated mucin that binds bacteria J Dairy Sci, October 1, 2009; 92(10): 5276 - 5291. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Hossler, S. F Khattak, and Z. J. Li Optimal and consistent protein glycosylation in mammalian cell culture Glycobiology, September 1, 2009; 19(9): 936 - 949. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. C. Bartholomeu, G. C. Cerqueira, A. C. A. Leao, W. D. daRocha, F. S. Pais, C. Macedo, A. Djikeng, S. M. R. Teixeira, and N. M. El-Sayed Genomic organization and expression profile of the mucin-associated surface protein (masp) family of the human pathogen Trypanosoma cruzi Nucleic Acids Res., June 1, 2009; 37(10): 3407 - 3417. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. A. Mihindukulasuriya, N. L. Nguyen, G. Wu, H. V. Huang, A. P. A. Travassos da Rosa, V. L. Popov, R. B. Tesh, and D. Wang Nyamanini and Midway Viruses Define a Novel Taxon of RNA Viruses in the Order Mononegavirales J. Virol., May 15, 2009; 83(10): 5109 - 5116. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Troese, M. Sarkar, N. L. Galloway, R. J. Thomas, S. A. Kearns, D. V. Reneer, T. Yang, and J. A. Carlyon Differential Expression and Glycosylation of Anaplasma phagocytophilum Major Surface Protein 2 Paralogs during Cultivation in Sialyl Lewis x-Deficient Host Cells Infect. Immun., May 1, 2009; 77(5): 1746 - 1756. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Ito, K.-H. Kim, A. S.-F. Lok, and S. Tong Characterization of Genotype-Specific Carboxyl-Terminal Cleavage Sites of Hepatitis B Virus e Antigen Precursor and Identification of Furin as the Candidate Enzyme J. Virol., April 15, 2009; 83(8): 3507 - 3517. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Vik, F. E. Aas, J. H. Anonsen, S. Bilsborough, A. Schneider, W. Egge-Jacobsen, and M. Koomey Broad spectrum O-linked protein glycosylation in the human pathogen Neisseria gonorrhoeae PNAS, March 17, 2009; 106(11): 4447 - 4452. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. E. Morris, C. D. St. Laurent, R. S. Hoeve, P. Forsythe, M. R. Suresh, R. D. Mathison, and A. D. Befus Autonomic nervous system regulates secretion of anti-inflammatory prohormone SMR1 from rat salivary glands Am J Physiol Cell Physiol, March 1, 2009; 296(3): C514 - C524. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. M. Bakheet and A. J. Doig Properties and identification of human protein drug targets Bioinformatics, February 15, 2009; 25(4): 451 - 457. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. A Reeves, D. Talavera, and J. M Thornton Genome and proteome annotation: organization, interpretation and integration J R Soc Interface, February 6, 2009; 6(31): 129 - 147. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Hoffhines, C. H. Jen, J. A. Leary, and K. L. Moore Tyrosylprotein Sulfotransferase-2 Expression Is Required for Sulfation of RNase 9 and Mfge8 in Vivo J. Biol. Chem., January 30, 2009; 284(5): 3096 - 3105. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J Sartain and J. T Belisle N-Terminal clustering of the O-glycosylation sites in the Mycobacterium tuberculosis lipoprotein SodC Glycobiology, January 1, 2009; 19(1): 38 - 51. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Ludwig, S. M. Theissen, M. J. Morton, and M. J. Caplan The Cytoplasmic Tail Dileucine Motif LL572 Determines the Glycosylation Pattern of Membrane-type 1 Matrix Metalloproteinase J. Biol. Chem., December 19, 2008; 283(51): 35410 - 35418. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Raman, T. A. Fritz, T. A. Gerken, O. Jamison, D. Live, M. Liu, and L. A. Tabak The Catalytic and Lectin Domains of UDP-GalNAc:Polypeptide {alpha}-N-Acetylgalactosaminyltransferase Function in Concert to Direct Glycosylation Site Selection J. Biol. Chem., August 22, 2008; 283(34): 22942 - 22951. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Thorsen, K. D. Sorensen, A. S. Brems-Eskildsen, C. Modin, M. Gaustadnes, A.-M. K. Hein, M. Kruhoffer, S. Laurberg, M. Borre, K. Wang, et al. Alternative Splicing in Colon, Bladder, and Prostate Cancer Identified by Exon Array Analysis Mol. Cell. Proteomics, July 1, 2008; 7(7): 1214 - 1224. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. E. Gerszten, F. Accurso, G. R. Bernard, R. M. Caprioli, E. W. Klee, G. G. Klee, I. Kullo, T. A. Laguna, F. P. Roth, M. Sabatine, et al. Challenges in translating plasma proteomics from bench to bedside: update from the NHLBI Clinical Proteomics Programs Am J Physiol Lung Cell Mol Physiol, July 1, 2008; 295(1): L16 - L22. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. M. Overton, C. A. J. van Niekerk, L. G. Carter, A. Dawson, D. M. A. Martin, S. Cameron, S. A. McMahon, M. F. White, W. N. Hunter, J. H. Naismith, et al. TarO: a target optimisation system for structural biology Nucleic Acids Res., July 1, 2008; 36(suppl_2): W190 - W196. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Herr, G. Korniychuk, Y. Yamamoto, K. Grubisic, and M. Oelgeschlager Regulation of TGF-{beta} signalling by N-acetylgalactosaminyltransferase-like 1 Development, May 15, 2008; 135(10): 1813 - 1822. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Sarkar, M. J. Troese, S. A. Kearns, T. Yang, D. V. Reneer, and J. A. Carlyon Anaplasma phagocytophilum MSP2(P44)-18 Predominates and Is Modified into Multiple Isoforms in Human Myeloid Cells Infect. Immun., May 1, 2008; 76(5): 2090 - 2098. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Nixon, R. C. Jones, and M. K. Holland Molecular and Functional Characterization of the Rabbit Epididymal Secretory Protein 52, REP52 Biol Reprod, May 1, 2008; 78(5): 910 - 920. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Luo, X. Zhang, A. Wakeel, V. L. Popov, and J. W. McBride A Variable-Length PCR Target Protein of Ehrlichia chaffeensis Contains Major Species-Specific Antibody Epitopes in Acidic Serine-Rich Tandem Repeats Infect. Immun., April 1, 2008; 76(4): 1572 - 1580. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. R. Sprague, H. Reinhard, E. J. Cheung, A. H. Farley, R. D. Trujillo, H. Hengel, and P. J. Bjorkman The Human Cytomegalovirus Fc Receptor gp68 Binds the Fc CH2-CH3 Interface of Immunoglobulin G J. Virol., April 1, 2008; 82(7): 3490 - 3499. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. K. Tipsmark Identification of FXYD protein genes in a teleost: tissue-specific expression and response to salinity change Am J Physiol Regulatory Integrative Comp Physiol, April 1, 2008; 294(4): R1367 - R1378. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Krucken, R. J. Hosse, A. N. Mouafo, R. Entzeroth, S. Bierbaum, P. Marinovski, K. Hain, G. Greif, and F. Wunderlich Excystation of Eimeria tenella Sporozoites Impaired by Antibody Recognizing Gametocyte/Oocyst Antigens GAM22 and GAM56 Eukaryot. Cell, February 1, 2008; 7(2): 202 - 211. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Itoh, M. Kamata-Sakurai, K. Denda-Nagai, S. Nagai, M. Tsuiji, K. Ishii-Schrade, K. Okada, A. Goto, M. Fukayama, and T. Irimura Identification and Expression of Human Epiglycanin/MUC21: a Novel Transmembrane Mucin Glycobiology, January 1, 2008; 18(1): 74 - 83. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. A. Herr, C.-Y. Hung, and G. T. Cole Evaluation of Two Homologous Proline-Rich Proteins of Coccidioides posadasii as Candidate Vaccines against Coccidioidomycosis Infect. Immun., December 1, 2007; 75(12): 5777 - 5787. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bettegowda, J. Yao, A. Sen, Q. Li, K.-B. Lee, Y. Kobayashi, O. V. Patel, P. M. Coussens, J. J. Ireland, and G. W. Smith JY-1, an oocyte-specific gene, regulates granulosa cell function and early embryonic development in cattle PNAS, November 6, 2007; 104(45): 17602 - 17607. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. de Turenne-Tessier and T. Ooka Post-translational modifications of Epstein Barr virus BARF1 oncogene-encoded polypeptide J. Gen. Virol., October 1, 2007; 88(10): 2656 - 2661. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Julenius NetCGlyc 1.0: prediction of mammalian C-mannosylation sites Glycobiology, August 1, 2007; 17(8): 868 - 876. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. T. Pallesen, L. R. L. Pedersen, T. E. Petersen, and J. T. Rasmussen Characterization of Carbohydrate Structures of Bovine MUC15 and Distribution of the Mucin in Bovine Milk J Dairy Sci, July 1, 2007; 90(7): 3143 - 3152. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. R. Pigott and D. J. Ellar Role of Receptors in Bacillus thuringiensis Crystal Toxin Activity Microbiol. Mol. Biol. Rev., June 1, 2007; 71(2): 255 - 281. [Abstract] [Full Text] [PDF] |
||||
![]() |
E Memili, D Peddinti, L A Shack, B Nanduri, F McCarthy, H Sagirkaya, and S C Burgess Bovine germinal vesicle oocyte and cumulus cell proteomics Reproduction, June 1, 2007; 133(6): 1107 - 1120. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. C. W. Avellar, L. Honda, K. G. Hamil, Y. Radhakrishnan, S. Yenugu, G. Grossman, P. Petrusz, F. S. French, and S. H. Hall Novel Aspects of the Sperm-Associated Antigen 11 (SPAG11) Gene Organization and Expression in Cattle (Bos taurus) Biol Reprod, June 1, 2007; 76(6): 1103 - 1116. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. S. Andrali, Q. Qian, and S. Ozcan Glucose Mediates the Translocation of NeuroD1 by O-Linked Glycosylation J. Biol. Chem., May 25, 2007; 282(21): 15589 - 15596. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Wang, K. Julenius, J. Hryhorenko, and F. K. Hagen Systematic Analysis of Proteoglycan Modification Sites in Caenorhabditis elegans by Scanning Mutagenesis J. Biol. Chem., May 11, 2007; 282(19): 14586 - 14597. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. B. Olszewski, A. J. Groot, J. Dastych, and E. F. Knol TNF Trafficking to Human Mast Cell Granules: Mature Chain-Dependent Endocytosis J. Immunol., May 1, 2007; 178(9): 5701 - 5709. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Carmon, M. Wilkin, J. Hassan, M. Baron, and R. MacIntyre Concerted Evolution Within the Drosophila dumpy Gene Genetics, May 1, 2007; 176(1): 309 - 325. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Mark, O. B. Spiller, M. Okroj, S. Chanas, J. A. Aitken, S. W. Wong, B. Damania, A. M. Blom, and D. J. Blackbourn Molecular Characterization of the Rhesus Rhadinovirus (RRV) ORF4 Gene and the RRV Complement Control Protein It Encodes J. Virol., April 15, 2007; 81(8): 4166 - 4176. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. S.A. Hofinger, M. Spickenreither, J. Oschmann, G. Bernhardt, R. Rudolph, and A. Buschauer Recombinant human hyaluronidase Hyal-1: insect cells versus Escherichia coli as expression system and identification of low molecular weight inhibitors Glycobiology, April 1, 2007; 17(4): 444 - 453. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. K. Inder, N. Ueda, A. A. Mercer, S. B. Fleming, and L. M. Wise Bovine papular stomatitis virus encodes a functionally distinct VEGF that binds both VEGFR-1 and VEGFR-2 J. Gen. Virol., March 1, 2007; 88(3): 781 - 791. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. P Y Lee, D. D Mruk, W. Xia, and C Y. Cheng Cellular localization of sphingomyelin synthase 2 in the seminiferous epithelium of adult rat testes J. Endocrinol., January 1, 2007; 192(1): 17 - 32. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. W. McBride, C. K. Doyle, X. Zhang, A. M. Cardenas, V. L. Popov, K. A. Nethery, and M. E. Woods Identification of a Glycosylated Ehrlichia canis 19-Kilodalton Major Immunoreactive Protein with a Species-Specific Serine-Rich Glycopeptide Epitope Infect. Immun., January 1, 2007; 75(1): 74 - 82. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Lefebvre, J. Fan, S. Chevalier, R. Sullivan, E. Carmona, and P. Manjunath Genomic structure and tissue-specific expression of human and mouse genes encoding homologues of the major bovine seminal plasma proteins Mol. Hum. Reprod., January 1, 2007; 13(1): 45 - 53. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Julenius and A. G. Pedersen Protein Evolution Is Faster Outside the Cell Mol. Biol. Evol., November 1, 2006; 23(11): 2039 - 2048. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Benjannet, D. Rhainds, J. Hamelin, N. Nassoury, and N. G. Seidah The Proprotein Convertase (PC) PCSK9 Is Inactivated by Furin and/or PC5/6A: FUNCTIONAL CONSEQUENCES OF NATURAL MUTATIONS AND POST-TRANSLATIONAL MODIFICATIONS J. Biol. Chem., October 13, 2006; 281(41): 30561 - 30572. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. J. Tarcha, V. Basrur, C.-Y. Hung, M. J. Gardner, and G. T. Cole Multivalent Recombinant Protein Vaccine against Coccidioidomycosis. Infect. Immun., October 1, 2006; 74(10): 5802 - 5813. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Schneider, A. A. Khalil, J. Poulton, C. Castillejo-Lopez, D. Egger-Adam, A. Wodarz, W.-M. Deng, and S. Baumgartner Perlecan and Dystroglycan act at the basal side of the Drosophila follicular epithelium to maintain epithelial organization Development, October 1, 2006; 133(19): 3805 - 3815. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. B. Johansen, L. Kiemer, and S. Brunak Analysis and prediction of mammalian protein glycation Glycobiology, September 1, 2006; 16(9): 844 - 853. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. E. Van den Steen, I. Van Aelst, V. Hvidberg, H. Piccard, P. Fiten, C. Jacobsen, S. K. Moestrup, S. Fry, L. Royle, M. R. Wormald, et al. The Hemopexin and O-Glycosylated Domains Tune Gelatinase B/MMP-9 Bioavailability via Inhibition and Binding to Cargo Receptors J. Biol. Chem., July 7, 2006; 281(27): 18626 - 18637. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Wernersson, K. Rapacki, H.-H. Staerfeldt, P. W. Sackett, and A. Molgaard FeatureMap3D--a tool to map protein features and sequence conservation onto homologous structures in the PDB. Nucleic Acids Res., July 1, 2006; 34(suppl_2): W84 - W88. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Mavromatis, C. K. Doyle, A. Lykidis, N. Ivanova, M. P. Francino, P. Chain, M. Shin, S. Malfatti, F. Larimer, A. Copeland, et al. The Genome of the Obligately Intracellular Bacterium Ehrlichia canis Reveals Themes of Complex Membrane Structure and Immune Evasion Strategies J. Bacteriol., June 1, 2006; 188(11): 4015 - 4023. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Hashimoto, S. Goto, S. Kawano, K. F. Aoki-Kinoshita, N. Ueda, M. Hamajima, T. Kawasaki, and M. Kanehisa KEGG as a glycome informatics resource Glycobiology, May 1, 2006; 16(5): 63R - 70R. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. A. Reeves, J. M. Thornton, and the BioSapiens Network of Excellence Integrating biological data through the genome Hum. Mol. Genet., April 15, 2006; 15(suppl_1): R81 - R87. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Wopereis, D. J. Lefeber, E. Morava, and R. A. Wevers Mechanisms in Protein O-Glycan Biosynthesis and Clinical and Molecular Aspects of Protein O-Glycan Biosynthesis Defects: A Review Clin. Chem., April 1, 2006; 52(4): 574 - 600. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. A. Fritz, J. Raman, and L. A. Tabak Dynamic Association between the Catalytic and Lectin Domains of Human UDP-GalNAc:Polypeptide {alpha}-N-Acetylgalactosaminyltransferase-2 J. Biol. Chem., March 31, 2006; 281(13): 8613 - 8619. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. H. Y. Jiang, B. M. Tyler, S. C. Whisson, A. R. Hardham, and F. Govers Ancient Origin of Elicitin Gene Clusters in Phytophthora Genomes Mol. Biol. Evol., February 1, 2006; 23(2): 338 - 351. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Ahmad, D. C. Hoessli, E. Walker-Nasir, S. M. Rafik, A. R. Shakoori, and Nasir-ud-Din Oct-2 DNA binding transcription factor: functional consequences of phosphorylation and glycosylation Nucleic Acids Res., January 8, 2006; 34(1): 175 - 184. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. K. Doyle, K. A. Nethery, V. L. Popov, and J. W. McBride Differentially Expressed and Secreted Major Immunoreactive Protein Orthologs of Ehrlichia canis and E. chaffeensis Elicit Early Antibody Responses to Epitopes on Glycosylated Tandem Repeats Infect. Immun., January 1, 2006; 74(1): 711 - 720. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Hubmacher, K. Tiedemann, R. Bartels, J. Brinckmann, T. Vollbrandt, B. Batge, H. Notbohm, and D. P. Reinhardt Modification of the Structure and Function of Fibrillin-1 by Homocysteine Suggests a Potential Pathogenetic Mechanism in Homocystinuria J. Biol. Chem., October 14, 2005; 280(41): 34946 - 34955. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Musicki, M. F. Kramer, R. E. Becker, and A. L. Burnett Inactivation of phosphorylated endothelial nitric oxide synthase (Ser-1177) by O-GlcNAc in diabetes-associated erectile dysfunction PNAS, August 16, 2005; 102(33): 11870 - 11875. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Semenov, K. Tamai, and X. He SOST Is a Ligand for LRP5/LRP6 and a Wnt Signaling Inhibitor J. Biol. Chem., July 22, 2005; 280(29): 26770 - 26775. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. T. Zlateva, P. Lemey, E. Moes, A.-M. Vandamme, and M. Van Ranst Genetic Variability and Molecular Evolution of the Human Respiratory Syncytial Virus Subgroup B Attachment G Protein J. Virol., July 15, 2005; 79(14): 9157 - 9167. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. O'Connor, B. Eisenhaber, J. Dalley, T. Wang, C. Missen, N. Bulleid, P. N. Bishop, and D. Trump Species specific membrane anchoring of nyctalopin, a small leucine-rich repeat protein Hum. Mol. Genet., July 1, 2005; 14(13): 1877 - 1887. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. I. Olason Integrating protein annotation resources through the Distributed Annotation System Nucleic Acids Res., July 1, 2005; 33(suppl_2): W468 - W470. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. K. Patnaik and P. Stanley Mouse Large Can Modify Complex N- and Mucin O-Glycans on {alpha}-Dystroglycan to Induce Laminin Binding J. Biol. Chem., May 27, 2005; 280(21): 20851 - 20859. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Nakav, A. Jablonka-Shariff, S. Kaner, P. Chadna-Mohanty, H. E. Grotjan, and D. Ben-Menahem The LH{beta} Gene of Several Mammals Embeds a Carboxyl-terminal Peptide-like Sequence Revealing a Critical Role for Mucin Oligosaccharides in the Evolution of Lutropin to Chorionic Gonadotropin in the Animal Phyla J. Biol. Chem., April 29, 2005; 280(17): 16676 - 16684. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. H. Olesen, L. L. Christensen, F. B. Sorensen, T. Cabezon, S. Laurberg, T. F. Orntoft, and K. Birkenkamp-Demtroder Human FK506 Binding Protein 65 Is Associated with Colorectal Cancer Mol. Cell. Proteomics, April 1, 2005; 4(4): 534 - 544. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. I. MacRae, A. Acosta-Serrano, N. A. Morrice, A. Mehlert, and M. A. J. Ferguson Structural Characterization of NETNES, a Novel Glycoconjugate in Trypanosoma cruzi Epimastigotes J. Biol. Chem., April 1, 2005; 280(13): 12201 - 12211. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


































