# Introduction

Evolution generate a large collection of homologous proteins with similar structures. The conserved tertiary structures of the protein impose constraints on the amino acid sequences sampled by the evolutionary dynamics. These constraints can be detected using methods that quantify conservations and direct correlations between positions in an alignment of the sequences of protein families.

These constraints are not the only ones that shape the ensemble of sequences in protein families. Most of the proteins evolved to perform several interactions with other protein partners, ranging from quaternary assembly to functional docking, for example in signals transduction or enzymatic interactions.

In this work we perform a large-scale analysis to detect the quaternary assembly co-evolutionary signals between different chains. To restrict the problem and avoid technical hard problems we focus on homo-oligomers, that are ubiquitous in quaternary structures.
The statistical imprints of quaternary structural constraints could be difficult to detect due to the co-occurrence in the sequences statistics of different signals (presumably with different magnitude): the intra-chains folding signals can cover up the inter-chains protein-protein interaction signals that we search. The idea is to select the co-evolutionary signals that do not have an intra-chain structure origin and see whether it comes from quaternary assembly constraints.

# Database

In order to perform a large-scale analysis of structural co-evolutionary signals in homo-oligomers, we collect a DataBase of alignments of protein domain families (from Pfam 27.0) and related PDB structures to test the co-evolutionary signals extract with DCA method.

The selection of the protein domain families are based on two main criteria :

• a statistical relevant sequence sample of the protein family i.e., the number of effective belonging sequences (similarity <0.8) greater than 500.

• the presence at least of one experimental solved structure by X-ray diffraction of a biological assembly with homo-oligomers (containing the domain) with a good resolution (< 3A )

Since the quaternary structure signals are likely to depends on the chains domains architecture, in order to be sure to include all the different biological assemblies that are experimentally solved we compare the DCA co-evolutionary signal with several PDB structures.
On this point some remarks are necessary. For each pfam domain selected:

• We collect all PDB with a biological assembly that contain homo-oligomers of the pfam domain selected.

• For each PDB we take into account repetitions of the domain inside the same chain and the different biological assemblies annotated. Finally one homo-oligomer domains pairing ( structure unit ) are identified unambiguously by :
Pfam ID, PDB ID, chain 1 ID , chain 2 ID, chain 1 domain number, chain 2 domain number, biological assembly number.

• We create a map between the alignment positions and the 3D residues distances for each PDB ( backmapping ). We extract the minimal distances between the heavy atoms between residues in the domain:

1. In the monomeric structure, , i.e. the intra-chain distances,

2. In the homo-oligomers paring, i.e. between different chains, the inter-chains distances.

• We select only chains that have a given coverage of the domain, the backmapped part of the domain is over $$30 \%$$ of the alignment length.

• In order to include only interacting homo-oligomers we filter out the ones that have a number of interaction residues under a given threshold (greater than 15).

At this step the database include 1272 pfam domains with 16483 PDB for a total of 92625 experimentally intra-chain structure units and 63107 inter-chains structure unit .

To compare the DCA results with the inter-distance and the intra-distance, we take for each pfam domain the minimal distance across the individual PDB-chains-assembly distances, for both the intra-distances and the inter-distances.
Note that in the last case the distance matrices are not symmetric, dist(res1,res2) $$\neq$$ dist(res2,res1). To superimpose the single distance matrices is necessary to identify the order in the two chain interaction, in principle one can perform a structure alignment of the chain-chain complex but for different assembly architectures can be ambiguous. We decide then to symmetrize the matrix taking the minimum distance between the two distances ( min(dist(res1,res2),dist(res2,res1)) ).

The procedure to reduce several PDB contact map to one as the union of the contact maps (minimum distance method) can arise pathological situations. Due to the heterogeneity of the different biological assembly architectures the union contact map can present a contact density very high that can bias the results.
This happened also with minor intensity in the intra-chain distances for the intrinsic variability of the domains structure (for example when the structures sample different conformational states).
Otherwise is possible to define a “consensus” contact map, where are considered contacts the pairs of residues that are in contact in at least a given percentage of the structures selected. This procedure require to have sampled in some sense fairly the space of possible structures (a possible procedure is to clusterize all the solved domains complex structures).

In the figure 1,2,3,4 are shown the contact densities for each domains for intra-chain ($$d_{intra}$$) and inter-chain ($$d_{inter}$$), and the scatter plot with the number of structure unit (domains pairings).
We decide to filter out the pfam that display a inter-chain contact density below a given threshold ($$d_{inter}>0.01$$ and $$d_{inter}/d_{intra}>0.1$$) ) . We end-up finally with a collection of 750 pfam.

We compute the DCA predictions, using pseudo-likelihood approach (see details in []), over all the Pfam multi sequence alignment (MSA) of the selected domains.