Cette page vous permet de fixer les différents paramètres disponibles sur cette plateforme
Méthode de calcul des similitudes des molécules
ref : https://medium.com/@santuchal/understanding-molecular-similarity-51e8ebb38886Methods for Measuring Molecular Similarity
Chemical Fingerprints: These encode structural information into a binary format, allowing for efficient comparisons of molecular structures.
2D and 3D Structural Alignment: Utilizing algorithms to align molecules in two or three dimensions, respectively, to assess similarities based on spatial conformations.
Quantitative Structure-Activity Relationships (QSAR): Analyzing the relationship between molecular structures and biological activities to predict new compounds’ properties.
Machine Learning and AI Techniques: Leveraging advanced algorithms to recognize patterns and similarities among large datasets of molecular structures
Different Molecular Similarity:
1. Tanimoto Similarity: The Tanimoto similarity, a measure used in molecular fingerprint comparison, gauges the similarity between two chemical compounds. In the context of chemical compounds, it measures the overlap of structural features. . It calculates the ratio of the common features (bits set to 1) in their binary fingerprints to the total features present in either or both compounds. A higher Tanimoto coefficient, ranging from 0 to 1, indicates greater structural similarity. For instance, a coefficient of 0.7 means 70% overlap in features, suggesting substantial molecular resemblance. This metric is pivotal in cheminformatics for comparing compounds based on their structural characteristics.
def tanimoto_similarity(query_smiles, all_smiles):
results = []
query_mol = Chem.MolFromSmiles(query_smiles)
query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2, nBits=2048)
for mol in stqdm(all_smiles):
similarity = DataStructs.TanimotoSimilarity(query_fp, AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(mol), 2, nBits=2048))
result_entry = {
'Query SMILE' : query_smiles,
'Data SMILE' : mol,
'Similarity' : similarity
}
results.append(result_entry)
df1 = pd.DataFrame(results)
return df1
2. RDKit Similarity: The RDKFingerprint begins by processing a molecular structure using a series of predefined molecular fragments, circular topological patterns, or molecular fragments within a specified radius. These fragments are hashed into a fixed-length bit vector, resulting in a fingerprint unique to each molecule. This fingerprinting process creates a binary fingerprint representation.This fingerprint can be used to compare molecules, assess structural similarities, and perform clustering or machine learning tasks in cheminformatics.
def rdkit_similarity(query_smiles, all_mols):
results = []
query_fp = Chem.RDKFingerprint(Chem.MolFromSmiles(query_smiles))
for mol in stqdm(all_mols):
rdkit_similarity = DataStructs.FingerprintSimilarity(query_fp, Chem.RDKFingerprint(Chem.MolFromSmiles(mol)))
result_entry = {
'Query SMILE' : query_smiles,
'Data SMILE' : mol,
'Similarity' : rdkit_similarity
}
results.append(result_entry)
df1 = pd.DataFrame(results)
return df1
3. Tversky Similarity: Tversky Similarity is a metric used in cheminformatics to measure the similarity between two chemical structures. It’s an extension of the Jaccard index and Tanimoto coefficient, offering a flexible way to quantify structural similarity based on shared substructures and molecular features. This similarity measure introduces two asymmetric parameters, α and β, allowing users to emphasize different aspects of similarity in their analysis. α determines the weight given to the features present in both molecules, while β controls the weight placed on features unique to each molecule. The parameters α and β enable users to fine-tune similarity calculations, emphasizing certain structural aspects over others based on their relevance to the research context.
def tversky_similarity(query_smiles, all_mols):
results = []
alpha = 0.5 # Weight for the query molecule
beta = 0.5 # Weight for the reference molecule
query_mol = Chem.MolFromSmiles(query_smiles)
query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2, nBits=1024)
for mol in stqdm(all_mols):
Chem.MolFromSmiles(mol)
tversky_similarity = DataStructs.TverskySimilarity(query_fp, AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(mol), 2, nBits=1024), alpha, beta)
result_entry = {
'Query SMILE' : query_smiles,
'Data SMILE' : mol,
'Similarity' : tversky_similarity
}
results.append(result_entry)
df1 = pd.DataFrame(results)
return df1
4. Euclidean Similarity: The Euclidean similarity is calculated as the inverse of the Euclidean distance, which measures the straight-line distance between two points in this multidimensional space. In simpler terms, it indicates how close two molecules are in terms of their properties. For molecular fingerprints or descriptors, the Euclidean similarity computes the square root of the sum of squared differences between corresponding descriptor values for two molecules. A smaller Euclidean distance implies higher similarity, indicating molecules with similar features.
def euclidian_similarity(query_smiles, all_mols):
results = []
query_mol = Chem.MolFromSmiles(query_smiles)
query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2)
for mol in stqdm(all_mols):
euclidean_similarity = 1 - DataStructs.DiceSimilarity(query_fp, AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(mol), 2))
result_entry = {
'Query SMILE' : query_smiles,
'Data SMILE' : mol,
'Similarity' : euclidean_similarity
}
results.append(result_entry)
df1 = pd.DataFrame(results)
return df1
5. Dice Similarity: Dice similarity metric is employed to gauge the similarity between two molecules based on their structural fingerprints. Specifically, it measures the resemblance or overlap between the binary fingerprints of the molecules. The Dice similarity coefficient calculates the ratio of the shared bits or features between two fingerprints to the total number of bits set in both fingerprints. In other words, it quantifies the common elements between the fingerprints in relation to their overall sizes. For molecules represented by binary fingerprints, the Dice similarity is computed as twice the number of common bits divided by the sum of the bits in each fingerprint. It ranges from 0 (no overlap) to 1 (complete overlap).
def dice_similarity(query_smiles, all_mols):
results = []
query_mol = Chem.MolFromSmiles(query_smiles)
query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2, nBits=1024)
for mol in stqdm(all_mols):
dice_similarity = DataStructs.DiceSimilarity(query_fp, AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(mol), 2, nBits=1024))
result_entry = {
'Query SMILE' : query_smiles,
'Data SMILE' : mol,
'Similarity' : dice_similarity
}
results.append(result_entry)
df1 = pd.DataFrame(results)
return df1
6. Cosine Similarity: Cosine similarity calculates the cosine of the angle between two fingerprint vectors, signifying the directional agreement or alignment of their features. The fingerprints capture structural features, representing the presence or absence of various substructures or molecular characteristics.For molecules represented as binary fingerprint vectors, the cosine similarity is computed as the dot product of the two fingerprint vectors divided by the product of their magnitudes. It ranges from -1 (completely dissimilar, 180-degree angle) to 1 (identical, 0-degree angle), with 0 denoting orthogonality or no structural correlation.
def cosine_similarity(query_smiles, all_mols):
results = []
query_mol = Chem.MolFromSmiles(query_smiles)
query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2, nBits=1024)
for mol in stqdm(all_mols):
cosine_similarity = DataStructs.CosineSimilarity(query_fp, AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(mol), 2, nBits=1024))
result_entry = {
'Query SMILE' : query_smiles,
'Data SMILE' : mol,
'Similarity' : cosine_similarity
}
results.append(result_entry)
df1 = pd.DataFrame(results)
return df1
7. Rogot Goldberg Similarity: The Rogot-Goldberg (RG) Similarity is a metric used in cheminformatics to assess the similarity between molecular structures, focusing on atom environments within a certain radius from each atom. RG similarity is derived from the atom environments present in the molecular structure. In the context of molecular similarity, RG similarity evaluates molecules based on the types and arrangements of atoms in their immediate vicinity. It examines the presence of specific atom environments surrounding individual atoms and compares these environments between molecules. The RG similarity score quantifies the resemblance between molecules by examining the shared atom environments within a given radius. Molecules with similar local atom environments receive higher RG similarity scores, indicating greater structural resemblance at a local, atomistic level.
def rogot_goldberg_similarity(query_smiles, all_mols):
results = []
query_mol = Chem.MolFromSmiles(query_smiles)
query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2, nBits=1024)
for mol in stqdm(all_mols):
rogot_goldberg_similarity = DataStructs.FingerprintSimilarity(query_fp, AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(mol), 2, nBits=1024), metric=DataStructs.DiceSimilarity)
result_entry = {
'Query SMILE' : query_smiles,
'Data SMILE' : mol,
'Similarity' : rogot_goldberg_similarity
}
results.append(result_entry)
df1 = pd.DataFrame(results)
return df1
8. MACCS Keys Similarity: The MACCS (Molecular ACCess System) Keys are a set of 166 structural keys used to represent molecular structures in cheminformatics. These keys encode various structural features present in a molecule, capturing information about substructures, functional groups, and specific atom arrangements. MACCS Keys Similarity evaluates molecular similarity by comparing the presence or absence of these predefined structural keys between molecules. When calculating MACCS Keys Similarity, molecules are first encoded using these 166 keys. The similarity score is then computed by comparing the binary representation of the keys between molecules. Similarity is determined based on the number of matching keys present in both molecules.
def MACCS_keys_similarity(query_smiles, all_mols):
results = []
query_mol = Chem.MolFromSmiles(query_smiles)
query_fp = AllChem.GetMACCSKeysFingerprint(query_mol)
for mol in stqdm(all_mols):
MACCS_keys_similarity = DataStructs.FingerprintSimilarity(query_fp, AllChem.GetMACCSKeysFingerprint(Chem.MolFromSmiles(mol)))
result_entry = {
'Query SMILE' : query_smiles,
'Data SMILE' : mol,
'Similarity' : MACCS_keys_similarity
}
results.append(result_entry)
df1 = pd.DataFrame(results)
return df1
9. Manhattan Similarity: The Manhattan distance, also known as the L1 distance or city block distance, measures the sum of absolute differences between corresponding elements of two vectors, in this case, the molecular fingerprints. For example, if we have two vectors representing the fingerprints of two molecules, the Manhattan Similarity would involve summing up the absolute differences between corresponding elements (bits, features, etc.) in the vectors. The resulting value represents the total dissimilarity between the molecules based on their structural features. Higher Manhattan Similarity scores indicate greater structural similarity between molecules, as lower values signify closer similarity in terms of their molecular descriptors or fingerprints.
def manhattan_similarity(query_smiles, all_mols):
results = []
query_mol = Chem.MolFromSmiles(query_smiles)
query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2, nBits=1024)
for mol in stqdm(all_mols):
manhattan_similarity = 1 - DataStructs.FingerprintSimilarity(query_fp, AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(mol), 2, nBits=1024), metric=DataStructs.DiceSimilarity)
result_entry = {
'Query SMILE' : query_smiles,
'Data SMILE' : mol,
'Similarity' : manhattan_similarity
}
results.append(result_entry)
df1 = pd.DataFrame(results)
return df1
Méthode de calcul des similitudes des Clusters
Notes complémentaires
Récupération de la liste des types de protéines
requête : MATCH (r:Protein) with split(r.ref,"_")[1] AS categ RETURN categ, count(categ)
Récupération des valeurs min et max de la relation FIT
Requête efficiency : MATCH ()-[r:FIT]->() RETURN MAX(r.efficiency),MIN(r.efficiency),AVG(r.efficiency)
Requête affinity : MATCH ()-[r:FIT]->() RETURN MAX(r.affinity),MIN(r.affinity),AVG(r.affinity)
Récupération des valeurs pour la heatMap Efficiency/Affinity
requête :
UNWIND [0.25,0.386,0.522,0.658,0.794,0.93,1.066,1.202,1.338,1.474,1.61] as efficience UNWIND range(6,30) as affinity
MATCH ()-[r:FIT]-() where affinity≤r.affinity
RETURN affinity,efficience,count(r)