Chemical Metadata (chemicals.identifiers)

This module contains a database of metadata on ~70000 chemicals from the PubChem datase. It contains comprehensive feature for searching the metadata. It also includes a small database of common mixture compositions.

For reporting bugs, adding feature requests, or submitting pull requests, please use the GitHub issue tracker.

Search Functions

chemicals.identifiers.CAS_from_any(ID, autoload=False, cache=True)[source]

Wrapper around search_chemical which returns the CAS number of the found chemical directly.

Parameters
IDstr

One of the name formats described by search_chemical, [-]

autoloadbool, optional

Whether to load new chemical databanks during the search if a hit is not immediately found, [-]

cachebool, optional

Whether or not to cache the search for faster lookup in subsequent queries, [-]

Returns
CASRNstr

A three-piece, dash-separated set of numbers

Notes

An exception is raised if the name cannot be identified. The PubChem database includes a wide variety of other synonyms, but these may not be present for all chemcials. See search_chemical for more details.

Examples

>>> CAS_from_any('water')
'7732-18-5'
>>> CAS_from_any('InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3')
'64-17-5'
>>> CAS_from_any('CCCCCCCCCC')
'124-18-5'
>>> CAS_from_any('InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N')
'64-17-5'
>>> CAS_from_any('pubchem=702')
'64-17-5'
>>> CAS_from_any('O') # only elements can be specified by symbol
'17778-80-2'
chemicals.identifiers.MW(ID, autoload=False, cache=True)[source]

Wrapper around search_chemical which returns the molecular weight of the found chemical directly.

Parameters
IDstr

One of the name formats described by search_chemical

Returns
MWfloat

Molecular weight of chemical, [g/mol]

Notes

An exception is raised if the name cannot be identified. The PubChem database includes a wide variety of other synonyms, but these may not be present for all chemcials. See search_chemical for more details.

Examples

>>> MW('water')
18.01528
>>> MW('InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3')
46.06844
>>> MW('CCCCCCCCCC')
142.286
>>> MW('InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N')
46.06844
>>> MW('pubchem=702')
46.06844
>>> MW('O') # only elements can be specified by symbol
15.9994
chemicals.identifiers.search_chemical(ID, autoload=False, cache=True)[source]

Looks up metadata about a chemical by searching and testing for the input string being any of the following types of chemical identifiers:

  • Name, in IUPAC form or common form or a synonym registered in PubChem

  • InChI name, prefixed by ‘InChI=1S/’ or ‘InChI=1/’

  • InChI key, prefixed by ‘InChIKey=’

  • PubChem CID, prefixed by ‘PubChem=’

  • SMILES (prefix with ‘SMILES=’ to ensure smiles parsing; ex. ‘C’ will return Carbon as it is an element whereas the SMILES interpretation for ‘C’ is methane)

  • CAS number (obsolete numbers may point to the current number)

If the input is an ID representing an element, the following additional inputs may be specified as

  • Atomic symbol (ex ‘Na’)

  • Atomic number (as a string)

Parameters
IDstr

One of the name formats described above

autoloadbool, optional

Whether to load new chemical databanks during the search if a hit is not immediately found, [-]

cachebool, optional

Whether or not to cache the search for faster lookup in subsequent queries, [-]

Returns
chemical_metadataChemicalMetadata

A class containing attributes which describe the chemical’s metadata, [-]

Notes

An exception is raised if the name cannot be identified. The PubChem database includes a wide variety of other synonyms, but these may not be present for all chemcials.

Examples

>>> search_chemical('water')
<ChemicalMetadata, name=water, formula=H2O, smiles=O, MW=18.0153>
>>> search_chemical('InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3')
<ChemicalMetadata, name=ethanol, formula=C2H6O, smiles=CCO, MW=46.0684>
>>> search_chemical('CCCCCCCCCC')
<ChemicalMetadata, name=DECANE, formula=C10H22, smiles=CCCCCCCCCC, MW=142.286>
>>> search_chemical('InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N')
<ChemicalMetadata, name=ethanol, formula=C2H6O, smiles=CCO, MW=46.0684>
>>> search_chemical('pubchem=702')
<ChemicalMetadata, name=ethanol, formula=C2H6O, smiles=CCO, MW=46.0684>
>>> search_chemical('O') # only elements can be specified by symbol
<ChemicalMetadata, name=oxygen, formula=O, smiles=[O], MW=15.9994>
chemicals.identifiers.IDs_to_CASs(IDs)[source]

Find the CAS numbers for multiple chemicals names at once. Also supports having a string input which is a common mixture name in the database. An error will be raised if any of the chemicals cannot be found.

Parameters
IDslist[str] or str

A string or 1-element list containing the name which may represent a mixture.

Returns
CASslist[str]

CAS numbers of found chemicals, [-]

Notes

White space, ‘-’, and upper case letters are removed in the search.

Examples

>>> IDs_to_CASs('R512A')
['811-97-2', '75-37-6']
>>> IDs_to_CASs(['norflurane', '1,1-difluoroethane'])
['811-97-2', '75-37-6']

CAS Number Utilities

chemicals.identifiers.check_CAS(CASRN)[source]

Checks if a CAS number is valid. Returns False if the parser cannot parse the given string.

Parameters
CASRNstr

A three-piece, dash-separated set of numbers

Returns
resultbool

Boolean value if CASRN was valid. If parsing fails, return False also.

Notes

Check method is according to Chemical Abstract Society. However, no lookup to their service is performed; therefore, this function cannot detect false positives.

Function also does not support additional separators, apart from ‘-‘.

CAS numbers up to the series 1 XXX XXX-XX-X are now being issued.

A long can hold CAS numbers up to 2 147 483-64-7

Examples

>>> check_CAS('7732-18-5')
True
>>> check_CAS('77332-18-5')
False
chemicals.identifiers.CAS_to_int(i)[source]

Converts CAS number of a compounds from a string to an int. This is helpful when storing large amounts of CAS numbers, as their strings take up more memory than their numerical representational. All CAS numbers fit into 64 bit ints.

Parameters
CASRNstr

CASRN [-]

Returns
CASRNint

CASRN [-]

Notes

Accomplishes conversion by removing dashes only, and then converting to an int. An incorrect CAS number will change without exception.

Examples

>>> CAS_to_int('7704-34-9')
7704349
chemicals.identifiers.int_to_CAS(i)[source]

Converts CAS number of a compounds from an int to an string. This is helpful when dealing with int CAS numbers.

Parameters
CASRNint

CASRN [-]

Returns
CASRNstr

CASRN [-]

Notes

Handles CAS numbers with an unspecified number of digits. Does not work on floats.

Examples

>>> int_to_CAS(7704349)
'7704-34-9'
chemicals.identifiers.sorted_CAS_key(CASs)[source]

Takes a list of CAS numbers as strings, and returns a tuple of the same CAS numbers, sorted from smallest to largest. This is very convenient for obtaining a unique hash of a set of compounds, so as to see if two groups of compounds are the same.

Parameters
CASslist[str]

CAS numbers as strings [-]

Returns
CASs_sortedtuple[str]

Sorted CAS numbers from lowest (first) to highest (last) [-]

Notes

Does not check CAS numbers for validity.

Examples

>>> sorted_CAS_key(['7732-18-5', '64-17-5', '108-88-3', '98-00-0'])
('64-17-5', '98-00-0', '108-88-3', '7732-18-5')

Database Objects

There is an object used to represent a chemical’s metadata, an object used to represent a common mixture’s composition, and an object used to hold the mixture metadata.

class chemicals.identifiers.ChemicalMetadata(pubchemid, CAS, formula, MW, smiles, InChI, InChI_key, iupac_name, common_name, synonyms)[source]

Class for storing metadata on chemicals.

Attributes
pubchemidint

Identification number on pubchem database; access their information online at https://pubchem.ncbi.nlm.nih.gov/compound/<pubchemid> [-]

formulastr

Formula of the compound; in the same format as chemicals.elements.serialize_formula generates, [-]

MWfloat

Molecular weight of the compound as calculated with the standard atomic abundances; consistent with the element weights in chemicals.elements.periodic_table, [g/mol]

smilesstr

SMILES identification string, [-]

InChIstr

InChI identification string as given in pubchem (there can be multiple valid InChI strings for a compound), [-]

InChI_keystr

InChI key identification string (meant to be unique to a compound), [-]

iupac_namestr

IUPAC name as given in pubchem, [-]

common_namestr

Common name as given in pubchem, [-]

synonymslist[str]

List of synonyms of the compound, [-]

CASint

CAS number of the compound; stored as an int for memory efficiency, [-]

class chemicals.identifiers.CommonMixtureMetadata(name, CASs, N, source, names, ws, zs, synonyms)[source]

Class for storing metadata on predefined chemical mixtures.

Attributes
namestr

Name of the mixture, [-]

sourcestr

Source of the mixture composition, [-]

Nint

Number of chemicals in the mixture, [-]

CASslist[str]

CAS numbers of the mixture, [-]

wslist[float]

Mass fractions of chemicals in the mixture, [-]

zslist[float]

Mole fractions of chemicals in the mixture, [-]

nameslist[str]

List of names of the chemicals in the mixture, [-]

synonymslist[str]

List of synonyms of the mixture which can also be used to look it up, [-]

class chemicals.identifiers.ChemicalMetadataDB(elements=True, main_db='/home/docs/checkouts/readthedocs.org/user_builds/chemicals/envs/release/lib/python3.11/site-packages/chemicals-1.1.5-py3.11.egg/chemicals/Identifiers/chemical identifiers pubchem large.tsv', user_dbs=['/home/docs/checkouts/readthedocs.org/user_builds/chemicals/envs/release/lib/python3.11/site-packages/chemicals-1.1.5-py3.11.egg/chemicals/Identifiers/chemical identifiers pubchem small.tsv', '/home/docs/checkouts/readthedocs.org/user_builds/chemicals/envs/release/lib/python3.11/site-packages/chemicals-1.1.5-py3.11.egg/chemicals/Identifiers/chemical identifiers example user db.tsv', '/home/docs/checkouts/readthedocs.org/user_builds/chemicals/envs/release/lib/python3.11/site-packages/chemicals-1.1.5-py3.11.egg/chemicals/Identifiers/Cation db.tsv', '/home/docs/checkouts/readthedocs.org/user_builds/chemicals/envs/release/lib/python3.11/site-packages/chemicals-1.1.5-py3.11.egg/chemicals/Identifiers/Anion db.tsv', '/home/docs/checkouts/readthedocs.org/user_builds/chemicals/envs/release/lib/python3.11/site-packages/chemicals-1.1.5-py3.11.egg/chemicals/Identifiers/Inorganic db.tsv'])[source]

Object which holds the main database of chemical metadata.

Warning

To allow the chemicals to grow and improve, the details of this class may change in the future without notice!

Attributes
finished_loading

Whether or not the database has loaded the main database.

Methods

autoload_main_db()

Load the main database when needed.

finish_loading()

Complete loading the main database, if it has not been fully loaded.

load(file_name)

Load a particular file into the indexes.

load_elements()

Load elements into the indexes.

search_CAS(CAS[, autoload])

Search for a chemical by its CAS number.

search_InChI(InChI[, autoload])

Search for a chemical by its InChI string.

search_InChI_key(InChI_key[, autoload])

Search for a chemical by its InChI key.

search_formula(formula[, autoload])

Search for a chemical by its serialized formula.

search_name(name[, autoload])

Search for a chemical by its name.

search_pubchem(pubchem[, autoload])

Search for a chemical by its pubchem number.

search_smiles(smiles[, autoload])

Search for a chemical by its smiles string.

chemicals.identifiers.get_pubchem_db()[source]

Helper function to delay the creation of the pubchem_db object.

This avoids loading the database when it is not needed.

Chemical Groups

It is convenient to tag some chemicals with labels like “refrigerant”, or in a certain database or not. The following chemical groups are available.

chemicals.identifiers.cryogenics = {'132259-10-0': 'Air', '1333-74-0': 'hydrogen', '630-08-0': 'carbon monoxide', '74-82-8': 'methane', '7439-90-9': 'krypton', '7440-01-9': 'neon', '7440-37-1': 'Argon', '7440-59-7': 'helium', '7440-63-3': 'xenon', '7727-37-9': 'nitrogen', '7782-39-0': 'deuterium', '7782-41-4': 'fluorine', '7782-44-7': 'oxygen'}
chemicals.identifiers.inerts = {'10043-92-2': 'radon', '10102-43-9': 'Nitric Oxide', '10102-44-0': 'Nitrogen Dioxide', '124-38-9': 'Carbon Dioxide', '132259-10-0': 'Air', '7439-90-9': 'krypton', '7440-01-9': 'Neon', '7440-37-1': 'Argon', '7440-59-7': 'Helium', '7440-63-3': 'Xenon', '7727-37-9': 'Nitrogen', '7732-18-5': 'water', '7782-41-4': 'fluorine', '7782-44-7': 'Oxygen', '7782-50-5': 'chlorine'}
chemicals.identifiers.dippr_compounds()[source]

Loads and returns a set of compounds known in the DIPPR database. This can be useful for knowing if a chemical is of industrial relevance.

Returns
dippr_compoundsset([str])

A set of CAS numbers from the 2014 edition of the DIPPR database.