amber.utils

amber.utils.data_parser

A set of Data Parsers

fasta_reader(fn)[source]

reader for FASTA files (a header line starting with ‘>’ and a nucleotide sequence)

fetch_seq(chrom, start, end, strand, genome)[source]
fetch_seq_pygr(chr, start, end, strand, genome)[source]

Fetch a genmoic sequence upon calling pygr genome is initialized and stored outside the function

Parameters
  • chr (str) – chromosome

  • start (int) – start locus

  • end (int) – end locuse

  • strand (str) – either ‘+’ or ‘-’

Returns

seq – genomic sequence in upper cases

Return type

str

get_data_from_fasta_sequence(positive_file, negative_file)[source]

Given two FASTA files, read in the numeric data X and label y

get_data_from_simdata(positive_file, negative_file, targets)[source]

Given two simdata files, read in the numeric data X and label y

get_generator_from_label_df(label_df, genome_fn, batch_size=32, y_idx=None)[source]
matrix_to_seq(mat, index_to_letter=None)[source]
read_label(fn)[source]
reverse_complement(dna)[source]
seq_to_matrix(seq, letter_to_index=None)[source]

Convert a list of characters to an N by 4 integer matrix

Parameters
  • seq (list) – list of characters A,C,G,T,-,=; heterozygous locus is separated by semiconlon. ‘-’ is filler for heterozygous indels; ‘=’ is homozygous deletions

  • letter_to_index (dict or None) – dictionary mapping DNA/RNA/amino-acid letters to array index

Returns

seq_mat – numpy matrix for the sequences

Return type

numpy.array

simdata_reader(fn, targets)[source]

reader for .simdata file generated by simdna simulation program. .simdata is a tabulated file with three columns: ‘seqName’, ‘sequence’, ‘embeddings’

Parameters
  • fn (str) – filepath for .simdata file

  • targets (list) – a list of target TFs (will also be in the order of y)

Returns

tuple

Return type

a tuple of (X, y) in np.array format

amber.utils.io

annotate_probs_list(probs_list, model_space, with_input_blocks, with_skip_connection)[source]

for a given probs_list, annotate what is each prob about

Parameters
  • probs_list

  • model_space

  • with_skip_connection

  • with_input_blocks

read_action_weights(fn)[source]

read ‘weight_data.json’ and derive the max likelihood architecture for each run. ‘weight_data.json’ stores the weight probability by function save_action_weights for a bunch of independent mock BioNAS optimization runs.

read_action_weights_old(fn)[source]

read ‘weight_data.json’ and derive the max likelihood architecture for each run. ‘weight_data.json’ stores the weight probability by function save_action_weights for a bunch of independent mock BioNAS optimization runs.

read_history(fn_list, metric_name_dict={'acc': 0, 'knowledge': 1, 'loss': 2})[source]
read_history_set(fn_list)[source]
save_action_weights(probs_list, state_space, working_dir, with_input_blocks=False, with_skip_connection=False, **kwargs)[source]
Parameters

probs_list (list) – list of probability at each time step output a series of graphs each plotting weight of options of each layer over time

Note

if with_input_blocks is True, then expect input_nodes in keyword_arguments input_nodes is a List of BioNAS.Controller.state_space.State, hence the layer name can be accessed by State.Layer_attributes[‘name’]

save_stats(loss_and_metrics_list, working_dir)[source]

amber.utils.logging

setup_logger(working_dir='.', verbose_level=20)[source]

The logging used by throughout the training envrionment

Parameters
  • working_dir (str) – File path to working directory. Logging will be stored in working directory.

  • verbose_level (int) – Verbosity level; can be specified as in logging

Returns

logger

Return type

the logging object

amber.utils.motif

class Scale(sx, sy=None)[source]

Bases: matplotlib.backend_bases.RendererBase

http://nbviewer.jupyter.org/github/saketkc/notebooks/blob/master/python/Sequence%20Logo%20Python%20%20–%20Any%20font.ipynb?flush=true ## Author: Saket Choudhar [saketkcgmail] ## License: GPL v3 ## Copyright © 2017 Saket Choudhary<saketkc__AT__gmail>

draw_path(renderer, gc, tpath, affine, rgbFace)[source]
convertlog2freq(pssm)[source]
draw_dnalogo_Rscript(pssm, savefn='seq_logo.pdf')[source]
draw_dnalogo_matplot(pssm)[source]

References

http://nbviewer.jupyter.org/github/saketkc/notebooks/blob/master/python/Sequence%20Logo%20Python%20%20–%20Any%20font.ipynb?flush=true

Author: Saket Choudhar [saketkcgmail]

License: GPL v3

Copyright © 2017 Saket Choudhary<saketkc__AT__gmail>

load_binding_motif_pssm(motif_file, is_log, swapbase=None, augment_rev_comp=False)[source]
read_file(filename)[source]

amber.utils.saliency

approx_grad(model, X, epsilon=0.01)[source]
approx_grad_array(model, X, epsilon=0.01)[source]
approx_hessian(model, x, epsilon=0.01)[source]
approx_hessian_array(model, data, epsilon=0.01)[source]

amber.utils.sampler module

This module provides the BioIntervalSource class and its children. These are essentially wrappers for sets of sequence intervals and associated labels.

class BatchedBioIntervalSequence(example_file, reference_sequence, batch_size, shuffle=True, n_examples=None, seed=1337, pad=0)[source]

Bases: amber.utils.sampler.BioIntervalSource, tensorflow.python.keras.utils.data_utils.Sequence

This data sequence type holds intervals in a genome and a label associated with each interval. Unlike a generator, this is based off of keras.utils.Sequence, which shifts things like shuffling elsewhere. The amount of padding added to the end of the intervals is able to be changed during runtime. This allows these functions to be passed to objects such as a model controller. Examples are divided into batches.

Parameters
  • example_file (str) – A path to a file that contains the examples in BED-like format. Specifically, this file will have one example per line, with the chromosome, start, end, and label for the example. Each column is separated by tabs.

  • reference_sequence (Sequence or str) – The reference sequence used to generate the input sequences from the example coordinates; could be a Sequence instance or a filepath to reference sequence.

  • batch_size (int) – Specifies size of the mini-batches.

  • shuffle (bool) – Specifies whether to shuffle the mini-batches.

  • n_examples (int, optional) – Default is None. The number of examples. If left as None, will use all of the examples in the file. If fewer than n_examples are found, an error will be thrown.

  • seed (int, optional) – Default is 1337. The value used to seed random number generation.

Variables
  • reference_sequence (Sequence) – The reference sequence used to generate the input sequences from the example coordinates.

  • examples (list) – A list of the example coordinates.

  • labels (list) – A list of the labels for the examples.

  • left_pad (int) – The length of padding added to the left side of the interval.

  • right_pad (int) – The length of padding added to the right side of the interval.

  • batch_size (int) – Specifies size of the mini-batches.

  • shuffle (bool) – Specifies whether to shuffle the mini-batches.

  • random_state (numpy.random.RandomState) – A random number generator to use.

  • seed (int) – The value used to seed the random number generator.

close()[source]

Close the file connection of Sequence

on_epoch_end()[source]

If applicable, shuffle the examples at the end of an epoch.

class BatchedBioIntervalSequenceGenerator(*args, **kwargs)[source]

Bases: amber.utils.sampler.BatchedBioIntervalSequence

This class modifies on top of BatchedBioIntervalSequence by performing the generator loop infinitely

on_epoch_end()[source]
class BatchedHDF5Generator(hdf5_fp, batch_size, shuffle=True, in_memory=False, seed=None, x_selector=None, y_selector=None)[source]

Bases: tensorflow.python.keras.utils.data_utils.Sequence

close()[source]
class BioIntervalGenerator(example_file, reference_sequence, n_examples=None, seed=1337)[source]

Bases: amber.utils.sampler.BioIntervalSource

This data generator type holds intervals in a genome and a label associated with each interval. This essentially acts as an iterator over the inputs examples. This approach is useful and preferable to BioIntervalSequence when there are a very large number of examples in the input. The amount of padding added to the end of the intervals is able to be changed during runtime. This allows these functions to be passed to objects such as a model controller.

Parameters
  • example_file (str) – A path to a file that contains the examples in BED-like format. Specifically, this file will have one example per line, with the chromosome, start, end, and label for the example. Each column is separated by tabs.

  • reference_sequence (Sequence) – The reference sequence used to generate the input sequences from the example coordinates.

  • n_examples (int, optional) – Default is None. The number of examples. If left as None, will use all of the examples in the file. If fewer than n_examples are found, an error will be thrown.

  • seed (int, optional) – Default is 1337. The value used to seed random number generation.

Variables
  • reference_sequence (Sequence) – The reference sequence used to generate the input sequences from the example coordinates.

  • examples (list) – A list of the example coordinates.

  • labels (list) – A list of the labels for the examples.

  • left_pad (int) – The length of padding added to the left side of the interval.

  • right_pad (int) – The length of padding added to the right side of the interval.

  • random_state (numpy.random.RandomState) – A random number generator to use.

  • seed (int) – The value used to seed random number generation.

class BioIntervalSequence(example_file, reference_sequence, n_examples=None, seed=1337)[source]

Bases: amber.utils.sampler.BioIntervalSource, tensorflow.python.keras.utils.data_utils.Sequence

This data sequence type holds intervals in a genome and a label associated with each interval. Unlike a generator, this is based off of keras.utils.Sequence, which shifts things like shuffling elsewhere. The amount of padding added to the end of the intervals is able to be changed during runtime. This allows these functions to be passed to objects such as a model controller.

Parameters
  • example_file (str) – A path to a file that contains the examples in BED-like format. Specifically, this file will have one example per line, with the chromosome, start, end, and label for the example. Each column is separated by tabs.

  • reference_sequence (Sequence) – The reference sequence used to generate the input sequences from the example coordinates.

  • n_examples (int, optional) – Default is None. The number of examples. If left as None, will use all of the examples in the file. If fewer than n_examples are found, an error will be thrown.

  • seed (int, optional) – Default is 1337. The value used to seed random number generation.

Variables
  • reference_sequence (Sequence) – The reference sequence used to generate the input sequences from the example coordinates.

  • examples (list) – A list of the example coordinates.

  • labels (list) – A list of the labels for the examples.

  • left_pad (int) – The length of padding added to the left side of the interval.

  • right_pad (int) – The length of padding added to the right side of the interval.

  • random_state (numpy.random.RandomState) – A random number generator to use.

  • seed (int) – The value used to seed the random number generator.

class BioIntervalSource(example_file, reference_sequence, n_examples=None, seed=1337, pad=400)[source]

Bases: object

A generic class for labeled examples of biological intervals. The amount of padding added to the end of the intervals is able to be changed during runtime. This allows these functions to be passed to objects such as a model controller.

Parameters
  • example_file (str) – A path to a file that contains the examples in BED-like format. Specifically, this file will have one example per line, with the chromosome, start, end, and label for the example. Each column is separated by tabs.

  • reference_sequence (Sequence) – The reference sequence used to generate the input sequences from the example coordinates.

  • seed (int, optional) – Default is 1337. The value used to seed random number generation.

  • n_examples (int, optional) – Default is None. The number of examples. If left as None, will use all of the examples in the file. If fewer than n_examples are found, and error will be thrown.

Variables
  • reference_sequence (Sequence) – The reference sequence used to generate the input sequences from the example coordinates.

  • examples (list) – A list of the example coordinates.

  • labels (list) – A list of the labels for the examples.

  • left_pad (int) – The length of padding added to the left side of the interval.

  • right_pad (int) – The length of padding added to the right side of the interval.

  • random_state (numpy.random.RandomState) – A random number generator to use.

  • seed (int) – The value used to seed the random number generator.

padding_is_valid(value)[source]

Determine if the specified value is a valid value for padding intervals.

Parameters

value (int) – Proposed amount of padding.

Returns

Whether the input value is valid.

Return type

bool

set_left_pad(value)[source]

Sets the length of the padding added to the left side of the input sequence.

Parameters

value (int) – The length of the padding to add to the left side of an example interval.

set_pad(value)[source]

Sets the length of padding added to both the left and right sides of example intervals.

Parameters

value (int) – The length of the padding to add to the left and right sides of input example intervals.

set_right_pad(value)[source]

Sets the length of the padding added to the right side of an example interval.

Parameters

value (int) – The length of the padding to add to the right side of an example interval.

class Selector(label, index=None)[source]

Bases: object

A helper class for making x/y selector easier for different hdf5 layouts

Parameters
  • label (str) – key label to get to array in hdf5 store

  • index (tuple) – array index for specific data

Notes

We will always assume that the first dimension is the sample_index dimension and thus will be preserved for batch extraction

amber.utils.sequences module

This module provides the Sequence class, which is an abstract class that defines the interface for loading biological sequence data.

Todo

  • Abstract away the PYFAIDX interface, so that inheritence and extension of the

    genome class is more straightforward.

class EncodedGenome(*args, **kwargs)[source]

Bases: amber.utils.sequences.EncodedSequence, amber.utils.sequences.Genome

This class allows the user to a query a potentially file-backed genome by coordinate. It is essentially a wrapper around the pyfaidx.Fasta class. The returned values have been encoded as numpy arrays.

Parameters
  • input_path (str) – Path to an indexed FASTA file.

  • in_memory (bool) – Specifies whether the genome should be loaded from disk and stored in memory.

Variables
  • data (pyfaidx.Fasta or dict) – The FASTA file containing the genome sequence. Alternatively, this can be a dict object mapping chromosomes to sequences that stores the file in memory.

  • in_memory (bool) – Specified whether the genomic data is being stored in memory.

  • chrom_len_dict (dict) – A dictionary mapping the chromosome names to their lengths.

  • ALPHABET_TO_ARRAY (dict) – A mapping from characters in the genome to their numpy.ndarray representations.

ALPHABET_TO_ARRAY = {'A': array([1., 0., 0., 0.], dtype=float16), 'C': array([0., 1., 0., 0.], dtype=float16), 'G': array([0., 0., 1., 0.], dtype=float16), 'N': array([0.25, 0.25, 0.25, 0.25], dtype=float16), 'T': array([0., 0., 0., 1.], dtype=float16)}

A dictionary mapping possible characters in the genome to their numpy.ndarray representations.

get_sequence_from_coords(chrom, start, end, strand='+')[source]

Fetches a string representation of a sequence at the specified coordinates.

Parameters
  • chrom (str) – Chromosome to query from.

  • start (int) – First position in queried sequence.

  • end (int) – One past the last position in the queried sequence.

  • strand (str) – The strand to sample from.

Returns

The sequence of bases occuring at the queried coordinates.

Return type

str

Raises

IndexError – If the coordinates are not valid.

class EncodedHDF5Genome(input_path, in_memory=False)[source]

Bases: amber.utils.sequences.EncodedGenome

This class allows the user to specify an HDF5-backed encoded genome.

Parameters
  • input_path (str) – Path to the HDF5 file.

  • in_memory (bool) – Specifies whether the genome should be loaded from disk and stored in memory.

Variables
  • data (h5py.File or dict) – The HDF5 file pointer or a dict containing the sequences in memory.

  • in_memory (bool) – Specified whether the genomic data is being stored in memory.

  • chrom_len_dict (dict) – A dictionary mapping the chromosome names to their lengths.

close()[source]

Close the file connection to HDF5

get_sequence_from_coords(chrom, start, end, strand='+')[source]

Fetches an array representation of a sequence at the specified coordinates.

Parameters
  • chrom (str) – Chromosome to query from.

  • start (int) – First position in queried sequence.

  • end (int) – One past the last position in the queried sequence.

  • strand (str) – The strand to sample from.

Returns

The sequence of bases occuring at the queried coordinates.

Return type

str

Raises

IndexError – If the coordinates are not valid.

class EncodedSequence[source]

Bases: amber.utils.sequences.Encoding, amber.utils.sequences.Sequence

Mixin of Encoding and Sequence to define the approach for encoding biological sequence data.

abstract property ALPHABET_TO_ARRAY

The alphabet used to encode the input sequence.

encode(s)[source]

Encodes a string with a numpy array.

Parameters

s (str) – The string to encode.

Returns

An array with the encoded string.

Return type

numpy.ndarray

get_sequence_from_coords(*args, **kwargs)[source]

Fetches an encoded sequence at the specified coordinates.

Returns

The numpy array encoding the queried sequence.

Return type

numpy.ndarray

class Encoding[source]

Bases: object

This class is a mostly-abstract class used to represent some dataset that should be transformed with an encoding.

abstract encode(*args, **kwargs)[source]

Method to encode some input.

class Genome(input_path, in_memory=False)[source]

Bases: amber.utils.sequences.Sequence

This class allows the user to a query a potentially file-backed genome by coordinate. It is essentially a wrapper around the pyfaidx.Fasta class.

Parameters
  • input_path (str) – Path to an indexed FASTA file.

  • in_memory (bool) – Specifies whether the genome should be loaded from disk and stored in memory.

Variables
  • data (pyfaidx.Fasta or dict) – The FASTA file containing the genome sequence. Alternatively, this can be a dict object mapping chromosomes to sequences that stores the file in memory.

  • in_memory (bool) – Specified whether the genomic data is being stored in memory.

  • chrom_len_dict (dict) – A dictionary mapping the chromosome names to their lengths.

coords_are_valid(chrom, start, end, strand='+')[source]

Checks if the queried coordinates are valid.

Parameters
  • chrom (str) – The chromosome to query from.

  • start (int) – The first position in the queried corodinates.

  • end (int) – One past the last position in the queried coordinates.

  • strand (str) – Strand of sequence to draw from.

Returns

True if the coordinates are valid, otherwise False.

Return type

bool

get_sequence_from_coords(chrom, start, end, strand='+')[source]

Fetches a string representation of a sequence at the specified coordinates.

Parameters
  • chrom (str) – Chromosome to query from.

  • start (int) – First position in queried sequence.

  • end (int) – One past the last position in the queried sequence.

  • strand (str) – The strand to sample from.

Returns

The sequence of bases occuring at the queried coordinates.

Return type

str

Raises

IndexError – If the coordinates are not valid.

class HDF5Genome(input_path, in_memory=False)[source]

Bases: amber.utils.sequences.Genome

This class allows the user to query a Genome stored in an HDF5 file.

Parameters
  • input_path (str) – Path to an HDF5 file.

  • in_memory (bool) – Specifies whether the genome should be loaded from disk and stored in memory.

Variables
  • data (h5py.File) – The HDF5 file containing the genome sequence. Alternatively, this can be a dict object mapping chromosomes to sequences that stores the file in memory.

  • in_memory (bool) – Specified whether the genomic data is being stored in memory.

  • chrom_len_dict (dict) – A dictionary mapping the chromosome names to their lengths.

get_sequence_from_coords(chrom, start, end, strand='+')[source]

Fetches a string representation of a sequence at the specified coordinates.

Parameters
  • chrom (str) – Chromosome to query from.

  • start (int) – First position in queried sequence.

  • end (int) – One past the last position in the queried sequence.

  • strand (str) – The strand to sample from.

Returns

The sequence of bases occuring at the queried coordinates.

Return type

str

Raises

IndexError – If the coordinates are not valid.

class Sequence[source]

Bases: object

This class represents a source of sequence data, which can be fetched by querying different coordinates.

abstract coords_are_valid(*args, **kwargs)[source]

Checks if the queried coordinates are valid.

Returns

True if the coordinates are valid, otherwise False.

Return type

bool

abstract get_sequence_from_coords(*args, **kwargs)[source]

Fetches a string representation of a sequence at the specified coordinates.

Returns

The sequence of bases occuring at the queried coordinates. Behavior is undefined for invalid coordinates.

Return type

str

amber.utils.simulator module

amber.utils.tensorflowMem module

get_session(gpu_fraction=0.75)[source]

Assume that you have 6GB of GPU memory and want to allocate ~2GB

get_session2(CPU, GPU)[source]