Analysis flow#
Here, weβll track typical data transformations like subsetting that occur during analysis.
If exploring more generally, read this first: Project flow.
Setup#
# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty
Show code cell output
π‘ creating schemas: core==0.47.5 bionty==0.30.4
β
saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-06 17:06:54)
β
saved: Storage(id='ha6ovdzp', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-09-06 17:06:54, created_by_id='DzTjkKse')
β
loaded instance: testuser1/analysis-usecase
π‘ did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
import lnschema_bionty as lb
lb.settings.species = "human" # globally set species
lb.settings.auto_save_parents = False
β
loaded instance: testuser1/analysis-usecase (lamindb 0.52.2)
ln.track()
π‘ notebook imports: lamindb==0.52.2 lnschema_bionty==0.30.4
β
saved: Transform(id='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-09-06 17:06:56, created_by_id='DzTjkKse')
β
saved: Run(id='oV6I1caVkTq1jjPk32QH', run_at=2023-09-06 17:06:56, transform_id='eNef4Arw8nNMz8', created_by_id='DzTjkKse')
Track cell types, tissues and diseases#
We fetch an example dataset from LaminDB that has a few cell type, tissue and disease annotations:
Show code cell content
adata = ln.dev.datasets.anndata_with_obs()
adata
AnnData object with n_obs Γ n_vars = 40 Γ 100
obs: 'cell_type', 'cell_type_id', 'tissue', 'disease'
adata.var_names[:5]
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
'ENSG00000000457', 'ENSG00000000460'],
dtype='object')
adata.obs[["tissue", "cell_type", "disease"]].value_counts()
tissue cell_type disease
brain my new cell type Alzheimer disease 10
heart hepatocyte cardiac ventricle disorder 10
kidney T cell chronic kidney disease 10
liver hematopoietic stem cell liver lymphoma 10
Name: count, dtype: int64
Register biological metadata and link to the dataset#
As a first step, we register the Anndata object with LaminDB using from_anndata()
:
file = ln.File.from_anndata(
adata, key="mini_anndata_with_obs.h5ad", field=lb.Gene.ensembl_gene_id
)
π‘ file will be copied to default storage upon `save()` with key 'mini_anndata_with_obs.h5ad'
π‘ parsing feature names of X stored in slot 'var'
β received 99 unique terms, 1 empty/duplicated term is ignored
β 99 terms (100.00%) are not validated for ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...
β no validated features, skip creating feature set
π‘ parsing feature names of slot 'obs'
β 4 terms (100.00%) are not validated for name: cell_type, cell_type_id, tissue, disease
β no validated features, skip creating feature set
file.save()
β
storing file 'KolSjBpTNOF0eWDtR45i' at 'mini_anndata_with_obs.h5ad'
cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)
Show code cell output
β
created 3 CellType records from Bionty matching name: 'T cell', 'hematopoietic stem cell', 'hepatocyte'
β did not create CellType record for 1 non-validated name: 'my new cell type'
β
created 4 Tissue records from Bionty matching name: 'kidney', 'liver', 'heart', 'brain'
β
created 4 Disease records from Bionty matching name: 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
All of these look good and contain no typos, letβs save them to their registries:
ln.save(cell_types)
ln.save(tissues)
ln.save(diseases)
We also need some features to bucket these labels:
ln.Feature(name="cell_type", type="category").save()
ln.Feature(name="tissue", type="category").save()
ln.Feature(name="disease", type="category").save()
features = ln.Feature.lookup()
Link labels against the file:
file.add_labels(cell_types, feature=features.cell_type)
file.add_labels(tissues, feature=features.tissue)
file.add_labels(diseases, feature=features.disease)
Show code cell output
β
linked feature 'cell_type' to registry 'bionty.CellType'
β
linked new feature 'cell_type' together with new feature set FeatureSet(id='oWVZVgnjkFimgYXF3aiA', n=1, registry='core.Feature', hash='vICM-wnOUqHWAQQHQ2oN', updated_at=2023-09-06 17:07:02, modality_id='NAJUfjPX', created_by_id='DzTjkKse')
β
linked feature 'tissue' to registry 'bionty.Tissue'
π‘ no file links to it anymore, deleting feature set FeatureSet(id='oWVZVgnjkFimgYXF3aiA', n=1, registry='core.Feature', hash='vICM-wnOUqHWAQQHQ2oN', updated_at=2023-09-06 17:07:02, modality_id='NAJUfjPX', created_by_id='DzTjkKse')
β
linked new feature 'tissue' together with new feature set FeatureSet(id='E6eczkWZRK2dg45b1TG1', n=2, registry='core.Feature', hash='EeCGWmEJA_a_SFVMhW33', updated_at=2023-09-06 17:07:02, modality_id='NAJUfjPX', created_by_id='DzTjkKse')
β
linked feature 'disease' to registry 'bionty.Disease'
π‘ no file links to it anymore, deleting feature set FeatureSet(id='E6eczkWZRK2dg45b1TG1', n=2, registry='core.Feature', hash='EeCGWmEJA_a_SFVMhW33', updated_at=2023-09-06 17:07:02, modality_id='NAJUfjPX', created_by_id='DzTjkKse')
β
linked new feature 'disease' together with new feature set FeatureSet(id='5VkV4avBkc8emMkXY9p3', n=3, registry='core.Feature', hash='-VmjtVLNlQe0_5t2UPz9', updated_at=2023-09-06 17:07:02, modality_id='NAJUfjPX', created_by_id='DzTjkKse')
file.describe()
π‘ File(id='KolSjBpTNOF0eWDtR45i', key='mini_anndata_with_obs.h5ad', suffix='.h5ad', accessor='AnnData', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', hash_type='md5', updated_at=2023-09-06 17:06:57)
Provenance:
ποΈ storage: Storage(id='ha6ovdzp', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-09-06 17:06:54, created_by_id='DzTjkKse')
π« transform: Transform(id='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-09-06 17:06:57, created_by_id='DzTjkKse')
π£ run: Run(id='oV6I1caVkTq1jjPk32QH', run_at=2023-09-06 17:06:56, transform_id='eNef4Arw8nNMz8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-06 17:06:54)
Features:
external: FeatureSet(id='5VkV4avBkc8emMkXY9p3', n=3, registry='core.Feature', hash='-VmjtVLNlQe0_5t2UPz9', updated_at=2023-09-06 17:07:02, modality_id='NAJUfjPX', created_by_id='DzTjkKse')
π disease (4, bionty.Disease): 'Alzheimer disease', 'chronic kidney disease', 'cardiac ventricle disorder', 'liver lymphoma'
π cell_type (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
π tissue (4, bionty.Tissue): 'liver', 'kidney', 'brain', 'heart'
Labels:
π·οΈ tissues (4, bionty.Tissue): 'liver', 'kidney', 'brain', 'heart'
π·οΈ cell_types (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
π·οΈ diseases (4, bionty.Disease): 'Alzheimer disease', 'chronic kidney disease', 'cardiac ventricle disorder', 'liver lymphoma'
file.view_flow()
Examine the currently available cell types and tissues:
lb.CellType.filter().df()
Show code cell output
name | ontology_id | abbr | synonyms | description | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
BxNjby0x | T cell | CL:0000084 | None | T-lymphocyte|T-cell|T lymphocyte | A Type Of Lymphocyte Whose Defining Characteri... | A6qM | 2023-09-06 17:07:02 | DzTjkKse |
J7hHC8SK | hepatocyte | CL:0000182 | None | None | The Main Structural Component Of The Liver. Th... | A6qM | 2023-09-06 17:07:02 | DzTjkKse |
m91LZBDZ | hematopoietic stem cell | CL:0000037 | None | blood forming stem cell|hemopoietic stem cell|HSC | A Stem Cell From Which All Cells Of The Lympho... | A6qM | 2023-09-06 17:07:02 | DzTjkKse |
lb.Tissue.filter().df()
Show code cell output
name | ontology_id | abbr | synonyms | description | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
HHKnN309 | liver | UBERON:0002107 | None | None | An Exocrine Gland Which Secretes Bile And Func... | JVv0 | 2023-09-06 17:07:02 | DzTjkKse |
j9lTWyWV | kidney | UBERON:0002113 | None | None | A Paired Organ Of The Urinary Tract Which Has ... | JVv0 | 2023-09-06 17:07:02 | DzTjkKse |
7HcGzG0l | brain | UBERON:0000955 | None | None | The Brain Is The Center Of The Nervous System ... | JVv0 | 2023-09-06 17:07:02 | DzTjkKse |
sm45H0wI | heart | UBERON:0000948 | None | vertebrate heart|chambered heart | A Myogenic Muscular Circulatory Organ Found In... | JVv0 | 2023-09-06 17:07:02 | DzTjkKse |
Processing the dataset#
To track our data transformation we create a new Transform
of type βpipelineβ:
transform = ln.Transform(
name="Subset to T-cells and liver lymphoma", version="0.1.0", type="pipeline"
)
Set the current tracking to the new transform:
ln.track(transform)
β
saved: Transform(id='vHEjwnwxAamCp3', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-09-06 17:07:03, created_by_id='DzTjkKse')
β
saved: Run(id='aADc7ZkRSoQzTGnleerw', run_at=2023-09-06 17:07:03, transform_id='vHEjwnwxAamCp3', created_by_id='DzTjkKse')
Get a backed AnnData object#
file = ln.File.filter(key="mini_anndata_with_obs.h5ad").one()
adata = file.backed()
adata
π‘ adding file KolSjBpTNOF0eWDtR45i as input for run aADc7ZkRSoQzTGnleerw, adding parent transform eNef4Arw8nNMz8
AnnDataAccessor object with n_obs Γ n_vars = 40 Γ 100
constructed for the AnnData object mini_anndata_with_obs.h5ad
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
adata.obs[["cell_type", "disease"]].value_counts()
cell_type disease
T cell chronic kidney disease 10
hematopoietic stem cell liver lymphoma 10
hepatocyte cardiac ventricle disorder 10
my new cell type Alzheimer disease 10
Name: count, dtype: int64
Subset dataset to specific cell types and diseases#
Create the subset:
subset_obs = adata.obs.cell_type.isin(["T cell", "hematopoietic stem cell"]) & (
adata.obs.disease.isin(["liver lymphoma", "chronic kidney disease"])
)
adata_subset = adata[subset_obs]
adata_subset
AnnDataAccessorSubset object with n_obs Γ n_vars = 20 Γ 100
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
cell_type disease
T cell chronic kidney disease 10
hematopoietic stem cell liver lymphoma 10
Name: count, dtype: int64
This subset can now be registered:
file_subset = ln.File.from_anndata(
adata_subset.to_memory(),
key="subset/mini_anndata_with_obs.h5ad",
field=lb.Gene.ensembl_gene_id,
)
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1840: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
π‘ file will be copied to default storage upon `save()` with key 'subset/mini_anndata_with_obs.h5ad'
π‘ parsing feature names of X stored in slot 'var'
β received 99 unique terms, 1 empty/duplicated term is ignored
β 99 terms (100.00%) are not validated for ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...
β no validated features, skip creating feature set
π‘ parsing feature names of slot 'obs'
β
3 terms (75.00%) are validated for name
β 1 term (25.00%) is not validated for name: cell_type_id
β
loaded: FeatureSet(id='5VkV4avBkc8emMkXY9p3', n=3, registry='core.Feature', hash='-VmjtVLNlQe0_5t2UPz9', updated_at=2023-09-06 17:07:02, modality_id='NAJUfjPX', created_by_id='DzTjkKse')
β
linked: FeatureSet(id='5VkV4avBkc8emMkXY9p3', n=3, registry='core.Feature', hash='-VmjtVLNlQe0_5t2UPz9', updated_at=2023-09-06 17:07:02, modality_id='NAJUfjPX', created_by_id='DzTjkKse')
file_subset.save()
β
storing file '7C2PYk3YFwoVBrQ8cnBy' at 'subset/mini_anndata_with_obs.h5ad'
Add labels to features, all of them validate:
cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)
file_subset.add_labels(cell_types, feature=features.cell_type)
file_subset.add_labels(tissues, feature=features.tissue)
file_subset.add_labels(diseases, feature=features.disease)
Show code cell output
β
loaded 3 CellType records matching name: 'T cell', 'hematopoietic stem cell', 'hepatocyte'
β did not create CellType record for 1 non-validated name: 'my new cell type'
file_subset.describe()
π‘ File(id='7C2PYk3YFwoVBrQ8cnBy', key='subset/mini_anndata_with_obs.h5ad', suffix='.h5ad', accessor='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', updated_at=2023-09-06 17:07:03)
Provenance:
ποΈ storage: Storage(id='ha6ovdzp', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-09-06 17:06:54, created_by_id='DzTjkKse')
𧩠transform: Transform(id='vHEjwnwxAamCp3', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-09-06 17:07:03, created_by_id='DzTjkKse')
π£ run: Run(id='aADc7ZkRSoQzTGnleerw', run_at=2023-09-06 17:07:03, transform_id='vHEjwnwxAamCp3', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-06 17:06:54)
Features:
obs: FeatureSet(id='5VkV4avBkc8emMkXY9p3', n=3, registry='core.Feature', hash='-VmjtVLNlQe0_5t2UPz9', updated_at=2023-09-06 17:07:02, modality_id='NAJUfjPX', created_by_id='DzTjkKse')
π disease (4, bionty.Disease): 'Alzheimer disease', 'chronic kidney disease', 'cardiac ventricle disorder', 'liver lymphoma'
π cell_type (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
π tissue (4, bionty.Tissue): 'liver', 'brain', 'kidney', 'heart'
Labels:
π·οΈ tissues (4, bionty.Tissue): 'liver', 'brain', 'kidney', 'heart'
π·οΈ cell_types (3, bionty.CellType): 'T cell', 'hepatocyte', 'hematopoietic stem cell'
π·οΈ diseases (4, bionty.Disease): 'Alzheimer disease', 'chronic kidney disease', 'cardiac ventricle disorder', 'liver lymphoma'
Examine data flow#
Common questions that might arise are:
Which h5ad file is in the
subset
subfolder?Which notebook ingested this file?
By whom?
And which file is its parent?
Letβs answer this using LaminDB:
Query a subsetted .h5ad
file containing βhematopoietic stem cellβ and βT cellβ to learn which h5ad file is in the subset
subfolder:
cell_types_bt_lookup = lb.CellType.lookup()
my_subset = ln.File.filter(
suffix=".h5ad",
key__startswith="subset",
cell_types__in=[
cell_types_bt_lookup.hematopoietic_stem_cell,
cell_types_bt_lookup.t_cell,
],
).first()
my_subset.view_flow()
Show code cell content
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase
π‘ deleting instance testuser1/analysis-usecase
β
deleted instance settings file: /home/runner/.lamin/instance--testuser1--analysis-usecase.env
β
instance cache deleted
β
deleted '.lndb' sqlite file
β consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase