Jupyter Notebook

Validate & register flow cytometry data#

Flow cytometry is a technique used to analyze and sort cells or particles based on their physical and chemical characteristics as they flow in a fluid stream through a laser beam.

Here, we’ll transform, validate and register two flow cytometry datasets (Alpert19 and FlowIO sample) to demonstrate how to create and query a custom flow cytometry registry.

!lamin init --storage ./test-flow --schema bionty
Hide code cell output
πŸ’‘ creating schemas: core==0.47.5 bionty==0.30.4 
βœ… saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-06 17:07:55)
βœ… saved: Storage(id='7BMzjDBU', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-flow', type='local', updated_at=2023-09-06 17:07:55, created_by_id='DzTjkKse')
βœ… loaded instance: testuser1/test-flow
πŸ’‘ did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb
import readfcs

lb.settings.species = "human"
βœ… loaded instance: testuser1/test-flow (lamindb 0.52.2)
ln.track()
πŸ’‘ notebook imports: lamindb==0.52.2 lnschema_bionty==0.30.4 readfcs==1.1.6
βœ… saved: Transform(id='OWuTtS4SAponz8', name='Validate & register flow cytometry data', short_name='facs', version='0', type=notebook, updated_at=2023-09-06 17:07:58, created_by_id='DzTjkKse')
βœ… saved: Run(id='EHPP1Ldy9I7CweDEXlV6', run_at=2023-09-06 17:07:58, transform_id='OWuTtS4SAponz8', created_by_id='DzTjkKse')

Alpert19#

Access #

We start with a flow cytometry file from Alpert19:

ln.dev.datasets.file_fcs_alpert19(
    populate_registries=True,  # pre-populate registries to simulate an used instance
)


PosixPath('Alpert19.fcs')

Use readfcs to read the fcs file into memory:

adata = readfcs.read("Alpert19.fcs")
adata
AnnData object with n_obs Γ— n_vars = 166537 Γ— 40
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnR'
    uns: 'meta'

This AnnData object does not require filtering, normalizing or formatting, hence, there is no step.

Validate #

First, let’s validate the features in .var using CellMarker:

lb.CellMarker.validate(adata.var.index);
βœ… 27 terms (67.50%) are validated for name
❗ 13 terms (32.50%) are not validated for name: Time, Cell_length, Dead, (Ba138)Dd, Bead, CD19, CD4, IgD, CD11b, CD14, CCR6, CCR7, PD-1

We see that many features aren’t validated. Let’s standardize the identifiers first to get rid of synonyms:

adata.var.index = lb.CellMarker.standardize(adata.var.index)
πŸ’‘ standardized 35/40 terms

After standardizing, we can validate our markers once more:

validated = lb.CellMarker.validate(adata.var.index)
βœ… 35 terms (87.50%) are validated for name
❗ 5 terms (12.50%) are not validated for name: Time, Cell_length, Dead, (Ba138)Dd, Bead

More markers are validated now, but we still have 5 cell markers that seem more like metadata. Hence, let’s curate the AnnData object a bit more.

Let’s move metadata (non-validated cell markers) into adata.obs:

adata.obs = adata[:, ~validated].to_df()
adata = adata[:, validated].copy()

Now we have a clean panel of 35 validated cell markers:

lb.CellMarker.validate(adata.var.index);
βœ… 35 terms (100.00%) are validated for name

Next, let’s register the metadata features we moved to .obs:

# Feature.from_df creates feature records with type auto-populated
features = ln.Feature.from_df(adata.obs)
ln.add(features)

Lastly, we’d like to annotate this file with β€œassay”.

Since we never validated the term β€œFACS”, let’s search for it’s ontolog from public source and register it:

lb.ExperimentalFactor.bionty().search("FACS").head(2)
ontology_id definition synonyms parents molecule instrument measurement __ratio__
name
fluorescence-activated cell sorting EFO:0009108 A Flow Cytometry Assay That Provides A Method ... FACS|FAC sorting [] None None None 100.0
FACS-seq EFO:0008735 Fluorescence-Activated Cell Sorting And Deep S... None [EFO:0001457] RNA assay None None 90.0
lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0009108").save()
βœ… created 1 ExperimentalFactor record from Bionty matching ontology_id: 'EFO:0009108'

Register #

modalities = ln.Modality.lookup()
features = ln.Feature.lookup()
efs = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
file = ln.File.from_anndata(
    adata, description="Alpert19", field=lb.CellMarker.name, modality=modalities.protein
)
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/Pf9YEZ9F0JlDjZA4SvtR.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    35 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(id='5h0dt5FKgiqNbnp4p4bJ', n=35, type='number', registry='bionty.CellMarker', hash='ldY9_GmptHLCcT7Nrpgo', modality_id='XqL2I9hb', created_by_id='DzTjkKse')
πŸ’‘ parsing feature names of slot 'obs'
βœ…    5 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(id='phVUsY5rsUZlQvrZL6W4', n=5, registry='core.Feature', hash='JAXyMu4GMkRC9F1sprVD', modality_id='Lj33xe6m', created_by_id='DzTjkKse')
file.save()
Hide code cell output
βœ… saved 2 feature sets for slots: 'var','obs'
βœ… storing file 'Pf9YEZ9F0JlDjZA4SvtR' at '.lamindb/Pf9YEZ9F0JlDjZA4SvtR.h5ad'
file.add_labels(efs.fluorescence_activated_cell_sorting, features.assay)
file.add_labels(species.human, features.species)
Hide code cell output
βœ… linked new feature 'assay' together with new feature set FeatureSet(id='jiOvvfjC6cw4cLpuCLP4', n=1, registry='core.Feature', hash='xJ3HV6pd02Tfm04qsxw0', updated_at=2023-09-06 17:08:05, modality_id='Lj33xe6m', created_by_id='DzTjkKse')
πŸ’‘ no file links to it anymore, deleting feature set FeatureSet(id='jiOvvfjC6cw4cLpuCLP4', n=1, registry='core.Feature', hash='xJ3HV6pd02Tfm04qsxw0', updated_at=2023-09-06 17:08:05, modality_id='Lj33xe6m', created_by_id='DzTjkKse')
βœ… linked new feature 'species' together with new feature set FeatureSet(id='Ch5FrKS00znfgHQvXmGN', n=2, registry='core.Feature', hash='qX1POCIiCFo8hEXOtLrP', updated_at=2023-09-06 17:08:06, modality_id='Lj33xe6m', created_by_id='DzTjkKse')
file.features
Features:
  var: FeatureSet(id='5h0dt5FKgiqNbnp4p4bJ', n=35, type='number', registry='bionty.CellMarker', hash='ldY9_GmptHLCcT7Nrpgo', updated_at=2023-09-06 17:08:05, modality_id='XqL2I9hb', created_by_id='DzTjkKse')
    'CD161', 'CD94', 'CD127', 'Cd14', 'CD11c', 'CD56', 'ICOS', 'CD86', 'CD33', 'CD38', ...
  obs: FeatureSet(id='phVUsY5rsUZlQvrZL6W4', n=5, registry='core.Feature', hash='JAXyMu4GMkRC9F1sprVD', updated_at=2023-09-06 17:08:05, modality_id='Lj33xe6m', created_by_id='DzTjkKse')
    Time (number)
    Dead (number)
    (Ba138)Dd (number)
    Cell_length (number)
    Bead (number)
  external: FeatureSet(id='Ch5FrKS00znfgHQvXmGN', n=2, registry='core.Feature', hash='qX1POCIiCFo8hEXOtLrP', updated_at=2023-09-06 17:08:06, modality_id='Lj33xe6m', created_by_id='DzTjkKse')
    πŸ”— assay (1, bionty.ExperimentalFactor): 'fluorescence-activated cell sorting'
    πŸ”— species (1, bionty.Species): 'human'

Check a few validated cell markers in .var:

file.features["var"].df().head()
name synonyms gene_symbol ncbi_gene_id uniprotkb_id species_id bionty_source_id updated_at created_by_id
id
4EojtgN0CjBH CD161 KLRB1 3820 Q12918 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
0qCmUijBeByY CD94 KLRD1 3824 Q13241 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
hVNEgxlcDV10 CD127 IL7R 3575 P16871 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
roEbL8zuLC5k Cd14 CD14 4695 O43678 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
L0WKZ3fufq0J CD11c ITGAX 3687 P20702 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse

FlowIO sample#

Let’s validate and register another flow file:

Access #

adata2 = readfcs.read(ln.dev.datasets.file_fcs())

This AnnData object does not require filtering, normalizing or formatting, hence, there is no step.

Validate #

First, let’s standardize the cell markers:

adata2.var.index = lb.CellMarker.standardize(adata2.var.index)
Hide code cell output
πŸ’‘ standardized 10/16 terms
❗ found 1 synonym in Bionty: ['KI67']
   please add corresponding CellMarker records via `.from_values(['Ki67'])`
validated = lb.CellMarker.validate(adata2.var.index)
Hide code cell output
βœ… 10 terms (62.50%) are validated for name
❗ 6 terms (37.50%) are not validated for name: FSC-A, FSC-H, SSC-A, Ki67, CD45RO, CCR5

Register non-validated markers from Bionty:

records = lb.CellMarker.from_values(adata2.var.index[~validated])
ln.save(records)
Hide code cell output
βœ… created 4 CellMarker records from Bionty matching name: 'SSC-A', 'Ki67', 'CD45RO', 'CCR5'
❗ did not create CellMarker records for 2 non-validated names: 'FSC-A', 'FSC-H'

Now they pass validation except for non-markers: β€˜FSC-A’, β€˜FSC-H’

lb.CellMarker.validate(adata2.var.index);
βœ… 14 terms (87.50%) are validated for name
❗ 2 terms (12.50%) are not validated for name: FSC-A, FSC-H

Register #

file2 = ln.File.from_anndata(
    adata2,
    description="My fcs file",
    field=lb.CellMarker.name,
    modality=modalities.protein,
)
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/mb2LKG4iOAuwrSs7SoNS.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    14 terms (87.50%) are validated for name
❗    2 terms (12.50%) are not validated for name: FSC-A, FSC-H
βœ…    linked: FeatureSet(id='g89WQZesMhda1UA9ErbD', n=14, type='number', registry='bionty.CellMarker', hash='npy5P7AYbjKLInpXlNvb', modality_id='XqL2I9hb', created_by_id='DzTjkKse')
file2.save()
Hide code cell output
βœ… saved 1 feature set for slot: 'var'
βœ… storing file 'mb2LKG4iOAuwrSs7SoNS' at '.lamindb/mb2LKG4iOAuwrSs7SoNS.h5ad'
file2.add_labels(efs.fluorescence_activated_cell_sorting, features.assay)
file2.add_labels(species.human, features.species)
Hide code cell output
βœ… linked new feature 'assay' together with new feature set FeatureSet(id='2n6jCwtAD3VxEmEPIpvg', n=1, registry='core.Feature', hash='xJ3HV6pd02Tfm04qsxw0', updated_at=2023-09-06 17:08:09, modality_id='Lj33xe6m', created_by_id='DzTjkKse')
βœ… loaded: FeatureSet(id='Ch5FrKS00znfgHQvXmGN', n=2, registry='core.Feature', hash='qX1POCIiCFo8hEXOtLrP', updated_at=2023-09-06 17:08:06, modality_id='Lj33xe6m', created_by_id='DzTjkKse')
βœ… linked new feature 'species' together with new feature set FeatureSet(id='Ch5FrKS00znfgHQvXmGN', n=2, registry='core.Feature', hash='qX1POCIiCFo8hEXOtLrP', updated_at=2023-09-06 17:08:09, modality_id='Lj33xe6m', created_by_id='DzTjkKse')
file2.features
Features:
  var: FeatureSet(id='g89WQZesMhda1UA9ErbD', n=14, type='number', registry='bionty.CellMarker', hash='npy5P7AYbjKLInpXlNvb', updated_at=2023-09-06 17:08:09, modality_id='XqL2I9hb', created_by_id='DzTjkKse')
    'CD57', 'CD127', 'Cd14', 'Ccr7', 'SSC-A', 'Ki67', 'CD27', 'CCR5', 'CD28', 'CD3', ...
  external: FeatureSet(id='Ch5FrKS00znfgHQvXmGN', n=2, registry='core.Feature', hash='qX1POCIiCFo8hEXOtLrP', updated_at=2023-09-06 17:08:09, modality_id='Lj33xe6m', created_by_id='DzTjkKse')
    πŸ”— assay (1, bionty.ExperimentalFactor): 'fluorescence-activated cell sorting'
    πŸ”— species (1, bionty.Species): 'human'

View data flow:

file2.view_flow()
https://d33wubrfki0l68.cloudfront.net/7e6a6a896441aaaec95f53e2b34a9669a0a0eedd/2ccd6/_images/30ba27ae70380d40fa6e526230c158d06ee92d2a99a60175fb81343f812c3ae7.svg

Flow marker registry #

Check out your flow marker registry:

lb.CellMarker.filter().df()
Hide code cell output
name synonyms gene_symbol ncbi_gene_id uniprotkb_id species_id bionty_source_id updated_at created_by_id
id
4EojtgN0CjBH CD161 KLRB1 3820 Q12918 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
0qCmUijBeByY CD94 KLRD1 3824 Q13241 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
hVNEgxlcDV10 CD127 IL7R 3575 P16871 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
roEbL8zuLC5k Cd14 CD14 4695 O43678 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
L0WKZ3fufq0J CD11c ITGAX 3687 P20702 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
h4rkCALR5WfU CD56 NCAM1 4684 P13591 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
0vAls2cmLKWq ICOS ICOS 29851 Q53QY6 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
L0m6f7FPiDeg CD86 CD86 942 A8K632 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
c3dZKHFOdllB CD33 CD33 945 P20138 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
CR7DAHxybgyi CD38 CD38 952 B4E006 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
4uiPHmCPV5i1 CXCR5 CXCR5 643 A0N0R2 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
a624IeIqbchl CD45RA None None None uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
bspnQ0igku6c CD16 FCGR3A 2215 O75015 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
fpPkjlGv15C9 Ccr6 CCR6 1235 P51684 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
k0zGbSgZEX3q HLADR HLA‐DR|HLA-DR|HLA DR None None None uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
8OhpfB7wwV32 Cd19 CD19 930 P15391 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
cFJEI6e6wml3 CD20 MS4A1 931 A0A024R507 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
gEfe8qTsIHl0 CD24 CD24 100133941 B6EC88 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
Nb2sscq9cBcB CD57 B3GAT1 27087 Q9P2W7 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
YA5Ezh6SAy10 DNA1 None None None uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
0evamYEdmaoY Igd None None None uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
ljp5UfCF9HCi TCRgd TCRGAMMADELTA|TCRΞ³Ξ΄ None None None uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
lRZYuH929QDw CD85j None None None uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
sYcK7uoWCtco Ccr7 CCR7 1236 P32248 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
yCyTIVxZkIUz DNA2 DNA2 1763 P51530 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
2VeZenLi2dj5 PD1 PID1|PD-1|PD 1 PDCD1 5133 A0A0M3M0G7 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
CLFUvJpioHoA CD28 CD28 940 B4E0L1 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
n40112OuX7Cq CD123 IL3RA 3563 P26951 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
N2F6Qv9CxJch CD11B ITGAM 3684 P11215 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
a4hvNp34IYP0 CD3 None None None uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
50v4SaR2m5zQ CD25 IL2RA 3559 P01589 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
HEK41hvaIazP Cd4 CD4 920 B4DT49 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
ttBc0Fs01sYk CD8 CD8A 925 P01732 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
agQD0dEzuoNA CXCR3 CXCR3 2833 P49682 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
uThe3c0V3d4i CD27 CD27 939 P26842 uHJU DcCt 2023-09-06 17:08:02 DzTjkKse
VZBURNy04vBi SSC-A SSC A|SSCA None None None uHJU DcCt 2023-09-06 17:08:08 DzTjkKse
Qa4ozz9tyesQ Ki67 Ki-67|KI 67 None None None uHJU DcCt 2023-09-06 17:08:08 DzTjkKse
UMsp5g0fgMwY CCR5 CCR5 1234 P51681 uHJU DcCt 2023-09-06 17:08:08 DzTjkKse
XvpJ6oL3SG7w CD45RO None None None uHJU DcCt 2023-09-06 17:08:08 DzTjkKse

Search for a marker (synonyms aware):

Tip

Search for a non-registered marker from public source: lb.CellMarker.bionty().search(...)

lb.CellMarker.search("PD-1").head(2)
id synonyms __ratio__
name
PD1 2VeZenLi2dj5 PID1|PD-1|PD 1 100.0
CD16 bspnQ0igku6c 50.0

Auto-complete of markers:

cell_markers = lb.CellMarker.lookup()
cell_markers.cd14
CellMarker(id='roEbL8zuLC5k', name='Cd14', synonyms='', gene_symbol='CD14', ncbi_gene_id='4695', uniprotkb_id='O43678', updated_at=2023-09-06 17:08:02, species_id='uHJU', bionty_source_id='DcCt', created_by_id='DzTjkKse')

Query panels and datasets based on markers, e.g. which datasets have CD14 in the flow panel:

panels_with_cd14 = ln.FeatureSet.filter(cell_markers=cell_markers.cd14).all()
ln.File.filter(feature_sets__in=panels_with_cd14).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
Pf9YEZ9F0JlDjZA4SvtR 7BMzjDBU None .h5ad AnnData Alpert19 None 33367624 14w5ElNsR_MqdiJtvnS1aw md5 OWuTtS4SAponz8 EHPP1Ldy9I7CweDEXlV6 None 2023-09-06 17:08:05 DzTjkKse
mb2LKG4iOAuwrSs7SoNS 7BMzjDBU None .h5ad AnnData My fcs file None 6876232 Cf4Fhfw_RDMtKd5amM6Gtw md5 OWuTtS4SAponz8 EHPP1Ldy9I7CweDEXlV6 None 2023-09-06 17:08:09 DzTjkKse

Shared cell markers between two files:

# no need to load the content of files
files = ln.File.filter(feature_sets__in=panels_with_cd14, species=species.human).list()
file1, file2 = files[0], files[1]
file1_markers = file1.features["var"]
file2_markers = file2.features["var"]

shared_markers = file1_markers & file2_markers
shared_markers.list("name")
['CD127', 'Cd14', 'Cd19', 'CD57', 'Ccr7', 'CD28', 'CD3', 'Cd4', 'CD8', 'CD27']

Load file in memory:

adata = file1.load()
adata
AnnData object with n_obs Γ— n_vars = 166537 Γ— 35
    obs: 'Time', 'Cell_length', 'Dead', '(Ba138)Dd', 'Bead'
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnR'
    uns: 'meta'
# clean up test instance
!lamin delete --force test-flow
!rm -r test-flow
Hide code cell output
πŸ’‘ deleting instance testuser1/test-flow
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-flow.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-flow