-
-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track DataTree progress #2015
Comments
Thanks for coming to the xarray community meeting yesterday to ask about datatree @OriolAbril! I've just had a look at the code for SimilaritiesI was pleasantly surprised by how similar the two classes are. They both:
This should mean its not too hard to swap one out for the other. DifferencesHowever there are some significant differences:
ImplementationNow the behaviour of
That's why the basic choice to be made on the arviz side is whether you would want to directly use and expose (Tagging @jhamman, @alexamici, and @aurghs in case any of you are interested) |
Thanks for the super detailed comment. I am slowly familiarizing myself with DataTree and I'll try to answer some of the points, but can't still speak on everything.
I think this is good. IIUC, one would now be able to do Note: this will probably mean doing some changes to ArviZ functions, probably to plotting, but imo it will be worth it to 1) take advantage of the improved datatree features and 2) stop maintaining InferenceData
The warmup datasets as hidden attributes is something we have discussed a couple times to change. These groups are generally not present, and are useless to users of bayesian models, they are mostly useful to people working on mcmc algorithms and in some cases to diagnose why some model is not being sampled correctly. I still have to look into that to see what could make sense. From a conceptual point of view, having those as subgroups inside their non-warmup counterpart makes sense, but I can't think of any case where we'd want functions to be applied to both warmup and non-warmup groups in the same way.
Does this come from this page? It isn't a feature yet, it's more of a useful concept and assumption made by several of our functions. Here is some intuition on the genral idea: the core of bayes+mcmc methods is generating samples via mcmc from the posterior distribution of the parameters in our model (because trying to get the posterior analytically is impossible). We then get the posterior dataset with one sample (sample means a pair of chain+draw values) for each variable (potentially multidimensional even before adding the chain and draw dimensions). But also, as they were generated with mcmc we get some summary statistics for each sample, and the pointwise log likelihood values also at each sample. And when we generate predictions we generally generate one prediction per posterior sample. Currently each of these groups have a chain and a draw dimension that are independent of each other because they are independent datasets, but we assume are the same in some plots or in model comparison or in model criticism. If possible, it might be useful to share that dimension/coordinate values between groups, but there are also cases where generating predictions is expensive for example, and we might have enough generating them for a subset of the samples only in which case it would need to be possible to have independent dimension too. And it would be perfectly fine to keep all dimensions independent and assuming they are the same, nobody has ever complained about this.
I sincerely have no idea how any of the typing things related to inferencedata work.
I think we can remove or deprecate that already cc @ahartikainen to make the transition easier. I am not sure anyone is using and I wouldn't have recommended anyone to use it but told them to use concat explicitly.
If all cases covered in
I might be able to help with that at some point, but I think it will take me some time to be able to do anything useful (both because I don't have a lot of free time on my hands and because understanding datatrees and especially the cases that are different from ArviZ will probably take some time by itself too). Is there an issue related to this I could track? Or alternatively feel free to tag me if you open an issue on this or there is a PR to test
I haven't still figured out how subsetting/filtering works with DataTrees. I'd be happy to write some docs as I learn this, but still quite unsure how to go about this. For now, I'm trying to look at things like getting a subset of the datatree that consists of multiple groups or applying a function to the variable x that is present in 3 out of 5 groups of the datatree. Depending on what is possible and your future plans it might be good to extend some functions; i.e. if |
Yes, I think we can deprecate |
For concat, we could always create a couple special functions
Or maybe one which can do both special cases
|
Yes that's correct.
I agree.
Okay, we'll have to think about this more carefully.
Thanks for the explanation about matching samples. I want to make it easier to refer to one coordinate from multiple different groups, which might help with this.
Great.
So I expect that
This also seems like a good idea.
That's because it's not really been defined yet 😅
There is now!: xarray-contrib/datatree#79
This is useful to know already, but we can discuss details on xarray-contrib/datatree#79 |
DataTree should replace InferenceData and provide much better feature set
The text was updated successfully, but these errors were encountered: