Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharing data sets between chapters #16

Open
rgommers opened this issue Jul 27, 2017 · 4 comments
Open

Sharing data sets between chapters #16

rgommers opened this issue Jul 27, 2017 · 4 comments

Comments

@rgommers
Copy link
Contributor

From Debra's email: Matt Rocklin suggested using some data sets in common through the book, so feel free to coordinate with others on the project. The Dask chapter will also be written using the data and projects described in some of the other chapters.

@mrocklin do you have an overview of data sets already in use? For the SciPy chapter we'd be happy to reuse something as well.

@rgommers
Copy link
Contributor Author

Cc @WarrenWeckesser @ev-br

@mrocklin
Copy link
Contributor

@mrocklin do you have an overview of data sets already in use? For the SciPy chapter we'd be happy to reuse something as well.

I personally have no exposure to what people have been doing. I like the idea of coordinating on datasets and examples, but have made no concrete steps in this direction.

Perhaps this issue is such a step? If others are around it might be interesting to list both our constraints for datasets for our sections as well as some datasets that we know about and appreciate.

For example for dask we have the following constraints:

  • It is useful if the data is inconveniently large, so that parallelism or off-memory approaches can be relevent.
  • It is useful if functions used in other examples are serializable (this is usually the case)

Datasets that we've frequently used in tutorials and examples include the following:

  • The NYC Taxi dataset
  • Various meteorology datasets, in particular ECMWF has public downloads
  • Airlines
  • ...

@rgommers
Copy link
Contributor Author

rgommers commented Aug 2, 2017

Perhaps this issue is such a step?

+1

For SciPy we are pretty flexible in terms of datasets to use. We do need:

  • time series data, for IIR/FIR functionality. EDIT: we've now adding a data set for this, pressure measurements: pressure.dat
  • one dataset that is large enough for using scipy.LowLevelCallable sensibly

@jbednar
Copy link
Contributor

jbednar commented Apr 28, 2018

We're using the measles incidence dataset highlighted in the Wall Street Journal a while back in our chapter (#26), along with some NYC taxi data, if anyone wants to use those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants