Hi Floris,
The VM machine is not booted by the user, but by the site.
Yes, and it runs inside the trusted network with system rights... So you have an "endorsed" user machine running with system rights. This is very unsafe.
Actually, I am not sure it has to run with system rights. The policy says that the applications in the VM should have the same environment as the existing worker nodes. If that can be accomplished with user rights, that would be fine.
The question is: how do we generate images and make Davide trust them enough to run them on his site?
A siteadmin doesn't need to trust VMs any more then he trusts regular user jobs, because from the system viewpoint they have the same rights. We (Davide and our HPC Cloud) run VMs with USER credentials, as a user job, and as an added feature inside its own virtual network. The outside connections that are allowed are only those that are allowed for regular user jobs and they are managed bridged connections, so they can indeed only see traffic that is their own.
Sounds good to me. What about performance issues? I'm especially concerned about file I/O and network I/O.
Could you explain to me, why you think liability is necessary for VMs, but not for normal jobs?
If I would trust an endorser to be able to say that an image is safe for use, I would need to know how high the insurance policy is. If the usual policy applies: "yes, I think it is safe, but in case of trouble you yourself pay"; then I will not boot their machines under this policy, because I have no means to block a machine without violating the policy.
The policy states that the site can block machines whenever they feel like it for what ever reason they want, so I don't think this is true. I do see your point about running a VM (even if it is endorsed) as a system level process, making liability more of a concern. Maybe it is possible to reproduce the Nikhef WN environment in a VM even if it is running as a user process. Would there be ways to give the VMs access to e.g. NFS shared storage? Maybe the hypervisor can get access and then provide it to the VM. Can someone from Nikhef shed some light on this?
As far as I know the WNoDeS of Davide is not a Class 3 facility (yet), but I'll have a look a his latest slides.
Please do, it is a "class 3 facility", which does scale to run 20,000 jobs each day as a virtual machine. Including user submitted ones and Grid-jobs!
I'm sorry, I don't see anywhere in the presentation it is a class 3 facility. As you quoted, it says O(10) supported images. That's what Davide told me last time. He also told me they are working on a mechanism for users to upload arbitrary images (e.g. via http, I also see that mentioned in this presentation), but that the trust issue wasn't solved up to a level that it was deployable on 1400 cores and 20k jobs per day. That would require additional monitoring tools and security infrastructure, which at that time was not in place yet. After carefully looking at the slides I am not convinced that it is in place now. Anyway, as I told you, I'll be in Catania in the week of the 17th of May for the INFN annual meeting and I will ask him about it.
My point is still: People can run any VM in our HPC cloud, but only with rights that are proper for an end-user. I do not trust a VM at all or more then a regular user job (because we have no control over its contents), and I do not trust users never to make mistakes.
Point taken, we should look into running VMs with end-user rights. But even with end-user rights, if they get access to NFS or other shared resources, we need an endorsement procedure for the VM. I think this is the main difference with Cloudia, where the VMs don't have access to such shared resources.
Users are actually very much afraid of that kind of trust, they expect us to protect them in case something goes wrong.
Interesting insights. Thanks, Sander