In most game engines today the typical design is that a scene is a graph / tree of components some of which may be renderable.
In a sense you are correct, every object is basically a node in a tree of things.
You may have heard the expression "Entity Component System" and the expression "Component Composition" in relation to game engines and more specifically scene based objects.
These 2 terms refer to a mechanism for maintaining a hierarchy of objects that are part of a renderable scene not all of which may contain a renderer.
Component Composition is particularly useful for games because it refers to the mechanism of using composition to dynamically bolt bits of objects together in order to achieve something useful where each bit would be function / behavior of the object I am bolting to.
For example:
In unity (arguably one of the most popular engines for indie developers today) we can create an object in our scene which is in itself just a container.
I can then add child objects to that object which are also just containers.
I can then add things like renderers and mesh information about materials behavior like input control or other behaviors like gravity to an object.
So I could say "this is a terrain object", and this child of that terrain is a character object, which has a controller component to allow the user to feed input to the character.
So to put this back in to terms that match your question ...
Your scene node object would contain a mesh component and a renderer component for rendering that mesh.
The renderer would use the mesh and a material to draw the object in our virtual world each frame.
Also:
Other things that might be scene nodes / objects would be a camera, audio source, a light source, none of this as such would typically be renderable but arguably have an impact on the scene, it is the job of the engine to determine how to interpret your defined scene graph / tree.